Abstract
In conceptual view of big data, some living data is selected and that data is called medical data, this data contains a crucial piece of details that used for analysis purpose, after performing such kind of analysis methods the biologists can easily know the outcome of bioinformatics science more efficiently. This research is like an outline of the data sequencing and method of analysis by providing details about the DNA. It also shows the dissimilarity between the two analysis method i.e. statistical and medical. This research is based on the classical statistical analysis that can generate what kind of sequence is active in the DNA dataset and that data contains the building block that called as nucleotide, based on this analyst can generate the histogram and analyze the sequence. For the analysis purpose some programming language is used like Python, R etc. this language is used by analyst for guess or making the report. To access Gen Data Bank python language is used in the research execution. The focus of this research is purely based on the hypothesis testing that used the structured nature of dataset, that dataset size is huge. The research shows how the python work with the big data to complete the data analysis, based on analysis analyst can make the decision about the DNA sequencing is good or bad.
Introduction
The idea of big data is used the four attributes veracity, variety, volume, and velocity it also called as 4’v’. [2] It includes the attributes of data analysis, such as hypothesis-generating, rather than hypothesis-testing. Big data majorly focuses on material stability of the association, rather than focuses on causal connection and the conclusions of probability distribution is not required. Medical big data can be analyzed as a material and it has a various attribute that are not only well-defined from the directions of big data, but also well-defined from traditional clinical epidemiology. Big data technology explore in many areas of application in healthcare domain like a guessing modeling and clinical decision support, disease or safety surveillance, public health, and research. Recent rapid increase as well as rapid development in the generation of digital data and the computational science enable us to extract a new result from the huge data sets, known as big data. In various directions, like internet business and finance. In the healthcare area, to discover a new actionable result is common, although several successful stories have been published in academic journals as well as media. This kind delayed progress is little bit odd in big data technology, healthcare sector is considering an earlier guessing that the application of big data technology was unavoidable and the healthcare sector would be one of the sectors expected to get benefits most from big data technology [1].Such type data analysis analyst used the concept of big data and some techniques for that reason python idle3.7 is used for combination and python is the first programming language used in the analysis.
Save your time!
We can take care of your essay
- Proper editing and formatting
- Free revision, title page, and bibliography
- Flexible prices and money-back guarantee
Place an order
Literature Review
Zhao, W., Zeng, X., and Xiao (2015) used Thermococcus eurythermalis Genbank to implement the research. The Genbank contains nucleotides data like Adenine, Guanine, thymine and Cytosine. He used this bank for analysis purpose only. [10]
Choong Ho Lee and Hyung-Jin Yoon (2017) used the two type of analysis medical big data analysis and classical statistical analysis. They shows how the hypothesis generating and testing is used by the data mining algorithm.[9] They also reviewed on this two type of analysis and show the specialty between the analyses. They also detect the difference between this two types of analysis technique. Kidney Res Clin Pract.(2017) uses executable algorithm of data mining in the data analysis.[8]
Shad arf (2018) used python language for analysis purpose. He shows how the python language used with huge dataset. Python language work with the concept of the big data.
Medical Big Data
Medical big data have some distinctive features that are different from big data from other directions. Medical big data are sometime difficult to access and most analyzer in the medical arena are unsure to practice on open data science for that reasons there is risk on other parties can misuse the data and lack of data-sharing incentives [2].The medical big data can be classified into the three frequent forms, such as small p and huge n (n = sample numbers, p = parameter numbers); small n and huge p; and huge n and huge p [3]. Data with huge n and small p can be dealt with the classical statistical methods. One example on this kind of data is supervised claim data. Because this kind of data is tend to be incomplete, noisy, data cleaning and inconsistent in state, such as defining this kind of cases to be analyzed is not trivial and understanding the context of data collection is essential.
Analysis of Medical big data
Big data analysis used various algorithms of data mining, which can be defined as the automatic extraction of useful, often previously not defined details from huge databases or datasets using advanced search techniques and the algorithms is used to discover patterns and connections in huge already present databases [4]. The tasks of data mining can be reviewed as description, finding human explainable patterns and associations, and estimation, guess some response of interest [5]. Clinical data mining can be defined as the application of data mining to a clinical problem [6].
The algorithms of data mining are classify as semi-supervised, supervised, and unsupervised learning. Supervised learning means to guess a known output of target, using a training set that includes already classified data to draw classify testing, prospective data. In unsupervised learning, there is no output to guess, so analyzers trying to find naturally occurring patterns or grouping within unlabeled data. Semi-supervised learning means to balance performance and accuracy using small sets of labeled data and a huge unlabeled data collection [7].
The analytical goals of medical big data is estimation, modeling, and inference. Classification, clustering, and regression are frequent methods in these contexts [3]. Classification is a kind of supervised learning and can be thought as guessing modeling in which the output guessing variable is categorical. Classification means to construct a rule to assign objects to one of an already specified set of classes (estimating variable) based on a measurement of vector taken on these objects. Classification techniques contains decision trees, logistic regression, naive Bayesian methods, neural networks, Bayesian networks, and support vector machine.[4] The classification execution can be judged by various execution metrics tested in a test set or an independent validation set. These type of techniques can be used to develop a decision support system assigning a diagnosis among several possible diagnoses or to build models to guess a prognosis based on data from analysis of many biomarkers. [3] Linear regression is the most frequently used technique in this type. Examples of its applications include a running analysis of patient’s data or decision support system [3]
Classical statistical analysis
Classical statistical analysis is a Hypothesis-testing and it’s trying to prove a causal connection, for this type of analysis the data or dataset can be selected from the single source that clearly specified a collection of data. This data is in a structured format and the quality of the data is controllable. This type of analysis is used in data analysis, for that we can use programming languages like a Python, R etc.it can also be used in the field of bioinformatics for analysis purpose. Python is the first language that used in the bioinformatics for analysis. Analytic goals of classical big data is to generate statistical score contrasted against random chance.[3]
One of the major site of classical statistical analysis is to determine the empirical frequency distribution that yields the absolute or relative frequency of the occurrence of each of the possible results of the repeated measurement of a property of an object or a class of objects when only a finite number of different outcomes is possible like a discrete case.[4] If one thinks of an infinitely repeated and arbitrarily detailed measurement where every outcome is different. The analysis of Classical statistical is based on repeatedly measuring properties of objects and its aims as estimating the frequency with which certain results will occur when the measuring operation is repeated at random or stochastically.[5]
To prove this type of analysis in bioinformatics, I take a sample of DNA dataset. By using this we can calculate the frequency of Nucleotides and counting the ratio of nucleotide that currently in the DNA dataset, and generate a histogram that help in analysis to make a decision.
Classical statistical analysis Implementation
This section represents the actual execution of model in a Python IDLE 3.7 which get the input from DNA dataset and plots the graph and also showing the ratio of DNA nucleotide which can be easily analyze. Python method for counting nucleotide eg. Species, with varied nucleotide frequency i.e. statistical analysis of nucleotide frequency in python.DNA contain four type of nucleotide in the structure Adenine(A), Thymine(T), Guanine(G), and Cytosine(C), respectively. Tend to have higher Guanine, Cytosine proportion than Thymine and Adenine and hence higher GC% than AT%. For these reason Thermococcus eurythermalis Genbank is used. (Genbank Id: CP008887.1)
In this Research for the DNA dataset nucleotide database is used that contain some sequences including Genbank and PDB. This data is useful for the biomedical as well as bioinformatics research.
- Python Method: Using a variable to hold DNA data, setting up a variable that uses count () to measure nucleotides, then get the nucleotide with IF statements. By using the matplotlib library function of python histogram is generated and by using function percentage ratio is shows. This can be helpful to analysis of DNA structure.
- Dataset: Thermococcus eurythermalis Genbank is used. The size of this dataset is 20mb and type is fasta extension (Thermococcus eurythermalis. Fasta) (Source: https://www.ncbi.nlm.nih.gov/nuccore/?term=Thermococcus+eurythermalis)
Conclusion
In the last few years the data analysis field growing with medical field such as biomedical science or bioinformatics. The scenario show how python is help in data analysis to guess the result in classical statistical analysis. Classical statistical analysis help in stream of bioinformatics, by using this analyst can guess the DNA structure from the Genbank. It shows the sequencing ratio of nucleotides that present in the DNA dataset, based on that analyst can guess the good or bad DNA, such kind of DNA will helpful to make a hybrid genes formation. This type of analysis is used rather than traditional analysis.
References
- Sinha A, Hripcsak G, Markatou M. Huge datasets in biomedicine: a discussion of salient analytic issues. J Am Med Inform Assoc. 2009;16:759–767. doi: 10.1197/jamia.M2780. [PMC free article]
- Lavecchia A. Machine-learning approaches in drug discovery: methods and applications. Drug Discov Today. 2015;20:318–331. doi: 10.1016/j.drudis.2014.10.012.
- Bellazzi R, Zupan B. Guessingdata mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77:81–97. doi: 10.1016/j.ijmedinf.2006.11.006.
- Iavindrasana J, Cohen G, Depeursinge A, Müller H, Meyer R, Geissbuhler A. Clinical data mining: a review. Yearb Med Inform. 2009:121–133.
- Deo RC. Machine learning in medicine. Circulation. 2015;132:1920–1930.doi:10.1161/CIRCULATIONAHA.115.001593.
- Kidney Res Clin Pract.Medicle data analysis 2017 Mar; 36(1): 3–11/ doi: 10.23876/j.krcp.2017.36.1.3
- Choong Ho Lee and Hyung-Jin Yoon.Challenges in Medicle big data 2017 The Korean Society of Nephrology. . [PMC free article]
- Zhao, W., Zeng, X., and Xiao. Biomed InformInsights(2015)doi: 10.4137/BII.S31559
- Murdoch TB, Detsky AS. The inevitable application of big data to health care. JAMA. 2013;309:1351–1352. doi: 10.1001/jama.2013.393.
- Scruggs SB, Watson K, Su AI, Hermjakob H, Yates JR, 3rd, Lindsey ML, Ping P. Harnessing the heart of big data. Circ Res.2015;116:11151119.doi:10.1161/CIRCRESAHA.115.306013. [PMC free article]