In this paper, we introduce genomics, study of the structure, function, evolution, mapping, and editing of genomes. Genomics determines the sequence of molecules that make up DNA of an organism. Here, the genomes play a vital role. Genomes can be encoded as well as decoded. So, here we see how to decode a genome with the machine learning concept. Machine learning, the field which is concerned with the development and application of computer algorithms that improve with experience, provides an overview of genome sequencing data sets. Sequencing of genomic data in each cell is classified in different ways. We focus mainly on large collection of the genomic data and methodology to decode it. Decoding a genome can also decipher a genome-based disease. This paper provides general points to assist in selection of these machine learning methods for the analysis and decoding genomic data sets.
Genomics is the study based on bioinformatics. The function and information structure encoded in the DNA sequences of living cells can be termed as genomes. Genomes are found in cells, the microscopic structures that make up all organisms. A gene is a small part of the genome which is the fundamental unit of heredity. DNA is the molecule that is the hereditary material in all living cells. Genome sequencing is pointed out the order of DNA nucleotides, or bases, in a genome. Scientists study the entire genome sequence that will help them understand how the genome works, how genes work together to direct the growth, development and maintenance of an entire organism.
Since 1953, genome has been the DNA molecule that is the physical medium of genetic information storage. By 2001, the Human Genome Project gave the genuine information of a typical human genome. The human genome has 20,000 protein coding genes and 25,000 noncoding genes. Some genes are crucial for life, while some are crucial for health. The genome is basically a DNA biopolymer. It’s a long molecule that has basic hereditary and functional unit of life. It is present in every cell in the body which is made up of four basic units symbolized by A, C, G and T which follows a genomic sequencing. In 2003, the full human genome sequence is obtained, the entire 3 billion letters that make up the human DNA. It took down 10 years to give the first draft consensus human genome.
A genome sequence does contain some clues about where genes are, even though scientists are just learning to interpret these clues. Genes account less than 25 percent of the DNA in the genome, and so knowing the entire genome sequence will help scientists study the parts of the genome outside the genes. Genome sequence model implemented in Probabilistic Logic Programming Language and Machine Learning System (PRISM) with generic algorithms for decoding
From the author, Michael K.K. Leung, using machine learning to interpret the genome has been obtained. This kind of approach determined only the genome in the cell. More easily, the phenotypes, and maps of DNA has been estimated. The concept simplified the view from biologists to medical researchers. Machine learning played a central role by turning high through-put measurements and machines. Also, explained about the cell variables where exactly the genome location is obtained.
Gene expression and the transcriptions of it is also another deal in the paper. Here, the gene in the DNA will transcript. The RNA molecule is the precursor molecule. It is called messenger RNA (mRNA). Translation creates a protein molecule (an amino-acid chain) by reading the three-letter ‘‘codes’’ in the mRNA sequence. Other processes include polyadenylation, wherein adenine bases are appended to the end of the mRNA; mRNA stabilization, wherein the mRNA molecule is processed so as to make it less likely to degrade; mRNA localization, wherein the mRNA is moved to a location suitable for translation; and protein localization, where in the protein is moved to a specific type of location in the cell.
From the author, Maxwell W. Libbrecht and William Stanford Noble, the concept of machine learning applications in genetics and genomics has been introduced. They described the stages involved in machine learning and the new applications that can be used in genomics. Here, the output result is the gene finding model. A simplified gene-finding model that captures the basic properties of a protein coding is as shown. The model takes the DNA sequence of a chromosome, or a protein as input and produces detailed gene annotations as output.
But, this simplified model is incapable of identifying overlapping genes or multiple isoforms of the same gene. The process of encoding as well as road mapping is supported here. But, model does not supports the latest technology of machine learning.
From the above two papers, there is a drawback of genome decoding, through which we can obtain our own personalized medicines and decipher the genome-based diseases.
Proposed work/ Implementation
This is to showcase that how we can use machine learning methods on large scale biological data to dramatically transform understanding of the genome. This takes us one step closer to their dream of personalized medicine on machine learning methods to decode the genome and decipher the genomic basis of disease. The genome is basically a DNA biopolymer. It’s a long molecule that has basic hereditary and functional unit of life. It is present in every cell in the body which is made up of four basic units symbolized by A, C, G and T which follows a genomic sequencing. In 2003, the full human genome sequence is obtained, the entire 3 billion letters that make up the human DNA. It took down 10 years to give the first draft consensus human genome. Fast forward to today, there is a rapid growth in human genome sequencing evolution. Genome sequencing is the order of DNA nucleotides or bases in a genome as the order of As, Cs, Gs, and Ts that make up an organism’s DNA. But no two individuals alike and essentially the difference in specific location at the genome is referred to as genetic variants. Genetic variants are associated statistically with the disease that are able to pinpoint genetic positions in the genome. A genome sequence does contain some clues about where genes are, but sequencing is not enough to know the meaning of any particular letter in genome such as ‘A’ or any particular word or group of letters in genome as ‘CCAGAGGC’. This focuses on decoding the genome function. Genome sequencing is often compared to ‘decoding’.
Decoding can be defined as to interpret the genome in the context of its cell types. There are thousands of cell types and tissues in the human body and all have the exact same genome. But, one genome sequence result diversity of certain types of function. Certainly, to interpret the genome in the context of its cell types, every cell does not use the whole genome. It uses a specific piece of the genome. So, how we can interpret the genome? Take the genomic sequence and we obtain various types of functional elements and most of them are genes. Genes encode the proteins as they are the workhorses of the cell. You also need certain other kinds of elements as control elements that tell the cell which genes should be activated, and which should not. So, knowing the precise location and identity of these control elements in genes, in different cell types is essentially going to allow to decode the genome. Here, the process is simple as cells use various kinds of biochemical markers as red, yellow, green flags. Each of these biochemical markers mark different types of elements these include the control elements as well as the genes.
NIH was very smart about this and funded a large-scale project, two consortia called ENOCODE and roadmap. It measured genome wide, across the entire genome hundreds of biochemical markers in hundreds of different cell types. So, the whole data cube with information about hundreds of biochemical markers and cells make sense of applying machine learning methods to take this massive data cube of information and transform it into functional annotation of the genome. Let’s see how the data cube looks like by showing a specific region of the genome, a very interesting gene called “Pacx-5”. It has a very important function in immune system biology. If we focus on the colours that in red and yellow, it means an act of control element. If you see something in green, it means as an active gene. Also, if you see something in grey, it means a repressed region. So, as you can see an embryonic stem cells, the locus is a largely grey. Therefore, it is inactive. Here, if I switched over to blood cells, various types of blood cells you notice in a set of T cell. Suddenly, the gene becomes active and a bunch of control elements also activate. Again, if you go on various other types of cells, that region is again repressed. So, from this you can say that Pacx-5 locus, that gene, potentially has a very specific role in immune function. Coming to how is this relevant? After having done this transformation the whole genome integrating the entire data cube, we identified two million novel control elements in the human genome and these 2 million elements control just 20,000 genes. It is an extremely complicated control circuitry to control 20,000 genes and they have highly cell type- specific activity.
Methodology: Genome sequence model implemented in Probabilistic Logic Programming Language and Machine Learning System (PRISM) with generic algorithms for decoding. PRISM is used as a flexible model for alternate platform. The use of probabilistic logic programming for evaluating sequence models as the contemporary gene locaters, has been demonstrated.
Given a sequence of genome predictions sorted by position, a easy way to the model capable of emitting this prediction sequence represents a classiﬁcation of predictions, into presumed true positives emitted from the frame state. Presumed false positives emitted from the delete state. A path with optimal probability represents a accurate hypothesis about the classiﬁcation of predictions into positives and negatives. This can be verified with machine learning algorithm.
For each genome prediction, each state emits a score symbol and frame. Score symbol, a symbolic value representing a range of scores for the predictions of the input genome locator (such as, ATTCG…). In this model, the transition probability is conditioned on the previous frame state instead of the previous state. The frame state transition probabilities are assumed to be reﬂected the probability of genome in a particular reading frame given the reading frame of the previous genome. To explore gene reading frame sequence, there is a second order version of Frameseq. The transition probabilities between the frame states are estimated as the relative frequency of observed adjacent genomes in the various frames observed in the set of veriﬁed genomes.
The probability of one transition state to the delete state give the probability that a genome decoding prediction is a false positive, P(delete) = 1− TP/ TP + FP Where, TP is the number of true positives predicted by the genome locator.
FP is the number of false decoding positives.
This method of probability is directly related to genome ﬁnder speciﬁcity and decodes for diﬀerent sensitivity/speciﬁcity trade-oﬀs. The frame state transition probabilities are relative frequencies, that interprets of conditional probabilities. The ranges of sequences are selected to ensure that each score symbol correspond to an equal proportion of genome predictions. The number of ranges(n) is a parameter. If n is high, the model can better exploit the scores from the genome. But, if the estimated emission probabilities become more complicated, data is needed to be reliably estimated. The probabilities of the delete state are estimated as the relative frequency of each of the possible score symbols for all false decode positives predictions:
P(δi|state = delete) = FP*δi/ FP
We obtained a technique to decode human genome. The methodology we obtained is new when compared with the previous papers. Therefore, through the new computational techniques of machine learning, the large data sets of genomes can be decoded. We can interpret a single letter in the data sequence or the group of letters as the genome function.
- Sakakibara, K., Fukui, Y., & Nishikawa, I. (2008). Genetics-Based Machine Learning Approach for Rule Acquisition in an AGV Transportation System. 2008 Eighth International Conference on Intelligent Systems Design and Applications. doi:10.1109/isda.2008.329
- Fernandez, A., Garcia, S., Luengo, J., Bernado-Mansilla, E., & Herrera, F. (2010). Genetics-Based Machine Learning for Rule Induction: State of the Art, Taxonomy, and Comparative Study. IEEE Transactions on Evolutionary Computation, 14(6), 913–941. doi:10.1109/tevc.2009.2039140
- Ouyang, C.-S., Lee, C.-T., & Lee, S.-J. (2007). An Improved Fuzzy Genetics-Based Machine Learning Algorithm for Pattern Classification. Second International Conference on Innovative Computing, Informatio and Control (ICICIC 2007). doi:10.1109/icicic.2007.150
- Kelemen, A. (1995). Run-time autotuning of a robot controller using a genetics based machine learning control scheme. 1st International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications (GALESIA). doi:10.1049/cp:19951067
- Deng, Y.-Y., & Guo, F.-B. (2015). Applications of four machine learning algorithms in identifying bacterial essential genes based on composition features. 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP). doi:10.1109/chinasip.2015.7230519
- Dorigo, M., & Schnepf, U. (1993). Genetics-based machine learning and behavior-based robotics: a new synthesis. IEEE Transactions on Systems, Man, and Cybernetics, 23(1), 141–154. doi:10.1109/21.214773
- Pattichis, C. S., & Schizas, C. N. (1996). Genetics-based machine learning for the assessment of certain neuromuscular disorders. IEEE Transactions on Neural Networks, 7(2), 427–439. doi:10.1109/72.485678
- Davi, C. C. M., Pastor, A., Oliveira, T., Lima Neto, F. B., Braga-Neto, U., Bigham, A., … Acioli-Santos, B. (2019). Severe Dengue Prognosis Using Human Genome Data and Machine Learning. IEEE Transactions on Biomedical Engineering, 1–1. doi:10.1109/tbme.2019.2897285
- Bioinformatics, Volume 23, Issue 13, July 2007, Pages i289–i296, https://doi.org/10.1093/bioinformatics/btm185