Introduction and Background
With the popular accessibility of genome sequencing, interpreting genetic variation has become a central challenge in medical genetics. Identifying disease-associated variation within a vast background of neutral variation is key to both gene discovery and diagnostics. This is of particular importance in the study of newly arising variants and functionally essential protein domains. Variants which cause protein sequence changes play an important role in disease, alongside other factors such as aberrant regulation of gene and protein activity. Several analytical approaches have been applied to identify functionally important sequence variants and shed light on how they may exert their effects (Hon, Zhang, Kaminker & Zhang, 2009). The presence of a high number of mutations at the same protein residue, also known as ‘‘hot spot’’, can be used as a clear indicator of functional significance. As a result, a variety of statistical methods have been developed to detect elevated mutation frequency at the level of individual genes, protein domains, and gene sets (Yue et al., 2010). Although it is becoming more accessible and affordable for individual laboratories to generate next-generation sequencing (NGS) data, the major hurdle in utilising these sequences lie in how to interpret the hundreds of genetic variants that each harbours, especially in genomic medicine settings (Lyon and Wang, 2012).
In 2015, the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) published updated standards and guidelines for the clinical interpretation of sequence variants concerning human diseases on the basis of 28 criteria (Richards et al., 2015). With the rapid development and adoption of NGS, variant interpretation has become more complex, and new challenges in the clinical interpretation of Mendelian and complex diseases have emerged. The process of identifying disease-causing or disease-contributing variants among the thousands of genetic variants within an individual’s genome generally involves some steps, such as variant filtering, variant curation, in silico prediction, and clinical interpretation by human experts. The ACMG-AMP guideline has provided a consistent process to evaluate the pathogenic potential of genetic variants using informed criteria and also updated the understanding of the clinical significance of any given sequence variant.
The curation of clinically relevant genetic variants in the rare disease space is difficult and time-consuming. Following the ACMG guidelines, a key aspect of variant curation is the identification and evaluation of functionally essential protein regions (Richards et al., 2015). Based on the guidelines, if a variant is located in a mutational hotspot or critical and well-established functional domain without benign variation, the variant will be assigned a moderate evidence of pathogenicity or PM1. The value of identifying functionally essential protein regions or ‘hotspots’ has been previously supported by other groups which showed that functionally significant protein alterations are enriched at key analogous residues in related proteins (Leipe, Koonin and Aravind, 2003; Davis et al., 2010). However, the process of identifying the functionally essential regions is not well-established (Maxwell et al., 2016). Studies have shown that criteria between different laboratories share little ascertainment in common of whether a region is important or not (Amendola et al., 2016). In addition, to date there are few tools for the prediction of functionally essential regions in germline diseases. This results in additional variant clarification for applying the guidelines in curation processes.
The aim of this project is to use human variation databases (such as ClinVar) to identify the functionally essential protein regions in the genome. In brief, the most important steps are detecting homology regions and merging domain hot spots affecting conserved residues by Protein family analysis and multiple sequence alignment. Every detected region will have its confidence level. After that, the mutations and predicted hot spot would be aligned into human protein sequences. Since some functionally significant protein mutations will inevitably not fall within mutation clusters, functionally essential regions could be identified by partitioning the protein sequence into regions of a specific size (bins). The identification of domain hot spots can make an important contribution to the interpretation of data generated by next-generation sequencing and genetic association studies. By building this tool to the standards required in the clinical setting, it could help to accelerate the speed of clinical variant curation.
The main objectives for this project are:
- to implement a tool which could identify and visualize potential functionally essential protein regions and
- to assign these regions a confidence score which can be applied in clinical variant curation.
The project is based on two Hypothesis:
- A mutation occurred in functionally essential regions are more likely to be pathogenic.
- Each position of amino acids is equally likely to be mutated (this hypothesis will be reviewed in the nearby future as it may not be the best approach).
Some functionally significant mutations will inevitably not fall within mutation clusters. Hence it is difficult to detect based on the multiple sequence alignment by Pfam (protein families database) analysis (Bateman, 2000). Besides, a cluster will become apparent only when enough member mutations have been discovered, however, at present, there seems to be lack of reliable data. ClinVar (Landrum et al., 2015) catalogs the clinical significance of variants which reported directly from submitters. However, these databases often contain variants that are incorrectly classified without a primary review of the evidence, and they sometimes have contradictory records on the assessment of pathogenicity. How to filter the data and cut down the background noise is another challenge needed to be considered. Moreover, partitioning the protein sequence into regions of a specific size is an important approach for analysing whole genome sequence. However, the size of selected regions is also a challenging problem which needed further discussion.
Approach and methods
Background Data (ClinVar)
ClinVar (Landrum et al., 2015) is a freely accessible, public archive of reports of the relationships among genomic variants and phenotypes. To facilitate evaluation of the clinical significance of each variant, the database aggregates submissions of the same variant, displays supporting data from each submission, and determines if the submitted clinical interpretations are conflicting or concordant. Although the ClinVar database archives the clinical significance of variants reported directly from submitters, these data often contain variants that are incorrectly classified without a primary review of evidence, and they sometimes have contradictory records on the assessment of pathogenicity. This might due to the submission time and difference on lab references (Yang et al., 2017). Put in statistical terms, clinical significance represents an estimate of the true population significance, and current estimates are based on limited, often private datasets.
In this project, the data of missense mutations from ClinVar should be filtered before used as a reliable source of pathogenic mutations occurred in protein sequence. For example, based on its number of submissions, a suitable statistical distribution model could be fitted to exam the clinical significance of reported mutations. Or based on the significance type, age of submission and submitter expertise category to filter the outdated or incomplete submissions, then summarize these metrics to assess the consistency of the clinical significance of variants (Butler III & Gejman, 2018). Once its overall clinical significance is determined, these mutations could be aligned to protein sequence as reliable pathogenic missense mutations occurred in human genome.
Detect Essential Protein Regions
In order to detect which of the residues have similar properties, proteins of the same family have determined by the Pfam database (Punta et al., 2018) will be aligned together by multiple sequence alignment (MSA) to specify the homology regions residue by residue. For the residues which are identified as evolutionarily conserved or structurally related, the number of pathogenic missense variants will be aggregated to determine whether this residue is enriched in pathogenic variants. By extending this approach to nearby residues, the region which is enriched in pathogenic variants could be identified. The statistical method to determine enrichment of pathogenic variants per residue and region is yet to be defined. This is the first step to detect functionally essential regions. With this approach, this project will be able to predict potential functionally essential regions, even in the absence of pathogenic mutations in one or more proteins members within the protein family. The regions with zero or low level of mutations will be marked as potential essential regions with lower confidence level. Finally, since some mutations may occur outside the protein domains, re-alignment could help to complete the detection of cluster. Processing each sequence by analyzing equal length of residues each time to detect the existence of missense mutations. And merge the regions when several mutations occurred in close distance. Only in this way, missense mutations occurred inside and outside the domain could be aligned together.
Probability Distribution Function
Probability is a key concept in bioinformatics, essential to the interpretation of biomolecular data. Statistics is concerned with the treatment of variability and uncertainty. It has traditionally been seen as an important feature of biology science because the variability between individual organism is fundamental to the theory of evolution (Sokal & Rohlf, 1969). Probability is a number between 0 and 1 reflecting how likely something (‘an event’) is to happen. It can be interpreted in a frequentist way, where a probability of 0.5is taken to mean that if the same ‘trial’ were carried out many times the event would happen in approximately half of the trials.
A probability distribution is a mathematical function giving the probability that a variable will take a particular value or fall within a range of values. For example, the binomial distribution, which concerns binomial trials, can have only two possible outcomes. The normal distribution is illustrated in below graph. It is obvious that the characteristic bell-shaped curve. Two parameters can be varied to give different normal distributions, the mean and standard deviation.
In this project, the number of submissions for each missense mutation’s clinical significance data from database is needed to be filtered. It could be hard to determine the overall clinical significance when the sample size is large, and the clinical significance data from different labs are not fallen in one side (benign or pathogenic). The normality of data distribution can be checked using the Shapiro-Wilk normality test. If the significant value is bigger than 0.05, it indicates that no significant differences from a normal distribution were detected (Zhang, 2015). Then we could fit two normal distributions in each side and examine the standard deviation. Smaller standard deviation means a narrow curve and better fits (Ruklisa, Ware, Walsh, Balding & Cook, 2015). The statistical distribution for fitting the data is yet to be determined, if the data fail to pass the normality test, other statistical models such as Binomial or Poisson distribution should be considered.
Sequence Alignment in Protein
Protein sequences are constructed from an alphabet of 20 naturally occurring amino acids. Like nucleic acid sequences, protein sequences can be aligned to maximize the number of identically matching pairs within the alignment (Russell & Barton, 1992). By aligning sequences, the closely related regions could be found. In this case, the contribution of every aligned pair of identical amino acids to the alignment score is one, and the contribution of every aligned pair of non-identical amino acids is zero. The substitution score matrix is used to show scores for amino acid substitutions (Thompson, Higgins & Gibson, 1994). For example, like the PAM250 substitution matrix, its elements are scores for alignment of each possible pair of amino acids. For instance, the score for aligning the similar amino acids L and I are high, reflecting the high likelihood of this substitution as an evolutionary process.
In sequence alignment, sequence similarity and homology regions are different terms. Any sets of sequences can exhibit similarity. Sequence are homologous only if they evolved by divergence from a common ancestor (Thompson, Higgins & Gibson, 1994). It is correct to say that two sequence share 50% similarity and that indicates probable homology. Sequence similarity searches are used commonly to predict gene or protein function. Similar sequences are likely to be homologous and therefore to have similar functions (Russell & Barton, 1992). If a biological function is essential for an organism to survive and reproduce, then the loss of that function will be selected against by evolution, and this is the reason why homologous genes might be expected to retain the same function in different organisms (Andreatta & Nielsen, 2015).
PAM stands for ‘accepted point mutations’ (the P and A are swapped to make it easier to read). PAM250 refers to an evolutionary distance of 250 accepted point mutations (PAMs) per 100 amino acid residues. PAM matrices are divided by counting observed evolutionary changes in closely related protein sequences and then extrapolating the observed transition probabilities to longer evolutionary distances (Russell & Barton, 1992). It is possible to derive PAM matrices for any evolutionary distance, but in practice, the most commonly used matrices are PAM120 and PAM 250 (Agrafiotis, 2008). PAM matrices with smaller numbers represent shorter evolutionary distances. Because it is rarely possible to know what the evolutionary distance is, the PAM250 matrix usually produces reasonable alignments (Pei, 2008).
Evolution changes nucleic acid sequences, but protein coding sequences evolve under strong functional constraints. Changes to these sequences generally survive only if they do not have a deleterious effect on the structure and function of the protein. Because the loss of function of a protein is usually a disadvantage to the organism. An obvious exception to this occurs after a gene duplication event (McClure, Vasi & Fitch, 1994). When one copy of the gene can evolve very quickly, perhaps to become a pseudogene, or perhaps to gain a new and useful function. Usually, however, protein-coding genes evolve much more slowly than most other parts of any genome, because of the need to maintain protein structure and function (Andreatta & Nielsen, 2015).
However, when evolutionary changes do occur in protein sequence, they tend to involve substitutions between amino acids with similar properties, because such changes are less likely to affect the structure and function of the protein (Thompson, Higgins & Gibson, 1994). For instance, hydrophobic amino acids of similar size tend to substitute for each other quite well, because more often than not these occupy positions within the hydrophobic core of the protein, where tight packing and hydrophobicity strongly affect the stability of the protein structure. Examples of such variants would be changes between LEU, ILE and VAL. Protein sequences from within the same evolutionary family usually show substitutions between amino acids with similar physicochemical properties (Russell & Barton, 1992). The picture below shows physicochemical relationships between amino acids.
Summary of Relevant Methods
There are three tools which are similar to what this project want to achieve. Mutation aligner (Gauthier et al., 2015) has discovered and explored somatic mutation hotspots in protein domains. mCluster (Yue et al., 2010) analyzed around 9000 somatic cancer mutations, 19 000 germline disease mutations and 17 000 other annotations of amino acid residues in the context of sequence alignments of domains. And InterVar (Li & Wang, 2017) has combined some of databases, such as ClinVar, SIFT (Ng and Henikoff, 2001, 2003) and PolyPhen-2 ('PolyPhen-2: prediction of functional effects of human nsSNPs', 2018), together to offer a one-stop web server for human interpreters to derive a final score for genetic variants.
These relevant methods both use domains, rather than individual genes as the basis for the discovery of functionally relevant regions. The protein domain refers to the evolutionarily conserved and structurally related functional regions encoded in the primary sequence of protein (Samocha et al., 2017). Protein domain analysis enhances the statistical power to detect disease-relevant mutation. Moreover, it also links mutations to specific biological functions encoded in protein domains (Decker et al., 2016). Analysis of protein domains is a powerful method to assess the common biological functions of genes. Besides, based on the studies, proteins of same family with shared domains have high degree of sequence similarity (Lawrence et al., 2014). This is detected by multiple sequence analysis and hidden Markov models. Domain hotspots refer to the mutations affecting conserved residues in 3D structures or conserved positions in multiple sequence alignments of protein domains (Gauthier et al., 2015). By associating rare mutations of unknown function with well-characterized mutations in homologous domain positions, these tools may find a novel hypothesis generation, which is aggregating not only mutations in homologous genes but also mutations across sets of genes sharing a common domain. Some of these methods are implemented by multiple sequence alignments of protein domains in the human genome.
Mutation aligner combines the mutation data from TCGA Somatic Mutations database with protein domain position in Pfam-A database. All the mutations are map to protein sequences and protein domains. Then using multiple sequence alignments, this tool identifies the mutation hotspots at domain level. This method uses binomial statistical test to identify significant domain hotspots. The test takes the length and total number of mutations observed in the domain, then generated a P-value by comparing the number of mutations observed at that domain position to what would be observed by chance assuming a random distribution of mutations. All P-values were adjusted for multiple hypothesis testing using the stringent Bonferroni correction method. However, the number of expected mutations does not follow a random distribution, but it depends on sequence context and other gene-features. Therefore, a better approach will be determined to calculate the number of expected missense mutations per gene. The missense mutations will be then mapped to protein sequences and protein domains. Since it is a statistical test, a cut-off value is needed. The protein domain will be excluded if there is no missense mutation, or the Pfam-A expectancy score (e-value) was greater than 1e-5. Furthermore, if the domain is only present as one instance in the human genome, this domain will not be included as well. Domain sequence alignments are analyzed based on at least two missense mutations occurred at each position, and more than 3/4 of the domain alignments are non-gaps at each position.
The Mutation Aligner offers the possibility to draw new and unexpected links between observations of mutations across different genes as mutations may often affect equivalent residues of domains of paralogous genes. Analyzing mutations across different user-selected tumor types also offers the ability to transfer knowledge from one context to another, particularly when recurrent mutations in one cancer type overlaps with more infrequent mutations in other cancer types. By taking a domain-centric approach, some new mutation hotspots in domains of genes not previously associated with cancer could be noticed and studied. In this case, it could predict the functional role of many rare mutations.
mCluster uses binomial statistics to identify domain hotspots, which refers to clusters. It performs cluster detection using a step-down approach which could analyze the position with the largest number of mutations first followed by the next largest and so forth ('Mathematical and statistical methods for genetic analysis', 1998). Protein mutations with functional consequences are enriched in clusters at conserved positions across related proteins. Since the mutation hotspot occurs at domain level, observing many mutations occurred at the same ‘hotspot’ is a clear indicator of functional significance(Rubin & Green, 2007). The goal of this method is to identify which novel mutation is most likely to be a disease driver. By using protein family database and multiple sequence alignment, mutation clusters were identified assets of mutations located in the same column in a Pfam domain alignment.
The size of a cluster is the total number of observed mutations at that location. For cancer mutations, this means the total number of mutant samples. For germline disease mutations it means the number of unique mutations. The mCluster score is calculated by the following formula. In this formula, L stands for the number of available positions in a domain and n is the total number of mutations observed, where b is the probability of observing a cluster with a size less than k in a domain in binomial distribution models. In this case, each position is considered to have same probability of mutation.
The mCluster approach is based on the hypothesis that A mutation occurred in functionally essential regions are more likely to be disease-causing. Disease mutations are by definition functionally significant, and cancer mutations should also be enriched for functional mutations because driver mutations are under positive selection. The value of cluster analysis depends upon the domain hotspot hypothesis that functionally significant protein alterations are enriched at key analogous residues in related proteins. However, there is a problem that some functionally significant mutations will inevitably not fall within mutation clusters.