| Literature DB >> 33094325 |
Mark F Rogers1, Tom R Gaunt2, Colin Campbell3.
Abstract
Sequencing technologies have led to the identification of many variants in the human genome which could act as disease-drivers. As a consequence, a variety of bioinformatics tools have been proposed for predicting which variants may drive disease, and which may be causatively neutral. After briefly reviewing generic tools, we focus on a subset of these methods specifically geared toward predicting which variants in the human cancer genome may act as enablers of unregulated cell proliferation. We consider the resultant view of the cancer genome indicated by these predictors and discuss ways in which these types of prediction tools may be progressed by further research.Entities:
Keywords: cancer; machine learning; variant prediction
Mesh:
Year: 2021 PMID: 33094325 PMCID: PMC8293831 DOI: 10.1093/bib/bbaa250
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Some commonly used tools for predicting the pathogenic impact of variants in the human genome. Except for Eigen and Eigen-PC, most methods use supervised learning. Most methods use data integration, utilizing conservation measures, functional annotations and other feature groups to optimize prediction accuracy
| Name | Method and features used | Reference |
|---|---|---|
|
| Logistic regression model trained with a wide variety of genomic features. Uses proxy neutrals estimated from the last human-ape genome divide and simulated | Kircher et al. [ |
|
| Deep neural network using conservation measures, epigenomics and genomic data. | Quang et al. [ |
|
| Unsupervised learning methods using genomics, functional annotations and epigenomics. | Ionita-Laza et al. [ |
|
| Multiple kernel learning and a later gradient boosting method using conservation measures, genomic and epigenomic features. | Shihab et al. [ |
|
| Naive Bayes classifier using conservation measures, regulatory and genomic features. | Schwarz et al. [ |
|
| Naive Bayes classifier using sequence and structure-based features. | Adzhubei et al. [ |
|
| Random Forest classifier, scoring amino acid substitutions as pathogenic, neutral or unknown, using conservation, functional and structural annotations. | Niroula et al. [ |
|
| Alignment scores based on sequence homology. | Choi et al. [ |
|
| Position-specific scoring matrix derived from sequence homology | Ng et al. [ |
|
| Random Forest method using conservation measures, protein structural measures, genomic and amino acid features. | Carter et al. [ |
Some further generic tools for predicting the pathogenic impact of variants in the human genome. These are examples of methods which use pre-existing prediction methods to leverage performance either by using further data or by using an ensemble of pre-existing tools, prospectively weighted by relative accuracy
| Name | Method and features used | Reference |
|---|---|---|
| DEOGEN2 | Random Forest classifier using | Raimondi et al. [ |
| GAVIN | Using a gene-specific calibration approach enhances test accuracy of | Van de Velde et al. [ |
| M-CAP | Gradient boosting tree classifier using 9 pre-existing tools ( | Jagadeesh et al. [ |
| MutPred2 | Random Forest based method using | Li et al. [ |
| REVEL | Random Forest classifier using 13 established tools such as | Ioannidis et al. [ |
A set of prediction tools specialized to predicting the disease-driver status of variants in the human cancer genome. Although the generic predictors of Tables 1 and 2 have been used successfully for variant prediction in the cancer genome, more specialized methods would be expected to achieve higher test accuracy. As for generic predictors, some methods are trained directly from data, while others, such as CanDrA and TransFIC, use predictions from pre-existing variant effect predictors
| Name | Method and features used | Reference |
|---|---|---|
|
| Random Forest method using evolutionary and structural features. | Carter et al. [ |
|
| An evolving suite of informatics tools for mutation interpretation and impact prediction. | Masica et al. [ |
|
| Gradient boosting (sequential learner) using evolutionary and genomic features | Rogers et al. [ |
|
| Similar to | Rogers et al. [ |
|
| Using evolutionary data, a predecessor to | Shihab et al. [ |
|
| Scoring scheme, using conservation, regulatory and other measures. Prioritizes cancer somatic variants, especially for regulatory noncoding mutations. | Fu et al. [ |
| CanDrA | Support Vector Machine method using 10 published predictors ( | Mao et al. [ |
| TransFIC | Scoring method utlilizing | Gonzalez-Perez [ |
Some typical feature groups which may be informative for discriminating SNV-drivers from neutrals in the context of cancer. Some feature groups may only be informative for coding, or alternatively for noncoding regions: for example, an indicated amino acid substitution under Consequence is only relevant to coding regions. During an additive sequential learning process, some feature groups may be discarded because only weakly informative or because the information is implicit in already learnt data. For the ENCODE feature group, and for construction of our CScape predictor, only four groups of data within this feature group yielded discriminatory information among variants in noncoding regions, and none in coding regions. However, this is not an indicator that this data source is inherently uninformative: in this case sparse coverage of data across the genome appeared to limit its use
| Feature group | Description |
|---|---|
| Conservation | Variants within highly conserved regions are more likely to be disease-drivers relative to variants within regions with high variability across species. Multispecies comparison can be achieved using a variety of evolutionary conservation scores derived, for example, from |
| Sequence | Sequence comparison of |
| Genomic context | Covering GC content, repeat regions, measures of region uniqueness and other genomic context measures. |
| Consequence | Covering the consequences of a variant, such as a resultant amino acid substitution, or the truncation of a transcript. In [ |
| ENCODE | Information from the |
Figure 1
y-axis: the proportions for correct prediction of disease-drivers (positives, light gray) and neutrals (negatives, dark gray) against P-score (x-axis: the confidence an SNV is a driver), evaluated on unseen test data. These predictions are for SNVs in coding regions of the human cancer genome and the methodology behind this plot is more fully described in Rogers et al. [42].
Figure 2
Four well-known cancer genes which can be labeled common drivers due to a higher incidence of predicted embedded SNV-drivers and an influence across multiple cancers. The y-axis gives the percentage incidence of at least one predicted high-confidence embedded SNV-driver (from use of CScape and using an FDR of 5%). The x-axis gives the typecodes of the 25 cancer types considered. These typecodes are matched with cancer name in supplementary table 1 of [55]. The figures for TP53 and KRAS are reproduced from [55] under the Creative Commons License [64].
Figure 3
The long noncoding RNA gene TTN-AS1 (white filled peaks) appears with a frequency, equal to, or slightly lower, than that for TTN (black filled peaks), in terms of percentage incidence of predicted embedded SNV-drivers (at an FDR of 5%). TTN-AS1 is transcribed from the antisense strand of TTN, with the latter expressing the complex muscle protein Titin.
Figure 4
Two genes which have a more selective influence in certain contexts. BRAF (top) has at least one high-confidence embedded SNV-driver at significant percentage incidence levels for thyroid cancer (typecode:THCA), skin cutaneous melanoma (SKCM) and colon adenocarcinoma (COAD). IDH1 (bottom) has a significant influence with LGG. Both genes are well documented within the cancer literature.