| Literature DB >> 31096972 |
Irantzu Anzar1, Angelina Sverchkova1, Richard Stratford1, Trevor Clancy2.
Abstract
BACKGROUND: The accurate screening of tumor genomic landscapes for somatic mutations using high-throughput sequencing involves a crucial step in precise clinical diagnosis and targeted therapy. However, the complex inherent features of cancer tissue, especially, tumor genetic intra-heterogeneity coupled with the problem of sequencing and alignment artifacts, makes somatic variant calling a challenging task. Current variant filtering strategies, such as rule-based filtering and consensus voting of different algorithms, have previously helped to increase specificity, although comes at the cost of sensitivity.Entities:
Keywords: Cancer genomics; Machine learning; Precision medicine; Somatic variant detection
Mesh:
Year: 2019 PMID: 31096972 PMCID: PMC6524241 DOI: 10.1186/s12920-019-0508-5
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
List of GIAB datasets used for NeoMutate benchmarking. The data was downloaded in BAM format and converted back to fastq in order to fully test all the functionalities of the workflow. (WES: whole-exome sequencing; PE: paired-end)
| Sample ID | Lab | Project | Library type | Read length | Insert size | File format | Number of reads before trimming | Number of reads after trimming | % kept reads |
|---|---|---|---|---|---|---|---|---|---|
| NA12878 | Broad Institute | CEU Trio Analysis (son) | WES, PE | 76 bp | 155 bp | BAM | 118,969,048 | 89,151,231 | 74.94 |
| NA12891 | Broad Institute | CEU Trio Analysis (father) | WES, PE | 76 bp | 155 bp | BAM | 116,639,621 | 88,079,244 | 75.51 |
| NA24631 | Oslo University Hospital | Asian (Han chinese) Trio (son) | WES, PE | 125 bp | 202 bp | BAM | 61,001,625 | 60,852,682 | 99.76 |
Fig. 1In silico variant simulation workflow on real data using BamSurgeon: NA12878, NA12891 and NA24631 real datasets were spiked-in with non-overlapping variant subsets at different allele frequencies (ranging from 0.01 to 1) extracted from COSMIC database for S1, S2 and S3 simulation experiments. An additional simulation was performed using NA24631 dataset and non-COSMIC random mutations having VAF < =0.2. Text boxes coloured with blue borders represent steps embedded in NeoMutate workflow
Simulation experiments design overview
| Simulation ID | Sample ID | is_COSMIC | VAF range | All | SNV | Insertions | Deletions |
|---|---|---|---|---|---|---|---|
| S1 | NA12878 | yes | 0.01–1 | 3600 | 3000 | 300 | 300 |
| S2 | NA12891 | yes | 0.01–1 | 6000 | 5000 | 500 | 500 |
| S3 | NA24631 | yes | 0.01–1 | 6000 | 4000 | 1000 | 1000 |
| S4 | NA24631 | no | 0.01–0.2 | 5000 | 3000 | 1012 | 988 |
List of individual somatic variant callers embedded in NeoMutate
| Tool | Version | Methodology |
|---|---|---|
| MuTect2 | 3.8 | Bayesian classifier |
| Strelka2 | 2.8.4 | Bayesian model of admixture |
| VarScan2 | 2.4.3 | Heuristic methodology with statistical test |
| VarDict | 1.4 | Combined heuristic and statistical algorithm |
| SomaticSniper | 1.0.5.0 | Bayesian approach for estimating genotype probabilities |
| Freebayes | 0.1.2 | Bayesian model with error probabilities |
| Lancet | 1.0.5 | Colored de Bruijn graphs |
List of biological and sequencing features selected for downstream ML analysis
| Feature name | Source | Group | Description |
|---|---|---|---|
| indel_or_snp | BAM | 3 | Is the given variant a SNP, insertion or deletion? |
| ts_or_tv | BAM | 3 | Transition or transversion |
| depth_TUM | BAM | 1 | Coverage in tumor sample for the given variant position |
| alt_counts_TUM | BAM | 1 | Alternative read counts (number of reads supporting the variant) |
| alt_avg_MQ_TUM | BAM | 2 | Average mapping quality of reads containing the variant. Quantification of the probability that a read is misplaced. |
| alt_avg_BQ_TUM | BAM | 2 | Average base quality of the reads containing the variant. Accuracy of a base sequenced by the sequencing machine. |
| alt_plus_TUM | BAM | 1 | Number of reads on the plus/forward strand supporting the variant |
| alt_minus_TUM | BAM | 1 | Number of reads on the minus/reverse strand supporting the variant |
| ref_plus_TUM | BAM | 1 | Number of reads on the plus/forward strand supporting the reference allele |
| ref_minus_TUM | BAM | 1 | Number of reads on the minus/reverse strand supporting the reference allele |
| VAF | BAM | 1 | Variant allele frequency |
| depth_WT | BAM | 1 | Coverage in normal sample for the given variant position |
| alt_counts_WT | BAM | 1 | Number of reads supporting the variant in normal sample (germline risk) |
| ref_counts_WT | BAM | 1 | Number of reads supporting the reference in normal sample |
| num_of_indels_closeby | BAM | 3 | Are there indels closeby? (false positive risk factor) |
| GC_content | BAM | 3 | Number of GC bases relative to the total number of bases located + − 20 bp for the given variant position |
| shannon_entropy | BAM | 3 | A mathematical measure of the degree of randomness in a set of data. The smaller the entropy value, the less complex the sequence is. |
| detection_status | VCF | 4 | Classification status (“somatic” or “non somatic”) for the given variant caller |
| “Tool”_F | VCF | 4 | Quality tag in FILTER column (“PASS” or “non PASS”) |
| “Tool”_alt_counts | VCF | 1, 4 | Number of reads supporting the variant reported by the specific tool |
| “Tool”_ref_counts | VCF | 1, 4 | Number of reads supporting the reference reported by the specific tool |
Definition of selected performance metrics used for algorithm evaluation. The four variables present in a 2 × 2 contingency table: true positive (TP) (variants predicted and validated), true negative (TN) (variants not predicted and not validated), false positive (FP) (variants predicted but failed in validation), and false negative (FN) (variants not predicted but validated) are used to calculate the metrics and assess model performance
| Metric | Formula | Definition |
|---|---|---|
| Accuracy |
| The ratio of correct calls out of the total number of positions. |
| Precision |
| The ratio of correct variant calls out of the total number of variant calls. |
| Recall |
| The ratio of correct variant calls out of the total number of variant positions. |
| False discovery rate (FDR) |
| The ratio of incorrect calls out of the total number of variant calls. |
| F1-Score |
| Harmonic mean of precision and recall, where 1 is the best score and 0 the worst. |
| Matthews correlation coefficient (MCC) |
| A measure of the quality of binary (two-class) classifications. The MCC represents the correlation coefficient between the observed and predicted binary classifications, where −1 indicates a completely wrong binary classifier while 1 indicates a completely correct classifier. |
Fig. 2NeoMutate workflow: This figure illustrates the main steps executed during NeoMutate framework, where raw reads from nearly any sequencing technology platform are transformed into an accurate list of prioritized somatic variants. Its modular architecture consists in quality control of the raw data, alignment and BAM post-processing, ensemble variant calling and machine learning boosted variant filtering step. 7 machine learning models are trained using the ensemble calling plus a set of biological and sequencing relevant features. Each algorithm will provide a mutational status classification per variant yielding a high-confidence somatic mutation call set
Fig. 3Comprehensive performance evaluation of different approaches. Only those approaches having a sensible recall (> 0.5) were chosen for the comparison. a) Individual variant callers raw results evaluation. b) Standard filtering results evaluation. m2 s2: mutect2 and strelka2 calls intersection; m2s2_HQ: mutect2 and strelka2 HQ (only variants tagged as `PASS`) calls intersection; cons_n: consensus voting (intersection) of at least n tools; cons_2_HQ: consensus voting of the HQ call sets of least 2 tools. b) ML results evaluation. LRC: logistic regression classifier; SVMl: support vector machine classifier with linear kernel; DT: decision tree; GNB: Gaussian Naïve Bayes; RFC: random forest classifier; GBDT: gradient boosting decision tree; NN: neural network
Fig. 4Correlation matrix plot: Pairwise comparison correlation matrix heatmap of some methods results. Half heatmap is represented with colors while the other half is represented with numbers (both halfs represent the same). For those variant callers having FILTER column (“toolname”_HQ), only those variants having “PASS” tag were selected. The methods were divided in three main strategies for better visualization: Machine learning based (blue), standard filtering based (red) and individual tools results (green)
Fig. 5a) F1-Scores of each evaluated method for each VAF range. b) F1-Scores of each evaluated method according to variant type. The methods were divided in three main strategies in the x-axis: Machine learning based (blue), standard filtering based (red) and individual tools results (green). Each VAF range considered is represented with a different color
Fig. 6Precision-recall (PR) and Receiver operating characteristic (ROC) curves for GBDT attending to variant type and variant allele frequency (VAF). Correspondent AUC values were computed for all the curves
Fig. 7Bar chart of 5-CV GBDT feature importance ranking. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. The features contributing most to the prediction variable are represented in the left of the plot with highest relative importance scores. The error bar represents the standard deviation across the 5 folds
F1-Scores obtained for the best standard filtering approach and best ML classifier across the different simulations. The consensus call of at least four out of the 7 tools was the winner option for the first category while GBDT model for the second
| Method | Category | Description | S1 | S2 | S3 | S4 |
|---|---|---|---|---|---|---|
| GBDT | ML | Gradient boosting decision tree | 0.9742 | 0.9762 | 0.9658 | 0.8748 |
| cons_4 | Standard filtering | Consensus of > = 4 tools | 0.9139 | 0.9326 | 0.9203 | 0.8044 |