| Literature DB >> 31907404 |
Yanmei Dou1, Minseok Kwon1, Rachel E Rodin2,3,4,5, Isidro Cortés-Ciriano1,6, Ryan Doan2,3,4, Lovelace J Luquette1,7, Alon Galor1, Craig Bohrson1,7, Christopher A Walsh2,3,4, Peter J Park8,9.
Abstract
Detection of mosaic mutations that arise in normal development is challenging, as such mutations are typically present in only a minute fraction of cells and there is no clear matched control for removing germline variants and systematic artifacts. We present MosaicForecast, a machine-learning method that leverages read-based phasing and read-level features to accurately detect mosaic single-nucleotide variants and indels, achieving a multifold increase in specificity compared with existing algorithms. Using single-cell sequencing and targeted sequencing, we validated 80-90% of the mosaic single-nucleotide variants and 60-80% of indels detected in human brain whole-genome sequencing data. Our method should help elucidate the contribution of mosaic somatic mutations to the origin and development of disease.Entities:
Mesh:
Year: 2020 PMID: 31907404 PMCID: PMC7065972 DOI: 10.1038/s41587-019-0368-8
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 54.908
Fig. 1:Framework of MosaicForecast to detect mosaic SNVs from bulk sequencing data.
(a) Candidate mosaics were classified as ‘hap=2’, ‘hap=3’ or ‘hap>3’ by read-based phasing, and a Random Forest model was trained to predict the phasing by using 25 read-level features as covariates. The model was then applied to non-phasable sites to predict their genotypes. Given a list of experimentally-evaluated sites, the model could be further improved by an additional genotype-refinement step. (b) The relative importance of the features from the RF model for the brain WGS data, with four examples of read-level features. (c) 483 phasable sites were orthogonally evaluated by single cell, trio, and targeted sequencing data. After genotype refinement, the phasable sites classified as ‘hap=2’, ‘hap=3’ and ‘hap>3’ were converted to ‘het’, ‘mosaic’, ‘repeat/CNV’ and ‘refhom’ for training. (d) We applied MosaicForecast to non-phasable MuTect2 candidate mosaics and evaluated them in single cell, trio, and targeted sequencing data. In non-repeat regions, the precision increased from 8.9% (MuTect2) to 76% for the Phasing prediction model and 85% for the Refined genotypes prediction model; in the RepeatMaster region, it increased from 1% (MuTect2) to 50% in the Phasing prediction model and 77% in the Refined genotypes prediction model in RepeatMasker regions.
Fig. 2:Comparison among algorithms.
(a) Candidate mosaics (both phasable and non-phasable) in the three individuals with single cell data were evaluated (see Methods). (b) Precision and recall are plotted separately for the non-repeat and repeat regions (as defined by RepeatMasker) and for each individual.
Fig. 3:Impact of read depth on sensitivity and detection of mosaic indels.
(a) At each coverage, a different RF model was trained on the phasable sites and predictions were made on non-phasable sites. Amplicon-sequencing data were used for validation. Although fewer true mosaics were identified at lower coverages, the sensitivity did not drop significantly (e.g., at 50X, MosaicForecast was able to detect ~80% of real variants identified at 250X). (b) Similar to (a) but using simulated data. The sensitivity was ~70% at 50X. (c) >70% of mosaic deletions called by MosaicForecast were validated by IonTorrent; the ‘hap=3’ sites and non-phasable sites had similar validation rates. (d) similar to (c) but for mosaic insertions.