| Literature DB >> 31469830 |
Laxmi Parida1, Claudia Haferlach2, Kahn Rhrissorrakrai1, Filippo Utro1, Chaya Levovitz1, Wolfgang Kern2, Niroshan Nadarajah2, Sven Twardziok2, Stephan Hutter2, Manja Meggendorfer2, Wencke Walter2, Constance Baer2, Torsten Haferlach2.
Abstract
The confluence of deep sequencing and powerful machine learning is providing an unprecedented peek at the darkest of the dark genomic matter, the non-coding genomic regions lacking any functional annotation. While deep sequencing uncovers rare tumor variants, the heterogeneity of the disease confounds the best of machine learning (ML) algorithms. Here we set out to answer if the dark-matter of the genome encompass signals that can distinguish the fine subtypes of disease that are otherwise genomically indistinguishable. We introduce a novel stochastic regularization, ReVeaL, that empowers ML to discriminate subtle cancer subtypes even from the same 'cell of origin'. Analogous to heritability, implicitly defined on whole genome, we use predictability (F1 score) definable on portions of the genome. In an effort to distinguish cancer subtypes using dark-matter DNA, we applied ReVeaL to a new WGS dataset from 727 patient samples with seven forms of hematological cancers and assessed the predictivity over several genomic regions including genic, non-dark, non-coding, non-genic, and dark. ReVeaL enabled improved discrimination of cancer subtypes for all segments of the genome. The non-genic, non-coding and dark-matter had the highest F1 scores, with dark-matter having the highest level of predictability. Based on ReVeaL's predictability of different genomic regions, dark-matter contains enough signal to significantly discriminate fine subtypes of disease. Hence, the agglomeration of rare variants, even in the hitherto unannotated and ill-understood regions of the genome, may play a substantial role in the disease etiology and deserve much more attention.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31469830 PMCID: PMC6742441 DOI: 10.1371/journal.pcbi.1007332
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Partitions of genomic regions based on Ensembl annotation and their predictability of blood cancer WGS data.
(A) The partition of the genomic region. (B-G) Median F1 values for the respective regions and their permuted controls when using off-the-shelf ML (B and E), ReVeaL on the original genomic areas (C and F) and ReVeaL on genomic areas normalized by length (D and G). See S2 Table for the F1 values.
Fig 2Disease-by-disease ReVeaL Analysis.
(A) F1 scores for genomic sectors for each disease are averaged over all 10 replicate analyses per chromosome and the maximum F1 score is reported for that disease. ReVeaL scores on disease-label permutations are shown in overlaid hatched bars. The gray bar represents the mean over all diseases. (B-C) Boxplot of f, shingle values representing the four moments of the distribution, of samples per disease and diseases ordered by decreasing median f for the top 2 ReVeaL features. The line above each boxplot represents the shingle, the yellow interval representing the portion of the segment that is masked. (D-G) t-SNE visualization (perplexity = 40, iterations = 300) using the top 50 shingle f values (B and C) and mutational load l, number of mutations for a given window in the genomic region for a given patient, (D and E), respectively, in exonic and dark sectors.