| Literature DB >> 33213499 |
Jacob Schreiber1, Ritambhara Singh2,3, Jeffrey Bilmes1,4, William Stafford Noble5,6.
Abstract
Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue.Entities:
Keywords: Epigenomics; Genomics; Machine learning
Mesh:
Substances:
Year: 2020 PMID: 33213499 PMCID: PMC7678316 DOI: 10.1186/s13059-020-02177-y
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1The performance of neural network models of varying complexity in three predictive settings on two tasks. Schematic diagrams of a cross-chromosome, b cross-cell type, and c hybrid cross-cell type/cross-chromosomal model evaluation schemes. d–f The figure plots the average precision (AP) of a machine learning model predicting gene expression as a function of model complexity. Evaluation is performed via d cross-chromosome, e cross-cell type, and f a combination of cross-chromosome and cross-cell type validation. In each panel, each point represents the test set performance of a single trained model. g–i is the same as d–f but predicting TAD boundaries rather than gene expression