| Literature DB >> 31142351 |
Fábio C P Navarro1,2, Hussein Mohsen1,2, Chengfei Yan1,2, Shantao Li3,4, Mengting Gu1,2, William Meyerson1,2, Mark Gerstein5,6,7,8.
Abstract
Data science allows the extraction of practical insights from large-scale data. Here, we contextualize it as an umbrella term, encompassing several disparate subdomains. We focus on how genomics fits as a specific application subdomain, in terms of well-known 3 V data and 4 M process frameworks (volume-velocity-variety and measurement-mining-modeling-manipulation, respectively). We further analyze the technical and cultural "exports" and "imports" between genomics and other data-science subdomains (e.g., astronomy). Finally, we discuss how data value, privacy, and ownership are pressing issues for data science applications, in general, and are especially relevant to genomics, due to the persistent nature of DNA.Entities:
Mesh:
Year: 2019 PMID: 31142351 PMCID: PMC6540394 DOI: 10.1186/s13059-019-1724-1
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1A holistic view of biomedical data science. a Biomedical data science emerged at the confluence of large-scale datasets connecting genomics, metabolomics, wearable devices, proteomics, health records, and imaging to statistics and computer science. b The 4 M processes framework. c The 5 V data framework
Fig. 2Data volume growth in genomics versus other disciplines. a Data volume growth in genomics in the context of other domains and data infrastructure (computing power and network throughput). Continuous lines indicate the amount of data archived in public repositories in genomics (SRA), astronomy (Earth Data, NASA), and sociology (Harvard dataverse). Data infrastructure such as computing power (TOP500 SuperComputers) and network throughput (IPTraffic) are also included. Dashed lines indicate projections of future growth in data volume and infrastructure capacity for the next decade. b Cumulative number of datasets being generated for whole genome sequencing (WGS) and whole exome sequencing (WES) in comparison with molecular structure datasets such as X-ray and electron microscopy (EM). PDB Protein Data Base, SRA Sequence Read Archive
Fig. 3Variety of sequencing assays. Number of new sequencing protocols published per year. Popular protocols are highlighted in their year of publication and their connection to omes
Fig. 4Technical exchanges between genomics and other data science subdisciplines. The background area displays the total number of publications per year for the terms. a Hidden Markov model, b Scale-free network, c latent Dirichlet allocation. Continuous lines indicate the fraction of papers related to topics in genomics and in other disciplines
Fig. 5Open source adoption in genomics and other data science subdisciplines. The number of GitHub commits (upper panel) and new GitHub repositories (lower panel) per year for a variety of subfields. Subfield repositories were selected by GitHub topics such as genomics, astronomy, geography, molecular dynamics (Mol. Dynamics), quantum chemistry (Quantum Chem.), and ecology