| Literature DB >> 30522862 |
Piotr Grabowski1, Juri Rappsilber2.
Abstract
High-throughput methodologies and machine learning have been central in developing systems-level perspectives in molecular biology. Unfortunately, performing such integrative analyses has traditionally been reserved for bioinformaticians. This is now changing with the appearance of resources to help bench-side biologists become skilled at computational data analysis and handling large omics data sets. Here, we show an entry route into the field of omics data analytics. We provide information about easily accessible data sources and suggest some first steps for aspiring computational data analysts. Moreover, we highlight how machine learning is transforming the field and how it can help make sense of biological data. Finally, we suggest good starting points for self-learning and hope to convince readers that computational data analysis and programming are not intimidating.Entities:
Keywords: data integration; data science; functional genomics; machine learning; systems biology
Mesh:
Year: 2018 PMID: 30522862 PMCID: PMC6318833 DOI: 10.1016/j.tibs.2018.10.010
Source DB: PubMed Journal: Trends Biochem Sci ISSN: 0968-0004 Impact factor: 13.807
Figure 1Basic High-Level Flow of Omics Data Analytics in the Life Sciences Field.
Summary of Large Data Repositories for Omics Analytics
| Repository | Data type | Link |
|---|---|---|
| Gene Expression Omnibus | Gene expression, noncoding RNA profiling, epigenetics, genome variation profiling | |
| ENCODE | Epigenetics, gene expression, computational predictions | |
| ArrayExpress | DNA sequencing, gene and protein expression, epigenetics | |
| European Genome-Phenome Archive | Various omics with phenotype data (biomedical studies) | |
| Proteomics, protein expression, post-translational modifications | ||
| 1000 Genomes | Genome sequences, sequence variants | |
| MetaboLights | Metabolomics | |
| GTEx | Gene expression (microarrays and RNA-seq), genome sequences | |
| National Institutes of Health/National Cancer Institute (NIH/NCI) Genomic Data Commons | Gene expression, epigenetics, miRNA-seq (focus on cancer) | |
| NIH dbGaP | Genotypes, gene expression, epigenetics, phenotypes | |
| cBioPortal | Focused on cancer, contains data on gene copy numbers, gene and protein expression, DNA methylation, and clinical data | |
| Single Cell Expression Atlas | Single-cell gene expression (RNA-seq) | |
| RIKEN SCPortalen | Single-cell gene expression (RNA-seq) |
Needs granted access for individual-level data.
Figure IPlanning a Machine Learning-Based Analysis Requires Careful Consideration at Each Stage of the Analysis. We listed the most general elements of designing such workflow using mitochondrial protein classification task as an example. However, same thinking patterns apply to regression tasks or for feature importance analysis.
Summary of Annotation Databases and Postanalysis Tools Helpful in Making Sense of Results in Computational Analytics
| Annotation database/Tool name | Description | Link |
|---|---|---|
| UniProt | Comprehensive proteomics knowledge base (functions, pathways, sequences, modifications, literature references, ID conversion). | |
| BioMart | Gene-centric database with ID conversion, genomic features (such as exons, introns, untranslated regions), sequences, positions of genes in the genome. | |
| NCBI Genome Data Viewer | A Web tool for exploration and analysis of eukaryotic genome assemblies. | |
| UCSC Genome Browser | A collection of tools for analysis of genomes with a plethora of available data ‘tracks’ such as epigenetic signals and genomic features. | |
| StringDB | A database of known and predicted protein–protein interactions. Integrates functional relationship data from various sources. | |
| BioGRID | Curated database of physical and genetic interactions based on various experimental sources. | |
| DAVID | Gene Ontology and pathway analysis Web tool for calculation of functional enrichments in lists of genes or proteins. | |
| Enrichr | Web tool for calculating various functional enrichments in lists of genes or proteins. | |
| g:Profiler | Web tools for functional profiling of groups of genes or proteins. Contains useful ID conversion and orthology mapping tools. |