| Literature DB >> 26463000 |
Lavanya Kannan, Marcel Ramos, Angela Re, Nehme El-Hachem, Zhaleh Safikhani, Deena M A Gendoo, Sean Davis, David Gomez-Cabrero, Robert Castelo, Kasper D Hansen, Vincent J Carey, Martin Morgan, Aedín C Culhane, Benjamin Haibe-Kains, Levi Waldron.
Abstract
Molecular interrogation of a biological sample through DNA sequencing, RNA and microRNA profiling, proteomics and other assays, has the potential to provide a systems level approach to predicting treatment response and disease progression, and to developing precision therapies. Large publicly funded projects have generated extensive and freely available multi-assay data resources; however, bioinformatic and statistical methods for the analysis of such experiments are still nascent. We review multi-assay genomic data resources in the areas of clinical oncology, pharmacogenomics and other perturbation experiments, population genomics and regulatory genomics and other areas, and tools for data acquisition. Finally, we review bioinformatic tools that are explicitly geared toward integrative genomic data visualization and analysis. This review provides starting points for accessing publicly available data and tools to support development of needed integrative methods.Entities:
Keywords: bioconductor; cancer; integrative genomics; multiple assays (multi-assays); omics; pharmacogenomics; public data
Mesh:
Substances:
Year: 2015 PMID: 26463000 PMCID: PMC4945830 DOI: 10.1093/bib/bbv080
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1.Data and cancer types provided by TCGA. The top barplot shows the number of data types available for each of the 36 cancer types (key provided as Supplementary Table S1) as of January 2015. Cancer types with fewer data types are still in the process of data collection. The lower barplot shows the number of cancer types for which each data type is available (key provided as Supplementary Table S2).
TCGA data acquisition tools
| Name and citation | Description | Download type | Data analysis integration | Data level | Software implementation |
|---|---|---|---|---|---|
| RTCGAToolbox [ | R package for downloading preprocessed data | Bulk | High | 3–4 | 'RTCGAToolbox' - Bioconductor Package |
| firehose_get [ | Unix command line tool | Bulk | Low | 1–4 | Command line, wget |
| Linked TCGA [ | 5 star-linked open data via SPARQL endpoints | Bulk | Low | 3 | Resource Description Framework (RDF) and SPARQL query endpoints |
| MSKCC cBioPortal [ | R package and Web interface to the MSKCC Cancer Genomics Data Server | Limited | High | 3–4 | 'cgdsr'—R Package |
| UCSC Cancer Genomics Hub [ | Restricted access tool to raw data files | Bulk | Low | 1 | GeneTorrent client (gtdownload) |
| TCGA Assembler [ | R script files for downloading preprocessed data | Bulk | Medium | 1, 3 | Collection of R scripts |
| Synapse client [ | Download within R using Synapse syntax (credentials required) | Limited | Medium | 1–4 | 'synapseClient'—R Package |
| TCGA Data Portal [ | Bulk, table and HTTP-link-based repository | Variable | Low | 1–4 | Web site and Web client |
| TCIA Imaging Archive [ | Repository for medical images of cancer in DICOM format | Bulk | – | 1 | Web site download |
‘Data levels’ are defined by TCGA, varying from 1 (raw data) to 4 (data analysis resulting from multiple samples, such as regions of common copy number variation).
Figure 2.Overlap across publicly available pharmacogenomic data sets. (A) Cell lines that have been molecularly and/or pharmacologically profiled in each study. (B) Drug compounds screened in each study. The substantial overlap across large pharmacogenomic studies using different molecular and pharmacological profiling assays enables integrative analysis to define more robust biomarkers of drug response.
Multi-assay pharmacogenomic and perturbation cell line data sets
| Program name | Number of unique cell lines | Number of unique tissues of origin | Assay | Number of drugs tested | FDA-approved drugs |
|---|---|---|---|---|---|
| CMAP | 5 | 4 | GE array | 1,309 | 576 |
| L1000 | 77 | 15 | GE array | 20,431 | 851 |
| NCI60 | 60 | 9 | GE array, SNP array, RPPI | 49,938 | 201 |
| CGP | 727 | 32 | GE array, WXS; SNP array | 140 | 29 |
| CCLE | 1036 | 24 | GE array and RNA-seq, WXS/WGS; SNP array | 24 | 8 |
| CTRP | 242 | 17 | See CCLE | 354 | 36 |
| CTD2 | 243 | 20 | DNA-seq, GE array, RPPA, perturbation-based screens, Comparative Genomic Hybridizations | 355 | 35 |
| GDSC | 714 | 14 | GE array, genetic mutations | 142 | |
| Achilles | 216 | GE array, genetic mutations, phenotypic information | 54,020 | ||
| LINCS | 356 | RNAseq, Proteomics | 5,943 |
Multi-assay data sets in the GEO as of April, 2015. Data types provided by every GEO series were queried using the ‘GEOmetadb' Bioconductor package [61]
| Number of data types | Number of GEO series |
|---|---|
| 1 | 53 066 |
| 2 | 3382 |
| 3 | 456 |
| 4 | 64 |
| 5 | 8 |
Figure 3.The growth of multi-assay genomic data sets in GEO. The GEOmetadb Bioconductor package [61] was used to identify all GEO series providing two or more data types. Using this subset of GEO series, the number of each of these data types was counted per year, and the six most common types are shown. The majority of multi-assay data set in GEO include expression profiling, noncoding RNA profiling and/or genome binding/occupancy profiling, each by array or high-throughput sequencing, with the number of sequencing experiments catching up to arrays in 2014.
Figure 4.Overview of clinical annotation in the curatedOvarian data package. Clinicopathological characteristics of patients (columns) are represented across 25 gene expression data sets (rows). For each data set, the percentage of patients in that data set that are annotated by a certain clinical characteristic is represented. Patient treatment by platin, taxol or neoadjuvant therapy is presented as pltx, tax and neo, respectively.
R packages for integrative data analysis
| R/bioconductor package name | Description | Repository |
|---|---|---|
| PMA [ | Penalized multivariate analysis (sparse CCA, PCA) | CRAN |
| mixOmics [ | rCCA, sPLS | CRAN |
| sPLS-DA | ||
| rGCCA | ||
| made4 [ | Coinertia analysis | Bioconductor |
| MCIA [ | Multi-CIA | Bioconductor |
| RGCCA [ | rGCCA, sparse GCCA for multi-block data analysis | CRAN |
| CNAmet [ | Signal-to-noise ratio statistic, permutation test | csbi.ltdk.helsinki.fi/CNAmet/ |
| Rtopper [ | Gene set enrichment | Bioconductor |
| iClusterPlus [ | Joint latent variable regression model | Bioconductor |
| STATegra ( | PCA, clustering | Bioconductor |