| Literature DB >> 26659699 |
Abstract
The advances of genomics, sequencing, and high throughput technologies have led to the creation of large volumes of diverse datasets for drug discovery. Analyzing these datasets to better understand disease and discover new drugs is becoming more common. Recent open data initiatives in basic and clinical research have dramatically increased the types of data available to the public. The past few years have witnessed successful use of big data in many sectors across the whole drug discovery pipeline. In this review, we will highlight the state of the art in leveraging big data to identify new targets, drug indications, and drug response biomarkers in this era of precision medicine.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26659699 PMCID: PMC4785018 DOI: 10.1002/cpt.318
Source DB: PubMed Journal: Clin Pharmacol Ther ISSN: 0009-9236 Impact factor: 6.875
Common data types for drug discovery
| Data type | Description | Common techniques | Public availability |
|---|---|---|---|
| SNP | A single nucleotide variation in a genetic sequence | SNP array: most widely used | **** |
| Whole genome sequencing | |||
| CNV | Variation of the number of copies of a particular gene in the genetic sequence | SNP array: most widely used; less sample DNA required; high probe density and coverage | **** |
| Comparative genome hybridization: high sensitivity and specificity; low spatial resolution | |||
| Whole genome sequencing: can detect smaller CNVs and novel types (e.g., inversions) | |||
| Mutation | A permanent change of the nucleotide sequence of the DNA; mostly somatic mutation that occurs in any of the cells except the germ cells | Whole exome sequencing: most widely used | **** |
| Whole genome sequencing: more expensive and more coverage | |||
| Gene expression | Mostly expression of mRNA but also includes expression of other transcripts | Microarray: most widely used | ***** |
| RNA‐Seq: can detect novel transcripts, low abundant transcripts and isoforms | |||
| Fluorescent | |||
| RT‐PCR: frequently used to confirm expression for a small number of genes | |||
| Protein expression | Can be expression of multiple isoforms or variations due to posttranslational modifications | Western blot: widely used to quantify protein expression for a small number of proteins | *** |
| ELISA: widely used to detect and quantitatively measure a protein in samples | |||
| Immunohistochemistry: can detect intracellular localization for a small number of proteins | |||
| Reverse phase protein array: can detect expression for a few hundred proteins | |||
| Mass spectrometry: can detect expression for a wide range of proteins | |||
| Protein‐protein interaction | Physical interactions between two or more proteins | Two‐hybrid screening: low‐tech; high false‐positive rate | **** |
| Mass spectrometry | |||
| Protein‐DNA interaction | Binding of a protein to a molecule of DNA | ChIP‐seq: combines chromatin immunoprecipitation with massively parallel DNA sequencing to identify the binding sites of DNA‐associated proteins | *** |
| Gene silencing | Effect of loss of gene function | RNAi: established method; knocks gene down at mRNA or non‐coding RNA level; can have transient effect (siRNA) or long‐term effect (shRNA) | ** |
| CRISPR‐Cas9: new method; modifies gene (via knockout/knockin) at the DNA level; causes permanent and heritable changes in the genome | |||
| Gene overexpression | Effect of gain of gene function | cDNAs/ORFs: provide clones of sequence | * |
| Drug efficacy | Effect of drug treatment; primarily represented as IC50/EC50/GI50
| HTS: rapidly assess the activity of a large number of compounds in biochemical assays or cell‐based assays | *** |
| MTT assay: often used to confirm activity for a small number of compounds | |||
| Drug‐target interaction | Physical interaction between a drug and a protein target | Affinity chromatography with mass spectrometry: most sensitive and unbiased method | *** |
| SPR | |||
| EMR/EHR | Patient response upon interventions | Digitalization | * |
CNV, copy number variation; CRISPR, clustered regularly interspaced short palindromic repeats; ELISA, enzyme‐linked immunosorbent assay; EMR/HER, electronic medical/health records; HTS, high throughput screening; MTT, methylthiazol tetrazolium; RT‐PCR, real‐time polymerase chain reaction; SNP, single‐nucleotide polymorphism; SPR, surface plasmon resonance.
Indicates the degree of public availability. For example, ***** shows researchers could easily access this type of data via public portals.
Common public databases for drug discovery
| Database | Description (as of October 2015) | URL |
|---|---|---|
| dbSNP | SNPs for a wide range of organisms, including >150M human reference SNPs. |
|
| dbVar | Genomic structural variations (primarily CNVs) generated mostly by published studies of various organisms, including >2.1M human CNVs. |
|
| COSMIC | Primarily somatic mutations from expert curation and genome‐wide screening, including >3.5M coding mutations. |
|
| 1000 Genomes Project | Genomes of a large number of people to provide a comprehensive resource on human genetic variation, including >2.5K samples. |
|
| TCGA | Genomics and functional genomics data repository for >30 cancers across >10K samples. Primary data types include mutation, copy number, mRNA, and protein expression. |
|
| GEO | Functional genomics data repository hosted by NCBI, including >1.6M samples. |
|
| ArrayExpress | Functional genomics data repository hosted by EBI, including >1.8M samples. |
|
| GTEx | Transcriptomic profiles of normal tissues, including >7K samples across >45 tissue types. |
|
| CCLE | Genetic and pharmacologic characterization of >1,000 cancer cell lines. |
|
| Human Protein Atlas | Expression of >17K unique proteins in cell lines, normal, and cancer tissues. |
|
| Human Proteome Map | Expression of >30K proteins in normal tissues. |
|
| StringDB | Protein‐protein interactions for >9M proteins from >2K organisms. |
|
| ENCODE | Protein‐DNA interactions, including >1.4K ChIP‐Seq experiments across ∼200 cell lines. |
|
| Project Achilles | Genetic vulnerabilities across >100 genomically characterized cancer cell lines by genome‐wide genetic perturbation reagents (shRNAs or Cas9/sgRNAs), including >11.2K genes. |
|
| LINCS | Cellular responses upon the treatment of chemical/genetic perturbagen, including >1M gene expression profiles representing >5,000 compounds and >3,500 genes (shRNA and overexpression) in >15 cell lines. |
|
| Genomics of Drug Sensitivity in Cancer project | Drug sensitivity data of 140 drugs in >700 cancer cell lines. |
|
| ChEMBL | Bioactivities for drug‐like small molecules, including >10K targets, >1.7M distinct compounds, and >13.5M activities. |
|
| PubChem | Chemical compounds and bioassay experiments, including >60M unique chemical compounds and >1.1M assays. |
|
| CMap | >6,000 drug gene expression profiles representing 1,309 compounds tested in 3 main cell lines. |
|
| CTRP | Links genetic, lineage, and other cellular features of cancer cell lines to small‐molecule sensitivity, including 860 cell lines and 461 compounds. |
|
| ImmPort | Clinical assessments in immunology along with molecular profiles, including 143 clinical studies/trials and 799 experiments on >22.4K subjects. |
|
| ClinicalTrials.gov | Registry and results database of publicly and privately supported clinical studies, including >201.7K studies. |
|
| PharmGKB | Genetic variations on drug response, including >3K diseases, >27K genes, and >3K drugs. |
|
CCLE, Cancer Cell Line Encyclopedia; CMap, Connectivity Map; CNVs, copy number variants; COSMIC, catalog of somatic mutations in cancer; CTRP, Cancer Therapeutics Response Portal; dbSNP, Single Nucleotide Polymorphism Database; dbVar, database of genomic structural variation; EBI, European Bioinformatics Institute; ENCODE, Encyclopedia of DNA Elements; GEO, Gene Expression Omnibus; GTEx, Genotype‐Tissue Expression; IMMPORT, Immunology Database and Analysis Portal; LINCS, Library of Integrated Network‐based Cellular Signatures; NCBI, National Center for Biotechnology Information; SNPs, single‐nucleotide polymorphisms; TCGA, The Cancer Genome Atlas.
Figure 1Public datasets can be leveraged to identify new targets, drug indications, and drug response biomarkers.
Figure 2An illustration of big data approaches to identifying new targets.
Figure 3An illustration of big data approaches to identifying new drug indications.
Figure 4An illustration of big data approaches to identifying new drug response biomarkers.