| Literature DB >> 24667251 |
Tsung-Jung Wu1, Amirhossein Shamsaddini, Yang Pan, Krista Smith, Daniel J Crichton, Vahan Simonyan, Raja Mazumder.
Abstract
Years of sequence feature curation by UniProtKB/Swiss-Prot, PIR-PSD, NCBI-CDD, RefSeq and other database biocurators has led to a rich repository of information on functional sites of genes and proteins. This information along with variation-related annotation can be used to scan human short sequence reads from next-generation sequencing (NGS) pipelines for presence of non-synonymous single-nucleotide variations (nsSNVs) that affect functional sites. This and similar workflows are becoming more important because thousands of NGS data sets are being made available through projects such as The Cancer Genome Atlas (TCGA), and researchers want to evaluate their biomarkers in genomic data. BioMuta, an integrated sequence feature database, provides a framework for automated and manual curation and integration of cancer-related sequence features so that they can be used in NGS analysis pipelines. Sequence feature information in BioMuta is collected from the Catalogue of Somatic Mutations in Cancer (COSMIC), ClinVar, UniProtKB and through biocuration of information available from publications. Additionally, nsSNVs identified through automated analysis of NGS data from TCGA are also included in the database. Because of the petabytes of data and information present in NGS primary repositories, a platform HIVE (High-performance Integrated Virtual Environment) for storing, analyzing, computing and curating NGS data and associated metadata has been developed. Using HIVE, 31 979 nsSNVs were identified in TCGA-derived NGS data from breast cancer patients. All variations identified through this process are stored in a Curated Short Read archive, and the nsSNVs from the tumor samples are included in BioMuta. Currently, BioMuta has 26 cancer types with 13 896 small-scale and 308 986 large-scale study-derived variations. Integration of variation data allows identifications of novel or common nsSNVs that can be prioritized in validation studies. Database URL: BioMuta: http://hive.biochemistry.gwu.edu/tools/biomuta/index.php; CSR: http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr; HIVE: http://hive.biochemistry.gwu.edu.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24667251 PMCID: PMC3965850 DOI: 10.1093/database/bau022
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.nsSNV data from various sources are collected, filtered and mapped to UniProtKB/Swiss-prot–defined complete human proteome and integrated into BioMuta.
Figure 2.HIVE interface showing result obtained from SNV profiling of short sequence reads mapped to nucleotide sequence surrounding a variation site. (A) Overall coverage result with the 121 485 241 position, showing variation. (B) Reads mapped to the reference sequence with the column of interest are highlighted in yellow. (C) Only variations are shown in this panel.
Twenty-six cancer types and 322 882 (small-scale: 13 896; large-scale: 308 986) associated variations in BioMuta
| Cancer types | COSMIC | UniProt | ClinVar | Manual | CSR-TCGA | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Small- scale | Large- scale | Small- scale | Large- scale | Small- scale | Large- scale | Small- scale | Large- scale | Small- scale | Large- scale | |
| Lung (LUAD) | 121 | 80 006 | 105 | |||||||
| Colon (COAD) | 486 | 68 249 | 235 | 20 | ||||||
| Breast (BRCA) | 176 | 7386 | 342 | 1 | 3314 | 16 | 31979 | |||
| Esophageal (ESCA) | 43 | 25 980 | 1 | |||||||
| Ovarian (OV) | 1229 | 16 411 | 31 | 1276 | 4 | |||||
| Skin (SKCM) | 496 | 17 041 | 2 | |||||||
| Prostate (PRAD) | 77 | 10 920 | 1 | |||||||
| Head and neck (HNSC) | 716 | 11 838 | 1 | |||||||
| Rectum (READ) | 9760 | 10 | ||||||||
| Lymphoid (DLBC) | 1710 | 7006 | ||||||||
| Adrenocortical (ACC) | 1000 | 4515 | 1 | |||||||
| Pancreatic (PAAD) | 896 | 3164 | 3 | |||||||
| Brain (LGG) | 773 | 2383 | ||||||||
| Uterine (UCEC) | 490 | 1414 | 1 | |||||||
| Kidney (KIRC) | 893 | 115 | ||||||||
| Liver (LIHC) | 1224 | 1023 | 14 | 3 | ||||||
| Glioblastoma (GBM) | 776 | |||||||||
| Acute myeloid (LAML) | 409 | 8 | ||||||||
| Thyroid (THCA) | 513 | 7 | 3 | |||||||
| Bladder (BLCA) | 450 | 2 | ||||||||
| Lung (LUSC) | 256 | |||||||||
| Stomach (STAD) | 89 | |||||||||
| Kidney renal (KIRP) | 33 | 42 | ||||||||
| Kidney chromo (KICH) | 57 | |||||||||
| Non-small lung (NSCLC) | 4 | 8 | ||||||||
| Cervical (CESC) | 1 | 5 | ||||||||
| Other | 5319 | |||||||||
aLUAD, lung adenocarcinoma; COAD, colon adenocarcinoma; BRCA, breast invasive carcinoma; ESCA, esophageal carcinoma; OV, ovarian serous cystadenocarcinoma; SKCM, skin cutaneous melanoma; PRAD, prostate adenocarcinoma; HNSC, head and neck squamous cell carcinoma; READ, rectum adenocarcinoma; DLBC, lymphoid neoplasm diffuse large B-cell lymphoma; ACC, adrenocortical carcinoma; PAAD, pancreatic adenocarcinoma; LGG, brain lower grade glioma; UCEC, uterine corpus endometrial carcinoma; KIRC, kidney renal clear cell carcinoma; LIHC, liver hepatocellular carcinoma; GBM, glioblastoma multiforme; LAML, acute myeloid leukemia; THCA, thyroid carcinoma; BLCA, bladder urothelial carcinoma; LUSC, lung squamous cell carcinoma; STAD, stomach adenocarcinoma; KIRP, kidney renal papillary cell carcinoma; KICH, kidney chromophobe; NSCLC, non-small cell lung cancer; CESC, cervical squamous cell carcinoma and endocervical adenocarcinoma.
bSmall-scale—SNVs associated with publications that report <1000 SNVs.
cLarge-scale—SNVs associated with publications that report >1000 SNVs or SNVs identified using computational pipelines from existing NGS data.
dCancer types not specified or well defined.
Example PubMed search terms and results
| Search terms | Total articles | Positive articles |
|---|---|---|
| SNP, biomarker, cancer | 702 | 60 |
| Biomarker, cancer, single-nucleotide- polymorphism | 1986 | 43 |
| Polymorphism, biomarker, cancer | 5215 | 20 |
| SNP, exon, cancer | 394 | 16 |
| Gene name | 20 | 4 |
| Total | 143 |
aTotal number of articles retrieved using the search terms.
bArticles from which data were extracted for inclusion in BioMuta.
cTargeted curation of specific genes, e.g. MTA1, MTA2, SULF2, SHBG, DLX4, etc.
dArticles and annotations that pass validation step are retained.
Figure 3.BioMuta data flow and utility in evaluating variations obtained from various cancers.
Figure 4.Loss of functional sites (PTM sites, active and binding sites). (A) Top six cancer types with the highest number of records in BioMuta. Lung adenocarcinoma (LUAD), colon adenocarcinoma (COAD), breast invasive carcinoma (BRCA), esophageal carcinoma (ESCA), ovarian serous cystadenocarcinoma (OV) and skin cutaneous melanoma (SKCM). (B) Statistical analysis of loss of functional sites show that for some cancer type–specific functional sites are less susceptible to variation (colored graph area almost touching the perimeter—where perimeter represents P-value close to 0).