Literature DB >> 30102366

Computational resources associating diseases with genotypes, phenotypes and exposures.

Wenliang Zhang¹, Haiyue Zhang¹, Huan Yang¹, Miaoxin Li¹, Zhi Xie², Weizhong Li¹.

Abstract

The causes of a disease and its therapies are not only related to genotypes, but also associated with other factors, including phenotypes, environmental exposures, drugs and chemical molecules. Distinguishing disease-related factors from many neutral factors is critical as well as difficult. Over the past two decades, bioinformaticians have developed many computational resources to integrate the omics data and discover associations among these factors. However, researchers and clinicians are experiencing difficulties in choosing appropriate resources from hundreds of relevant databases and software tools. Here, in order to assist the researchers and clinicians, we systematically review the public computational resources of human diseases related to genotypes, phenotypes, environment factors, drugs and chemical exposures. We briefly describe the development history of these computational resources, followed by the details of the relevant databases and software tools. We finally conclude with a discussion of current challenges and future opportunities as well as prospects on this topic.

Entities: CellLine Chemical Disease Gene Species

Keywords: database; disease phenotype; environmental exposure; genotype; software tool; web platform

Year: 2019 PMID： 30102366 PMCID： PMC6954426 DOI： 10.1093/bib/bby071

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Introduction

As the advance of sequencing and other high-throughput technologies are producing big omics data for medical research, how to utilize and analyze these data to understand human diseases has become increasingly challenging. Whole exome sequencing or whole genome sequencing could unravel hundreds of thousands to even millions of variants, of which only a few may be disease-causative or related [1-4], thus identifying disease-causing genes and pathogenic variants is critical in human genetics studies. The focus of the genetics field is shifted from the production of genotypic data to the annotation and interpretation of analysis results. The causes of a disease and its therapies are not only related to genotypes but also associated with other factors, such as phenotypes, environmental exposures, drugs and chemical molecules, etc. Distinguishing disease-related factors from many neutral factors is critical as well as difficult. Misleading assignment of pathogenicity to factors may result in inaccurate disease-risk assessments and diagnoses along with unsuitable treatments. Individual phenotype, broadly defined as any observable characteristics of an individual [5], arises from complex interactions between the above multiple factors. Correct and accurate interpretation of the relationships between these factors is fundamentally important for the investigation of human disease mechanisms. Over the past two decades, bioinformaticians have developed more than 100 computational resources to integrate the omics data and discover associations among genotypes, phenotypes, environmental exposures, drugs and chemical molecules. These computational resources, including databases such as Online Mendelian Inheritance in Man (OMIM) [6], ClinVar [7] and dbGAP [8-10], software tools such as Polyphen [11], ANNOVAR [12], Eigen [13], DeepSea [14] and PhenIX [15] and web platforms such as Open Targets [16] and DisGeNet [17], offer online and standalone applications to prioritize genotype–phenotype associations (GPAs), phenotype-drug/chemical-target associations and other associations. Undoubtedly, these computational resources have facilitated the research in life sciences and greatly supported the development of precision clinical medicine. However, researchers and clinicians are experiencing difficulties in choosing appropriate resources from hundreds of relevant databases and software tools. Therefore, it is imperative to critically review the disease-related databases and tools, not only for life scientists, but also for medical researchers and clinicians. Here we systematically review the public computational resources of human diseases related to genotypes, phenotypes, environment factors, drugs and chemical exposures. We begin with the history of development of computational resources for human diseases, followed by the description of the relevant databases and the comparison of their scales of data and scopes of usage. Then we summarize and compare the software tools and the web platforms for the deeper understanding of associations between multiple disease-related factors. Finally, we conclude with a discussion of current challenges and future opportunities as well as prospects on this topic.

Development of the computational resources

Disease-related data, including phenotypes, genotypes, environment factors and drug/chemical exposures, were mainly generated by a range of international projects or research programs and have been stored and integrated in different public computational resources, freely available to the public (Figure 1 and Supplementary S-Table 1).

Figure 1

Development history of disease-related computational resources. The development of disease-related databases, software tools and web platforms, is depicted over the timeline. According to scopes and applications, the computational resources are classified into different groups. OMIM is the first established database to provide a catalog of human genes and genetic disorders [6], followed by the starting of the Human Genome Project in 1990. Five years later, the Human Gene Mutation Database (HGMD) was published to handle the data of human gene mutations [18, 19], followed by the construction of dbSNP [20] and Orphanet [21] in the late 1990s to integrate data of single nucleotide polymorphisms (SNPs) and rare diseases based on protein-coding genes. Since the year of 2000, several organism models have been developed and the databases of these model species are available not only for life science studies but also for medical research, e.g. Mouse Genome Database (MGD) [22] and MouseNet [23], Rat Genome Database (RGD) [24] and Zebrafish Model Organism Database (ZFIN) [25]. In 2000s, the databases of drug targets and chemical molecules were established to accelerate the development of molecular drugs, such as PharmGKB [26], DrugBank [27] and PubChem [28]. Since the late 2000s, noncoding RNAs have been found important in the development of diseases [29-33], and thus databases have been constructed to classify relationships between noncoding RNAs and human diseases, for example, NONCODE [34], miR2Disease [35] and LncRNADisease [36]. At the same time, the international projects and research programs of population genomics, including 1000Genomes [37], TCGA [38, 39], ICGC [40] and UK10K [41], have produced biomedical big data for the communities of life and medical sciences to share, analyse and utilize. Environmental factors (EFs), drugs and chemicals also play critical roles in the development of diseases, such as the Comparative Toxicogenomics Database (CTD) [42], LncREnvironmentDB [43] and Exposome-Explorer [44]. With the rapid growth of data in these databases, data mining and analysis have become another challenge. Since 2003, at least 30 tools have been developed to annotate, predict and prioritize functional effects of genomic variants, as well as to identify genomic variants of uncertain significance (Figure 1 and Supplementary S-Table 1), e.g. SIFT [45], PolyPhen [11], ANNOVAR [12], VASST [46] and GWAVA [47]. Additionally, several ontology-driven computational tools have been developed to facilitate clinical interpretation of genomic variants based on functional prediction of genomic variants and deep phenotype annotations, such as PhenIX [15] and Phevor [48]. Moreover, machine learning technologies (including deep learning) have recently been implemented to predict variations and their biological effects, for example, CADD [49], Eigen [13], DeepSea [14] and DeepVariant [50]. Furthermore, several web platforms, such as DisGeNET [17], MalaCards [51], Monarch Initiative [52] and Open Targets Platform [16], have been established to comprehensively integrate a variety of disease-related data sources with computational tools, allowing easy and simultaneous data access and analysis.

Databases

Dozens of public databases have been developed to store, retrieve and manage disease-related data. According to scopes and data associations, the databases can be categorized into seven groups (Table 1). The database group for coding genes includes data resources that primarily provide association information between protein-coding genes and phenotypes of human diseases, while the group for ncRNAs contains noncoding RNAs information associated with diseases. The group for genomic variations associates genomic variant information with phenotypes of disease. The group for population genomic data focuses on the worldwide clinical genomic variation and allele frequencies in various populations. The group of genetical organism models stores association information between genotypes and phenotypes/diseases of laboratory organisms. The group of environment exposures offers toxicogenomic relationships relevant to exposed factors, genes, proteins, phenotypes or diseases. The treatment group provides information that involves target drugs, drug resistance mutations, disease and their associations. All of these databases offer internet access of data through web browsers, and some of them also offer web Application Programming Interfaces (APIs). Table 1 summarizes the groups according to their scopes and associations. Table 2 states the current status of the databases and Supplementary S-Table 2 states the data standards of nomenclature. The URLs of the databases can also be found in the supplementary file.

Table 1

Comparison of different disease-related data resources

Data resource name	Phenotype/Disease			Genotype			Environmental factors	Drugs/chemicals	Association
Data resource name	Mendelian and Rare	Complex and Trait	Organism model of disorder	Coding	Non-coding	Function annotation of variant	Environmental factors	Drugs/chemicals	Types	Score
Coding genes
OMIM	√(M)	√(F)	√	√(M)	√(F)				GPAs
Orphanet	√			√(M)	√(F)			√	GPAs, PDAs
DIDA	√			√(digenic)					GPAs
DiseaseMeth		√(Cancer)		√					GPAs
Noncoding RNAs
miR2Disease		√			√(miR)				GPAs
HMDD v2.0		√			√(miR)				GPAs
NONCODE		√			√(lnc)				GPAs
LncRNADisease		√			√(lnc)				GPAs
Lnc2Cancer		√(Cancer)			√(lnc)				GPAs
NSDNA		√(NSDs)			√(ncR)				GPAs
circRNADisease		√			√(circ)				GPAs
MNDR	√(F)	√(M)			√(ncR)				GPAs	√
Genomic variants
HGMD	√	√		√(M)	√(F)	√			GPAs	√
ClinVar	√	√		√(M)	√(F)	√			GPAs	√
VarCards	√	√		√(M)	√(F)	√			GPAs	√
GWAS Catalog		√		√(M)	√(F)	√			GPAs	√
GWAS Central		√		√(M)	√(F)	√			GPAs	√
GWASdb		√		√(M)	√(F)	√		√	GPAs, GDAs	√
COSMIC		√(Cancer)		√(M)	√(F)	√		√	GPAs, GDAs	√
CIViC		√(Cancer)		√		√		√	GPAs, GDAs	√
Denovo-db		√(NSDs)		√(M)	√(F)	√			GPAs	√
miRdSNP		√			√(miR)	√			GPAs
LincSNP		√			√(lnc)	√			GPAs	√
LncRNASNP		√			√(lnc)	√			GPAs	√
Population genomic data
dbSNP				√	√	√				√
ESP	√	√		√		√			GPAs
ExAC				√		√
1000Genome				√	√	√
Kaviar				√	√	√
FINDbase				√	√	√
Genetical organism models
MGD	√	√	√(Mouse)	√					GPAs
MTB		√(Cancer)	√(Mouse)	√					GPAs
RGD	√	√	√(Rat)	√					GPAs
ZFIN	√	√	√(zebra fish)	√				√	GPAs
Environmental exposures
CTD	√	√		√			√		GPAs, GEFAs, PEFAs
miREnvironment		√			√(miR)		√		GPEFAs
SM2miR		√			√(miR)		√	√	GEFAs
LncEnvironmentDB					√(lnc)		√		GEFAs	√
DLREFD		√			√(lnc)		√	√	GPEFAs

Continued

Table 2

Summary of disease-related databases

Database	Scope and scale	Date of statistic
Coding genes
OMIM [6]	15 919 gene descriptions, 8670 phenotypes and 3928 genes with association to 1 or more phenotype(s)	22 June 2018^w
Orphanet [53]	6949 associations between genes and rare diseases	Aug 2016^w
Gene2phenotype [54]	2285 GPAs in developmental disorders	Oct 2017^w
DIDA [55]	213 digenic combination-disease associations in 44 digenic diseases	Oct 2015^p
DiseaseMeth v2.0 [56]	679 602 aberrant DNA methylation-disease associations in 88 diseases, especially in various cancer	Nov 2016^p
Noncoding RNAs
NONCODE [57]	1110 lncRNAs associated with 284 diseases	Nov 2016^p
miR2Disease [35]	3273 associations between 349 miRNAs and 169 diseases	Jun 2018^w
HMDD v2.0 [58]	10 368 associations between 572 miRNAs and 378 diseases	Jun 2013^p
LncRNADisease [36]	3000 association between 914 lncRNAs and 329 diseases	July 2017^w
Lnc2Cancer [59]	1488 associations between 666 lncRNAs and 97 cancers	July 2016^w
NSDNA [60]	26 128 associations between 8736 ncRNAs and 144 nervous system diseases	May 2017^w
circRNADisease [61]	354 associations between 330 circRNAs and 48 diseases	Apr 2018^p
MNDR v2.0 [62]	8824 lncRNA-disease, 70 381 miRNA-disease, 118 piRNA-disease and 67 snoRNA-disease experimental associations across 6 mammals	Nov 2017^p
Genomic variants and population genomics
Clinvar [7]	428 435 genomic variant-disease associations across 30 181 genes	Jun 2018^w
HGMD [63]	224 642 disease related variants on 8784 genes	Jan 2018^w
Denovo-db [64]	(July 2016)^p: 32 991 de novo genetic variants in neurodevelopmental disorders
VarCards [65]	110 154 363 artificially generated SNVs and 1 223 370-reported indels in coding region and splicing sites	Oct 2017^p
LOVD 2.0 [66]	3 334 104 (2 400 084 unique) variants in 248 807 individuals in 86 LOVD installations	Dec 2015^p
MITOMAP [67]	1746 variants on mitochondrial DNA	Dec 2015^p
COSMIC [68]	208 368 associations between somatic mutations and cancer	Nov 2016^p
CIViC [69]	1678 interpretations of clinical relevance for 713 variants affecting 283 genes associated with 209 cancer subtypes and 291 drugs	Feb 2017^p
GWAS Catalog v2 [70]	∼60 000 associations between SNPs and traits/diseases	Apr 2018^w
GWASdb v2.0 [71]	252 530 associations between SNPs and traits/diseases	Nov 2015^p
GWAS Central [72]	69 986 326 associations between 2 974 961 SNPs and 829 traits/diseases	Nov 2017^w
LincSNP2.0 [73]	371 647 associations between lncRNA SNPs and diseases, and 1 266 485 Linkage disequilibrium (LD)-SNPs	Oct 2016^p
LncRNASNP2 [74]	697 lncRNA-Disease associations; 602 GWAS-SNPs and 2 859 147 SNPs in LD regions	Oct 217^p
miRdSNP [75]	786 associations between 630 unique disease-associated SNPs and 204 disease types	2012^p
miRNASNP [76]	2257 SNPs in 1596 human pre-miRNAs;706 SNPs in miRNA mature regions and 227 SNPs in miRNA seed regions	Jan 2015^p
dbSNP [77]	A genomic variation database including 660 773 127 SNPs of Homo sapiens.	Mar 2018^w
ExAC [78]	Variations from 130 000 subject exome sequencing data from a wide variety of large-scale sequencing projects	Aug 2016^p
ESP [79]	1 788 563 variants of 6700 exome sequencing data from heart-, lung- and blood-related diseases and traits	Oct 2016^p
1000Genome [80-82]	Over 88 million variants of 2504 whole genome sequencing data from 26 populations	Oct 2015^p
Kaviar [83]	Over 162 million variants from 35 projects encompassing 13 200 genomes and 64 600 exomes	Feb 2016^w
Genetically modified organism models
MGD [84]	5021 associations between mouse genetic models and human diseases	Nov 2016^p
MouseNet v2 [85]	788 080 functional gene network associations for laboratory mouse and eight other model vertebrates	Nov 2015^p
MTB [86]	6057 associations between mouse genetic models-human cancer; 2288 associations between specific genes-cancers	Oct 2014^p
RGD [87]	2998 associations between rat genetic models-human diseases	Nov 2016^p
ZFIN [88]	11348 associations between zebrafish genetic models-human diseases	Nov 2016^p
Environmental exposures
CTD [89]	1 379 105 chemical-gene associations, 202 085 chemical-disease associations and 33 583 gene-disease associations	Sep 2016^p

Continued

Comparison of different disease-related data resources Continued (continued) According to scopes and data associations, the databases can be categorised into major groups, but some of them could be included in multiple groups. The symbol ‘√’ indicates the relevant information provided in each database. The following are the name abbreviations: NSDs: nervous system diseases; M: majority; F: few; lnc: lncRNA; mi: miRNA; circ: circRNA; ncR: ncRNAs, including lncRNA, miRNA, piRNA, siRNA and snoRNA etc.; GPAs: genotype–phenotype associations; GDAs: genotype-drug associations; PDAs: phenotype-drug associations; GPEFAs: genotype–phenotype-EF associations; GEFAs: genotype-environmental gactor associations; PDTAs: phenotype-drug-target associations; DTAs: drug-target associations. Summary of disease-related databases Continued (continued) Scope refers to the major focus of the databases. The number of associations or items currently provided in the database is given. In the date of statistic, p indicates the Month-Year of statistic from journal publications; w refers to the Month-Year of statistic from official websites.

Coding genes

Approximately 50 databases provide disease-related phenotype information associated with genotypes. Several of them focus on depicting the association between protein coding gene and phenotypes (Table 1). One of the most widely used databases is OMIM, which is manually collated and integrated from numerous peer-reviewed literature and other medical information, offering broad and powerful compilations of knowledge about human genes, genetic phenotypes and the relationships between them [6]. The latest OMIM database contains 15 919 gene descriptions, 8670 phenotypes and 3928 genes with association to 1 or more phenotype(s) [6] (Table 2). Another similar example is Orphanet [53]. Instead of targeting on Mendelian disorders, Orphanet focuses on easy access to accurate and specific recommendations for the management of rare diseases. It establishes the relationships between classification of rare diseases, textual data and the appropriate services for patients and healthcare professionals. It has been debated that many diseases classically considered monogenic may be better described as more complex inheritance, such as oligogenic mechanisms [101]. Gazzo et al. published the DIDA database as a Nucleic Acids Research breakthrough article in 2016 to offer the first-time detailed information on genes and associated genetic variants involved in digenic disorders, the simplest form of oligogenic inheritance [55]. The current DIDA database includes 213 digenic combination-disease associations involved in 44 digenic diseases (Table 2). The publication of DIDA may initiate further data annotation and tool development for deciphering more complex inheritance, such as polygenic disorders. Complex diseases generally involve multiple levels of alterations, such as epigenetics and transcriptomic alterations [102, 103]. The human disease methylation database (DiseaseMeth), first published in 2012 [104], associates aberrant DNA methylation with human diseases, especially various cancers. Data in DiseaseMeth are manually or computationally extracted from experimental studies and high-throughput methylome data. The current DiseaseMeth [56] database contains over 679000 aberrant DNA methylation-disease associations across 88 diseases (Table 2). To identify correlations between DNA methylation and RNA expression, another methylation-related database, called MethHC, provides a large collection of DNA methylation data combined with mRNA/microRNA expression profiles in human cancer [105]. These resources provide coding gene-disease associations that are a great utility in different research and clinical purposes, including the investigation of causes of specific human diseases and the interpretation of clinical significance of genetic dysfunctions in coding genes. Researchers are recommended to use OMIM for studies in Mendelian inheritance, Ophanet for rare disorders, DIDA for digenic disorders and DiseaseMeth for disease-related methylation.

Noncoding RNAs

A large portion of human genome is transcribed into noncoding RNAs (ncRNAs), particularly long-noncoding RNAs (lncRNAs), micro RNAs (miRNAs) and circular RNA (circRNA), potentially representing another layer of epigenetic regulation [33, 106]. Accumulative investigations have shown that ncRNAs play critical roles in many important biological processes [32] and its deregulations could be related to a broad spectrum of diseases [29-33]. Evidently, ncRNAs have become a novel class of potential biomarkers and targets for disease diagnosis, therapy and prognosis. Due to their functional and clinical significance, several databases have been established since 2005, including miRbase [107] for miRNAs, NONCODE [57], LNCipedia [108] and lncRNAdb [109] for lncRNAs. These databases connect ncRNA to diseases and also integrate annotation data of sequences, functions, expressions, related targets and cellular locations. For example, the latest NONCODE [57] has annotated 167150 human lncRNA sequences, of which 1110 are associated with 284 diseases [36] (Table 2). Several databases target on the association between ncRNA dysregulation and human diseases (Table 1 and Table 2). For example, miR2Disease [35] and Human MicroRNA Disease Database (HMDD) [58] provide miRNA dysregulation-human disease associations and miRNA-target associations. The current release of HMDD has integrated 10368 associations between 572 miRNAs and 378 diseases. Similarly, LncRNADisease [36] and Lnc2Cancer [59] contain manually curated entries of experimentally supported lncRNA-disease associations and lncRNA-target associations, and the latter focuses on association data for cancer research. Unlike LncRNADisease and Lnc2Cancer, the Nervous System Disease NcRNAome Atlas (NSDNA) [60] aims to offer a comprehensive, quality and special resource of NSD-related ncRNA dysregulation. It manually collects experimentally supported associations between nervous system diseases (NSDs) and different types of ncRNAs, including miRNAs, lncRNAs, piRNAs, siRNAs and snoRNAs. The latest [60] NSDNA contained 26 128 associations between 8736 ncRNAs and 144 NSDs (Table 2). The MNDR database [110] integrates experimentally supported and predicted ncRNA-disease associations from 14 resources such as HMDD [58], Lnc2Cancer [59], NSDNA and LncDisease [111]. Moreover, several databases store predicted circRNA-disease associations such as Circ2Traits [112] and manually curated circRNA-disease associations from peer review papers such as circRNADisease [61]. Currently, circRNADisease provides 354 curated associations between 330 circRNAs and 48 diseases including cancers, neurodegeneration and cerebrovascular diseases [61]. Each association has comprehensive annotation information such as circRNA name, expression pattern, associated partners, associated diseases, experimental detection techniques and publication reference. The above resources of ncRNA-disease relationships can be used conjunctively to discover and predict associations between novel ncRNAs and diseases, and to facilitate the interpretation of clinical significance of dysfunctions in ncRNAs. Lnc2Cancer is preferable for studying cancer-related lncRNAs, and NSDNA for NSD-related lncRNAs.

Genomic variations

Many genetic and complex diseases are associated with genomic variations and thus many genotype–phenotype databases store and curate genomic coverage of germline and somatic variations in single genes across the majority of genetic diseases, including Mendelian disorders, rare diseases and complex traits (Table 2). HGMD [63] is a representative repository for the clinical annotation of genetic mutations manually curated from more than 2600 peer-reviewed journals. HGMD has two types of version: the public version is freely available to users from academic institutions and non-profit organizations while the subscription version is available to all users under a commercial license provided by QIAGEN Inc. Another representative repository is ClinVar [7], which provides clinical annotation of genomic variation data. Data in ClinVar are submitted by clinical laboratory users and integrated from a variety of curated resources, including HGMD. Compared to HGMD, the freely available database LOVD provides not only the gene-centric collection and web search of nuclear DNA variations, but also the patient-centric data storage and storage of NGS data, even of variants outside of genes [66]. Moreover, MITOMAP reports 1746 human mitochondrial variants associated with diseases [67]. To provide standardization of annotation and improve accessibility of genomic variants, Li et al. developed VarCards to artificially generate all possible human single nucleotide variants (SNVs) in coding regions and splicing sites, and to classify all reported insertions and deletions (indels) [65]. VarCards has annotated variants from more than 60 genomic data sources, including disease-associated knowledge, functional effects, drug–gene interactions, predicted consequences through different in silico algorithms and allele frequencies in different population [65]. VarCards currently maintain over 110 million possible SNVs and more than 1.2 million reported indels (Table 2). Additionally, several other databases also cover genomic variations in genome-wide association studies (GWASs), such as GWAS Catalog [70], GWASdb [71], GWAS Central [72] and somatic variations in cancer, such as Catalogue of Somatic Mutations in Cancer (COSMIC) [68]. During recent years, abundant de novo variants and non-coding variants have been discovered in studies of complex diseases [64]. Novel variants of an individual not presented in either of his/her parents are termed de novo [113]. To facilitate better usages of the data of de novo variants, many databases have been established to integrate, characterize and annotate disease-related human de novo variants, including Denovo-db [64], NPdenovo [114] and Developmental Brain Disorder [115]. On the other hand, a few other databases focus on the disease/trait-related variants in human ncRNAs, ncRegion or their transcript factor binding sites (TFBSs), e.g. lncRNASNP [74], SNP2TFBS [116], miRdSNP [75], miRNASNP [117] and LincSNP 2.0 [73]. LincSNP specifically integrates and annotates disease-associated SNPs in human lncRNAs and TFBSs [73]. Similarly, miRNASNP [117] collects polymorphisms altering miRNA target sites, in order to identify miRNA-related SNPs in GWAS SNPs and eQTLs. The current miRNASNP [76] has integrated multiple filters to prioritize functional SNPs and experimentally supported miRNA-mRNA, as well as provided expression level annotation and correlation of miRNAs and target genes in various tissues. These above resources often have a limitation that there is no mechanism for rapid improvement of the content and annotation of genomic variants. To address this, Griffith et al. have recently developed the CIViC knowledgebase for biocurators to annotate the clinical interpretation of variants in cancer which involves in the susceptible, diagnostic, therapeutic and prognostic relevance of somatic and germline variants of all types [69]. CIViC currently provides 1678 interpretations of clinical relevance for 713 variants affecting 283 genes associated with 209 cancer subtypes and 291 drugs. The variants in CIViC are annotated by provenance of supporting evidence and allowed users to transparently generate current and accurate variant interpretations [69]. Altogether, these comprehensive resources of genomic variants with disease-related annotations are not only valuable for investigating the functions and mechanisms of coding genes and ncRNAs in human diseases, but also helpful for developing computational tools to functionally predict and interpret clinical significance of genomic variants in exome and genome sequencing data. According to the maturity and the annotation quality, HGMD, ClinVar, CIViC and COSMIC are highly recommended in this category.

Population genomic data

Population genomics examines genomic variations within and among various populations. NCBI’s dbSNP is the first published population genomic database [20], which deposits SNPs and other classes of minor genetic variation including indels, copy number variations (CNVs) and structure variations from multiple resources [77]. With the NGS technology being widely adopted, several international projects have been launched to construct and integrate large number of genomic databases associated with populational phenotypes and features. These projects include National Heart, Lung and Blood Institute Exome Sequencing Project (NHLBI ESP), Exome Aggregation Consortium (ExAC), 1000 Genome and Kaviar (Table 2). NHLBI ESP [79] has offered an unprecedented depth to identify rare variants located in protein coding regions from about 6500 individuals who have been clinically diagnosed with heart, lung and blood disorders. Similarly, ExAC [118] has discovered rare variants from over 130 000 subjects whose exomes have been sequenced as part of various disease-specific and population genetic studies. Compared to NHLBI ESP and ExAC, the 1000 Genomes project provides a comprehensive resource for over 88 million human genomic variants in 2504 individuals from 26 populations [80-82]. 1000 Genomes also offers freely available RNA expression data from RNA sequencing and expression arrays, which can be explored to determine whether the genomic variants are associated with the changes of gene expression in RNA level [119]. Another consolidated database for allele frequencies is Kaviar [83] that contains genotype information of over 162 million variants from 35 projects, encompassing 13 200 genomes and 64 600 exomes. dbSNP is recommended for its quality annotation and maturity, Kaviar is recommended for its large scale of data in both genomes and exomes and 1000 Genomes is preferable for studying diseases associated with different populations.

Genetical organism models

Despite the recent success in identifying causative associations between genetic alterations and disorders, GPAs remain uncovered for many diseases. For example, almost half of the known genetic disorders recorded in the OMIM knowledgebase are still unclear for causative genes [120]. With the advanced technology of gene modifying and gene editing such as RNAi, Zinc-Finger Nuclease, TALENs and CRISPR/Cas system, a number of genetic modified organism models have been constructed to investigate genetic mechanisms in human diseases and to identify GPAs. The disease-associated information of genetically modified organism models is annotated and available from different databases, such as MGD [84], MouseNet [85], Mouse Tumor Biology (MTB) [86], RGD [87] and ZFIN [88, 121] (Table 2). MGD is a highly integrated and curated database, housing comprehensive knowledge about mouse genes, genetic markers and genomic features as well as associations to various human diseases [84]. MGD also provides a portal of the Human-Mouse Disease Connection to facilitate the investigation of phenotypic similarity between mouse models and human patients. Similarly, RGD is a comprehensive data repository for laboratory rat, involving genomic and genetic variants as well as disease data [87]. The various disease portals at RGD are entry points of data and tools related to 12 classes of diseases, including cancer, diabetes, aging and cardiovascular disease. Compared to MGD, MTB is a database for mining data on tumor development and patterns of metastases [86]. It can facilitate the selection of strains in cancer research. In addition, Zebrafish (Danio rerio) is another useful model organism to investigate human disease, especially in developmental disorders. ZFIN is a central resource for zebrafish genomic, genetic, phenotypic and developmental data [88]. MGD, MTB, TGD and ZFIN house thousands of disease associations between the model species and human beings, involving cancer, mutation, congenic and transgenic constructions, etc. Other special organism model resources for rhesus monkey [122], dog [123], chicken [124], Drosophila [125] and Caenorhabditis elegans [126, 127] have also integrated confirmed association information between genetic makers and disorders. Thus, genetical organism models associated with diseases are useful resources for demonstrating and identifying the relationships between genetic alterations and phenotypes of human diseases.

Environmental exposures

Except for genetic factors, accumulative evidence has suggested that EFs have a great contribution to the development of many diseases, especially in complex disorders such as cancer and cardiovascular diseases [128-131]. Moreover, complex interaction between genetic factors and environmental exposures plays critical roles in developing the phenotypes of diseases. Several databases have been established to associate environment factors with protein coding genes and phenotypes of diseases [44, 90, 132–134] (Table 2). For example, the CTD [89] is a comprehensive repository of interactions between chemicals and gene products, as well as their relationships to diseases. The latest CTD contains over 30.5 million toxicogenomic relationships for the interactions of chemical-gene, chemical-disease and gene-disease [89]. Different from CTD, the Exposome Explorer database focuses on annotating biomarkers of exposure to environmental risk factors and dietary [44]. Recently, like other genetic factors, it has been suggested that miRNAs, lncRNAs and other type of ncRNAs also have complex interactions with a wide spectrum of exposure factors such as drugs [135], stress [136], alcohol [137], cigarette [138], virus [139], radiation [140], air pollution [141] and diet [142] in the development of diseases. With the rapid growth of interaction data between ncRNAs, environmental exposures and development of diseases, a number of databases have been generated to describe their relationships, such as SM2miR [91], miREnvironment [92], DLREFD [93] and LncEnvironmentDB [43] (Table 2). SM2miR is the first established database to provide experimentally validated effects of small molecules on miRNA expression and hosts manually curated association data between miRNAs and small molecules across 17 species [91]. Compared to SM2miR, miREnvironment not only provides manually curated information on environmental exposures and miRNA expression, but also offers phenotypes associated with miRNAs and EFs [92]. Different from SM2miR and miREnvironment for miRNAs, DLREFD [93] and LncEnvironmentDB [43] focus on the lncRNAs that are experimentally or computationally associated with environmental exposures and disease-related phenotypes. These environment-related databases (Table 2) are valuable data resources for investigating the impacts of EFs on the development of human diseases at the molecular level as well as at the network level. Due to the large numbers of associations, CTD is highly recommended for coding genes associated with environmental and chemical exposures in this category.

Drugs and their targets

To facilitate successful medicine research with comprehensive information across drug discovery and development process, several public repositories have been established to dedicate associations across phenotypes, drugs, chemicals and their targets (Table 2). Therapeutic Target Database (TTD) is the earliest repository [143] to provide information about drugs, targets and their associations with specific pathways. DrugBank [95] and DrugCentral [96] are the other two main databases, hosting comprehensive drug-target interactions and drug action information captured and integrated from online non-commercial resources, e.g. US Food and Drug Administration (FDA), European Medicines Agency and Japan Pharmaceutical and Medical Devices Agency, as well as curated data from published research articles and drug labels. DrugBank and DrugCentral have become the referential drug data source for a number of well-known public databases such as PubChem [144], ChEMBL [94], PharmGKB [98], UniProt [145]and SuperTarget [146]. Moreover, TTD, DrugBank and DrugCentral link to targets and pathways to in silico drug discovery efforts. Other notable databases include PharmGKB [98] for impact of human genetic variations on drug responses, and the Drug-Gene Interaction Database (DGIdb) [99] for drug–gene interactions and gene druggability. Moreover, several databases have integrated drug-target information with special medical indications, such as cancer [100, 147, 148], side effects [149], pharmacophores [150] and special metabolic pathways [151]. The data resources of drugs with diseases enable the investigations of drug effects in specific genetic contexts and provide new insights in drug action at the molecular level. Due to the maturity and the data quality, ChEMBL and DrugBank are recommended for drug annotation in this category. On the other hand, PharmGKB is recommended for the interpretation of impact of human genetic variations on drug responses.

Software tools and web platforms

Software tools and web platforms are another type of computational resources, accelerating deeper understanding associations between multiple disease-related factors. Most of the available public software tools used to bridge the gaps between biology, medicine and clinic are driven by either genomic features or ontologies. These tools can be downloaded and used to analyze data in a standalone computer. To analyze online, several web platforms have been constructed to include interactive applications that comprehensively integrate a variety of disease-related data sources and software tools to prioritize disease-related associations spanning genotypes, phenotypes and treatments.

Genomic feature-driven tools

To facilitate clinical interpretation of genetic and genomic factors, many computational tools have been developed based on various features including evolutionary conservation, sequence homology and genomic and epigenetic annotations (Table 3). These computational tools have been widely used to annotate, predict and prioritize functional effects of varieties of genomic variants from high-throughput sequencing data, including KGGSeq [152, 153], ANNOVAR [12] and wANNOVAR [154] for functional annotation of genetic variants, VEST3 [155] and REVEL [156] for prioritization of rare missense variants, GWAVA [47] and Deepsea [14] for prioritization of noncoding variants, MutationTaster [157], VAAST [46], CADD [49], DANN [158], FATHMM-MKL [159] and Eigen [13] for prediction of the functional consequences of both coding and non-coding variants (Table 3). Some past research attempted to compare the usage and performance of these tools. It has been shown that Eigen has better discriminatory ability than CADD using disease-related variants and putatively benign variants in both noncoding and coding regions [13]. Moreover, M-CAP [160] and InterVar [161] were developed to eliminate the majority of variants of uncertain significance and facilitate interpretation of clinical significance of variants (Table 3). Furthermore, SIFT [45], LRT [162], PolyPhen2 [11], MutationAssessor [163], PROVEAN [164], FATHMM [165], MetaSVM [166] and IMHOTEP [167] have been developed to predict functional impacts of amino acid substitutions (Table 3). On predictions of polymorphisms and mutations with variants causing single amino acid substitutions, MutationTaster2 [168] had the highest accuracy compared to SIFT, PolyPhen-2 and PROVEAN. Different from all the above tools, ClinLabGeneticist [169] was established to manage clinical genetic variants from whole exome sequencing based on extensive variants annotation data (Table 3). ClinLabGeneticist contains information of data entry, distribution of work assignments and selection of variants for validation, report generation and communications between various personnel, and the entire workflow of ClinLabGeneticist has been integrated into a single data management platform.

Table 3

Genomic feature-driven tools for annotation and evaluation of clinical significance of variants

Application	Year of first deployment: tool name	Regular update	Based on
Functional annotation of genomic variants	2010: ANNOVAR [12]	Yes, annually since 2015	Functional annotation of genetic variants from high-throughput sequencing data
	2012: wANNOVAR [154]	Yes	Functional annotation of genetic variants from high-throughput sequencing data
	2012: KGGSeq [152, 153]	Yes, bugs fixed monthly	Three different levels: genetic level, variant-gene level and knowledge level
Prediction of functional impact of amino acid substitutions	2003: SIFT [45]	Last update in Aug 2011	Sequence homology based on PSI-BLAST
	2009: LRT [162]	Last update in Nov 2009	Sequence homology
	2010: PolyPhen2 [11]	Last update in 2016	Eight sequence-based and three structure-based predictive features
	2011: MutationAssessor [163]	Last update in Dec 2015	Sequence homology of protein families and subfamilies between species
	2012: PROVEAN [164]	Last update in Jan 2015	Sequence homology
	2013: FATHMM [165]	Last update in May 2015	Sequence homology
	2015: MetaSVM [166]	Last update in 2016	9 prediction scores and allele frequencies in 1000Genomes
	2017: IMHOTEP [167]	Unknown	9 popular predicted tools
Prioritization of rare missense variants	2013: VEST3 [155]	Yes, quarterly	86 sequence features
	2016: REVEL [156]	Last update in 2016	13 popular predicted tools
	2016: M-CAP [160]	Last update in 2016	Pathogenicity likelihood scores and direct measures of evolutionary, conservation, the cross-species analog to frequency within the human population
Prioritization of noncoding variants	2014: GWAVA [47]	Last update in 2014	Various genomic and epigenomic annotations
Prioritization of noncoding variants	2015: DeepSEA [14]	Yes, annually	Regulatory sequence code
Prediction of functional consequences for both coding and non-coding variants	2010: MutationTaster [157]	Yes	Conservation, splice site, mRNA features, protein features and regulatory features
	2011: VAAST [46]	Last update in Sep 2016	Variant frequency data with AAS effect information on a feature-by-feature basis
	2014: CADD [49]	Last update in Apr 2018	63 annotations including 949 sequence features
	2015: DANN [158]	Last update in 2015	63 annotations including 949 sequence features that is same to CADD
	2015: FATHMM-MKL [159]	Last update in 2015	1281 sequence features
	2016: Eigen [13]	Last update in 2016	Functional, evolutionary conservation and regulatory annotations
Interpretation of clinical significance of variants	2017: InterVar [161]	Yes, last update in Jan. 2018	The-2015-ACMG-AMP-Guidelines
Interpretation of clinical significance of variants	2015: ClinLabGeneticist [169]	Last update in 2014	Extensive variant annotation data source and prioritization of variants

The tools are classified into different categories according to their uses.

Genomic feature-driven tools for annotation and evaluation of clinical significance of variants The tools are classified into different categories according to their uses.

Ontology-driven tools

The ontology databases in life science, such as Human Phenotype Ontology (HPO) [170-174], Mammalian Phenotype Ontology [175], Disease Ontology [176], Gene Ontology (GO) [177] and Experimental Factor Ontology (EFO) [178], provide standard terminologies and controlled vocabularies to describe and classify molecules, diseases, genotypic and phenotypic features, etc. The ontologies can be utilized to support computational tools that allow sophisticated search and analysis routines. For example, HPO offers standard terminologies for phenotypic features and diseases, to bridge the gap between genome biology and clinical medicine [179]. Several tools use phenotypic ontologies to enable deep interpretation for the analysis results of NGS data, including eXtasy [180], PhenIX [15], Exomiser [181], Phen-Gen [182], Phevor [48] and PhenogramViz [183] (Table 4). eXtasy, the earliest tool of them, ranks the damaging impacts of nonsynonymous single-nucleotide variants (nSNVs) by genomic data fusion. PhenIX evaluates and prioritizes impacts of SNVs, splice sites and short indels in the exome sequencing data of Mendelian diseases based on pathogenicity of variants and semantic similarity of HPO-based phenotypes [15]. Compared to PhenIX, Phen-Gen implements an exome-centric approach to rank the impacts of coding mutations, and a genome-wide approach to prioritize pathogenicity of non-coding variants (Table 4). Similar to Phen-Gen, the recently developed tool Exomiser [184] integrates a number of algorithms, including HiPHIVE [185], PHIVE [186], ExomeWalker [187] and OWLSim [188], to enable the clinical interpretation of variants in exome and genome sequencing data. Instead of postulating a set of fixed associations between genes, diseases and phenotypes, Phevor dynamically integrates various knowledge of multiple biomedical ontologies into the variant-ranking process [48]. This enables Phevor to improve its accuracy not only of established gene-disease-phenotype associations but also of previously atypical and undescribed disease statements. PhenogramViz focuses on the interpretation of candidate CNVs and their pathogenicity prioritization from the data analyses of array comparative genome hybridization (aCGH) and NGS [183].

Table 4

Comparison of phenotype-driven tools for interpretation of clinical significance of variants

Year: tool	Availability	Operation System	Requirements	Algorithms implemented	Input data and parameter	Application scopes
2013: eXtasy [185]	Online & Standalone	Linux	Ruby; Tabix; Bedtools; R Statistical Framework with randomForest; RobustRankAggreg libraries	Random Forests; Phenomizer	VCF file; TSV for HPO term(s)	Mendelian and oligogenic disorders; nSNVs; Exome analysis; (Only 1 sample per run)
2014: Phen-Gen [187]	Online & Standalone	Linux (Ubuntu, CentOS, & RHEL)	Perl	Bayesian framework; Random walk–with–restart; Variant-predicted pathogenicity score; Phenomizer	VCF file; text file for HPO term(s); Pedigree(PED) file; Inheritance models; Type of prediction-genomic or coding; Discard de novo and Stringency	Rare disorders; nSNVs, splice-sites and short indels and non-coding variants; Genome and Exome analysis; (Only 1 family or 1 sample per run)
2014: PhenIX [15]	Online	-	-	Semantic similarity score; Variant-frequency score; Variant-predicted pathogenicity score	VCF file; HPO term(s); Inheritance modes; Frequency sources; Number of candidates to show	Mendelian diseases; SNVs, splice-sites and short indels; Exome analysis; (Only 1 sample per run)
2014: Phevor [54]	Online	-	-	Disease-gene association score; Variant-prioritization score	VAAST simple or Table for variants; Ontology Term(s); Ontologies to link to HPO	Rare disorders; SNVs; Exome analysis; (Only 1 sample per run)
2016: Exomiser [186]	Standalone	Linux; Mac OS X; Windows	∼4GB RAM for an exome analysis and ∼12GB RAM for a genome analysis; >3 GB free RAM (8 GB preferred); Java 8 or above	HiPHIVE; PHIVE; PhenIX; ExomeWalker; OWLSim; Logistic regression	YML file that include VCF file name; HPO term(s); PED file name; inheritance modes, Probands; Frequency sources; Pathogenicity sources and other alterative parameters	Mendelian, oligogenic and multigenic disorders; SNVs, splice-sites, short indels and non-coding variants; Genome and Exome analysis; (Multiple samples or families per run)
2014: PhenogramViz [188]	Cytoscape app	Windows	Cytoscape Version 3.1.0. and above	Phenogram-score (PHS); NAG, OBE, OPA, HI score	Enter symptom(s) directly for symptoms or create file with HPO term(s); Lists of CNVs (include types, Chromosome, Start, End); Lists of genes	Mendelian disorders; CNVs; aCGH and exome analysis; (Only 1 sample per run)

The availabilities, the requirements and the use of these tools are detailed in the table.

Comparison of phenotype-driven tools for interpretation of clinical significance of variants The availabilities, the requirements and the use of these tools are detailed in the table. In the performance aspect of causal gene identification, previous researches indicate that Phen-Gen gains 13∼58% improvement in sensitivity over eXtasy, Phevor, PHIVE and the earlier version of Exomiser [182]. Bone et al. [181] suggest that Exomiser is slightly favorable compared to Phen-Gen in the causal gene identification for autosomal dominant disorders and autosomal recessive disorders as well as the detection of novel variant-disease associations [181]. Moreover, Exomiser can analyse multiple samples or families per run for both Mendelian and multigenic disorders, while Phen-Gen can only handle single sample or family per run for Mendelian disorders (Table 4). eXtasy and Phen-Gen have both online and standalone versions of programs. The standalone eXtasy has many library dependencies of bioinformatics, statistics and machine learning algorithms (Table 4). Exomiser has the standalone version only, while PhenIX and Phevor have online versions instead. PhenogramViz can be downloaded, installed as an application in Cytoscape [189], and used through the Cytoscape interface. The standalone tools can be installed locally and run within hospital firewalls, thereby relieving the concerns of privacy and security for the information of patients. On the other hand, the online version tools are more acceptable and useable for many biologic researchers and clinicians, who lack bioinformatic and computing skills. In the timing aspect, the online eXtasy takes about 15 min to analyze a whole exome data sample with ∼82 000 variants, while the online PhenIX takes about 100 s to complete the same analysis, much faster than eXtasy. Exomiser [184] consumes about 10∼15 min to analyze an exome and genome sample or family, approximately 5–15 min faster than the online Pen-Gen (http://54.173.20.191). Moreover, Exomiser [184] produces HTML, tab-delimited and VCF format files that can be incorporated into many bioinformatic workflows. Taken together, the standalone versions of Phen-Gen and Exomiser are recommended to skilled bioinformaticians for the interpretation of SNVs, splice-sites, short indels and non-coding variants from data of exomes and genomes. Exomiser is also suggested for the analysis of multiple samples or families. Phevor is recommended for the prioritization of variants pathogenicity related to previously atypical and undescribed disease statements, and PhenogramViz for the interpretation of CNVs pathogenicity.

Interactive platforms

To tackle the hurdles in utilising disease-related data resources, several web platforms have implemented a number of analysis software tools to allow users to search, analyze and visualize the resources through web interface and APIs (Table 5). Most of these platforms, such as DisGeNET [190], Open Targets [16], Monarch Initiative [52] and MalaCards [51], target on human Mendelian and complex diseases, involving data of genotypes, phenotypes, genetically organism models, drugs targets and chemical molecules.

Table 5

Summary of different biomedical data and analysis web platforms

Name	Scope and scale (Date of statistic)	Applications/Tools Available	Sources
DisGeNET [190]	GPAs (May 2017)^w: 429 036 associations between 17 381 genes and 15 093 human diseases; 72 870 associations between 46 589 SNPs and 6356 human diseases/phenotypes	Web interface, DisGeNET Cytoscape plugin, Disgenet2r R package, DisGeNET-RDF	UniProt, dbSNP, GDA, CTD, MGD, OMIM, Clinvar, RGD, GWAS Catalog, Orphaned, HPO, UMLS, MeSH, DO, ICD9-CM, HGNC, dbSNP, CTD in total 22 resources
Monarch Initiative [52]	Genetically modified model support GPA (Nov 2016)^p: 237 531 gene-phenotype associations in human; 1 489 573 variant-phenotype associations in human; 19 783 disease models	Web interface, Phenotypes Analyzer, PhenoGrid, Text annotator, Exomiser	ClinVar, CTD, GeneReviews, OMIM, HPO, Orphanet, GWAS Catalog, MGI, ZFIN, NCBI, UCSC, HGNC, MeSH, OMIM, ORDO, HPO, EFO, UMLS in total 53 resources
Open Targets Platform [16]	Genotype–phenotype-drug association (Apr 2018)^w: 2 336 807 associations between genes/variants/drugs and diseases/phenotypes/targets	Web interface, Phylogenetic tree and HEART, Application programming interface	GWAS Catalog, UniProt, Expression Atlas, ChEMBL, Reactome, PhenoDigm, UMLS, MeSH, GO, ECO, HPO, MP, OMIM, ICD9-CM in total 21 resources
MalaCards [51]	Genotype–phenotype-drug association (Nov 2016)^p: 10 198 genes associated with 13 619 disease entries; 966 338 associations between 8005 distinct diseases and 3017 distinct drugs	Web interface, Tgex, GeneAnalytics, VarElect GeneALaCart, PathCards	Clinvar, Cosmic, dbSNP, DGIdb, DrugBank, FDA, HGMD, OMIM, PharmGKB, ICD10, MeSH, MGI, UMLS, UniProt in total 68 resources
MARRVEL [191]	GPA (June 2017)^p: 12.3 million variants; 6.95 million genotype–phenotype relationships	Web interface, Mutalyzer Position Converter, OMIM API, DIOPT, GTEx	ExAC, gnomAD, IMPC, Monarch, ClinVar, Geno2MP, DGV, DECIPHER, DIOPT, Mutalyzer, SGD, PomBase, WormBase, FlyBase, ZFin, MGI and RGD in total 17 resources

Scope refers to the major focus of the web platform. Scale is the number of associations and items currently provided in the platform. Each platform has integrated multiple tools/applications. Sources refer to the original data resources that have been integrated in the platform. In the date of statistic, p indicates the Month-Year of statistic from journal publications; w refers to the Month-Year of statistic from official websites.

Summary of different biomedical data and analysis web platforms Scope refers to the major focus of the web platform. Scale is the number of associations and items currently provided in the platform. Each platform has integrated multiple tools/applications. Sources refer to the original data resources that have been integrated in the platform. In the date of statistic, p indicates the Month-Year of statistic from journal publications; w refers to the Month-Year of statistic from official websites. The distinctions between different platforms are reflected in their different focuses and different applications. DisGeNET [190] is designed to collate GPAs and to offer tool applications for medical and biological research. It can be plugged into Cytoscape to visualise and explore gene-disease associations in bipartite networks [17] (Table 5). Open Targets and MalaCards not only integrate GPA information from OMIM, GWAS Catalog, ClinVar, UniProtKB and disease model databases, but also offer information of target-diseases related to approved drugs, clinical candidates, biological pathways and RNA expressions (Table 5). Due to their comprehensive knowledgebases, sophisticated web technologies as well as User Experience designs, Open Targets and MalaCards have been considered as effective platforms for medicine research. For instance, Open Targets provide two types of workflows to enable effective applications for different destinations which are as follows: the disease-centric workflow to identify targets (such as genes, variants, proteins and chemicals) associated with a specific disease, and the target-centric workflow to identify diseases associated with a specific target [16]. Moreover, Monarch Initiative semantically integrates genotype–phenotype resources from many species for exploring their relationships across species [52]. Based on its broad genotype–phenotype information, many tool applications have been developed on Monarch Initiative, including Phenogrid for phenotype analysis [52], text annotators [52] for text annotation of genes, diseases and phenotypes, Exomiser [181] for inferring causative variants (Table 5). MARRVEL [191] is another publicly available platform integrating multiple model organism resources for rare variant exploration. It improves accessibility of data collection and facilitates analysis of human genes and variants by aggregating about 18 million public data records (Table 5). Altogether, these platforms have not only facilitated the research in life sciences, but also greatly supported the development of precision clinical medicine. They can be used for the investigation of causes of specific human diseases and their comorbidities, the discovery of therapeutic action and adverse effects, the validation of computationally predicted phenotypes and genotypes and the evaluation of text-mining methods performance.

Discussion

The computational resources have facilitated deeper understanding of disease mechanisms, easier assessment of disease risks and more accurate diagnoses, and also helped to guide clinical therapies as well as to evaluate prognosis. However, challenges remain in many aspects, such as building complex networks of associations, database design for bigger data, data analysis with more effective tools and platforms, data interpretation in consistent and standard manners, result representation with user friendly interfaces and so on. Phenotype plays a central role in connection with other disease-related factors in the current network (Figure 2). The focus of software and database development is being shifted from the connection between genotypes and phenotypes to the association among multiple factors. As wider collaborations have been made to establish interoperable systems across international projects, much bigger data are being generated by many complete genomes of whole populations. Difficulties exist in connecting much more complex and multi-dimensional data. Moreover, additional data types including multi-omics results, extensive environmental contexts and life styles of patients are necessary to integrate and associated in the current network. Obviously, more effective algorithms and software tools are greatly needed to take more related factors, additional data types and bigger size of data into account.

Figure 2

Framework of a comprehensive web platform. A comprehensive web platform should integrate various disease-related information including genotypes, phenotypes, environmental factors, life styles and so on. The available information in the platform should be homogenously annotated by controlled vocabularies and community-driven ontologies, such as GenBank, dbSNP and miRbase for genotypes, HPO and DO for phenotypes, EFO and ChEBI for environmental factors and life styles, DrugBank and PubChem for drugs. Moreover, the platform should have solid scoring models to prioritize associations between different factors, such as genotype-phenotype associations (GPAs), environmental factor-phenotype associations (EFPAs), genotype-environmental factor-phenotype associations (GEFPAs), phenotype-treatment associations (PTAs), genotype-treatment associations (GTAs) and genotype-phenotype-treatment associations (GPTAs). Although the approaches of deep phenotyping are helpful for clinical diagnosis in Mendelian disorders and rare diseases, patients with similar features or at a same stage of illness often have various clinical outcomes in cancer and many complex diseases [2]. Existing spectrum of phenotype states is not optimally captured by current phenotypic ontology systems. Therefore, substantial efforts are required to better integrate the ontologies and enable the full interpretation of clinical outcomes of genetic mutations that may lead to the precision management of diseases. Currently, there are abundant biomedical resources that cover disease information involving in genotypes, phenotypes, environmental exposures and their associations. However, most of the popular resources only represent a fraction of available information. Therefore, more comprehensive platforms are needed to integrate other ever-growing biomedical information, such as noncoding genetic factors, multi-omics and extensive environmental contexts and life styles (Figure 2). In addition, these platforms should integrate clinical, environmental contexts and life styles of patients to enable reliable and useful diagnoses and discoveries, and also make data fully accessible and easily interpreted through with highly graphical representation. Moreover, the available information in majority of databases is represented and annotated by heterogenous vocabularies (Supplementary S-Table 2). Thus, better platforms are needed to comprehensively integrate the available information with controlled vocabularies and community-driven ontologies and present analysis results in a consistent and standard manner (Figure 2). Recently, MNDR has been updated to offer confidence score of each ncRNA-disease association based on a simple classification of supporting evidences [62]. However, to better support translational research and precision medicine, there is a great need to develop solid scoring models or to refine current models based on experimental evidences to assist the prioritization of associations, such as GPAs, EF-phenotype associations, genotype-EF-phenotype associations, phenotype-treatment associations, genotype-treatment associations and genotype–phenotype-treatment associations (Figure 2). In this review, we detail the human disease-related computational resources, including databases, software tools and online platforms. These resources are classified by disparate data types with focuses on association among genotypes, phenotype, EFs, organism models, drugs and chemical molecules. We also provide some of the resulting needs and requirements that should be regarded as imperative for the development of databases, tools and platforms (Figure 2). From the view of precision medicine, better services of computation resources and more training on these services will accelerate better medical research and clinical diagnoses as well as treatments. Life scientists, bioinformaticians and clinicians are suggested to cooperate to develop more comprehensive databases, more accurate software tools and more practical platform systems to facilitate the goals of precision medicine, enabling reliable and useful diagnoses and discoveries.

Key Points

The present study is a comprehensive review of avail-able computational resources of human diseases, including databases, software tools and interactive platforms to assist in the appropriate selection and use of relevant resources. Bioinformaticians have developed more than 100 computational resources to integrate omics data and discover associations among genotypes, phenotypes, environmental exposures, drugs and chemical molecules. According to scopes and data associations, the databases can be categorized into seven groups, includ-ing coding genes, noncoding RNAs, genomic variations, population genomic data, genetical organism models, environment exposures and treatments. Most of the available public software tools used to bridge the gaps between biology, medicine and clinic are driven by either genomic features or ontologies. Click here for additional data file.

Table 1

(continued)

Data resource name	Phenotype/Disease			Genotype			Environmental factors	Drugs/chemicals	Association
Data resource name	Mendelian and Rare	Complex and Trait	Organism model of disorder	Coding	Non-coding	Function annotation of variant	Environmental factors	Drugs/chemicals	Types	Score
Treatments (drugs and their targets)
ChEMBL				√(Target)		√(F)		√	PDTAs
DrugBank	√	√		√(Target)		√(F)		√	PDTAs
DrugCentral	√	√		√(Target)				√	PDTAs
TTD	√	√		√(Target)		√		√	PDTAs
PharmGKB	√	√		√(Target)		√		√	GDAs	√
DGIdb				√(Target)				√	DTAs	√
CancerPPD		√(Cancer)		√(Target)				√(Peptides)	DTAs

According to scopes and data associations, the databases can be categorised into major groups, but some of them could be included in multiple groups. The symbol ‘√’ indicates the relevant information provided in each database. The following are the name abbreviations: NSDs: nervous system diseases; M: majority; F: few; lnc: lncRNA; mi: miRNA; circ: circRNA; ncR: ncRNAs, including lncRNA, miRNA, piRNA, siRNA and snoRNA etc.; GPAs: genotype–phenotype associations; GDAs: genotype-drug associations; PDAs: phenotype-drug associations; GPEFAs: genotype–phenotype-EF associations; GEFAs: genotype-environmental gactor associations; PDTAs: phenotype-drug-target associations; DTAs: drug-target associations.

Table 2

(continued)

Database	Scope and scale	Date of statistic
ExposomeExplorer [44]	8034 concentrations correspond to dietary biomarkers (488) for 50 foods and 78 food compounds	Oct 2016^p
CEBS [90]	Over 11 000 exposure agents and over 8000 exposure studies	Nov 2016^p
SM2miR [91]	5161 associations between 1681 miRNAs and 255 small molecules	Apr 2015^p
miREnvironment [92]	3857 associations between 1242 miRNAs, EFs and 305 phenotypes	Sep 2012^w
DLREFD [93]	835 associations between 475 LncRNAs, 153 EFs and 124 phenotypes	Oct 2016^p
Drug/chemical exposures
ChEMBL [94]	Over 1.6 million distinct compound structures and 14 million activity values from over 1.2 million assays; ∼11 000 drug targets including 9052 proteins	Nov 2016^p
DrugBank 4.0 [95]	2037 FDA-approved small molecule drugs and 241 FDA-approved biotech (protein/peptide) drugs; over 6000 experimental drugs and over 201 SNP-associated drug effects, and 4661 drug targets	Nov 2013^p
DrugCentral [96]	2021 FDA drugs, 2423 drugs approved outside US, 3799 small molecules, 239 peptides, 294 other drugs; 10 427 human protein targets including 837 drug efficacy targets	Oct 2016^p
TTD [97]	2071 approved drugs, 7291 clinical trial drugs, 357 preclinical drugs, 17 803 experimental drugs397 successful targets, 723 clinical trial targets, 1469 research targets	Nov 2015^p
PharmGKB [98]	20 017 associations between SNPs and drugs, and 65 important pharmacogenes	Jun 2018^w
DGIdb [99]	40 017 mining clinically associations between 2644 genes and 11 215 drugs	Nov 2015^p
CancerPPD [100]	3491 Experimentally verified anticancer peptides and 121 proteins spanning in 21 tissues	Sep 2014^p

Scope refers to the major focus of the databases. The number of associations or items currently provided in the database is given. In the date of statistic, p indicates the Month-Year of statistic from journal publications; w refers to the Month-Year of statistic from official websites.

187 in total

1. Transcriptome Analysis of Triple-Negative Breast Cancer Reveals an Integrated mRNA-lncRNA Signature with Predictive and Prognostic Value.

Authors: Yi-Zhou Jiang; Yi-Rong Liu; Xiao-En Xu; Xi Jin; Xin Hu; Ke-Da Yu; Zhi-Ming Shao
Journal: Cancer Res Date: 2016-03-03 Impact factor: 12.701

2. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease.

Authors: Peter N Robinson; Sebastian Köhler; Sebastian Bauer; Dominik Seelow; Denise Horn; Stefan Mundlos
Journal: Am J Hum Genet Date: 2008-10-23 Impact factor: 11.025

3. The Zebrafish Information Network (ZFIN): the zebrafish model organism database.

Authors: Judy Sprague; Dave Clements; Tom Conlin; Pat Edwards; Ken Frazer; Kevin Schaper; Erik Segerdell; Peiran Song; Brock Sprunger; Monte Westerfield
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

4. Clinical interpretation of CNVs with cross-species phenotype data.

Authors: Sebastian Köhler; Uwe Schoeneberg; Johanna Christina Czeschik; Sandra C Doelken; Jayne Y Hehir-Kwa; Jonas Ibn-Salem; Christopher J Mungall; Damian Smedley; Melissa A Haendel; Peter N Robinson
Journal: J Med Genet Date: 2014-10-03 Impact factor: 6.318

5. MNDR v2.0: an updated resource of ncRNA-disease associations in mammals.

Authors: Tianyu Cui; Lin Zhang; Yan Huang; Ying Yi; Puwen Tan; Yue Zhao; Yongfei Hu; Liyan Xu; Enmin Li; Dong Wang
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

6. A comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases.

Authors: Miao-Xin Li; Hong-Sheng Gui; Johnny S H Kwan; Su-Ying Bao; Pak C Sham
Journal: Nucleic Acids Res Date: 2012-01-12 Impact factor: 16.971

Review 7. DiseaseMeth version 2.0: a major expansion and update of the human disease methylation database.

Authors: Yichun Xiong; Yanjun Wei; Yue Gu; Shumei Zhang; Jie Lyu; Bin Zhang; Chuangeng Chen; Jiang Zhu; Yihan Wang; Hongbo Liu; Yan Zhang
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

8. Identifying Mendelian disease genes with the variant effect scoring tool.

Authors: Hannah Carter; Christopher Douville; Peter D Stenson; David N Cooper; Rachel Karchin
Journal: BMC Genomics Date: 2013-05-28 Impact factor: 3.969

9. GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies.

Authors: Tim Beck; Robert K Hastings; Sirisha Gollapudi; Robert C Free; Anthony J Brookes
Journal: Eur J Hum Genet Date: 2013-12-04 Impact factor: 4.246

10. PubChem Substance and Compound databases.

Authors: Sunghwan Kim; Paul A Thiessen; Evan E Bolton; Jie Chen; Gang Fu; Asta Gindulyte; Lianyi Han; Jane He; Siqian He; Benjamin A Shoemaker; Jiyao Wang; Bo Yu; Jian Zhang; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2015-09-22 Impact factor: 16.971

8 in total

1. ncRPheno: a comprehensive database platform for identification and validation of disease related noncoding RNAs.

Authors: Wenliang Zhang; Guocai Yao; Jianbo Wang; Minglei Yang; Jing Wang; Haiyue Zhang; Weizhong Li
Journal: RNA Biol Date: 2020-03-26 Impact factor: 4.652

2. Public data sources to support systems toxicology applications.

Authors: Allan Peter Davis; Jolene Wiegers; Thomas C Wiegers; Carolyn J Mattingly
Journal: Curr Opin Toxicol Date: 2019-03-11

3. Knowledge bases and software support for variant interpretation in precision oncology.

Authors: Florian Borchert; Andreas Mock; Aurelie Tomczak; Jonas Hügel; Samer Alkarkoukly; Alexander Knurr; Anna-Lena Volckmar; Albrecht Stenzinger; Peter Schirmacher; Jürgen Debus; Dirk Jäger; Thomas Longerich; Stefan Fröhling; Roland Eils; Nina Bougatf; Ulrich Sax; Matthieu-P Schapranow
Journal: Brief Bioinform Date: 2021-11-05 Impact factor: 11.622

4. Open Targets Platform: new developments and updates two years on.

Authors: Denise Carvalho-Silva; Andrea Pierleoni; Miguel Pignatelli; ChuangKee Ong; Luca Fumis; Nikiforos Karamanis; Miguel Carmona; Adam Faulconbridge; Andrew Hercules; Elaine McAuley; Alfredo Miranda; Gareth Peat; Michaela Spitzer; Jeffrey Barrett; David G Hulcoop; Eliseo Papa; Gautier Koscielny; Ian Dunham
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

5. A novel liver cancer diagnosis method based on patient similarity network and DenseGCN.

Authors: Ge Zhang; Zhen Peng; Chaokun Yan; Jianlin Wang; Junwei Luo; Huimin Luo
Journal: Sci Rep Date: 2022-04-26 Impact factor: 4.996

Review 6. Computational approaches for predicting variant impact: An overview from resources, principles to applications.

Authors: Ye Liu; William S B Yeung; Philip C N Chiu; Dandan Cao
Journal: Front Genet Date: 2022-09-29 Impact factor: 4.772

7. Open Targets Platform: supporting systematic drug-target identification and prioritisation.

Authors: David Ochoa; Andrew Hercules; Miguel Carmona; Daniel Suveges; Asier Gonzalez-Uriarte; Cinzia Malangone; Alfredo Miranda; Luca Fumis; Denise Carvalho-Silva; Michaela Spitzer; Jarrod Baker; Javier Ferrer; Arwa Raies; Olesya Razuvayevskaya; Adam Faulconbridge; Eirini Petsalaki; Prudence Mutowo; Sandra Machlitt-Northen; Gareth Peat; Elaine McAuley; Chuang Kee Ong; Edward Mountjoy; Maya Ghoussaini; Andrea Pierleoni; Eliseo Papa; Miguel Pignatelli; Gautier Koscielny; Mohd Karim; Jeremy Schwartzentruber; David G Hulcoop; Ian Dunham; Ellen M McDonagh
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

8. CanImmunother: a manually curated database for identification of cancer immunotherapies associating with biomarkers, targets, and clinical effects.

Authors: Wenliang Zhang; Binghui Zeng; Huancai Lin; Wen Guan; Jing Mo; Song Wu; Yanjie Wei; Qianshen Zhang; Dongsheng Yu; Weizhong Li; Godfrey Chi-Fung Chan
Journal: Oncoimmunology Date: 2021-07-16 Impact factor: 8.110

8 in total