| Literature DB >> 34224878 |
Yu Wang1, Fang-Yuan Shi1, Yu Liang2, Ge Gao3.
Abstract
More than 90% of disease- and trait-associated human variants are noncoding. By systematically screening multiple large-scale studies, we compiled REVA, a manually curated database for over 11.8 million experimentally tested noncoding variants with expression-modulating potentials. We provided 2424 functional annotations that could be used to pinpoint the plausible regulatory mechanism of these variants. We further benchmarked multiple state-of-the-art computational tools and found that their limited sensitivity remains a serious challenge for effective large-scale analysis. REVA provides high-quality experimentally tested expression-modulating variants with extensive functional annotations, which will be useful for users in the noncoding variant community. REVA is freely available at http://reva.gao-lab.org.Entities:
Keywords: Benchmark; Database; Expression-modulating variant; Massively parallel reporter assay; Noncoding variant
Mesh:
Year: 2021 PMID: 34224878 PMCID: PMC9040024 DOI: 10.1016/j.gpb.2021.06.001
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 6.409
Properties of involved computational tools
| FunSeq2 | Knowledge-based | Evolutionary parameters; ENCODE summaries; PWMs; likely target genes; biological networks; recurrent elements across cancer samples | Cancer driver mutations | http://funseq2.gersteinlab.org/ | |
| CADD | Supervised learning | Evolutionary parameters; ENCODE summaries; population frequencies; transcript information; protein-level scores | Functional variants | https://cadd.gs.washington.edu/ | |
| GWAVA | Supervised learning | Evolutionary parameters; ENCODE summaries; population frequencies | Disease-related variants | https://www.sanger.ac.uk/science/tools/gwava | |
| Eigen | Unsupervised learning | Evolutionary parameters; ENCODE summaries; population frequencies | Functional variants | http://www.columbia.edu/∼ii2135/eigen.html | |
| DeepSEA | Supervised learning (DL) | Local sequences; evolutionary parameters | Functional variants | http://deepsea.princeton.edu/ | |
| EnsembleExpr | Ensemble-based | Including features used by DeepSEA, DeepBind, KSM, and ChromHMM | Expression-modulating variants | http://ensembleexpr.csail.mit.edu/ | |
| ExPecto | Supervised learning (DL) | Local sequences | Expression-modulating variants | https://hb.flatironinstitute.org/expecto/ |
Note: ENCODE, Encyclopedia of DNA Elements; PWM, position weight matrix; DL, deep learning; KSM, k-mer set memory.
Figure 1Overview of the structure of REVA Manually curated noncoding variant data, as well as supplementary information, were stored in the database at two levels: accession and variant data. Accession contained the information about the publication, and variant data contained all related information about the variant. A web interface was built for users to access the data in the database. TF, transcription factor; SNP, single nucleotide polymorphism.
Variant information extracted during the data collection process
| Genome location | Genome location of the variant in both GRCh37 and GRCh38Strand information was also included |
| Reference SNP ID | Reference SNP ID of the variant |
| Reference allele | Reference allele of the variant |
| Alternative allele | Alternative allele of the variant |
| Raw | Raw |
| Adjusted | If the publication did not provide adjusted |
| Cutoff | The cutoff for the adjusted |
| Label | Given based on the cutoff for the adjusted |
| Effect size | Effect size provided by the publication |
| Fragment effect | The effect of the fragment carrying the variant, given based on the effect size: activation, repression, or no effect |
| Experimental cell line | The cell line used to conduct the experiment |
| Genomic region | The genomic region in which the variant was located, such as the particular gene and intron |
| TF | TF related to the variant |
| TF effect | The effect of the aforementioned TF: activation or repression |
Note: SNP, single nucleotide polymorphism; TF, transcription factor.
Figure 2Annotation of the variants in REVA A. Distribution of positive and negative variants in human genome. B. Density distribution of positive and negative variants on chromosomes. A two-sided Fisher's exact test with Benjamini and Hochberg correction [42] was used in the analysis of the chromosome distribution of variants. The cutoff for the adjusted P value was set to 0.05. The density distribution plot was constructed with the karyoploteR package [43] in R. No variants were located on the Y chromosome.
Figure 3Performance of involved tools on the benchmarking dataset A. Performance comparison of involved tools. Bubbles are colored by F1 scores. The tools are ordered by F1 scores. B. The ROC curves for involved tools. C. Performance comparison of involved tools except for EnsembleExpr on variants that were also included in GWAS catalog. D. Performance comparison of involved tools except for EnsembleExpr on variants with different phastCons100way scores. E. Performance comparison of involved tools except for EnsembleExpr on variants from different cell lines. “All” represents the F1 score shown in (A). F. Performance comparison of involved tools except for EnsembleExpr on variants that were also included in HGMD. “All” represents the F1 score shown in (A). “All HGMD” represents the F1 score on all variants that were also included in HGMD. “DM?”, “DP”, “FP”, and “DFP” refer to the classes of related variants documented in HGMD. AUROC, area under the receiver operating characteristic curve; N.A., not available.
Figure 4Illustration of the Web interface of REVA A. Chromatin profile feature plot in “Annotation” module of the variant detail page. Chromatin features are presented by category. Users can hover the mouse over the outlier or the box to show more information. At the right of the boxplot is a table to show detailed information. Users can click the boxplot to show the corresponding category. B. Chromatin profile feature heatmap in “Annotation” module of the variant detail page. The heatmap is presented by cell line and each row in heatmap corresponding to one category. Users can click the “Cell line / Tissue” list at the right of the heatmap to render annotation in the target cell line / tissue and hover the mouse over the block in the heatmap to show feature information. Both (A) and (B) were retrieved from . IQR, interquartile range.