| Literature DB >> 34037796 |
Hu Zhao1, Zhuo Tu1, Yinmeng Liu1, Zhanxiang Zong1, Jiacheng Li1, Hao Liu1, Feng Xiong1, Jinling Zhan1, Xuehai Hu2, Weibo Xie1,2.
Abstract
Characterizing regulatory effects of genomic variants in plants remains a challenge. Although several tools based on deep-learning models and large-scale chromatin-profiling data have been available to predict regulatory elements and variant effects, no dedicated tools or web services have been reported in plants. Here, we present PlantDeepSEA as a deep learning-based web service to predict regulatory effects of genomic variants in multiple tissues of six plant species (including four crops). PlantDeepSEA provides two main functions. One is called Variant Effector, which aims to predict the effects of sequence variants on chromatin accessibility. Another is Sequence Profiler, a utility that performs 'in silico saturated mutagenesis' analysis to discover high-impact sites (e.g., cis-regulatory elements) within a sequence. When validated on independent test sets, the area under receiver operating characteristic curve of deep learning models in PlantDeepSEA ranges from 0.93 to 0.99. We demonstrate the usability of the web service with two examples. PlantDeepSEA could help to prioritize regulatory causal variants and might improve our understanding of their mechanisms of action in different tissues in plants. PlantDeepSEA is available at http://plantdeepsea.ncpgr.cn/.Entities:
Year: 2021 PMID: 34037796 PMCID: PMC8262748 DOI: 10.1093/nar/gkab383
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Overview of PlantDeepSEA. (A) Workflow of PlantDeepSEA. Firstly, we collected high-quality chromatin accessibility data from multiple representative tissues of six plant species. Secondly, we obtained credible open chromatin regions (OCRs) for each species through sequence alignment, quality control (QC), and OCR identification steps. Thirdly, we implemented a high-performance deep learning model, DeepSEA (8) using the Selene SDK (13), and used chromatin accessibility data to train the model. Fourthly, we built PlantDeepSEA (http://plantdeepsea.ncpgr.cn) based on tools such as Django and bokeh. PlantDeepSEA can be used to identify high-impact sites or prioritize causal variants. (B) Boxplot of area under curve (AUC) in each deep neural network model. Each point represents the corresponding AUC of each sample. (C) Two main functions in PlantDeepSEA. We designed two tools named ‘Variant Effector’ and ‘Sequence Profiler’, the accepted inputs and outputs are listed in the plot.
Summary statistics of ATAC-seq data used in PlantDeepSEA
| Species | Tissue number | Sample number | Total Q30 read numbera | Mean TSS enrichmentb | Mean OCR numberc |
|---|---|---|---|---|---|
|
| 6 | 14 | 458 734 749 | 12.1 | 25 947 |
|
| 5 | 9 | 187 359 453 | 11.5 | 44 370 |
|
| 6 | 15 | 625 034 398 | 12.0 | 75 670 |
|
| 6 | 15 | 521 213 434 | 13.9 | 72 567 |
|
| 5 | 9 | 624 666 196 | 7.2 | 72 230 |
|
| 7 | 14 | 818 482 967 | 9.9 | 82 166 |
|
| 8 | 19 | 856 301 588 | 11.0 | 74 257 |
aThe total number of reads per sample aligned to the reference genome (mapping quality >30).
bMean TSS enrichment score for each sample.
cMean of the number of OCRs identified by MACS2 in each sample.
Figure 2.Two case studies. (A) Prioritization of causal variants in DEP1 promoter region. We made the VCF file of nine variants in DEP1 promoter region and used the tool ‘Variant Effector’ to prioritize these variants. The result showed that vg0916410299 was ranked as the most likely causal variant among the provided variants. (B) Analysis of high impact sites around the SNP vg0916410299. We used the tool ‘Sequence Profiler’ by entering the chromosome and the position of vg0916410299. The in silico saturated mutagenesis map showed sequence TGGCCC (overlapped with vg0916410299) might be a cis-regulatory element. (C) Analysis of high impact sites for different haplotypes of QTL UPA2 using the tool ‘Sequence Profiler’. The in silico saturated mutagenesis map of CIMMYT 8759 haplotype (upper) and W22 haplotype (under) showed the sequence AGTGTG might be a cis-regulatory element, which is consistent with the results of Tian et al. (39). The loss score refers to the maximum decrease in probability that an allele belongs to open chromatin compared to the reference nucleotide in all mutations at each site. And the gain score refers to the maximum increase.