| Literature DB >> 35392802 |
Guan-Dong Shang1,2, Zhou-Geng Xu1,2, Mu-Chun Wan1,3, Fu-Xiang Wang1,2, Jia-Wei Wang4,5,6.
Abstract
BACKGROUND: Transcription factors (TFs) play central roles in regulating gene expression. With the rapid growth in the use of high-throughput sequencing methods, there is a need to develop a comprehensive data processing and analyzing framework for inferring influential TFs based on ChIP-seq/ATAC-seq datasets.Entities:
Keywords: ATAC-seq; ChIP-seq; Chromatin accessibility; Gene regulation; R package; Transcription factor
Mesh:
Substances:
Year: 2022 PMID: 35392802 PMCID: PMC8988339 DOI: 10.1186/s12864-022-08506-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1The components of FindIT2. A sketch of FindIT2 components is shown. FindIT2 supports a complete framework for annotating ChIP-seq/ATAC-seq peaks, identifying TF targets by the combination of ChIP-seq and RNA-seq datasets, and inferring influential TFs based on different types of data input. The mmAnno module accepts the bed or bed-like format file like narrowPeak, broadPeak which contains coordinates of interesting region. mmAnno can build peak-gene links to annotate peak according to the genomic coordinates of features. The peakGeneCor module can use the genomic coordinates and count matrix to calculate correlation between features, which can build more robust peak-gene link. The caclRP module can accept the peak count matrix and Granges object produced by mm_geneScan function in mmAnno module to calculate regulatory potential (RP). Or it can also accept bigwig file or TF ChIP-seq peak to calculate RP. The data frame containing RP calculated by calcRP_TFHit in calcRP module can be integrated with differential gene expression to calculate TF target rank using integrate_ChIP_RNA function in find_influential_Target module. The find_influential_TF provides many methods to infer influential TF based on different analysis purpose and annotation. For example, findIT_regionRP can accepts the Granges object from calcRP_region to infer influential TF of interesting gene set. findIT_enrichFisher can accept public TF ChIP-seq database to find influential TF of interesting peak set
Major FindIT2 functions
| Function | Description |
|---|---|
| loadPeakFile | read peak file and transform it into GRanges object |
| mm_nearestGene | annotate peaks using nearest gene mode |
| mm_geneScan | annotate peaks using gene scan mode |
| mm_geneBound | search related peaks of interesting genes |
| plot_annoDistance | plot the distance distribution |
| peakGeneCor | calculate correlation between gene and peak |
| enhancerPromoterCor | calculate correlation between enhancer and promoter |
| getAssocPairNumber | get associated peak number of gene and vice verse |
| plot_peakGeneAlias_summary | plot the distribution of associated feature number |
| plot_peakGeneCor | plot correlation between two features |
| shinyParse_peakGeneCor | explore feature relationship interactively |
| calcRP_coverage | calculate RP using big wig files |
| calcRP_region | calculate RP based on mm_geneScan and peak count matrix |
| calcRP_TFHit | calculate RP based on ChIP-Seq peak data |
| integrate_ChIP_RNA | integrate ChIP-Seq and RNA-Seq data to find TF target genes |
| findIT_TTPair | find influential TF of input genes based on public TF-target data |
| findIT_TFHit | find influential TF of input genes based on public ChIP-seq or motif scan |
| findIT_enrichFisher | find influential TF of input peaks based on public ChIP-seq or motif scan |
| findIT_enrichWilcox | find influential TF of input peaks based on public ChIP-seq or motif scan |
| findIT_regionRP | find influential TF of input genes based on RP and public ChIP-seq or motif scan. |
| findIT_MARA | infer TF activity based on motif scan and peak count matrix |
| jaccard_findIT_enrichFisher | calculate jaccard index based on findIT_enrichFisher |
| jaccard_findIT_TTpair | calculate jaccard index based on findIT_TTPair |
| integrate_replicates | integrate value from replicates |
Fig. 2The functional test of the mmAnno module. A Distribution of the number of peaks linked to a gene inferred by mm_nearestGene. The result was plotted by plot_peakGeneAlias_summay. B Distribution of the number of peaks linked to a gene inferred by enhancerPromoterCor. The result was plotted by plot_peakGeneAlias_summay. The origin result is shown on the left. The filtered result is given on the right. Threshold, p-value < 0.01 and cor > 0.8. C Dot plot of the distal enhancer and promoter accessibility of peak-to-gene link located within 20 kb of AT1G80840. This plot is generated by plot_peakGeneCor. D The ATAC-seq track of AT1G80840. The genomic region is shown and the selected gene is highlighted in black. The locations of the ATAC-seq peaks are indicated by purple rectangles. The related distal enhancer or promoter are shadowed and promoter is marked by an asterisk
Fig. 3Analysis result of the AT3G14440 gene locus and enhancerPromoter function. A The ATAC track of AT3G14440. B Distribution of the number of genes linked per peak. This result is produced by enhancerPromoterCor. The left panel in plot is the origin result, while the right panel is the filtered result according to the threshold: p-value < 0.01 and cor > 0.8. This plot is generated by plot_peakGeneAlias_summay
The top 10 target genes of LEC2
| gene_id | withPeakN | sumRP | RP_rank | log2FoldChange | padj | diff_rank | rankProduct | rankOf_ | gene_category | gene symbol |
|---|---|---|---|---|---|---|---|---|---|---|
| AT2G30470 | 7 | 7.026484 | 2 | 3.699854 | 5.27E-64 | 2 | 4 | 1 | up | HSI2 |
| AT5G08460 | 6 | 7.046663 | 1 | 4.554615 | 3.60E-51 | 8 | 8 | 2 | up | NA |
| AT1G11170 | 2 | 3.464122 | 54 | 3.26064 | 2.76E-62 | 3 | 162 | 3 | up | NA |
| AT3G43270 | 4 | 2.352608 | 194 | 4.30147 | 1.03E-108 | 1 | 194 | 4 | up | NA |
| AT5G15830 | 2 | 3.873782 | 33 | 6.779295 | 1.88E-56 | 6.5 | 214.5 | 5 | up | AtbZIP3 |
| AT2G13810 | 3 | 5.147604 | 10 | 4.602503 | 1.83E-21 | 42 | 420 | 6 | up | ALD1 |
| AT5G23360 | 5 | 3.082765 | 88 | 3.312275 | 1.88E-56 | 6.5 | 572 | 7 | up | NA |
| AT5G07550 | 5 | 3.759361 | 36 | 8.162229 | 3.32E-36 | 16 | 576 | 8 | up | ATGRP19 |
| AT5G57785 | 6 | 5.970996 | 6 | 5.74865 | 2.69E-10 | 100 | 600 | 9 | up | NA |
| AT3G59850 | 2 | 2.594872 | 156 | 4.139314 | 1.55E-59 | 4 | 624 | 10 | up | NA |
The “withPeakN” column represents the peak number located in the scan region. The “sumRP” column represents the RP calculated by calcRP_TFHit. The “RP_rank” column represents the rank of gene’s RP. The “log2FoldChange” and “padj” columns represent expression fold change and adjust p-value, respectively. The “diff_rank” column represents the rank of gene’s padj. The “rankProduct” represents the results of “RP_rank” and “diff_rank”. The “rankOf_rankProduct” represents the rank of “rankProduct” column. The “gene_category” column stands for the gene group according to their expression trend (up, down or static) upon induction of LEC2. The “symbol” column represents the gene symbol in the TAIR database. NA not available
Fig. 4The findIT_influential_TF module recovers LEC2 as a top influential TF. A The z-score rank distribution of all TFs. The z-score is calculated by converting the rank of each TF into z-scores using inverse normal transformation. LEC2 is marked with red dot. B The resultant rank score of LEC2 using each function (X-axis) of the findIT_influential_TF module. The ranking results for each function are given
Fig. 5The findIT_regionRP function provides detailed information in multi-dimension. A The TF ranking result produced from findIT_regionRP. The ATAC-seq dataset at E5 0 h was used. The Y-axis represents the -log10(p-value), while the x-axis represents the rank order of all TFs. B The ATAC-seq track of AT2G8610. The genomic region is shown and selected gene is highlighted in black. The locations of the ATAC-seq peaks are indicated by purple rectangles. The peaks hit by LEC2 are shadowed. C The interface of shinyParse_findIT_regionRP
Fig. 6Inference of the timing TF activities during SE by the findIT_MARA function. The top 40 highly variable TFs are given. Seven time points along with SE are shown