| Literature DB >> 17576669 |
Xinbin Dai1, Ji He, Xuechun Zhao.
Abstract
Identifying transcription factor target genes (TFTGs) is a vital step towards understanding regulatory mechanisms of gene expression. Methods for the de novo identification of TFTGs are generally based on screening for novel DNA binding sites. However, experimental screening of new binding sites is a technically challenging, laborious and time-consuming task, while computational methods still lack accuracy. We propose a novel systematic computational approach for predicting TFTGs directly on a genome scale. Utilizing gene co-expression data, we modeled the prediction problem as a 'yes' or 'no' classification task by converting biological sequences into novel reverse-complementary position-sensitive n-gram profiles and implemented the classifiers with support vector machines. Our approach does not necessarily predict new DNA binding sites, which other studies have shown to be difficult and inaccurate. We applied the proposed approach to predict auxin-response factor target genes from published Arabidopsis thaliana co-expression data and obtained satisfactory results. Using ten-fold cross validations, the area under curve value of the receiver operating characteristic reaches around 0.73.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17576669 PMCID: PMC1935008 DOI: 10.1093/nar/gkm454
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
The classification of 7720 genes represented by probes on Affymetrix 8K AG-Chips
| Number of IAA/ BL-affected genes ( | Number of IAA/ BL-unaffected genes | Sum | |
|---|---|---|---|
| Number of genes with TGTCTC/ GAGACA in their 1000 bp upstream regions | 2787 | ||
| Number of genes without TGTCTC/GAGACA in their 1000 bp upstream regions | 451 | 4482 | 4933 |
| Sum | 637 | 7083 | 7720 |
Figure 1.An example of constructing a reverse-complementary position-sensitive four-gram profile. The ‘TGTCTC’ highlighted in red is marked as the core motif. List of reverse-complementary position sensitive 4-grams of the given sequence: Box-a: TAGT/ACTA|+1; Box-1: GCTA/TAGC|-1; Box-b: GTAG/CTAC|+2; Box-2: CTAG/CTAG|-2; Box-c: AGTA/TACT|+3; Box-3: TAGA/TCTA|-3; Box-d: TAGT/ACTA|+4; Box-4: AGAT/ATCT|-4.
Figure 2.Vector representation of DNA sequence using the featured reverse-complementary position sensitive n-grams.
Top 10 significant four-grams screened by information gain value*
| No. | Position-sensitive four-grams | Number of occurrences in 186 ARF-related genes | Number of occurrences in 2601 ARF-unrelated genes | Information gain value |
|---|---|---|---|---|
| 1 | AAAT/ATTT|-018 | 13 | 37 | 0.00146 |
| 2 | AAAG/CTTT|-060 | 11 | 29 | 0.00132 |
| 3 | AAGT/ACTT|-088 | 9 | 19 | 0.00128 |
| 4 | TAGA/TCTA|-048 | 8 | 15 | 0.00124 |
| 5 | CCCA/TGGG|-048 | 6 | 8 | 0.00114 |
| 6 | CTAC/GTAG|+065 | 5 | 5 | 0.00109 |
| 7 | ACTA/TAGT|-074 | 8 | 18 | 0.00109 |
| 8 | ACAT/ATGT|-068 | 8 | 19 | 0.00104 |
| 9 | ATTC/GAAT|-063 | 9 | 26 | 0.00099 |
| 10 | TACA/TGTA|+024 | 7 | 15 | 0.00098 |
*The reverse-complementary position-sensitive 4-gram profile was generated from 1000 bp upstream regions of the 186 ARF-target genes and the 2601 ARF-non-target genes with position-sensitive factor P = 1.
Figure 3.Receiver operating characteristic (ROC) curve of optimal models.
Figure 4.(a) AUC value versus N number of n-Grams (C) (b) AUC value versus position-sensitivity factor (P).