| Literature DB >> 24743548 |
Song Cui1, Eunseog Youn2, Joohyun Lee3, Stephan J Maas3.
Abstract
Biological prediction of transcription factor binding sites and their corresponding transcription factor target genes (TFTGs) makes great contribution to understanding the gene regulatory networks. However, these approaches are based on laborious and time-consuming biological experiments. Numerous computational approaches have shown great potential to circumvent laborious biological methods. However, the majority of these algorithms provide limited performances and fail to consider the structural property of the datasets. We proposed a refined systematic computational approach for predicting TFTGs. Based on previous work done on identifying auxin response factor target genes from Arabidopsis thaliana co-expression data, we adopted a novel reverse-complementary distance-sensitive n-gram profile algorithm. This algorithm converts each upstream sub-sequence into a high-dimensional vector data point and transforms the prediction task into a classification problem using support vector machine-based classifier. Our approach showed significant improvement compared to other computational methods based on the area under curve value of the receiver operating characteristic curve using 10-fold cross validation. In addition, in the light of the highly skewed structure of the dataset, we also evaluated other metrics and their associated curves, such as precision-recall curves and cost curves, which provided highly satisfactory results.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24743548 PMCID: PMC3990533 DOI: 10.1371/journal.pone.0094519
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
An example of a reverse-complementary distance-sensitive n-gram profile (RCDSNGP) representation with n = 4, 5, and 6 for a given sequence (AAGCTT) with the reference subsequence marked in bold*.
| Length of | Reverse-complementary distance-sensitive | Frequency count |
|
| {AAGC, GCTT, 1} | 1 |
| {AGCT, AGCT, 2} | 2 | |
| {AAGC, GCTT, 3} | 1 | |
| {CAGC, GCTG, 1} | 1 | |
|
| {AAGCT, AGCTT, 1} | 1 |
| {AAGCT, AGCTT, 2} | 1 | |
| {AGCTG, CAGCT, 1} | 1 | |
|
| {AAGCTT, AAGCTT, 1} | 1 |
*Given an m-length sequence s = s 1, s 2… s…s…s m, the RCDSNGP of s with respect to an j-length reference subsequence x = s…s is a set of K 2-tuples, denoted as RCDSNGP(s) RCDSNGP(s) = {({f 1, r 1, d 1}, c 1),({f 2, r 2, d 2}, c 2)…({f, r, d}, c)}, f being a distinct n-gram, r being the reverse complement of f, d being the relative distance parameter, and c being the sum of frequency counts of f and r with the same d relative to x in s. Each set in a 2-tuple ({f, r, d}) is a reverse-complementary distance-sensitive n-gram (RCDSNG), or a feature in our study. This RCDSNGP representation was adopted for all training sequences. For testing processes, each sequence was converted to RCDSNGP first, and then represented according to the selected RCDSNGs generated from the training datasets, including those with zero count.
Confusion matrix for performance evaluation with positive class label (+1) denoting transcription factor target gene (TFTG) and negative class label (-1) denoting non-TFTG.
| True class label | |||
| +1 | −1 | ||
| Predicted class label | +1 | TP | FP−+ |
| −1 | FN−− | TN+− | |
, −+, −−, +− denote true positive, false positive, false negative, and true negative, respectively.
Number of unique features (union of selected features with p-value <0.01 based on 10-fold cross validation) and classification performances [evaluated as accuracy, precision, recall, F 1, area under the curve (AUC) value of receiver operating characteristic (ROC) curve, and AUC value of precision-recall (PR) curve] affected by the maximum distance on each flank from the central binding site (d) based on 10-fold cross-validation on transcription factor target gene prediction using reverse-complementary distance-sensitive n-gram profile algorithm with n = 4–9 and support vector machine with Gaussian radial basis function kernel.
| dmax | Unique Feature | Accuracy | Precision | Recall | F1 | ROC-AUC | PR-AUC |
| 25 | 893 | 0.9476 | 0.6923 | 0.3871 | 0.4966 | 0.7202 | 0.4180 |
| 50 | 1870 | 0.9541 | 0.7544 | 0.4624 | 0.5733 | 0.7562 | 0.5286 |
| 75 | 2732 | 0.9559 | 0.7739 | 0.4785 | 0.5914 | 0.7739 | 0.5554 |
| 100 | 3502 | 0.9566 | 0.7519 | 0.5215 | 0.6159 | 0.7720 | 0.5808 |
| 125 | 4255 | 0.9580 | 0.7899 | 0.5054 | 0.6164 | 0.7640 | 0.5646 |
| 150 | 4949 | 0.9602 | 0.8319 | 0.5054 | 0.6288 | 0.7626 | 0.5690 |
| 175 | 5622 | 0.9587 | 0.8034 | 0.5054 | 0.6205 | 0.7673 | 0.5639 |
| 200 | 6136 | 0.9580 | 0.7805 | 0.5161 | 0.6214 | 0.7664 | 0.5879 |
The top 20 smallest p-value reverse-complementary distance-sensitive n-grams (RCDSNGs; n = 4–9) selected as representative features with their information gain (IG) values and p-values in the 1000-bp upstream of 186 transcription factor target genes (TFTGs) and 2601 non-TFTGs, when the maximum distance (half of the window size) on each flank of the central transcription factor binding site d = 150.
| Ranking | RCDSNG (feature) | IG value |
|
| 1 | {ACACGT, ACGTGT, 4} | 0.001885 | 0 |
| 2 | {CGAGAA, TTCTCG, 82} | 0.001884 | 0 |
| 3 | {AATATAA, TTATATT, 52} | 0.001884 | 0 |
| 4 | {ACTTCC, GGAAGT, 30} | 0.001880 | 0 |
| 5 | {ACACC, GGTGT, 44} | 0.001880 | 0 |
| 6 | {GTAC, GTAC, 39} | 0.001701 | 0 |
| 7 | {CAAACA, TGTTTG, 149} | 0.001707 | 0 |
| 8 | {AAAAATA, TATTTTT, 44} | 0.001707 | 0 |
| 9 | {AGTAT,ATACT, 51} | 0.001713 | 0 |
| 10 | {ATGATTA, TAATCAT, 130} | 0.001656 | 0 |
| 11 | {ACTTC, GAAGT, 30} | 0.001516 | 0 |
| 12 | {CTAAC, GTTAG, 91} | 0.001476 | 0 |
| 13 | {ACAAATA, TATTTGT, 71} | 0.001463 | 0 |
| 14 | {ATACG, CGTAT, 49} | 0.001463 | 0 |
| 15 | {AAAACC, GGTTTT, 75} | 0.001463 | 0 |
| 16 | {AAAGACA, TGTCTTT, 117} | 0.001463 | 0 |
| 17 | {TAAAACA, TGTTTTA, 85} | 0.001463 | 0 |
| 18 | {AGTATA, TATACT, 124} | 0.001458 | 0 |
| 19 | {AATGTG, CACATT, 43} | 0.001412 | 0 |
| 20 | {ATACCC, GGGTAT, 16} | 0.001412 | 0 |
Figure 1Performances of transcription factor target gene prediction affected by the maximum distance on each flank from the binding site (d) based on 10-fold cross-validation using reverse-complementary distance-sensitive n-gram profile algorithm with n = 4–9 and support vector machine with Gaussian radial basis function kernel.
(A) Performance evaluation metric (accuracy, precision, recall, and F 1) values versus d on each flank from the central binding site. (B) Area under the curve (AUC) value of receiver operating characteristic (ROC) curve and precision-recall (PR) curve versus d on each flank from the central binding site.
Figure 2Classification performances using the optimal maximum distance on each flank from the binding site (d = 150).
(A) Receiver operating characteristic (ROC) curve, (B) precision-recall (PR) curve, and (C) cost curve of the 10-fold cross-validation on transcription factor target gene prediction using reverse-complementary distance-sensitive n-gram profile (RCDSNGP) algorithm with d = 150 and n = 4–9 based on different support vector machine (SVM) kernels, reverse-complementary position-sensitive n-gram profile (RCPSNP) algorithm using linear-kernel SVM, and Position-Specific Scoring Matrices (PSSM)-based approach.
Classification performances [evaluated as accuracy, precision, recall, F 1, area under the curve (AUC) value of receiver operating characteristic (ROC) curve, and AUC value of precision-recall (PR) curve] using different feature encoding algorithms with optimal parameter settings for SVM and d = 150, including reverse-complementary distance-sensitive n-gram profile (RCDSNGP), reverse-complementary position-sensitive n-gram profile (RCPSNP), and a Position-Specific Scoring Matrices (PSSM)-based algorithms.
| Feature Encoding Algorithm | Accuracy | Precision | Recall |
| ROC-AUC | PR-AUC |
| RCDSNGP | 0.9602 | 0.8319 | 0.5054 | 0.6288 | 0.7626 | 0.5690 |
| RCPSNP | 0.9300 | 0.2000 | 0.0161 | 0.0299 | 0.5055 | 0.0773 |
| PSSM | 0.9332 | NAN | 0 | 0 | 0.5569 | 0.0801 |