| Literature DB >> 20122241 |
Ravi Gupta1, Priyankara Wikramasinghe, Anirban Bhattacharyya, Francisco A Perez, Sharmistha Pal, Ramana V Davuluri.
Abstract
BACKGROUND: Use of alternative gene promoters that drive widespread cell-type, tissue-type or developmental gene regulation in mammalian genomes is a common phenomenon. Chromatin immunoprecipitation methods coupled with DNA microarray (ChIP-chip) or massive parallel sequencing (ChIP-seq) are enabling genome-wide identification of active promoters in different cellular conditions using antibodies against Pol-II. However, these methods produce enrichment not only near the gene promoters but also inside the genes and other genomic regions due to the non-specificity of the antibodies used in ChIP. Further, the use of these methods is limited by their high cost and strong dependence on cellular type and context.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20122241 PMCID: PMC3009539 DOI: 10.1186/1471-2105-11-S1-S65
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance statistics of classification models based on 10-fold cross validation
| Method | 10-fold cross-validation test result, 39 features, Promoters = 8793, NonPromoters = 34686 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| # of true positive | # of false negative | # of true negative | # of false positive | Sensitivity (%) | Positive predictive value (PPV) | Mathew correlation coefficient | True positive cost | ROC Area | |
| 7603 | 1190 | 34354 | 332 | 86.47 | 95.82 | 0.89 | 0.04 | 0.97 | |
| 7638 | 1155 | 34252 | 434 | 86.86 | 94.62 | 0.88 | 0.06 | 0.97 | |
| 7626 | 1167 | 34921 | 395 | 86.73 | 95.08 | 0.89 | 0.05 | 0.97 | |
| 7153 | 1640 | 34198 | 488 | 81.35 | 93.61 | 0.84 | 0.07 | 0.96 | |
Performance statistics of classification models and other existing programs based on independent test set
| Method | Promoters = 2980, NonPromoters = 11481 | |||||||
|---|---|---|---|---|---|---|---|---|
| # of true positive | # of false negative | # of true negative | # of false positive | Sensitivity (%) | Positive predictive value (PPV)% | Mathew correlation coefficient | True positive cost | |
| 2593 | 387 | 11385 | 96 | 87.01 | 96.43 | 0.9 | 0.04 | |
| 2594 | 386 | 11356 | 125 | 87.05 | 95.4 | 0.89 | 0.05 | |
| 2599 | 381 | 11349 | 132 | 87.21 | 95.17 | 0.89 | 0.05 | |
| 2391 | 589 | 11332 | 149 | 80.23 | 94.13 | 0.84 | 0.06 | |
| 2493 | 487 | 11064 | 417 | 86.91 | 85.67 | 0.81 | 0.17 | |
| 2581 | 399 | 9633 | 1848 | 87.01 | 58.28 | 0.62 | 0.72 | |
| 2563 | 417 | 8817 | 2664 | 86.01 | 49.03 | 0.53 | 1.04 | |
| 1714 | 1226 | 11402 | 79 | 57.52 | 95.6 | 0.70 | 0.05 | |
Figure 1ROC curve for four classification models. The ROC curve obtained by 10-fold cross-validation test for the four different classification methods.
Figure 2Variable Importance table. Top ranking feature variables selected by Random Forest and their mean decrease in accuracy and mean decrease in Gini measure in discriminating Pol-II enriched promoter regions and Pol-II enriched non-promoter regions. The mean decrease in accuracy/Gini measure was an average of 100 runs of RF.
Figure 3ROC curve for comparison of our method with existing programs. The ROC curve obtained on the test set using our method and other existing programs: EP3, Eponine, FirstEF and ProSOM.
Summary of prediction and annotation of Pol-II promoters from published ChIP-seq datasets
| Stage | D0 | D1 | D2 | D3 | D4 | D6 | ES Cell |
|---|---|---|---|---|---|---|---|
| 5252311 | 5252311 | 5252311 | 5252311 | 5252311 | 5252311 | 2688589 | |
| 108416 | 134674 | 153097 | 140137 | 159599 | 88606 | 13942 | |
| 24888 | 25179 | 24510 | 25101 | 22374 | 15838 | 5889 | |
| 10645 | 10632 | 10349 | 10539 | 9701 | 8153 | 5034 | |
| 1039 | 1095 | 1088 | 1101 | 1029 | 708 | 313 | |
| 11684 | 13452 | 13673 | 13461 | 11644 | 6977 | 542 | |