| Literature DB >> 19287502 |
Rami N Mahdi1, Eric C Rouchka.
Abstract
Accurate identification of promoter regions and transcription start sites (TSS) in genomic DNA allows for a more complete understanding of the structure of genes and gene regulation within a given genome. Many recently published methods have achieved high identification accuracy of TSS. However, models providing more accurate modeling of promoters and TSS are needed. A novel identification method for identifying transcription start sites that improves the accuracy of TSS recognition for recently published methods is proposed. This method incorporates a metric feature based on oligonucleotide positional frequencies, taking into account the nature of promoters. A radial basis function neural network for identifying transcription start sites (RBF-TSS) is proposed and employed as a classification algorithm. Using non-overlapping chunks (windows) of size 50 and 500 on the human genome, the proposed method achieves an area under the Receiver Operator Characteristic curve (auROC) of 94.75% and 95.08% respectively, providing increased performance over existing TSS prediction methods.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19287502 PMCID: PMC2654504 DOI: 10.1371/journal.pone.0004878
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Training sequences are divided around the TSS with overlapping regions.
This specific subdivision shows feature 7 settings, as described in Table 1.
Sub-regions and oligonucleotide lengths considered for feature extraction.
| Feature | Sub-region ranges (relative to the TSS) | Oligonucleotide |
| 1 | (−500,−230),(−300,−50),(−100,20),(−20,99) | 4 mer |
| 2 | (−500,−220),(−310,−40),(−110,30),(−30,99) | |
| 3 | (−500,−240),(−290,−60),(−90,10),(−10,99) | |
| 4 | (−600,−230),(−280,−40),(−70,70),(40,199) | |
| 5 | (−600,−240),(−270,−50),(−60,60),(50,199) | |
| 6 | (−600,−280),(−330,−110),(−150,20),(−20,149) | |
| 7 | (−600,−230),(−280,−40),(−70,70),(40,249) | |
| 8 | (−650,−490),(−550,−400),(−450,−310),(−350,−220),(−260,−140),(−170,−60),(−90,10),(10,70), (50,150),(120,229) | 3 mer |
Figure 2Typical Radial Basis Function network topology.
Figure 3Average scores at positions around the true TSS vs. average scores of negative examples in validation data.
The x-axis represents the relative position to the true TSS within the positive examples.
Figure 4ROC curve for chunk sizes 50 and 500.
Both axes are scaled to logarithm base 10 to highlight the difference.
Figure 5PRC curve for chunk sizes of 50 and 500.
Area under the curve for RBF-TSS and ARTS.
| Curve | auROC % | auPRC % | ||
| Chunk Size | 50 | 500 | 50 | 500 |
| RBF-TSS | 94.75 | 95.08 | 24.08 | 54.64 |
| ARTS | 92.77 | 93.44 | 26.18 | 57.19 |
| Eponine | 88.48 | 91.51 | 11.79 | 40.80 |
| McPromoter | 92.55 | 93.59 | 6.32 | 24.23 |
| FirstEF | 71.29 | 90.25 | 6.54 | 40.89 |