| Literature DB >> 19573244 |
Jia Zeng1, Reda Alhajj, Douglas J Demetrick.
Abstract
BACKGROUND: Translational initiation site (TIS) prediction is a very important and actively studied topic in bioinformatics. In order to complete a comparative analysis, it is desirable to have several benchmark data sets which can be used to test the effectiveness of different algorithms. An ideal benchmark data set should be reliable, representative and readily available. Preferably, proteins encoded by members of the data set should also be representative of the protein population actually expressed in cellular specimens.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19573244 PMCID: PMC2712473 DOI: 10.1186/1471-2105-10-206
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Statistics regarding the cardinality and size portion of the data sets
| Phase I | Phase II | Phase III | ||||
| Card. | Ratio | Card. | Ratio | Card. | Ratio | |
| 25697 | 100% | 8006 | 31% | 2498 | 10% | |
| 21031 | 100% | 6723 | 32% | 2504 | 12% | |
| 23758 | 100% | 7883 | 33% | 1342 | 7% | |
| 20096 | 100% | 5900 | 29% | 1670 | 8% | |
The cardinality of a data set at a particular phase (denoted by Card.) and the size portion of the Phase I collection that has been retained in this phase (denoted by Ratio).
Figure 1Presentation of the MW-pI plot for Phase I data sets. Each sequence entry's corresponding protein is represented by a (MW, pI) tuple, therefore, the entire collection can be plotted onto a two-dimensional space. To highlight the difference of population densities among different regions of the graph, a variable called density is employed. It controls the color of the scattered points on the plot, where the plot space is viewed as a grid that consists of many small cells each of which has a length of 10 KDa along the X-axis and a width of 0.1 pH along the Y-axis. Each cell's local population (the number of points that fall into that region) is recorded and its ratio over the entire population is calculated. Since most of the ratios are very small numbers, they have been multiplied by 10000 before they are used in plotting the graph.
Figure 2Illustrates the MW-pI plot for Phase II data sets. Compared to Figure 1, narrower value ranges are used for MW and pI, where MW is in between 20–70 KDa, and pI is in between 5 to 9 pH.
Figure 3A visual presentation of the Phase III data sets' MW-pI plot. In obtaining the data sets at this phase, more restrictions are applied, making the distributions of scattered points even more sparse than Figure 2.
Statistics of the proteins' molecular weight
| [0, 20) | [20, 40) | [40, 60) | [60, 80) | [80, 100) | [100, 120) | [120, 140) | [140, +∞) | ||||||||||
| PR | LR | PR | LR | PR | LR | PR | LR | PR | LR | PR | LR | PR | LR | PR | LR | ||
| I | 13% | 0% | 0% | 0% | 0% | 8% | 0% | 5% | 0% | 3% | 0% | 7% | 0% | ||||
| II | 0% | 100% | 55% | 39% | 15% | 64% | 0% | 100% | 0% | 100% | 0% | 100% | 0% | 100% | |||
| III | 0% | 100% | 83% | 83% | 14% | 89% | 0% | 100% | 0% | 100% | 0% | 100% | 0% | 100% | |||
| I | 11% | 0% | 0% | 0% | 0% | 8% | 0% | 5% | 0% | 3% | 0% | 6% | 0% | ||||
| II | 0% | 100% | 59% | 38% | 15% | 63% | 0% | 100% | 0% | 100% | 0% | 100% | 0% | 100% | |||
| III | 0% | 100% | 82% | 79% | 14% | 87% | 0% | 100% | 0% | 100% | 0% | 100% | 0% | 100% | |||
| I | 19% | 0% | 0% | 0% | 0% | 5% | 0% | 3% | 0% | 2% | 0% | 3% | 0% | ||||
| II | 0% | 100% | 56% | 39% | 13% | 60% | 0% | 100% | 0% | 100% | 0% | 100% | 0% | 100% | |||
| III | 0% | 100% | 92% | 89% | 13% | 93% | 0% | 100% | 0% | 100% | 0% | 100% | 0% | 100% | |||
| I | 14% | 0% | 0% | 0% | 0% | 8% | 0% | 5% | 0% | 3% | 0% | 8% | 0% | ||||
| II | 0% | 100% | 55% | 39% | 17% | 64% | 0% | 100% | 0% | 100% | 0% | 100% | 0% | 100% | |||
| III | 0% | 100% | 85% | 84% | 14% | 91% | 0% | 100% | 0% | 100% | 0% | 100% | 0% | 100% | |||
This table reports the statistics of the molecular weight of the proteins corresponding to the transcript sequences included in Phases I, II, III data sets (Unit: KDa). To facilitate the analysis, the entire MW value range has been divided into several smaller ranges (e.g., 0–20 KDa, 20–40 KDa, etc). Throughout the table, two measures have been employed: PR, short for Population Ratio, is a horizontal comparison which compares the protein population within a group against other groups in the same phase; whereas LR, short for Loss Ratio, is a vertical comparison which, for a certain group, considers the portion of the protein population that have been lost through phase transitions. More specifically, given a particular aspect X under investigation (e.g. MW, pI, etc), PR calculates the portion of the number of proteins whose value on X is within a particular range (e.g, [20, 40) KDa, [8, 9) pH, etc) over that of the entire protein population in the same phase. LR computes the ratio of the number of eliminated proteins whose value on X is within a particular range Y over that of the original (Phase I) protein population within that range Y. The data that have been highlighted in italic and bold font are the outstanding numbers within the row it belongs to.
Analysis of the data sets at Phases I, II, and III by isoelectric point (Unit: pH)
| [0, 5) | [5, 6) | [6, 8) | [8, 9) | [9, 14] | |||||||
| PR | LR | PR | LR | PR | LR | PR | LR | PR | LR | ||
| I | 8% | 0% | 0% | 0% | 0% | 26% | 0% | ||||
| II | 0% | 100% | 27% | 53% | 52% | 25% | 51% | 1% | 98% | ||
| III | 0% | 100% | 30% | 83% | 86% | 25% | 84% | 1% | 99% | ||
| I | 8% | 0% | 0% | 0% | 0% | 27% | 0% | ||||
| II | 0% | 100% | 26% | 51% | 50% | 28% | 50% | 1% | 98% | ||
| III | 0% | 100% | 28% | 80% | 82% | 27% | 82% | 1% | 99% | ||
| I | 10% | 0% | 0% | 0% | 0% | 28% | 0% | ||||
| II | 0% | 100% | 28% | 45% | 46% | 26% | 49% | 1% | 98% | ||
| III | 0% | 100% | 30% | 90% | 90% | 24% | 92% | 0% | 100% | ||
| I | 9% | 0% | 0% | 0% | 0% | 27% | 0% | ||||
| II | 0% | 100% | 27% | 53% | 55% | 24% | 55% | 1% | 98% | ||
| III | 0% | 100% | 28% | 86% | 88% | 25% | 87% | 1% | 99% | ||
Statistics regarding the hydropathy plot
| [0, 1) | [1, 2) | [2, 3) | [3, 4) | 4+ | |||||||
| PR | LR | PR | LR | PR | LR | PR | LR | PR | LR | ||
| I | 1% | 0% | 11% | 0% | 0% | 0% | 2% | 0% | |||
| II | 0% | 100% | 7% | 80% | 63% | 34% | 70% | 0% | 100% | ||
| III | 0% | 100% | 8% | 92% | 88% | 33% | 91% | 0% | 100% | ||
| I | 1% | 0% | 10% | 0% | 0% | 0% | 3% | 0% | |||
| II | 0% | 100% | 6% | 80% | 61% | 37% | 70% | 2% | 78% | ||
| III | 0% | 100% | 8% | 90% | 85% | 34% | 89% | 1% | 96% | ||
| I | 1% | 0% | 11% | 0% | 0% | 0% | 3% | 0% | |||
| II | 0% | 100% | 7% | 78% | 58% | 38% | 69% | 1% | 88% | ||
| III | 0% | 100% | 8% | 95% | 92% | 36% | 95% | 1% | 98% | ||
| I | 1% | 0% | 11% | 0% | 0% | 0% | 3% | 0% | |||
| II | 0% | 100% | 7% | 81% | 64% | 34% | 72% | 0% | 100% | ||
| III | 0% | 100% | 9% | 93% | 89% | 31% | 92% | 0% | 100% | ||
This table reports the statistics regarding the analysis conducted on the corresponding protiens' hydropathy plot using Kyte-Doolittle scale. The particular parameter that is investigated is called the maximal hydrophobicity, which corresponds to the global maximal value of a continuous hydropathy plot.
Statistics regarding the analysis of the corresponding proteins' hydropathy plot
| Below -4 | [-4, -3) | [-3, -2) | [-2, -1) | [-1, 0) | |||||||
| PR | LR | PR | LR | PR | LR | PR | LR | PR | LR | ||
| I | 5% | 0% | 0% | 0% | 1% | 0% | 0% | 0% | |||
| II | 1% | 93% | 68% | 28% | 65% | 0% | 100% | 0% | 0% | ||
| III | 1% | 98% | 90% | 32% | 87% | 0% | 100% | 0% | 0% | ||
| I | 4% | 0% | 0% | 0% | 1% | 0% | 0% | 0% | |||
| II | 1% | 92% | 67% | 33% | 62% | 0% | 100% | 0% | 0% | ||
| III | 1% | 97% | 88% | 33% | 85% | 0% | 100% | 0% | 0% | ||
| I | 3% | 0% | 0% | 0% | 2% | 0% | 0% | 0% | |||
| II | 1% | 88% | 64% | 32% | 66% | 0% | 100% | 0% | 0% | ||
| III | 1% | 98% | 93% | 30% | 94% | 0% | 100% | 0% | 0% | ||
| I | 6% | 0% | 0% | 0% | 1% | 0% | 0% | 0% | |||
| II | 1% | 95% | 70% | 28% | 65% | 0% | 100% | 0% | 0% | ||
| III | 1% | 98% | 91% | 28% | 90% | 0% | 100% | 0% | 0% | ||
This table focuses on the analysis of minimal hydrophobicity, which corresponds to the global minimal value of a continuous hydrophobicity plot.
Statistics of proteins' hydropathy plots at different phases
| [0, 1) | [1, 2) | [2, 3) | [3, 4) | 4+ | |||||||
| PR | LR | PR | LR | PR | LR | PR | LR | PR | LR | ||
| I | 0% | 19% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | ||
| II | 66% | 13% | 78% | 0% | 0% | 0% | 0% | 0% | 0% | ||
| III | 89% | 14% | 92% | 0% | 0% | 0% | 0% | 0% | 0% | ||
| I | 0% | 24% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | ||
| II | 65% | 18% | 76% | 0% | 0% | 0% | 0% | 0% | 0% | ||
| III | 86% | 17% | 91% | 0% | 0% | 0% | 0% | 0% | 0% | ||
| I | 0% | 28% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | ||
| II | 61% | 16% | 81% | 0% | 0% | 0% | 0% | 0% | 0% | ||
| III | 93% | 15% | 96% | 0% | 0% | 0% | 0% | 0% | 0% | ||
| I | 0% | 18% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | ||
| II | 67% | 11% | 82% | 0% | 0% | 0% | 0% | 0% | 0% | ||
| III | 90% | 11% | 94% | 0% | 0% | 0% | 0% | 0% | 0% | ||
The table reports the statistics of the data sets at different phases with the focus on the proteins' hydropathy plot. The particular metric called average positive hydrophobicity is computed by averaging all the hydrophobicity index values that are higher than or equal to 0 within one plot.
A summary of the data on average negative hydrophobicity
| Below -4 | [-4, -3) | [-3, -2) | [-2, -1) | [-1, 0) | |||||||
| PR | LR | PR | LR | PR | LR | PR | LR | PR | LR | ||
| I | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 35% | 0% | ||
| II | 0% | 0% | 0% | 0% | 0% | 0% | 72% | 42% | 62% | ||
| III | 0% | 0% | 0% | 0% | 0% | 0% | 91% | 42% | 88% | ||
| I | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 37% | 0% | ||
| II | 0% | 0% | 0% | 0% | 0% | 0% | 71% | 43% | 62% | ||
| III | 0% | 0% | 0% | 0% | 0% | 0% | 89% | 43% | 86% | ||
| I | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 38% | 0% | ||
| II | 0% | 0% | 0% | 0% | 0% | 0% | 70% | 43% | 62% | ||
| III | 0% | 0% | 0% | 0% | 0% | 0% | 94% | 39% | 94% | ||
| I | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 31% | 0% | ||
| II | 0% | 0% | 0% | 0% | 0% | 0% | 74% | 41% | 61% | ||
| III | 0% | 0% | 0% | 0% | 0% | 0% | 92% | 38% | 89% | ||
This parameter is yielded by averaging over all the hydrophobicity values that are less than 0 within one plot.
Statistics of hydrophobic residues within the context of hydropathy plots
| [0, 20) | [20, 40) | [40, 60) | [60, 80) | [80, 100] | |||||||
| PR | LR | PR | LR | PR | LR | PR | LR | PR | LR | ||
| I | 7% | 0% | 0% | 0% | 6% | 0% | 0% | 0% | |||
| II | 2% | 91% | 67% | 61% | 1% | 94% | 0% | 0% | |||
| III | 3% | 95% | 90% | 87% | 2% | 96% | 0% | 0% | |||
| I | 6% | 0% | 0% | 0% | 10% | 0% | 0% | 0% | |||
| II | 2% | 89% | 67% | 59% | 5% | 84% | 0% | 0% | |||
| III | 3% | 94% | 87% | 85% | 4% | 95% | 0% | 0% | |||
| I | 7% | 0% | 0% | 0% | 10% | 0% | 0% | 0% | |||
| II | 2% | 90% | 63% | 59% | 2% | 93% | 0% | 0% | |||
| III | 3% | 97% | 93% | 93% | 3% | 98% | 0% | 0% | |||
| I | 7% | 0% | 0% | 0% | 4% | 0% | 0% | 0% | |||
| II | 2% | 91% | 70% | 63% | 1% | 92% | 0% | 0% | |||
| III | 3% | 96% | 91% | 90% | 1% | 97% | 0% | 0% | |||
Given one continuous plot, the percentage is calculated by dividing the number of nucleotide residues that correspond to a positive (0 included) hydropathy index by that in a complete ORF.
The population density of the proteins having a pI value within range [7.5, 7.9] or [8.1, 8.5]
| II | 8006 | 365 | 4.56% | 548 | 6.84% | |
| III | 2498 | 95 | 3.80% | 163 | 6.53% | |
| II | 6723 | 306 | 4.55% | 507 | 7.54% | |
| III | 2504 | 95 | 3.79% | 188 | 7.51% | |
| II | 7883 | 337 | 4.28% | 577 | 7.32% | |
| III | 1342 | 57 | 4.25% | 93 | 6.93% | |
| II | 5900 | 246 | 4.17% | 401 | 6.80% | |
| III | 1670 | 51 | 3.05% | 103 | 6.17% | |
This table provides an illustration of the population density of the proteins that have a pI value within range [7.5, 7.9] or [8.1, 8.5] for Phases II, III data sets. N0 denotes the total count of the proteins in Phase II or Phase III data set; N1 refers to the number of proteins whose pI is in [7.5, 7.9], R1 is equal to N1 over N0; and N2 denotes the population of the proteins having an isoelectric point in range [8.1, 8.5] and R2 is its corresponding ratio.
Figure 4Presentation of the MW-pI plots for the C. elegans II data sets in three phases. These plots represent the distribution of the Phase I, II, III data sets for C. elegans II for the purpose of demonstrating the applicability of the proposed data set generation method in the absence of a priori parameter configuration scheme. We suggest the user to divide the entire ranges of the parameters (e.g. MW and pI) into subgroups and use the proposed algorithm to handle each subgroup individually. Ultimately a representative data set is obtained by combining all of the subgroups together. From the plots we can observe that eventually (in Phase III) the most populated regions converge to some certain ranges, i.e., MW in between 20–60 KDa and pI in between 4 to 10. A bimodal pattern can also be observed.
Contingency Matrix
| Classified as True | Classified as False | |
| Actual True | TP | FN |
| Actual False | FP | TN |
Experimental results using six existing TIS predictors on our transcript collections
| Data Set | Approach | Sn | Sp | AA | OA |
| NetStart | 83.59% | 69.35% | 76.47% | 70.00% | |
| TISHunter | 94.99% | 99.79% | 97.39% | 99.57% | |
| CUB | 80.14% | 99.04% | 89.54% | 98.17% | |
| StartScan | 96.72% | 61.29% | 79.01% | 62.92% | |
| GENSCAN | 79.02% | 99.11% | 89.07% | 98.19% | |
| GlimmerHMM | 68.29% | 98.39% | 83.34% | 96.94% | |
| NetStart | 83.43% | 68.00% | 75.71% | 68.70% | |
| TISHunter | 70.42% | 98.76% | 84.59% | 97.48% | |
| CUB | 79.11% | 99.01% | 89.06% | 98.11% | |
| GENSCAN | 84.07% | 99.35% | 91.71% | 98.66% | |
| CUB | 81.30% | 99.28% | 90.29% | 98.62% | |
| CUB | 83.65% | 99.19% | 91.42% | 98.45% | |
| CUB | 79.99% | 99.19% | 89.59% | 98.45% | |
Four metrics are employed in our experiments: sensitivity (Sn), specificity (Sp), adjusted accuracy (AA) and overall accuracy (OA).
Experimental results using three existing TIS predictors on benchmark collections
| Data Set | Approach | Sn | Sp | AA | OA |
| NetStart | 82.25% | 87.80% | 85.02% | 86.44% | |
| CUB | 89.58% | 96.61% | 93.10% | 94.89% | |
| GENSCAN | 0.24% | 90.25% | 45.24% | 68.17% | |
| NetStart | 97.32% | 88.79% | 93.06% | 90.97% | |
| CUB | 91.78% | 97.18% | 94.48% | 95.80% | |
| GENSCAN | 0.57% | 89.31% | 44.94% | 66.65% | |
| NetStart | 88.00% | 69.93% | 78.97% | 71.78% | |
| CUB | 80.00% | 97.72% | 88.86% | 95.91% | |
| GENSCAN | 64.00% | 98.41% | 81.20% | 94.89% | |
This table reports the results using three TIS predictors on vertebrates (vert.), Arabidopsis thaliana (Arab.) and TIS+50 data sets.