| Literature DB >> 17868445 |
Antonio Lijoi1, Ramsés H Mena, Igor Prünster.
Abstract
BACKGROUND: Expressed sequence tags (ESTs) analyses are a fundamental tool for gene identification in organisms. Given a preliminary EST sample from a certain library, several statistical prediction problems arise. In particular, it is of interest to estimate how many new genes can be detected in a future EST sample of given size and also to determine the gene discovery rate: these estimates represent the basis for deciding whether to proceed sequencing the library and, in case of a positive decision, a guideline for selecting the size of the new sample. Such information is also useful for establishing sequencing efficiency in experimental design and for measuring the degree of redundancy of an EST library.Entities:
Mesh:
Year: 2007 PMID: 17868445 PMCID: PMC2220008 DOI: 10.1186/1471-2105-8-339
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
EST surveys information clustered into levels of expression
| 346 | 57 | 19 | 12 | 9 | 5 | 4 | 2 | 4 | 5 | 4 | ||
| 491 | 72 | 30 | 9 | 13 | 5 | 3 | 1 | 2 | 0 | 1 | ||
| 378 | 33 | 21 | 9 | 6 | 1 | 3 | 1 | 1 | 1 | 0 | ||
| 200 | 21 | 14 | 4 | 3 | 3 | 1 | 0 | 1 | 0 | 0 | ||
| 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 473 | 959 | ||
| 0 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 631 | 969 | ||
| 0 | 1 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 460 | 715 | ||
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 248 | 363 | ||
Source: Susko and Roger [6]
Figure 1Expected number of new genes: comparison with Good-Toulmin estimator. Expected number of new genes in an additional sample for the Naegleria gruberi aerobic and anaerobic libraries arising from the application of the Good-Toulmin estimator and of the Bayesian nonparametric estimator.
Estimates for the Mastigamoeba libraries
| % | Expected number of new genes in a additional sample of size | Probability of discovering a new gene at the ( | |
| 50 | 358 | 180 ∈ (158 , 204) | 0.481 ∈ (0.466 , 0.498) |
| 100 | 715 | 346 ∈ (312 , 382) | 0.452 ∈ (0.434 , 0.470) |
| 150 | 1072 | 503 ∈ (458 , 550) | 0.430 ∈ (0.411 , 0.449) |
| 200 | 1430 | 654 ∈ (599 , 711) | 0.412 ∈ (0.393 , 0.433) |
| 250 | 1788 | 799 ∈ (734 , 866) | 0.398 ∈ (0.379 , 0.419) |
| 300 | 2145 | 939 ∈ (865 , 1015) | 0.386 ∈ (0.367 , 0.407) |
| 50 | 182 | 94 ∈ (79 , 111) | 0.493 ∈ (0.475 , 0.512) |
| 100 | 363 | 180 ∈ (156 , 206) | 0.456 ∈ (0.434 , 0.479) |
| 150 | 544 | 260 ∈ (229 , 293) | 0.428 ∈ (0.406 , 0.452) |
| 200 | 726 | 336 ∈ (299 , 375) | 0.406 ∈ (0.384 , 0.430) |
| 250 | 908 | 408 ∈ (365 , 453) | 0.389 ∈ (0.366 , 0.412) |
| 300 | 1089 | 477 ∈ (428 , 528) | 0.374 ∈ (0.351 , 0.398) |
Non-normalized and normalized Mastigamoeba libraries: the first column provides the size of the additional sample in % of the size of the initial sample, the second the actual size of the additional survey, the third presents the expected number of new genes and the fourth the discovery probability. The estimates in the third and fourth column are accompanied by the 95% highest posterior density intervals.
Estimates for the Naeglaria libraries
| % | Expected number of new genes in an additional sample of size | Probability of discovering a new gene at the ( | |
| 50 | 480 | 162 ∈ (138 , 188) | 0.318 ∈ (0.307 , 0.329) |
| 100 | 959 | 307 ∈ (271 , 345) | 0.290 ∈ (0.277 , 0.303) |
| 150 | 1438 | 441 ∈ (394 , 488) | 0.270 ∈ (0.257 , 0.282) |
| 200 | 1918 | 566 ∈ (510 , 624) | 0.254 ∈ (0.241 , 0.267) |
| 250 | 2398 | 685 ∈ (619 , 751) | 0.242 ∈ (0.229 , 0.255) |
| 300 | 2877 | 798 ∈ (725 , 873) | 0.231 ∈ (0.219 , 0.244) |
| 50 | 484 | 231 ∈ (206 , 258) | 0.450 ∈ (0.440 , 0.461) |
| 100 | 969 | 440 ∈ (402 , 478) | 0.412 ∈ (0.400 , 0.424) |
| 150 | 1454 | 632 ∈ (583 , 683) | 0.384 ∈ (0.371 , 0.397) |
| 200 | 1938 | 812 ∈ (753 , 873) | 0.362 ∈ (0.349 , 0.375) |
| 250 | 2422 | 983 ∈ (915 , 1053) | 0.344 ∈ (0.332 , 0.357) |
| 300 | 2907 | 1146 ∈ (1069 , 1225) | 0.330 ∈ (0.317 , 0.342) |
Naeglaria aerobic and anaerobic libraries: the first column provides the size of the additional sample in % of the size of the initial sample, the second the actual size of the additional survey, the third presents the expected number of new genes and the fourth the discovery probability. The estimates in the third and fourth column are accompanied by the 95% highest posterior density intervals.
Figure 2Discovery rate. Bayesian nonparametric estimates of the discovery rate associated to the non-normalized and normalized Mastigamoeba libraries.
Figure 3Expected number of new genes. Expected number of new genes in an additional sample and corresponding 95 % highest posterior density intervals for the Naegleria gruberi aerobic and anaerobic libraries arising from the application of the Bayesian nonparametric method.
Figure 4Contour plots of Pitman's sampling formula. Contour plots of Pitman's sampling formula, as a function of (σ, θ), corresponding to the two Naegleria gruberi datasets: aerobic (right) and anaerobic (left).