| Literature DB >> 26107654 |
Luis Fernando García-Ortega1, Octavio Martínez1.
Abstract
RNA-seq experiments estimate the number of genes expressed in a transcriptome as well as their relative frequencies. However, an undetermined number of genes can remain undetected due to their low expression relative to the sample size (sequence depth). Estimation of the true number of genes expressed in a transcriptome is essential in order to determine which genes are exclusively expressed in specific tissues or under particular conditions. A reliable estimate of the true number of expressed genes is also required to accurately measure transcriptome changes and to predict the sequencing depth needed to increase the proportion of detected genes. This problem is analogous to ecological sampling problems such as estimating the number of species at a given site. Here we present a non-parametric estimator for the number of undetected genes as well as for the extra sample size needed to detect a given proportion of the undetected genes. Our estimators are superior to ones already published by having smaller standard errors and biases. We applied our method to a set of 32 publicly available RNA-seq experiments, including the evaluation of 311 individually sequenced libraries. We found that in the majority of the cases more than one thousand genes are undetected, and that on average approximately 6% of the expressed genes per accession remain undetected. This figure increases to approximately 10% if individual sequencing libraries are analyzed. Our method is also applicable to metagenomic experiments. Using our method, the number of undetected genes as well as the sample size needed to detect them can be calculated, leading to more accurate and complete gene expression studies.Entities:
Mesh:
Year: 2015 PMID: 26107654 PMCID: PMC4479379 DOI: 10.1371/journal.pone.0130262
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Comparison of estimators.
| Estimator of | Standard Error |
| Error | ||||
|---|---|---|---|---|---|---|---|
|
| % | ( | Min. | Median | Mean | Max. | |
| Chao1 | 384 | 100.00 | 0.9664 | -3723 | -22 | -140 | 57 |
| iChao1 | 306 | 79.57 | 0.9651 | -3268 | -7 | -91 | 111 |
| Medial | 141 | 36.59 | 0.9665 | -1857 | 54 | 61 | 551 |
|
| 85 | 22.04 | 0.9897 | -1563 | 3 | -3 | 438 |
Comparison of Chao1, iChao1, Medial and h 6 estimators of f 0 evaluated in B = 100000 bootstrap replicates of the complete dataset (accession GSE1581) using random sample sizes uniformly distributed between 1 and 160.5 million tags. Estimated standard error, , percentage of standard error compared with the standard error of Chao1, %se(Ch1), estimated coefficient of determination between and f 0 (r 2), and statistics for the errors (minimum, median, mean and maximum) are presented for each one of the four estimators.
Fig 1Scatterplot of true (f 0;X axis) and estimated ( axis) values for four estimators.
Values of 10000 true and estimated values, () using four estimators (harmonic of degree 6, h 6 in red, Medial in dark green Chao1 in blue and iChao1 in brown), in random samples of the complete dataset (accession GSE1581). Sample sizes vary uniformly between 1 and 160.5 million tags. Panel A presents the plot in the complete intervals, while panel B presents a close-up including only the values .
Statistics for three RNA-seq datasets.
| Accession (dataset) |
|
|
|
| Standard Errors of | ||||
|---|---|---|---|---|---|---|---|---|---|
| Chao1 | Medial |
| Chao1 | Medial |
| ||||
| Human MPSS | 31,411,949 | 22,935 | 3 | 0 | 1 | 0 | 929 | 498 | 358 |
| E-GEOD-38298 | 35,973,307 | 6,096 | 9 | 1 | 2 | 7 | 41 | 27 | 25 |
| E-GEOD-46953 | 415,562,392 | 18,752 | 40 | 8 | 16 | 25 | 248 | 111 | 95 |
Statistics for three RNA-seq datasets including estimated standard errors, for the Chao1, Medial and h 6 estimators. Table presents sample sizes, N; observed number of genes, g; values of f 1 as well as values of f 0 estimated in the datasets (columns 5 to 7; for each estimator, rounded figures) and values of the standard errors of for Chao1, Medial and h 6, obtained from B = 100,000 bootstrap replicates (columns 8 to 10); see text and details in S1 Table.
Statistics for the ‘total’ libraries for 31 accessions from different organisms.
| 95% Conf. Int. | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Row | Accession | Organism |
|
|
|
| Lower | Upper | % |
| 1 | GSE1581 |
| 160.6 | 23,332 | 0 | 5 | 0 | 9 | 0 |
| 2 | HumanMPSS |
| 31.4 | 22,935 | 1 | 2 | 0 | 5 | 0 |
| 3 | E-GEOD-38298 |
| 36.0 | 6,096 | 7 | 6 | 0 | 20 | 0 |
| 4 | Sunflower |
| 579.7 | 36,314 | 23 | 5 | 13 | 33 | 0 |
| 5 | E-GEOD-46953 |
| 415.6 | 18,752 | 25 | 8 | 9 | 41 | 0 |
| 6 | E-GEOD-48862 |
| 404.8 | 22,534 | 38 | 17 | 5 | 71 | 0 |
| 7 | E-GEOD-38435 |
| 150.5 | 24,293 | 39 | 8 | 23 | 55 | 0 |
| 8 | E-GEOD-43667 |
| 258.0 | 22,419 | 53 | 23 | 8 | 98 | 0 |
| 9 | E-MTAB-1178 |
| 496.9 | 27,982 | 137 | 12 | 113 | 161 | 0 |
| 10 | E-GEOD-51091 |
| 101.4 | 9,269 | 289 | 21 | 248 | 331 | 3 |
| 11 | E-GEOD-34914 |
| 314.0 | 20,422 | 291 | 25 | 242 | 340 | 1 |
| 12 | E-GEOD-27971 |
| 64.6 | 23,770 | 383 | 24 | 336 | 429 | 2 |
| 13 | E-GEOD-44171 |
| 228.5 | 20,857 | 521 | 31 | 460 | 583 | 2 |
| 14 | E-GEOD-48147 |
| 108.8 | 17,677 | 1,250 | 44 | 1,163 | 1,337 | 7 |
| 15 | E-GEOD-40285 |
| 30.8 | 19,885 | 1,576 | 53 | 1,471 | 1,680 | 7 |
| 16 | E-GEOD-45474 |
| 371.0 | 20,998 | 1,613 | 52 | 1,511 | 1,715 | 7 |
| 17 | E-GEOD-37544 |
| 38.2 | 16,920 | 1,680 | 57 | 1,569 | 1,791 | 9 |
| 18 | E-GEOD-53024 |
| 141.9 | 32,471 | 1,760 | 56 | 1,651 | 1,870 | 5 |
| 19 | E-GEOD-56890 |
| 53.2 | 17,424 | 1,761 | 54 | 1,656 | 1,867 | 9 |
| 20 | E-GEOD-42960 |
| 89.9 | 18,593 | 1,881 | 56 | 1,772 | 1,991 | 9 |
| 21 | E-GEOD-47735 |
| 54.4 | 21,370 | 1,914 | 61 | 1,794 | 2,034 | 8 |
| 22 | E-MTAB-651 |
| 191.6 | 18,429 | 2,050 | 61 | 1,931 | 2,168 | 10 |
| 23 | E-GEOD-29992 |
| 28.1 | 21,446 | 2,429 | 63 | 2,305 | 2,553 | 10 |
| 24 | E-GEOD-29162 |
| 31.9 | 39,013 | 2,752 | 66 | 2,624 | 2,881 | 7 |
| 25 | E-GEOD-16868 |
| 10.0 | 21,602 | 3,421 | 76 | 3,272 | 3,571 | 14 |
| 26 | E-GEOD-16789 |
| 5.4 | 24,743 | 4,270 | 83 | 4,107 | 4,433 | 15 |
| 27 | E-GEOD-29163 |
| 257.3 | 54,644 | 4,295 | 84 | 4,130 | 4,460 | 7 |
| 28 | GSE54123 |
| 8.0 | 34,066 | 4,786 | 113 | 4,565 | 5,008 | 12 |
| 29 | E-GEOD-29134 |
| 103.8 | 48,306 | 5,403 | 95 | 5,217 | 5,588 | 10 |
| 30 | E-GEOD-33793 |
| 2.4 | 16,331 | 5,588 | 111 | 5,370 | 5,807 | 25 |
| 31 | E-GEOD-44384 |
| 546.2 | 31,375 | 7,131 | 111 | 6,913 | 7,349 | 19 |
|
| |||||||||
|
|
|
|
|
|
|
|
| ||
| Minimum | 2.39 | 6,096 | 0 | 2 | 0 | 5 | 0 | ||
| Median | 103.84 | 21,602 | 1,613 | 53 | 1,511 | 1,715 | 7 | ||
| Average | 171.44 | 24,331 | 1,851 | 48 | 1,757 | 1,944 | 6 | ||
| Maximum | 579.73 | 54,644 | 7,131 | 113 | 6,913 | 7,349 | 25 | ||
| Standard deviation | 172.31 | 10,065 | 1,980 | 34 | 1,915 | 2,043 | 6 | ||
N—Sample size in millions, g—Number of genes detected, —Estimated number of missing genes, —Estimated standard error for , 95% approximated confidence intervals for (lower and upper bounds) and estimated percentage of missing genes, %h 6/G.
Statistics for individual libraries of 32 accessions group by organism.
| Row | Organism | #Acc. | #Lib. |
|
| % | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| min. | avg. | max. | min. | avg. | max. | |||||
| 1 |
| 1 | 7 | 32,735 | 524 | 775 | 1,235 | 2 | 2 | 4 |
| 2 |
| 1 | 2 | 65,645 | 336 | 1,430 | 2,523 | 2 | 2 | 4 |
| 3 |
| 1 | 4 | 6,059 | 108 | 136 | 158 | 2 | 2 | 3 |
| 4 |
| 1 | 8 | 21,168 | 19,809 | 21,482 | 23,145 | 50 | 50 | 52 |
| 5 |
| 1 | 6 | 20,518 | 559 | 1,291 | 1,510 | 6 | 6 | 7 |
| 6 |
| 1 | 8 | 23,639 | 3,056 | 3,680 | 4,020 | 13 | 13 | 15 |
| 7 |
| 1 | 5 | 8,868 | 247 | 396 | 474 | 4 | 4 | 5 |
| 8 |
| 1 | 3 | 7,747 | 265 | 3,115 | 6,887 | 22 | 22 | 38 |
| 9 |
| 2 | 35 | 12,733 | 1,360 | 1,804 | 3,817 | 13 | 13 | 33 |
| 10 |
| 2 | 4 | 21,492 | 3,086 | 4,063 | 4,863 | 16 | 16 | 18 |
| 11 |
| 3 | 36 | 18,876 | 44 | 1,108 | 2,619 | 6 | 6 | 15 |
| 12 |
| 3 | 17 | 40,901 | 2,663 | 5,136 | 7,280 | 11 | 11 | 15 |
| 13 |
| 6 | 77 | 15,391 | 0 | 1,747 | 10,247 | 7 | 7 | 29 |
| 14 |
| 8 | 99 | 14,317 | 14 | 1,871 | 11,466 | 10 | 10 | 50 |
| Total | 32 | 311 | 17,501 | 0 | 2,433 | 23,145 | 10 | 10 | 52 | |
#Acc.—Number of accessions, #Lib.—Number of libraries, avg(g)—Average number of detected genes per library, and minimum (min.), average (avg.) and maximum (max.) for the values of missing genes, , and estimated percentage of missing genes, %h 6/G.