| Literature DB >> 26559532 |
Shanrong Zhao1, Li Xi1, Baohong Zhang1.
Abstract
In recent years, RNA-seq is emerging as a powerful technology in estimation of gene and/or transcript expression, and RPKM (Reads Per Kilobase per Million reads) is widely used to represent the relative abundance of mRNAs for a gene. In general, the methods for gene quantification can be largely divided into two categories: transcript-based approach and 'union exon'-based approach. Transcript-based approach is intrinsically more difficult because different isoforms of the gene typically have a high proportion of genomic overlap. On the other hand, 'union exon'-based approach method is much simpler and thus widely used in RNA-seq gene quantification. Biologically, a gene is expressed in one or more transcript isoforms. Therefore, transcript-based approach is logistically more meaningful than 'union exon'-based approach. Despite the fact that gene quantification is a fundamental task in most RNA-seq studies, however, it remains unclear whether 'union exon'-based approach for RNA-seq gene quantification is a good practice or not. In this paper, we carried out a side-by-side comparison of 'union exon'-based approach and transcript-based method in RNA-seq gene quantification. It was found that the gene expression levels are significantly underestimated by 'union exon'-based approach, and the average of RPKM from 'union exons'-based method is less than 50% of the mean expression obtained from transcript-based approach. The difference between the two approaches is primarily affected by the number of transcripts in a gene. We performed differential analysis at both gene and transcript levels, respectively, and found more insights, such as isoform switches, are gained from isoform differential analysis. The accuracy of isoform quantification would improve if the read coverage pattern and exon-exon spanning reads are taken into account and incorporated into EM (Expectation Maximization) algorithm. Our investigation discourages the use of 'union exons'-based approach in gene quantification despite its simplicity.Entities:
Mesh:
Year: 2015 PMID: 26559532 PMCID: PMC4641603 DOI: 10.1371/journal.pone.0141910
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The STAR mapping and featureCounts counting summaries.
| Sample | STAR Mapping Summary | featureCounts Counting Summary | |||||
|---|---|---|---|---|---|---|---|
| Total_reads | Unique(%) | Unmap(%) | Mapped_Reads | Gene(%) | Ambiguity(%) | No_Feature(%) | |
| HBRR_C4 | 97730535 | 87.98 | 12.02 | 85957548 | 88.95 | 2.11 | 8.93 |
| HBRR_C6 | 94064211 | 92.26 | 7.74 | 86759046 | 88.48 | 2.05 | 9.47 |
| UHRR_C1 | 83374339 | 88.26 | 11.74 | 73564305 | 87.12 | 2.49 | 10.39 |
| UHRR_C2 | 84897013 | 89.43 | 10.57 | 75901362 | 87.69 | 2.56 | 9.75 |
The total number of counted reads by featureCounts and RSEM.
| Reads | All genes | Filtered genes (11634) | ||||||
|---|---|---|---|---|---|---|---|---|
| HBRR_C4 | HBRR_C6 | UHRR_C1 | UHRR_C2 | HBRR_C4 | HBRR_C6 | UHRR_C1 | UHRR_C2 | |
|
| 64024052 | 64658363 | 59538166 | 62146717 | 43404482 | 43704815 | 39448095 | 41261435 |
|
| 62857039 | 63494011 | 58624841 | 61238508 | 41642175 | 41967121 | 37811679 | 39553948 |
|
| 1.019 | 1.018 | 1.016 | 1.015 | 1.042 | 1.041 | 1.043 | 1.043 |
*1 Those genes originated from mitochondrion are excluded.
*2 Ratio = fc_count/resem_count
The rsem_rpkm and rsem_txSum_rpkm for genes RP11-6N17.4 and HAMP and their isoforms.
| Measurement | RP11-6N17.4 (ENSG00000264920.1) | HAMP (ENSG00000105697.3) | ||||||
|---|---|---|---|---|---|---|---|---|
| HBRR_C4 | HBRR_C6 | UHRR_C1 | UHRR_C2 | HBRR_C4 | HBRR_C6 | UHRR_C1 | UHRR_C2 | |
|
| 7.70 | 7.32 | 6.93 | 6.88 | 4.06 | 3.47 | 4.33 | 4.47 |
|
| 1.64 | 1.22 | 0.98 | 1.01 | 3.81 | 3.05 | 3.27 | 3.27 |
|
| 12.71 | 8.97 | 6.86 | 7.03 | 15.49 | 10.61 | 14.17 | 14.65 |
|
| 11.94 | 7.85 | 6.38 | 6.39 | 2.26 | 1.81 | 6.70 | 3.55 |
|
| 0.54 | 0.86 | 0.24 | 0.38 | 11.98 | 7.31 | 6.84 | 10.46 |
|
| 0.23 | 0.25 | 0.24 | 0.26 | 1.26 | 1.49 | 0.63 | 0.64 |
* Ratio = rsem_txSum_rpkm/rsem_rpkm.
The average RPKM for those 11634 filtered genes.
| RPKM | HBRR_C4 | HBRR_C6 | UHRR_C1 | UHRR_C2 |
|---|---|---|---|---|
|
| 19.87 | 19.68 | 21.73 | 21.95 |
|
| 19.81 | 19.63 | 21.71 | 21.93 |
|
| 39.14 | 38.32 | 47.02 | 47.51 |
Differential analysis results and read counts for genes ENSG00000185963.9 and ENSG00000122126.11, and their isoforms.
| Type | Ensembl ID | log2Ratio | FDR | HBRR_C4 | HBRR_C6 | UHRR_C1 | UHRR_C2 |
|---|---|---|---|---|---|---|---|
|
| ENSG00000185963.9 | -0.434 | 1.8896E-05 | 4438 | 4638 | 3496 | 3860 |
| Transcript | ENST00000375512.3 | 1.509 | 0.0088750 | 643 | 685 | 2170 | 2319 |
| Transcript | ENST00000356884.6 | -1.685 | 0.0106644 | 3795 | 3953 | 1326 | 1541 |
|
| ENSG00000122126.11 | -0.544 | 6.7632E-07 | 5892 | 5881 | 4246 | 4593 |
| Transcript | ENST00000357121.5 | 2.599 | 0.0383466 | 751 | 406 | 3775 | 4170 |
| Transcript | ENST00000463271.1 | 1.409 | 0.3034412 | 5 | 2 | 8 | 14 |
| Transcript | ENST00000486673.1 | 0.685 | 0.5620808 | 0 | 0 | 0 | 1 |
| Transcript | ENST00000371113.4 | -3.858 | 0.0069851 | 5136 | 5474 | 462 | 407 |