| Literature DB >> 26044773 |
Yuchao Xia, Fugui Wang, Minping Qian, Zhaohui Qin, Minghua Deng.
Abstract
BACKGROUND: RNA-Seq is a powerful new technology to comprehensively analyze the transcriptome of any given cells. An important task in RNA-Seq data analysis is quantifying the expression levels of all transcripts. Although many methods have been introduced and much progress has been made, a satisfactory solution remains be elusive.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26044773 PMCID: PMC4460722 DOI: 10.1186/1755-8794-8-S2-S14
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Illustration of 3 datasets
| Dataset | Subdataset | Platform | read length | mapping | |
|---|---|---|---|---|---|
| Dataset 1 | Wold | w1:Brain | Illumina/Solexa | 25 bp | Seqmap |
| Burge | b1:Group 1 | Illumina/Solexa | 32 bp | Seqmap | |
| Grimmond | g1:EB g2:ES | SOLiD | 35 bp | SOC | |
| Dataset 2 | Synthetic spike-in RNA-Seq | Illumina/Solexa | 36 bp | Bowtie | |
| Dataset 3 | Two samples from human kidney | Illumina | 36 bp | Bowtie | |
| Dataset 4 | Three samples from mouse liver | SOLiD | 25 bp | Seqmap | |
R2 for 3 models in 8 RNA-Seq samples in Dataset 1.
|
| ||||||
|---|---|---|---|---|---|---|
| Wold | Brain | 0.51 | 0.65 | 0.70 | 0.68 | |
| Liver | 0.50 | 0.64 | 0.70 | 0.66 | ||
| Muscle | 0.46 | 0.56 | 0.59 | 0.60 | ||
| Burge | Group 1 | 0.42 | 0.49 | 0.52 | 0.53 | |
| Group 2 | 0.35 | 0.42 | 0.46 | 0.50 | ||
| Group 3 | 0.42 | 0.50 | 0.54 | 0.52 | ||
| Grimmond | EB | 0.40 | 0.54 | 0.58 | 0.58 | |
| ES | 0.37 | 0.54 | 0.54 | 0.56 | ||
R2 for 3 models in 8 RNA-Seq samples in Dataset 1. Eight different sub-datasets are chosen to compute the R2 of these three models. We chose 40 nucleotides, 20 bp upstream, and 19 bp downstream to estimate the sequencing preference. For each row, the number in bold indicates the highest R2 among different methods in for the dataset.
1No Cross-Validation.
2Cross-Validation.
∗ PL: the Poisson-Linear model;
∗ EB: Embryonic stem cells;
∗ ES: Undifferentiated mouse embryonic stem cells.
Figure 1The stacking energy of PDEGEM in 8 different samples of Dataset 1. The x-axis represents the 16 dinucleotides AA, AC, ..., and TT , while the y-axis indicates the stacking energies of the dinucleotides. Lines with different colors indicate different datasets. w1, w2 and w3 represent Wold data, b1, b2 and b3 stand for Burge Data, while g1 and g2 indicate Grimmond data.
Figure 2The positional weight of PDEGEM in 8 different samples of Dataset 1. We chose 40 surrounding nucleotides (20 upstream and 19 downstream) to fit the model. The x-axis is the relative position around the starting point of a read. The y-axis indicates the positional weight. Red lines (w1, w2, w3) represent Wold data, green lines (b1, b2, and b3) for Burge Data and blue lines (g1 and g2) for Grimmond data.
Figure 3Fitting counts for the mouse Rp19 gene. Black vertical lines represent counts (experimental values or fitted values) along the Grimmond EB Rp19 gene (with the UTR and a further 100 nucleotides truncated). We use the other 99 genes of the top 100 genes to train the three models and then predict the counts for the Rp19 gene.(a) Counts of reads (true values) in the Grimmond EB data.(b) Counts of fitted reads using mseq. (c) Counts of fitted reads using MART. (d) Counts of fitted reads using PDEGEM.
Consistency between transcript abundance estimated by different methods and gold standards in Dataset 2.
| RPKM | MART | PDEGEM | |
|---|---|---|---|
| 0.30 | |||
| SRCC2 | 0.8341 | 0.8501 |
Spearman's rank correlation coefficient of RPKM, MART, PDEGEM with the true transcript abundance measured by transcript concentration in the experiment [25]. We compared four different methods using a synthetic RNA-Seq dataset with 90 isoforms. The number in bold indicates the highest R2 among three methods for the dataset.
1RPKM has no R2.
2 Spearman's rank correlation coefficient.
R2 for 3 models in Dataset 3.
| Sample | Poisson-Linear | MART | PDEGEM |
|---|---|---|---|
| SRX0005711 | 0.15 | 0.50 | |
| SRX0006041 | 0.12 | 0.48 | |
| SRX0006052 | 0.15 | 0.53 | |
| SRX0006062 | 0.13 | 0.52 |
Goodness-of-fit measured by R2 for Poisson-Linear model, MART and PDEGEM in four RNA-Seq samples in Data Set 3. The four samples came from human kidney and liver. For each row, the number in bold indicates the highest R2 among different methods for the dataset
1Illumina sequencing of Human liver transcripts.
2Illumina sequencing of Human kidney transcripts.
Consistency between transcript abundance estimated by different methods and gold standards in Dataset 3.
| Sample | N1 | RPKM | PL2 | MART | PDEGEM |
|---|---|---|---|---|---|
| SRX0005713 | 4857 | 0.474 | 0.474 | 0.471 | |
| SRX0006043 | 4880 | 0.460 | 0.458 | 0.460 | |
| SRX0006054 | 5309 | 0.527 | 0.527 | 0.530 | |
| SRX0006064 | 5293 | 0.442 | 0.411 | 0.452 |
Spearman's rank correlation coefficient of RPKM, Poisson-Linear model, MART, PDEGEM with the "true" gene expression measured by microarray in Dataset 3.
1 Number of transcripts used to calculate the correlation coefficients.
2 PL: the Poisson-Linear model.
3 Illumina sequencing of Human liver transcripts.
4 Illumina sequencing of Human kidney transcripts.
Pearson's correlation coefficients of %ASex.
| Tissue | AS events | PCC1 by uniform model | PCC by MART | PDEGEM | |
|---|---|---|---|---|---|
| 1 | Brain | 699 | 0.36 | 0.40 | |
| Liver | 472 | 0.48 | 0.50 | ||
| Muscle | 451 | 0.40 | 0.45 | ||
| 2 | Brain | 298 | 0.44 | 0.50 | |
| Liver | 228 | 0.60 | 0.60 | ||
| Muscle | 194 | 0.48 | 0.51 | ||
The Pearson's correlation coefficients(PCC) of uniform model, MART, PDEGEM with the isoform expression measured by microarray in Dataset 4.
1Pearson's correlation coeficient