| Literature DB >> 19371405 |
Alicia Oshlack1, Matthew J Wakefield.
Abstract
BACKGROUND: Several recent studies have demonstrated the effectiveness of deep sequencing for transcriptome analysis (RNA-seq) in mammals. As RNA-seq becomes more affordable, whole genome transcriptional profiling is likely to become the platform of choice for species with good genomic sequences. As yet, a rigorous analysis methodology has not been developed and we are still in the stages of exploring the features of the data.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19371405 PMCID: PMC2678084 DOI: 10.1186/1745-6150-4-14
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
Figure 1Differential expression as a function of transcript length. The data is binned according to transcript length and the percentage of transcripts called differentially expressed using a statistical cut-off is plotted (points). A linear regression is also plotted (lines). a – e use all the data from RNA-seq and the microarrays from studies [4-6] respectively. f and g plot 33% of genes with highest expression levels (blue crosses) and 33% of genes with low expression (red triangles) taken from the microarray data for genes which appear on both platforms in [6]. The regression gives a significant trend for the percent of differential expression with transcript length for a, c, d and f and the lowly expressed genes in g. Note that this figure illustrates common data features between disparate experiments and is not a comparison between platforms, methods or experiments.
Figure 2Mean-variance relationship. Here we show the sample variance across lanes in the liver sample from the Marioni et al[6] data plotted as a function of the mean for each gene (a). Next we have the same data where the tag counts for each gene are divided by the length of the gene (b). The red line fits a linear relationship between the mean and variance for the one third of shortest genes while the blue line is the linear fit to the longest genes. In plot a the fits are very close to the line of equality between mean and variance (black line) which is what would be expected from a Poisson process. In plot b the short genes have higher variance for a given expression level than long genes.
Overrepresented KEGG pathways using microarrays.
| Term | Count | Pop Hits | PValue | Benjamini |
| hsa04610:Complement and coagulation cascades | 54 | 68 | 2.36E-10 | 5.44E-08 |
| hsa00980:Metabolism of xenobiotics by cytochrome P450 | 45 | 65 | 6.97E-06 | 5.37E-04 |
| hsa00190:Oxidative phosphorylation | 74 | 121 | 5.83E-06 | 6.73E-04 |
| hsa00120:Bile acid biosynthesis | 25 | 36 | 0.00126 | 0.0702 |
| hsa00260:Glycine, serine and threonine metabolism | 29 | 45 | 0.00246 | 0.107 |
| hsa00591:Linoleic acid metabolism | 20 | 31 | 0.01496 | 0.252 |
| hsa00380:Tryptophan metabolism | 35 | 60 | 0.00764 | 0.255 |
| hsa05010:Alzheimer's disease | 19 | 29 | 0.0149 | 0.271 |
| hsa00363:Bisphenol A degradation | 11 | 14 | 0.0188 | 0.287 |
| hsa00020:Citrate cycle (TCA cycle) | 18 | 27 | 0.0148 | 0.291 |
| hsa04514:Cell adhesion molecules (CAMs) | 65 | 126 | 0.0108 | 0.300 |
| hsa03320:PPAR signaling pathway | 39 | 70 | 0.0125 | 0.305 |
| hsa00650:Butanoate metabolism | 26 | 45 | 0.0280 | 0.374 |
| hsa00280:Valine, leucine and isoleucine degradation | 25 | 44 | 0.03995 | 0.425 |
| hsa00903:Limonene and pinene degradation | 18 | 29 | 0.0360 | 0.432 |
| hsa00230:Purine metabolism | 69 | 143 | 0.0472 | 0.462 |
| hsa00071:Fatty acid metabolism | 25 | 45 | 0.0536 | 0.488 |
| hsa00620:Pyruvate metabolism | 23 | 42 | 0.0776 | 0.526 |
| hsa00010:Glycolysis/Gluconeogenesis | 31 | 59 | 0.0666 | 0.531 |
| hsa00860:Porphyrin and chlorophyll metabolism | 21 | 38 | 0.0857 | 0.535 |
| hsa00410:beta-Alanine metabolism | 15 | 25 | 0.0838 | 0.541 |
| hsa00052:Galactose metabolism | 18 | 32 | 0.0996 | 0.554 |
Categories in bold are not found to be overrepresented in the RNA-seq data.
Over represented KEGG pathways using Illumina sequencing.
| Term | Count | Pop Hits | PValue | Benjamini |
| hsa04610:Complement and coagulation cascades | 60 | 68 | 3.08E-08 | 7.11E-06 |
| hsa00020:Citrate cycle (TCA cycle) | 25 | 27 | 2.23E-04 | 0.0170 |
| hsa00120:Bile acid biosynthesis | 31 | 36 | 4.23E-04 | 0.0242 |
| hsa00071:Fatty acid metabolism | 37 | 45 | 5.35E-04 | 0.0244 |
| hsa00980:Metabolism of xenobiotics by cytochrome P450 | 50 | 65 | 7.29E-04 | 0.0277 |
| hsa00190:Oxidative phosphorylation | 85 | 121 | 0.001155 | 0.0374 |
| hsa00650:Butanoate metabolism | 35 | 45 | 0.00448 | 0.109 |
| hsa00010:Glycolysis/Gluconeogenesis | 43 | 59 | 0.0110 | 0.177 |
| hsa00230:Purine metabolism | 94 | 143 | 0.0128 | 0.180 |
| hsa00280:Valine, leucine and isoleucine degradation | 33 | 44 | 0.0147 | 0.183 |
| hsa05010:Alzheimer's disease | 23 | 29 | 0.0200 | 0.228 |
| hsa00620:Pyruvate metabolism | 31 | 42 | 0.0262 | 0.253 |
| hsa00260:Glycine, serine and threonine metabolism | 33 | 45 | 0.0241 | 0.256 |
| hsa04514:Cell adhesion molecules (CAMs) | 82 | 126 | 0.0285 | 0.262 |
| hsa00220:Urea cycle and metabolism of amino groups | 23 | 30 | 0.0359 | 0.297 |
| hsa00052:Galactose metabolism | 24 | 32 | 0.0448 | 0.315 |
| hsa00903:Limonene and pinene degradation | 22 | 29 | 0.0488 | 0.319 |
| hsa03320:PPAR signaling pathway | 47 | 70 | 0.0549 | 0.327 |
| hsa00380:Tryptophan metabolism | 41 | 60 | 0.0535 | 0.328 |
| hsa00591:Linoleic acid metabolism | 23 | 31 | 0.0596 | 0.333 |
| hsa00410:beta-Alanine metabolism | 19 | 25 | 0.0717 | 0.356 |
| hsa00363:Bisphenol A degradation | 12 | 14 | 0.0691 | 0.360 |
| hsa00860:Porphyrin and chlorophyll metabolism | 27 | 38 | 0.0752 | 0.363 |
Categories in bold are not found to be over represented in the microarray data.
Figure 3Length of genes found in KEGG pathways significantly over represented with differentially expressed genes. The first box in the plot represents the length of genes found in the four significant categories from both platforms. The second box is the length of genes found in categories significant only in the sequencing data. The third box is the length of all genes in common to both technologies. It can be seen that categories unique to the sequencing data tend to have longer transcripts.