| Literature DB >> 27220689 |
Montserrat Torres-Oliva1,2, Isabel Almudi3,4, Alistair P McGregor3, Nico Posnien5,6.
Abstract
BACKGROUND: RNA-seq based on short reads generated by next generation sequencing technologies has become the main approach to study differential gene expression. Until now, the main applications of this technique have been to study the variation of gene expression in a whole organism, tissue or cell type under different conditions or at different developmental stages. However, RNA-seq also has a great potential to be used in evolutionary studies to investigate gene expression divergence in closely related species.Entities:
Keywords: Annotation; Closely related species; DESeq2; Differential gene expression; Drosophila; EXONERATE; Emerging model systems; Length bias; Limma; RNA-seq; RPKM; Voom
Mesh:
Year: 2016 PMID: 27220689 PMCID: PMC4877740 DOI: 10.1186/s12864-016-2646-x
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Number of genes obtained by each annotation method
| Method |
|
|
| Comparable |
|---|---|---|---|---|
| Published annotation | 13,676 | 11,837 (86.55 %) | 12,005 (87.78 %) | 9,994 (73.08 %) |
| after filtering | 8,810 (64.42 %) | |||
| Direct re-annotation | 13,676 | 13,436 (98.24 %) | 13,401 (97.99 %) | 13,328 (97.45 %) |
| after filtering | 12,334 (90.19 %) | |||
| Reciprocal re-annotation | 13,457 (98.40 %) | 13,373 (97.78 %) | 13,346 (97.59 %) | 13,311 (97.33 %) |
| after filtering | 13,239 (96.80 %) |
The last column contains the number of genes for which 1:1 orthologs were identified in the three species. “after filtering” indicates the remaining common genes after filtering out genes with length difference larger than 49 bp. Percentages in brackets are always given in relation to the total number of gene models in D. melanogaster (r5.55; 13,676 gene models)
Fig. 1Pairwise length difference between orthologous genes. Bars indicate the number of genes with that difference in length (calculated in number of nucleotides in the annotated transcripts) for each pair of species. Green shades indicate differences lower than 50 bp while orange to red indicate larger differences. The comparison is showed for (a) the published annotations, b the direct re-annotation of the published genomes and c the reciprocal re-annotation of the published genomes
Differentially expressed genes and correlation between calculated log2-fold changes and length difference between orthologous genes. Results are shown for the four applied methods, the three studied annotation strategies and the two described pairwise species comparisons
| Method | Annotation | Species | # Common genes | # Differentially expressed genes (% of common genes)a | Spearman’s | FDR 0.05 | Spearman’s rhob |
|---|---|---|---|---|---|---|---|
| DESeq2 | Published |
| 11,503 | 2,438 (21.2) | 6.52e-33 | *** | −0.36 |
|
| 10,023 | 2,974(29.7) | 2.15e-20 | *** | −0.33 | ||
| Direct |
| 13,401 | 2,665 (19.9) | 3.35e-20 | *** | −0.33 | |
|
| 13,328 | 3,710 (27.8) | 5.90e-58 | *** | −0.57 | ||
| Reciprocal |
| 13,331 | 2,501 (18.8) | 5.12e-02 | n.s. | −0.23 | |
|
| 13,320 | 3,508 (26.3) | 1.48e-01 | n.s. | −0.29 | ||
| DESeq2 + length matrix | Published |
| 11,503 | 1,192 (10.4) | 2.24e-05 | *** | −0.13 |
|
| 10,023 | 1,545 (15.4) | 3.20e-02 | a | −0.08 | ||
| Direct |
| 13,401 | 1,259 (9.4) | 1.07e-02 | a | −0.09 | |
|
| 13,328 | 1,957 (14.7) | 5.24e-04 | ** | −0.13 | ||
| Reciprocal |
| 13,331 | 1,215 (9.1) | 7.03e-01 | n.s. | −4.6e-02 | |
|
| 13,320 | 1,910 (14.3) | 7.34e-01 | n.s. | −0.07 | ||
| RPKM + limma | Published |
| 11,503 | 1,904 (16.6) | 4.42e-04 | *** | −0.11 |
|
| 10,023 | 2,427 (24.2) | 1.06e-03 | ** | −0.12 | ||
| Direct |
| 13,401 | 1,890 (14.1) | 5.68e-03 | a | −0.10 | |
|
| 13,328 | 2,795 (21,0) | 4.49e-04 | *** | −0.14 | ||
| Reciprocal |
| 13,331 | 1,830 (13.7) | 5.92e-01 | n.s. | −6.4e-02 | |
|
| 13,320 | 2,738 (20.6) | 2.83e-01 | n.s. | −0.22 | ||
| RPKM + voom + limma | Published |
| 11,503 | 1,853(16.1) | 9.39e-04 | *** | −0.10 |
|
| 10,023 | 2,204(22.0) | 4.63e-02 | a | −0.07 | ||
| Direct |
| 13,401 | 1,899(14.2) | 1.01e-02 | a | −0.10 | |
|
| 13,328 | 2,607(19.6) | 5.92e-03 | a | −0.11 | ||
| Reciprocal |
| 13,331 | 1,819(13.6) | 5.79e-01 | n.s. | −0.07 | |
|
| 13,320 | 2,519(18.9) | 4.06e-01 | n.s. | −0.17 |
aFDR 0.05
bSpearman’s rank correlation is measured between log2FC and length difference of genes showing more than 49 bp length difference
Published annotation: D. mau vs. D.mel: 1,038 genes/D.mau vs. D.sim: 764 genes; Direct annotation: D.mau vs. D.mel: 716 genes/D.mau vs. D.sim: 658 genes; Reciprocal annotation: D.mau vs. D.mel: 71 genes/D.mau vs. D.sim: 26 genes
*p < 0.05; **p < 0.005; ***p < 0.0005
Fig. 2Length differences between orthologous genes introduce gene expression biases. Relation between length differences and the log2-fold change in the RNA-seq experiment between D. mauritiana and D. simulans using the direct re-annotation of these species as mapping references. Dots represent genes with length difference > 49 bp in these annotations (658 genes). Genes significantly differentially expressed in the presented analysis (padj < 0.05) are shown in red. A negative log2-fold change indicates higher expression in D. mauritiana. A positive length difference indicates that the ortholog of D. mauritiana is longer. The p-value and rho of the Spearman’s rank correlation are indicated on the lower right side of the plots. a Results of DESeq2 without length correction. b Results of DESeq2 applying length normalization factor matrix. c Results of applying RPKM normalization and limma to call differentially expressed genes. d Results of applying voom normalization followed by a length normalization matrix and limma to call differentially expressed genes
Fig. 3Schematic representation of length bias in inter-species differential expression analysis and our reciprocal re-annotation strategy to correct it. a Length bias in the analysis of a non-differentially expressed gene. Coloured rectangles represent the part of the transcript which is included as reference for the RNA-seq reads to map to, while unfilled rectangles are regions of the transcript which are omitted and to which RNA-seq reads cannot be mapped. Red “N”s represent sequencing errors that prevent the complete annotation of a transcript. Mapped reads are shown as thin black lines and the number bellow indicates the total of reads mapped. (upper panel) If one transcript is shorter in one of the references compared to its orthologs, for the same expression levels fewer reads will map to it. This can result in false positives in the analysis of differential expression. (lower panel) Our strategy to correct this bias is to shorten the orthologs in the other references to match the length of the shorter sequence. b Pipeline of reciprocal transcriptome re-annotation method. Black numbers in white circles represent genome annotation steps using the “est2genome” command of Exonerate [58]. Grey numbers in grey circles represent conversion of the resulting GFF file into a new transcript set. Filled horizontal bars represent the annotated set of transcripts; non-filled horizontal bars at the start/end of the transcripts represent parts of the transcript that cannot be correctly annotated in one reference and are therefore eliminated from the transcript set. The boxes with red frame indicate the transcript sets that will be used as reference for RNA-seq read mapping (after confirmation by reciprocal blast). Step 1: the transcript set of the best annotated genomes (D. melanogaster in our study) is used to annotate one of the other genomes (D. simulans in our study) and generate a new transcript set for this species. Due to sequencing errors, some transcripts will be shorter. Step 2: the new transcript set form D. simulans is used to annotate the last genome (D. mauritiana in our study). The gene set generated contains shorter transcripts due to sequencing errors in D. mauritiana but also in D. simulans. Step 3: the transcript set from D. mauritiana is used to re-annotate the previously generated set from D. simulans to integrate the information from the D. mauritiana assembly. Step 4: the second transcript set from D. simulans is used to annotate the D. melanogaster set in order to integrate the information from D. simulans and D. mauritiana
Analysis of differential expression
| Published transcriptomes | Reciprocally re-annotated transcriptomes | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Gene | qPCR | Gene length (# nucl.) | DESeq2 | DESeq2 + length matrix | RPKM + limma | RPKM + voom + limma | Gene length (# nucl.) | DESeq2 | ||
|
|
|
|
|
|
|
|
|
|
| |
| lace | −0.19 | 1791 | 903 | 1.40*** | 0.38 | 0.41 | 0.15 | 902 | 902 | 0.03 |
| CG3558 | 0.08 | 3147 | 1956 | 1.50*** | 0.67 | 0.75 | 0.26* | 3150 | 3135 | 0.16 |
| dac | −0.29 | 3243 | 1878 | 1.47*** | 0.57 | 0.65 | 0.18 | 1887 | 1878 | 0.46 |
| RAF2 | 1.0e-03 | 3351 | 1854 | 1.77*** | 0.84 | 0.94 | 0.31 | 1959 | 1966 | 0.33 |
| Cp110 | −0.18 | 1998 | 1218 | 2.31*** | 1.38* | 1.4** | 0.55** | 2000 | 1998 | 0.11 |
| CBP | −0.21 | 1653 | 894 | 1.42*** | 0.35 | 0.54 | 0.14 | 1656 | 1653 | −0.24 |
| CG6766 | −0.41 | 1575 | 852 | 1.81*** | 0.79 | 0.88 | 0.25* | 855 | 855 | 0.31 |
| piwi | −2.60** | 2529 | 2526 | −2.48*** | −2.54*** | −1.99** | −1.08** | 2532 | 2529 | −2.48*** |
| alrm | −2.37*** | 1413 | 1413 | −6.54*** | −6.67*** | −4.93*** | −2.68*** | 1416 | 1416 | −6.49*** |
| Nplp1 | 1.04 | 1461 | 1461 | 3.85*** | 3.63*** | 3.06*** | 1.50*** | 1464 | 1464 | 3.80*** |
Expression comparison is for D. mauritiana vs. D. melanogaster, thus a positive log2-fold change (log2FC) indicates higher expression in D. melanogaster and vice versa. *p < 0.05; **p < 0.005; ***p < 0.0005
Fig. 4qPCR results. Boxplot of normalized Ct values (reference gene: actin 79B) For each studied gene (one colour) boxplot is showed for Ct values in D. melanogaster OreR (“D. mel”) and D. mauritiana TAM16 (“D. mau”). (Significance calculated by t-test (for genes with homogeneous distribution of variances) or t-Welch-test (for genes with not homogeneous distribution of variances); *p < 0.05, **p < 0.005; ***p < 0.0005)
| List of RNA-seq samples and the percentage and number of mapped reads to different reference transcriptomes
| Sample | Original read typea | Published transcriptomes | Reciprocally re-annotated transcriptomes | ||
|---|---|---|---|---|---|
| Percentage | Total number of mapped reads | Percentage | Total number of mapped reads | ||
|
| SE 50 bp | 58.86 % | 28,486,024 | 57.33 % | 27,744,730 |
|
| SE 50 bp | 44.23 % | 17,675,472 | 43.19 % | 17,260,775 |
|
| SE 50 bp | 65.51 % | 25,316,846 | 63.91 % | 24,699,746 |
|
| SE 50 bp | 40.70 % | 16,575,011 | 43.31 % | 17,639,874 |
|
| SE 50 bp | 56.17 % | 31,884,442 | 60.07 % | 34,100,435 |
|
| SE 50 bp | 53.01 % | 23,653,723 | 56.98 % | 25,425,486 |
|
| PE 100 bp | 56.06 % | 111,643,922 | 61.07 % | 121,610,905 |
|
| PE 100 bp | 54.28 % | 130,638,956 | 59.51 % | 143,226,939 |
|
| PE 100 bp | 60.90 % | 144,541,354 | 66.21 % | 157,165,639 |
|
| PE 100 bp | 62.26 % | 118,272,529 | 66.71 % | 126,741,807 |
|
| PE 100 bp | 57.90 % | 138,364,665 | 62.56 % | 149,508,494 |
|
| PE 100 bp | 56.32 % | 150,692,651 | 60.98 % | 163,168,587 |
a Reads originally 100 bp paired-end (PE) were split in half to be 50 bp each and treated as single-end (SE) reads