| Literature DB >> 21943375 |
Athanasia C Tzika1, Raphaël Helaers, Gerrit Schramm, Michel C Milinkovitch.
Abstract
BACKGROUND: Reptiles are largely under-represented in comparative genomics despite the fact that they are substantially more diverse in many respects than mammals. Given the high divergence of reptiles from classical model species, next-generation sequencing of their transcriptomes is an approach of choice for gene identification and annotation.Entities:
Year: 2011 PMID: 21943375 PMCID: PMC3192992 DOI: 10.1186/2041-9139-2-19
Source DB: PubMed Journal: Evodevo ISSN: 2041-9139 Impact factor: 2.250
Figure 1Our sequence data analysis pipeline including (a) contig assembly and homology assignment and (b and c) consensus building.
Statistics of the 454 sequencing: number of plates, raw reads, discarded reads, and average read length
| All | ||||||
|---|---|---|---|---|---|---|
| Plates | 1.5 | 2 | 1.5 | 2 | 2 | 9 |
| Raw reads | 558,538 | 523,785 | 554,054 | 884,080 | 613,632 | 3,134,089 |
| Discarded | 13,484 (2.4%) | 42,284 (8.1%) | 9,139 (1.7%) | 15,591 (1.8%) | 30,320 (4.9%) | 110,818 (3. 5%) |
| Av read length | 191 | 181 | 207 | 191 | 164 | 187 |
Statistics of NGen assembly (mt: mitochondrial DNA)
| Contigs generated | 39,723 | 36,088 | 25,819 | 52,348 | 37,498 |
| Contigs without mt | 36,809 (92.7%) | 34,013 (94.2%) | 22,983 (89%) | 48,838 (93.3%) | 34,592 (92.2%) |
| Singletons | 184,139 | 171,709 | 217,290 | 263,428 | 168,075 |
| Singletons without mt | 65,066 (35.3%) | 77,684 (45.2%) | 56,705 (26.1%) | 85,666 (32.5%) | 69,968 (41.6%) |
| Total | 223,862 | 207,797 | 243,109 | 315,776 | 205,573 |
| Total without mt | 101,875 (46%) | 111,697 (54%) | 79,688 (33%) | 134,504 (43%) | 104,560 (51%) |
| Av. contig length | 375 | 415 | 424 | 407 | 360 |
| Max contig length | 4,255 | 7,513 | 5,317 | 6,063 | 4,841 |
| Cumul. contig length | 15.0 Mb | 15.2 Mb | 10.1 Mb | 21.6 Mb | 13.8 Mb |
| Cumul. total length | 48.2 Mb | 39.9 Mb | 53.9 Mb | 69.0 Mb | 37.6 Mb |
| Average reads/contig | 9.5 | 9.6 | 13.2 | 11.9 | 11.9 |
| Greater than 500 b | 7,080 | 7,796 | 5,206 | 10,709 | 6,081 |
| Greater than 1 Kb | 1,570 | 1,805 | 1,269 | 2,792 | 1,386 |
| Av. sequencing depth | 3.2 | 2.9 | 3.1 | 3.7 | 3.2 |
Figure 2. (a) Percentages of contigs/singletons with a BLAST hit against each of the databases searched with 'LANE runner'. The central number within each pie-chart is the number of contigs and singletons used in BLAST searches; (b) Percentage of each of the seven reference species used for anchoring input transcriptome sequences and building consensuses (results obtained against the Ensembl Coding and Unigene databases are grouped, and the central number gives the total number of consensuses); (c) Distribution of sequenced species against which 'orphan sequences' exhibited a hit (= fourth BLAST round in Figure 1a).
Consensus statistics: number of sequences (total, contigs, and singletons) with a BLAST hit against reference databases, and number of consensus sequences generated
| Input seq with BLAST hit | 88,754 | 45,773 | 36,241 | 54,284 | 37,180 |
| Contigs | 35,330 | 18,505 | 13,530 | 24,981 | 17,534 |
| Singletons | 53,424 | 27,268 | 22,711 | 29,303 | 19,646 |
| Consensus sequences | 31,021 | 24,676 | 20,016 | 26,203 | 20,897 |
| One-to-one consensus | 17,885 | 19,617 | 15,701 | 18,802 | 17,728 |
| >50% coverage | 3,505 | 7,114 | 1,233 | 2,372 | 1,346 |
Comparisons with other transcriptome datasets: ubiquitously expressed genes [31] and the Mouse Brain Atlas [57]
| Ensembl genes hits | 17,346 | 18,407 | 17,335 | 20,964 | 15,101 |
| Human homologs | 10,425 | 8,167 | 8,658 | 9,964 | 6,940 |
| 3,716 human homologs were found in all species | |||||
| 7,750 ubiquitously expressed genes | 5,822 | 4,804 | 5,097 | 5,752 | 4,124 |
| 2,595 genes were found in all species | |||||
| Mouse homologs | 11,238 | 10,068 | 10,162 | 11,697 | 8,970 |
| 4,926 mouse homologs were found in all species | |||||
| 15,112 Mouse Brain Atlas genes | 8,873 | 7,844 | 7,907 | 9,020 | 6,922 |
| 3,928 genes were found in all species | |||||
Figure 3Gene ontology low-level categories associated with central-nervous system biological processes for transcripts identified in each sequenced species. Only the significantly over- and under-represented categories (False Discovery Rate <0.05) are shown, and they are marked with green and red, respectively (non-significant cases are in grey).
Figure 4Phylogenomic analyses. (a) Amino-acid sequences from 4,689 genes (for 7 Sauropsida, 3 mammals, and 2 outgroup taxa) analyzed with RaxML (WAG model and approximate rate heterogeneity) after removal of excessively gapped positions (final dataset size = 2,012,759 characters/species), as well as with RaxML (GTR model and approximate rate heterogeneity) and MetaPIGA-2 (GTR, gamma-rate heterogeneity) after removal of all gapped positions (final size of dataset = 1,612 characters per species); labels on nodes indicate bootstrap proportions under RaxML for the 2 million and 1,612 aa datasets as well as posterior probabilities generated by MetaPIGA for the 1,612 aa dataset; branch lengths are indicated for the MetaPIGA analysis. (b) Amino-acid sequences from 4,689 genes (for 5 Sauropsida lineages, including 2 hybrid sequences, 3 mammals, and 1 hybrid outgroup) after removal of all gapped positions (final dataset size = 24,071 characters/species) analyzed with RaxML (GTR, approximate rate heterogeneity) and MetaPIGA (GTR, gamma-rate heterogeneity); bootstrap proportions and posterior probabilities are 100% for all branches; MetaPIGA branch lengths are indicated. (c) Amino-acid sequences from 1,139 genes devoid of known paralogs (for 7 Sauropsida species, 3 mammals and 2 outgroup taxa) analyzed with MetaPIGA after removal of excessively gapped positions (final size of dataset = 246,208 characters/species); labels on nodes indicate posterior probabilities for analyses under GTR with/without gamma rate heterogeneity; analysis of this dataset under RaxML still generated long-branch attraction: (corn snake + bearded dragon) and (crocodile + turtle).