| Literature DB >> 26133641 |
Athanasia C Tzika1, Asier Ullate-Agote2, Djordje Grbic2, Michel C Milinkovitch1.
Abstract
Despite the availability of deep-sequencing techniques, genomic and transcriptomic data remain unevenly distributed across phylogenetic groups. For example, reptiles are poorly represented in sequence databases, hindering functional evolutionary and developmental studies in these lineages substantially more diverse than mammals. In addition, different studies use different assembly and annotation protocols, inhibiting meaningful comparisons. Here, we present the "Reptilian Transcriptomes Database 2.0," which provides extensive annotation of transcriptomes and genomes from species covering the major reptilian lineages. To this end, we sequenced normalized complementary DNA libraries of multiple adult tissues and various embryonic stages of the leopard gecko and the corn snake and gathered published reptilian sequence data sets from representatives of the four extant orders of reptiles: Squamata (snakes and lizards), the tuatara, crocodiles, and turtles. The LANE runner 2.0 software was implemented to annotate all assemblies within a single integrated pipeline. We show that this approach increases the annotation completeness of the assembled transcriptomes/genomes. We then built large concatenated protein alignments of single-copy genes and inferred phylogenetic trees that support the positions of turtles and the tuatara as sister groups of Archosauria and Squamata, respectively. The Reptilian Transcriptomes Database 2.0 resource will be updated to include selected new data sets as they become available, thus making it a reference for differential expression studies, comparative genomics and transcriptomics, linkage mapping, molecular ecology, and phylogenomic analyses involving reptiles. The database is available at www.reptilian-transcriptomes.org and can be enquired using a wwwblast server installed at the University of Geneva.Entities:
Keywords: Archosauria; deep sequencing; reptiles; squamates; transcriptomes; turtles
Mesh:
Year: 2015 PMID: 26133641 PMCID: PMC4494049 DOI: 10.1093/gbe/evv106
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
FChronogram among the selected reptilian and reference species used for annotation. The letters between parentheses after the species names indicate the data type (T, transcriptome; G, genome; G, genome of a reference species). The underlined species were newly sequenced in our laboratory for this study. The tree topology and divergence times are based on the “TimeTree of Life” estimates (Hedges et al. 2006).
FAssembly pipeline used to combine the “454” (in blue) and Illumina (in green) reads into nonredundant contigs (in red). Framed values correspond to those obtained for the E. macularius mix data set (multiple developmental stages and multiple adult organs), provided as an example. Dashed boxes delineate major steps of the assembly: (A) The “454” reads are de novo assembled, (B) the Illumina reads are aligned to the “454” assembly, and (C–E) iterative building of nonredundant Illumina contigs. The final assembly (F) includes both the “454” assembly and the Illumina contigs.
Reptilian Transcriptomes and Genomes Considered for Annotation
| Species | Vernacular Name | Data Type | Source | Sequencing/Assembly | Size |
|---|---|---|---|---|---|
| Ocellated skink | cDNA | Illumina assembly from uterus | 300,966 contigs | ||
| Tuatara | cDNA | Illumina assembly from embryos | 32,911 contigs | ||
| Burmese python | cDNA | “454” assembly from liver and heart | 37,245 contigs | ||
| Common chameleon | cDNA | SOLiD assembly multitissue | 164,692 contigs | ||
| Garter snake | cDNA | “454” assembly multitissue and multi-individual | 188,940 contigs and singletons | ||
| Painted turtle | gDNA | Genome assembly 3.0.1 | 6,080 scaffolds | ||
| Gharial | gDNA | Draft genome assembly | 47,351 scaffolds | ||
| American alligator | gDNA | Draft genome assembly | 8,897 scaffolds | ||
| Saltwater crocodile | gDNA | Draft genome assembly | 23,365 scaffolds | ||
| American alligator | cDNA | Brain “454” reads | 438,029 reads | ||
| Corn snake | cDNA | VNO “454” and Illumina libraries | 343,062 and 54.4M reads from “454” and Illumina | ||
| Corn snake | cDNA | This study | Adults testes, brain and kidneys “454” and Illumina paired-end libraries | 135,630 and 145M reads from “454” and Illumina | |
| Corn snake | cDNA | This study | Embryonic “454” and Illumina paired-end libraries | 45,417 and 129.8M reads from “454” and Illumina | |
| Leopard gecko | cDNA | This study | Adults testes, brain and kidneys “454” and Illumina paired-end libraries | 112,760 and 128M reads from “454” and Illumina | |
| Leopard gecko | cDNA | This study | Embryonic “454” and Illumina paired-end libraries | 79,437 and 129.8M reads from “454” and Illumina |
FOutline of the transcriptome annotation pipeline. All steps included in the outer dash-framed box are performed in LANE runner. The steps of a single iteration (i.e., using one reference species) are grouped in the inner dashed frame. The reference species iteratively considered for annotation are listed in the inset. Query sequences having a hit are indicated with a yellow mark and those having an RBBH with a green mark.
NGen Assembly Workflow Statistics
| Adults | Embryos | VNO | Mix | Adults | Embryos | Mix | |
|---|---|---|---|---|---|---|---|
| Number of plates | 1 | 1 | 1 | 3 | 1 | 1 | 2 |
| 454 reads | 135,630 | 45,417 | 343,062 | 524,109 | 112,760 | 79,437 | 192,197 |
| 454 discarded | 8,557 (6.3%) | 4,374 (9.6%) | 1,912 (0.6%) | 15,071 (2.9%) | 29,056 (25.8%) | 4,466 (5.6%) | 33,522 (17.4%) |
| 454 contigs | 17,570 | 6,133 | 38,666 | 45,955 | 10,635 | 6,595 | 17,876 |
| 454 singletons | 27,826 | 17,069 | 56,632 | 98,265 | 26,200 | 13,434 | 28,215 |
| Av. contig length | 556 | 499 | 447 | 497 | 393 | 545 | 712 |
| Greater than 500 bp | 9,585 | 2,951 | 12,369 | 17,075 | 3,192 | 3,439 | 13,122 |
| Greater than 1 Kb | 1,471 | 343 | 1,325 | 3,667 | 265 | 595 | 3,404 |
| 454 assembly | 45,396 | 23,202 | 95,298 | 144,220 | 36,835 | 20,029 | 46,091 |
| Number of lanes | 0.5 | 0.5 | 0.5 | 1.5 | 0.5 | 0.5 | 1 |
| Illumina reads | 145M | 129.8M | 54.4M | 329.2M | 128M | 129.8M | 257.8M |
| Aligned to 454 | 92M (63%) | 62.8M (48%) | 32.3M (59%) | 216.9M (66%) | 82M (64%) | 94.8M (73%) | 194.2M (75%) |
| Illumina contigs | 63,227 | 86,371 | 28,453 | 134,457 | 50,737 | 36,254 | 64,902 |
| Av. contig length | 423 | 394 | 575 | 396 | 410 | 454 | 389 |
| Greater than 500 bp | 20,805 | 25,892 | 14,891 | 39,300 | 15,131 | 13,668 | 18,577 |
| Greater than 1 Kb | 2,738 | 3,570 | 2,652 | 3,589 | 953 | 2,097 | 1,831 |
| Number iterations | 1 | 2 | 1 | 3 | 1 | 1 | 2 |
| Final assembly | 108,623 | 109,573 | 123,751 | 278,677 | 87,572 | 56,283 | 110,993 |
| After adaptor removal | 108,678 | 109,589 | 124,012 | 279,699 | 87,703 | 56,302 | 111,237 |
Note.—The two first shaded rows correspond to the total number of contigs/singletons obtained from "454" and Illumina reads, and the third shaded row corresponds to the total number of contig/singletons after removal of adaptors.
FPiecharts showing the percentage of contig/singletons annotated at each step of the pipeline. The number of input sequences for each transcriptome is indicated in the middle of each graph.
FPiecharts showing the percentage of nonannotated contigs/singletons that match with the other annotated reptilian transcriptomes. The total number of hits is indicated in the middle of each graph.
FPiecharts showing the percentage of consensus sequences annotated with each reference species in the Ensembl or UniGene databases. The total number of consensuses is indicated in the middle of each graph.
FCompleteness of the annotated transcriptomes assessed with four reference data sets: Ramskold ubiquitously expressed genes in human (blue bars), CEGMA core human genes (red bars), OrthoDB7 BUSCOs from the vertebrate (green bars), and the metazoan (purple bars) radiation nodes. The species are ordered from higher to lower overlap with the Ramskold data set and Anolis is shown as reference.