| Literature DB >> 23190771 |
Ana Riesgo1, Sónia C S Andrade, Prashant P Sharma, Marta Novo, Alicia R Pérez-Porro, Varpu Vahtera, Vanessa L González, Gisele Y Kawauchi, Gonzalo Giribet.
Abstract
INTRODUCTION: Traditionally, genomic or transcriptomic data have been restricted to a few model or emerging model organisms, and to a handful of species of medical and/or environmental importance. Next-generation sequencing techniques have the capability of yielding massive amounts of gene sequence data for virtually any species at a modest cost. Here we provide a comparative analysis of de novo assembled transcriptomic data for ten non-model species of previously understudied animal taxa.Entities:
Year: 2012 PMID: 23190771 PMCID: PMC3538665 DOI: 10.1186/1742-9994-9-33
Source DB: PubMed Journal: Front Zool ISSN: 1742-9994 Impact factor: 3.172
Figure 1Phylogenetic position of the higher taxonomic ranks of the species selected for this study, and accessory pictures of the living animals.a.Petrosia ficiformis. b.Crella elegans. c.Cerebratulus marginatus. d.Cephalothrix hongkongiensis. e.Chiton olivaceus. f.Octopus vulgaris. g. Sipunculus nudus. h. Hormogaster samnitica. i. Metasiro americanus. j. Alipes grandidieri. (Pictures taken by Ana Riesgo (a), Alicia R. Pérez-Porro (b), Gonzalo Giribet (c, f, j), Sichun Sun (d), Jiri Nóvak (e), Gisele Kawauchi (g), Marta Novo (h), and Prashant Sharma (i).
Collecting information for the 10 species used for this study
| Porifera | Demospongiae, Haplosclerida | Punta Santa Anna, Blanes, Girona, Spain | DNA105722* | Entire animal | LN2/-80°C | |
| Demospongiae, Poecilosclerida | Tossa de Mar, Girona, Spain | DNA105740* | Entire animal | RNA | ||
| Nemertea | Anopla, Paleonemertea | Akkeshi, Hokkaido, Japan | DNA106145* | Entire animal | RNAlater | |
| Anopla, Heteronemertea | False Bay, San Juan Island, Washington, USA | DNA105590* | Entire animal | LN2/-80°C | ||
| Mollusca | Polyplacophora, Chitonida | Tossa de Mar, Girona, Spain | DNA106012* | Entire animal | RNA | |
| Cephalopoda, Octopoda | Blanes Bay, Blanes, Girona, Spain | DNA106283* | Fragment of arm | RNA | ||
| Sipuncula | Sipunculidae | Fort Pierce, Florida, USA | DNA106878* | Distal fragment of animal | LN2/-80°C | |
| Annelida | Oligochaeta, Opisthopora | Gello, Toscana, Italy | GEL6** | Distal fragment of animal | RNA | |
| Arthropoda | Arachnida, Opiliones | Kingfisher Pond, Savannah, Georgia, USA | DNA101532* | Entire animal | LN2/-80°C | |
| Chilopoda, Scolopendromorpha | Tanzania; pet supplier ( | DNA106771* | Mid part of body | LN2/-80°C |
Voucher numbers refer to specimens collected in the same area as the one used for the nucleic extraction, since most of the times the entire animal (or the entire collected piece of animal) was processed. A single asterisk refers to voucher numbers in the Museum of Comparative Zoology, Harvard University, and a double asterisk to those deposited in the Department of Zoology and Physical Anthropology, Universidad Complutense de Madrid. In all cases only one specimen was used for extraction, except for Metasiro americanus, which also had embryos in several developmental stages.
Assembly parameters
| 49,758,556 | 32,612,454* | 34.5 | 65.4 | 28,439,277 | 67,423 | 29.9 | 443.3 | 370.7 | 7,377 | 503 | 926.8 | 496.6 | |
| 26,513,534 | 25,951,906* | 2.1 | 93.1 | 16,464,495 | 71,524 | 26.7 | 372.7 | 261.7 | 4,637 | 437 | 682.1 | 333.1 | |
| 51,091,244 | 26,631,980* | 47.9 | 79.8 | 14,447,555 | 76,507 | 28.8 | 376.7 | 242.7 | 5,198 | 390 | 652.8 | 300.1 | |
| 51,711,276 | 46,967,592* | 9.2 | 73.8 | 22,977,409 | 109,947 | 57.1 | 518.0 | 394.2 | 7,731 | 559 | 991.0 | 521.6 | |
| 46,265,184 | 40,889,060* | 11.6 | 98.5 | 32,085,523 | 207,559 | 75.9 | 366.0 | 238.6 | 9,374 | 372 | 627.0 | 305.3 | |
| 16,431,468 | 15,422,631* | 6.1 | 125.0 | 11,670,780 | 77,383 | 41.7 | 540.0 | 125.0 | 16,472 | 599 | 1122.9 | 660.5 | |
| 45,973,825 | 43,842,184** | 4.6 | 100.5 | 25,679,520 | 71,960 | 31.2 | 431.7 | 228.0 | 3,032 | 437 | 676.2 | 262.5 | |
| 50,789,952 | 47,857,894** | 5.8 | 96.5 | 32,511,666 | 190,189 | 75.9 | 399.8 | 312.5 | 7,319 | 423 | 766.6 | 426.8 | |
| 24,943,641 | 23,959,711** | 3.9 | 129.6 | 19,735,275 | 101,929 | 43.9 | 439.5 | 423.0 | 10,407 | 477 | 1,010.3 | 621.7 | |
| 32,294,430 | 31,561,359** | 2.3 | 134.8 | 25,457,734 | 162,326 | 59.9 | 380.9 | 306.9 | 9,323 | 377 | 710.7 | 443.4 |
Grey background indicates libraries sequenced for 150 bp; otherwise they are 100 bp. Abbreviations: N, number; BT, before thinning and trimming; AT, after thinning and trimming; NMRC, number of reads matched to contigs; Mb, megabases; bp, base pairs; avg., average; L, length; SD, standard deviation; *, thinning limit of 0.05; **, thinning limit of 0.005.
Figure 2Workflow followed for the transcriptome analysis.
Coverage for the selected assemblies per species, estimated as the number of reads per bp and number of reads used to build the contigs (average value and maximum and minimum values)
| | ||||||
|---|---|---|---|---|---|---|
| 64.7 | 31926.9 (309) | 421.7 | 2 | 113,180 | 9 | |
| 72.7 | 88692.0 (238) | 230.2 | 2 | 317,465 | 5 | |
| 48.7 | 74756.8 (337) | 172.5 | 2 | 173,829 | 6 | |
| 36.2 | 56724.0 (657) | 208.9 | 2 | 307,273 | 5 | |
| 45.2 | 91002.5 (217) | 124.3 | 2 | 168,082 | 3 | |
| 38.4 | 27963.1 (490) | 151.0 | 2 | 65,985 | 3 | |
| 92.1 | 123567.7 (463) | 355.0 | 2 | 412,174 | 10 | |
| 40.6 | 85181.4 (273) | 171.3 | 2 | 543,848 | 3 | |
| 61.3 | 58777.3 (201) | 186.2 | 1 | 89,980 | 2 | |
| 65.3 | 98893.9 (211) | 161.8 | 2 | 153.215 | 2 | |
Also, the minimum number of reads used to build the contigs longer than 300 bp is given. N, number; SD, standard deviation, bp, base pairs.
Number of transcripts with blast hits and associated Gene Ontology (GO) terms for each transcriptome
| 26,291 | 9,069 | 5,380 | |
| 17,719 | 13,984 | 7,288 | |
| 22,035 | 14,251 | 9,778 | |
| 69,803 | 11,062 | 5,722 | |
| 69,384 | 24,495 | 12,533 | |
| 37,851 | 18,881 | 9,165 | |
| 40,946 | 9,322 | 4,942 | |
| 65,247 | 25,681 | 8,806 | |
| 29,382 | 18,056 | 9,720 | |
| 49,511 | 16,688 | 9,691 |
Figure 3Size distribution of a. short contigs(between 300 and 2,000 bp) and b. long contigs(from 2,001 to >6,000 bp) without blast hit (light grey), with blast hit (dark grey) and with annotation or GO assignment (black). Asterisks represent species for which datasets were obtained using read length of 150 bp.
Protein names and lengths (in aminoacids, aa) for the five most redundant hits in each transcriptome
| | | | | |
| x9 | PREDICTED: hypothetical protein LOC100641198 [ | - | 673 | XP_003382742 |
| x9 | PREDICTED: hypothetical protein LOC100639583 [ | yes | 1768 | XP_003390293 |
| x10 | PREDICTED: RING finger protein 213-like [ | - | 5361 | XP_003389786 |
| x12 | ankyrin 2,3/unc44 [ | - | 789 | XP_001649474 |
| x16 | PREDICTED: hypothetical protein LOC100637079 [ | - | 41943 | XP_003386025 |
| | | | | |
| x25 | Collagen protein [ | - | 282 | CAC81019 |
| x36 | aggregation factor protein 3, form C [ | - | 2205 | AAC33162 |
| x38 | PREDICTED: deleted in malignant brain tumors 1 protein-like [ | - | 3131 | XP_003389240 |
| x46 | PREDICTED: hypothetical protein LOC100640736 [ | - | 5715 | XP_003383871 |
| x193 | PREDICTED: hypothetical protein LOC100637079 [ | - | 41943 | XP_003386025 |
| | | | | |
| x14 | pol-like protein [ | yes | 1235 | BAC82623 |
| x14 | pol-like protein [ | yes | 1263 | BAC82626 |
| x15 | PREDICTED: similar to ORF2-encoded protein, partial [ | yes | 372 | XP_002155414 |
| x15 | PREDICTED: Pao retrotransposon peptidase family protein-like [ | - | 1559 | XP_002731015 |
| x23 | putative zinc finger protein [ | - | 486 | CCD80531 |
| | | | | |
| x9 | PREDICTED: hypothetical protein LOC497165 [ | yes | 2265 | XP_003200870 |
| x11 | ORF2-encoded protein [ | yes | 1027 | BAE46429 |
| x11 | PREDICTED: similar to ORF2-encoded protein, partial [ | yes | 1117 | XP_001187755 |
| x11 | PREDICTED: similar to ORF2-encoded protein [ | yes | 1124 | XP_001189850 |
| x11 | PREDICTED: hypothetical protein LOC100535924 [ | - | 1448 | XP_003199942 |
| | | | | |
| x38 | PREDICTED: hypothetical protein LOC100609033 [ | yes | 255 | XP_003317434 |
| x44 | PREDICTED: hypothetical protein LOC100597269 [ | yes | 220 | XP_003276349 |
| x57 | PREDICTED: hypothetical protein LOC100414382, partial [ | yes | 178 | XP_002762361 |
| x57 | PREDICTED: zinc finger protein 91-like [ | - | 818 | XP_003243211 |
| x90 | PREDICTED: hypothetical protein LOC100608502, partial [ | yes | 211 | XP_003315526 |
| | | | | |
| x16 | predicted protein [ | yes | 1079 | XP_001630327 |
| x17 | PREDICTED: similar to tyrosine recombinase [ | - | 461 | XP_001183896 |
| x22 | pol-like protein [ | yes | 1222 | ABN58714 |
| x29 | hypothetical protein EAI_13357 [ | - | 172 | EFN88744 |
| x48 | PREDICTED: similar to ORF2-encoded protein, partial [ | yes | 372 | XP_002155414 |
| | | | | |
| x7 | dopamine beta hydroxylase-like protein, partial [ | - | 504 | ADB11406 |
| x7 | pol-like protein [ | yes | 1263 | BAC82626 |
| x7 | PREDICTED: similar to transposase [ | yes | 1312 | XP_001193486 |
| x9 | pol-like protein [ | yes | 1235 | BAC82623 |
| x11 | lectin 1B [ | - | 243 | ADO22714 |
| | | | | |
| x15 | leechCAM [ | - | 858 | AAC47655 |
| x15 | pannexin 4 [ | - | 413 | NP_001191576 |
| x16 | predicted protein [ | - | 2047 | XP_001624963 |
| x19 | hypothetical protein CBG_27119 [ | - | 224 | CAR99373 |
| x24 | tractin [ | - | 1880 | AAC47654 |
| | | | | |
| x14 | transglutaminase [ | - | 764 | 2012342A |
| x15 | putative reverse transcriptase [ | yes | 851 | AAK58879 |
| x30 | hypothetical protein BRAFLDRAFT_210900 [ | - | 489 | XP_002611360 |
| x39 | hypothetical protein BRAFLDRAFT_79800 [ | - | 512 | XP_002597956 |
| x53 | hypothetical protein BRAFLDRAFT_89523 [ | - | 396 | XP_002590717 |
| | | | | |
| x55 | PREDICTED: similar to predicted protein [ | yes | 1371 | XP_002161911 |
| x56 | Transposable element Tcb1 transposase [ | yes | 281 | ACN11475 |
| x57 | hypothetical protein TcasGA2_TC002110 [ | yes | 346 | EEZ99596 |
| x58 | hypothetical protein EAG_05969 [ | yes | 282 | EFN71217 |
| x123 | hypothetical protein TcasGA2_TC000717 [ | yes | 346 | EEZ98274 |
Their putative transposable element nature is indicated, as well as the Genbank accession number for each protein.
Figure 4Number of sequences that resulted in unique hits (only one contig matching to each protein) or redundant hits (two or more blast hits matching to each protein) for each species.
Figure 5Paired comparison per phylum of the percentages of sequences mapped to given gene ontology (GO) terms.
Individual searches for our transcriptome datasets (no background color) and the JGI genomes of a sponge (pink), a mollusk (violet), and an annelid (green)
| | | | | | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 578 | 247 | 680 | 346 | 234 | 510 | 157 | | | | | |
| 472 | 247 | 247 | 307 | 85 | 300 | 147 | | | | | |
| 1667/614 | 539 | 1320 | 413/370/279 | 290 | 656 | 454 | | | | | |
| 272/173/103 | 139 | - | 137 | 343/309/110 | 205 | 168/89 | | | | | |
| 286/495 | 101 | 131 | 140 | 358/109/167 | 271 | 233 | | | | | |
| 498 | 358 | - | 123 | 289/174/96 | 168 | 65 | | | | | |
| 916 | 217 | - | - | 324/247 | 445 | 101 | | | | | |
| 2404 | 724 | 1245 | 350 | 231 | 549 | 230 | | | | | |
| 451/232/241 | - | - | - | - | - | - | | | | | |
| 464/546/684/456 | 521/268/260/388/314/170 | - | 200/197 | 350/314/238/173/133/108 | 521/482 | 171/952 | | | | | |
| 2580/2612/2673/2985 | 785 | 1204 | 207 | 307/199/141 | 459/445 | 610 | | | | | |
| 600 | 780 | 552 | 340 | 58/179/ 344 | 493 | - | | | | | |
| 196/203 | 151 | 115/416 | 66/82 | 80/294 | 273 | 415 | | | | | |
| | | | | | | | | | | | |
| | | ||||||||||
| 435 | - | 230 | - | 186 | - | 184 | - | - | 102 | | |
| 90 | - | 408 | 98 | 190 | - | 150 | - | - | 140 | | |
| 371 | - | 408/412/181 | - | - | - | 1035 | - | - | 413 | | |
| 251 | 118 | 151 | - | 307 | - | 208 | - | 114 | - | | |
| 413 | 120 | 473 | - | 266 | - | 172 | - | 114 | - | | |
| 109 | 285 | 77 | - | 242 | 181 | 110 | - | 152 | | | |
| - | 500 | 302 | - | - | 97 | 107 | 308 | - | 679 | | |
| 516 | 523/576 | 466 | 428 | - | 406 | 332 | 381 | 104 | - | | |
| - | 59 | - | - | - | - | 153 | - | - | - | | |
| 446/317 | 192 | 472 | - | 417 | 104 | 226 | 96 | - | - | | |
| 511 | 471/429 | 309 | | 452 | 339 | 1 | 239 | - | - | | |
| 362/374 | 407/71 | - | 287 | - | 351 | 340/613 | 117 | 411 | 160 | | |
| 425/507 | - | 117 | - | 94 | 173 | 113 | 480 | - | - | | |
| | |||||||||||
| | |||||||||||
| 1212 | - | 288/252/133 | 343 | 165 | 180 | 94 | 239 | 231 | 253 | 383/422 | |
| 327 | - | 165/156 | 517 | 218 | 267 | 377/162 | 139 | 172/128 | 460 | 374 | |
| 1184 | - | 300/289/275 | 143 | 228 | 184 | 439/385 | 840 | 359 | 249 | 508/498 | |
| - | - | 199 | - | 141 | 264 | - | 698 | - | 191/79 | 152 | |
| - | - | 100 | | 248 | 393 | 100 | 705 | 38 | 331 | 255 | |
| 303 | - | 465 | 105 | 235 | 499 | 247 | 421 | 121 | - | 190 | |
| - | 145 | 590/305/221 | - | 79 | 509 | 332 | 815 | 60 | 109 | 335 | |
| 355 | 527 | 879/572/489 | 1493 | 252 | 521 | 410 | 770 | 273 | 469 | 510 | |
| 91 | - | - | - | 222 | 325 | 247 | - | 243/113 | 333 | 509 | |
| 386/301/127 | 555/536/107 | 838 | 695 | 150 | 178 | 402/106 | 585 | 213 | 230/133 | 458 | |
| 329 | 1465 | 589/597/591 | 235 | 240 | 479 | 393 | 826 | 364 | 463 | 534 | |
| 236 | 670 | 75/577 | 597 | 236 | 210 | 261/182 | 766 | 213 | 207 | 101 | |
| 285 | 132 | 66 | 681 | 235 | 289 | 107/78 | 525 | 202 | 124 | 431 | |
Length of protein sequences are given in amino acids. Abbreviations: JAG/SER, jagged and serrate; HES, hairy enhancer of split; Su(H), suppressor of hairless; Dx, deltex; TGF-β1, transforming growth factor β; ACV, activin; Smad, mothers against decapentaplegic; dpp, decapentaplegic; BMP, bone morphogenetic protein. Asterisks indicate the presence of hedgling instead of hedgehog; SMO/FZD, smoothened and frizzled; Ci/Gli, cubitus interruptus/GLI; TPI, triosephosphate isomerase; ATPB, ATP synthase subunit b vesicular; MAT, methionine adenosyl transferase; PFK, phosphofructokinase; FBA, fructose biphosphate aldolase; EF-1α, elongation factor-1α; CAT, catalase.
Figure 6Compared abundances of PFAM domains for selected domains.
Figure 7Phylogenetic reconstruction of metazoans using the gene methionine adenosyl transferase. Only bootstrap support values above 50% shown. Sequences derived from our transcriptomes are shown in red. GenBank accession numbers for all sequences used can be found in Additional file 9.
Figure 8Ortholog hit ratio (OHR) analysis showing the median (solid line), the mean (dotted line) and the 95th and 5th percentiles for all species.
Figure 9Assembly of the transcriptome datasets through sequential addition of 5 million reads.a: N50; and b: total number of contigs, were plotted against the different assemblies obtained for each species. Note that the final values in this figure are different from those in Table 2 because we used a newer version of CLC Genomics Workbench (v. 5.1).