| Literature DB >> 23390612 |
Katrina M Dlugosch1, Zhao Lai, Aurélie Bonin, José Hierro, Loren H Rieseberg.
Abstract
Transcriptome sequences are becoming more broadly available for multiple individuals of the same species, providing opportunities to derive population genomic information from these datasets. Using the 454 Life Science Genome Sequencer FLX and FLX-Titanium next-generation platforms, we generated 11-430 Mbp of sequence for normalized cDNA for 40 wild genotypes of the invasive plant Centaurea solstitialis, yellow starthistle, from across its worldwide distribution. We examined the impact of sequencing effort on transcriptome recovery and overlap among individuals. To do this, we developed two novel publicly available software pipelines: SnoWhite for read cleaning before assembly, and AllelePipe for clustering of loci and allele identification in assembled datasets with or without a reference genome. AllelePipe is designed specifically for cases in which read depth information is not appropriate or available to assist with disentangling closely related paralogs from allelic variation, as in transcriptome or previously assembled libraries. We find that modest applications of sequencing effort recover most of the novel sequences present in the transcriptome of this species, including single-copy loci and a representative distribution of functional groups. In contrast, the coverage of variable sites, observation of heterozygosity, and overlap among different libraries are all highly dependent on sequencing effort. Nevertheless, the information gained from overlapping regions was informative regarding coarse population structure and variation across our small number of population samples, providing the first genetic evidence in support of hypothesized invasion scenarios.Entities:
Keywords: 454 GS FLX Titanium; allele clustering; invasive species; normalized ESTs; yellow starthistle
Mesh:
Year: 2013 PMID: 23390612 PMCID: PMC3564996 DOI: 10.1534/g3.112.003871
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Sequencing effort and assembly information for C. solstitialis normalized Roche 454 transcriptome libraries
| Raw | Cleaned | Assembled | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SampleTissue [Latitude, Longitude] | Platform | Plates | Mb | Mb | Read No. | Median bp | Mb | Unigene No. | Median bp | Contig No. | UCO No. | |||
| North America (introduced) | ||||||||||||||
| CA-1-1S [N 41° 59’, W 122° 36’] | FLX | 0.25 | 17.2 | 13.0 | 69939 | 211 | 2.4 | 9783 | 242 | 8769 | 10 | |||
| CA-1-2S [N 41° 59’, W 122° 36’] | FLX | 0.25 | 25.1 | 19.0 | 101468 | 213 | 3.2 | 12810 | 242 | 11237 | 18 | |||
| CA-2-2S [N 40° 25’, W 122° 16’] | FLX | 0.25 | 20.1 | 15.9 | 82739 | 216 | 2.8 | 10596 | 255 | 9236 | 25 | |||
| CA-2-4S [N 40° 25’, W 122° 16’] | FLX | 0.25 | 19.9 | 14.3 | 78770 | 210 | 2.4 | 9853 | 238 | 8551 | 14 | |||
| CA-3-1S [N 39° 12’, W 121° 06’] | FLX | 0.5 | 58.0 | 45.6 | 229635 | 224 | 7.1 | 23671 | 268 | 19905 | 77 | |||
| CA-3-2S [N 39° 12’, W 121° 06’] | FLX | 0.5 | 58.1 | 39.9 | 212773 | 219 | 5.8 | 21578 | 254 | 18377 | 28 | |||
| CA-4-1S [N 38° 31’, W 121° 45’] | FLX | 0.5 | 80.1 | 66.8 | 321955 | 226 | 10.1 | 32206 | 270 | 27121 | 109 | |||
| CA-4-3S [N 38° 31’, W 121° 45’] | FLX | 0.25 | 20.0 | 13.2 | 72154 | 214 | 2.3 | 9524 | 238 | 8655 | 11 | |||
| CA-4-4S [N 38° 31’, W 121° 45’] | Ti | 0.75 | 206.3 | 193.2 | 579859 | 366 | 24.8 | 42042 | 537 | 32786 | 252 | |||
| CA-5-3S [N 38° 16’, W 121° 49’] | FLX | 0.5 | 80.2 | 65.5 | 325569 | 223 | 9.9 | 32743 | 266 | 27222 | 111 | |||
| CA-5-4S [N 38° 16’, W 121° 49’] | FLX | 0.25 | 21.1 | 14.7 | 79481 | 215 | 2.3 | 9172 | 249 | 8112 | 13 | |||
| South America (introduced) | ||||||||||||||
| AR-1-24L [S 36° 26’, W 64° 17’] | Ti | 0.5 | 234.7 | 225.4 | 542881 | 462 | 33.4 | 46538 | 632 | 38346 | 299 | |||
| AR-1-25CL [S 36° 26’, W 64° 17’] | Ti | 0.56 | 153.7 | 132 | 369437 | 412 | 21.1 | 39579 | 532 | 29944 | 177 | |||
| AR-6-13L [S 37° 39’, W 64° 08’] | Ti | 0.5 | 256.4 | 247.1 | 596060 | 473 | 28.5 | 38572 | 630 | 31082 | 287 | |||
| AR-6-26L [S 37° 39’, W 64° 08’] | Ti | 0.5 | 181.7 | 169.2 | 485202 | 390 | 17.3 | 30641 | 537 | 23376 | 185 | |||
| AR-8-15L [S 38° 11’, W 64° 04’] | Ti | 0.5 | 166.1 | 154.6 | 443044 | 391 | 20.4 | 35628 | 549 | 26419 | 230 | |||
| AR-8-19L [S 38° 11’, W 64° 04’] | Ti | 0.5 | 259.3 | 249.9 | 600878 | 458 | 28.7 | 37199 | 673 | 30305 | 303 | |||
| AR-13-24L [S 36° 18’, W 65° 40’] | Ti | 0.5 | 274.6 | 264.7 | 622982 | 491 | 28.4 | 37408 | 679 | 29348 | 311 | |||
| AR-13-28L [S 36° 18’, W 65° 40’] | Ti | 0.63 | 183.5 | 171.5 | 488746 | 405 | 21.1 | 38160 | 540 | 28579 | 206 | |||
| Western Europe (putative ancient expansion) | ||||||||||||||
| SP-1-5L [N 39° 50’, W 2° 30’] | Ti | 0.5 | 152.5 | 135.6 | 393803 | 398 | 19.5 | 38699 | 519 | 27841 | 172 | |||
| SP-1-10L [N 39° 50’, W 2° 30’] | Ti | 0.5 | 269.7 | 258.7 | 604697 | 483 | 29.5 | 43236 | 637 | 32884 | 292 | |||
| SP-2-2L [N 41° 45’, W 4° 5′] | Ti | 0.63 | 181.9 | 162.1 | 522160 | 341 | 22.8 | 46071 | 474 | 34707 | 184 | |||
| SP-2-6L [N 41° 45’, W 4° 5′] | Ti | 1 | 332.6 | 315.9 | 815538 | 459 | 41.6 | 62444 | 583 | 48230 | 311 | |||
| Eastern Europe (native) | ||||||||||||||
| GA-1-1S [N 41° 56’, E 45° 27’] | FLX | 0.25 | 19.1 | 12.5 | 71584 | 209 | 2.0 | 8901 | 233 | 7823 | 8 | |||
| GA-2-1S [N 41° 56’, E 45° 35’] | FLX | 0.25 | 18.9 | 12.9 | 73656 | 210 | 2.0 | 8921 | 233 | 7659 | 11 | |||
| GA-3-4S [N 41° 44’, E 45° 12’] | FLX | 0.25 | 11.5 | 7.5 | 42595 | 212 | 1.3 | 5257 | 240 | 4704 | 3 | |||
| GA-4-3S [N 41° 43’, E 45° 16’] | FLX | 0.25 | 13.0 | 7.7 | 43567 | 215 | 1.3 | 5299 | 240 | 4811 | 11 | |||
| GA-5-4S [N 41° 38’, E 45° 38’] | FLX | 0.25 | 12.8 | 8.6 | 48417 | 214 | 1.5 | 6073 | 243 | 5407 | 4 | |||
| GA-5-24L [N 41° 38’, E 45° 38’] | Ti | 0.63 | 153.5 | 126.5 | 367110 | 390 | 17.4 | 33086 | 522 | 25382 | 165 | |||
| HU-1-8L [N 46° 58’, E 18° 41’] | Ti | 0.5 | 185.3 | 172.4 | 465081 | 421 | 22.3 | 39661 | 559 | 28967 | 213 | |||
| HU-2-10L [N 47° 19’, E 21° 02’] | Ti | 0.5 | 297.9 | 286.6 | 661490 | 489 | 32.5 | 40448 | 699 | 32643 | 311 | |||
| RO-1-6L [N 47° 42’, E 26° 2’] | Ti | 0.75 | 189.6 | 170.8 | 518480 | 376 | 22.6 | 43834 | 510 | 32820 | 182 | |||
| RO-5-10L [N 46° 15’, E 27° 39’] | Ti | 0.5 | 292.5 | 279.1 | 663787 | 479 | 33.7 | 52108 | 602 | 39456 | 293 | |||
| TK-1-3S [N 39° 45’, E 29° 06’] | FLX | 0.5 | 24.7 | 17.3 | 89230 | 229 | 3.2 | 11053 | 268 | 9841 | 26 | |||
| TK-1-5L [N 39° 45’, E 29° 06’] | Ti | 0.63 | 159.5 | 145.6 | 494855 | 327 | 20.2 | 41229 | 462 | 30361 | 207 | |||
| TK-2-3S [N 37° 02’, E 29° 47’] | FLX | 0.25 | 17.1 | 10.6 | 63668 | 201 | 1.8 | 8085 | 223 | 7106 | 4 | |||
| TK-2-4S [N 37° 02’, E 29° 47’] | FLX | 0.25 | 15.7 | 10.0 | 59734 | 204 | 1.5 | 7087 | 225 | 6191 | 6 | |||
| TK-3-2S [N 37° 50’, E 27° 51’] | FLX | 0.25 | 11.4 | 6.2 | 38205 | 198 | 1.0 | 4873 | 210 | 4373 | 2 | |||
| TK-5-4S [N 37° 01’, E 30° 22’] | Ti/FLX | 2.25 | 430.7 | 389.4 | 1258319 | 352 | 35.1 | 71045 | 418 | 48608 | 278 | |||
| TK-5-9L [N 37° 01’, E 30° 22’] | Ti | 0.5 | 252.1 | 241.7 | 554227 | 481 | 29.8 | 40147 | 671 | 32773 | 283 | |||
| Pseudo-reference | 39.6 | 43717 | 811 | 260 | ||||||||||
| Pseudo-reference filtered for polymorphic single loci | 21.2 | 22687 | 840 | 154 | ||||||||||
Tissues included are whole seedlings (S) or leaves (L). FLX, 454 Life Science Genome Sequencer FLX; Ti, FLX-Titanium.
Figure 1 AllelePipe workflow for identifying alleles without a reference genome. Unigenes from all individuals are pooled and clustered by similarity. Clustered sequences are aligned and consensus sequences are generated, providing a pseudo-reference genome. Unigenes from the same and/or different individuals are aligned to the reference, and SNPs are identified. Multilocus SNP information is used to construct a minimum set of haplotypes for each individual, and clusters in which individuals are represented by an excess number of putative alleles are flagged as potential multigene clusters.
Figure 2 Contig numbers in transcriptome libraries of native (triangles), naturalized (dashes), and invading (circles) genotypes, as a function of total sequence effort after cleaning by SnoWhite. A logarithmic fit is shown to variation across all individuals.
Figure 3 Coverage of SNP positions identified across the dataset. (A) Proportion of SNPs that were sequenced in each individual, and (B) frequency of observed heterozygous loci among sequenced SNPs within native (triangles), naturalized (dashes), and invading (circles) individuals, relative to total sequencing effort after read cleaning. Linear fits are shown to native (gray line) and invading (black line) individuals.
Figure 4 Pairwise overlap in observed SNPs. Isoclines reflect the proportion of all SNP positions that are observed in pairwise comparisons of 40 transcriptome libraries, as a function of sequencing effort in both samples.
Figure 5 The distribution of GO annotations to A. thaliana as a function of transcriptome sequencing effort.
Figure 6 STRUCTURE populations inferred from SNP variation. Vertical bars show the population assignment (color) for each individual by region. (A) Two major genetic groups (blue and orange) are supported for the native range, based upon genotypes from Hungary (HU), Romania (RO), Turkey (TK) and the Republic of Georgia (GA). (B) Admixture of these sources is suggested for both putatively naturalized genotypes in Spain (SP) and invading genotypes from California (CA) and Argentina (AR), when the two native genetic groups are fixed as potential source populations. SNP frequencies were based upon 2568 positions observed in at least five individuals per continent.