| Literature DB >> 27316437 |
Ning Jiang1, Fengjun Zhang1,2, Jinhua Wu1, Yue Chen3, Xiaohua Hu1, Ou Fang1, Lindsey J Leach3, Di Wang4, Zewei Luo5,6.
Abstract
KEY MESSAGE: This optimized approach provides both a computational tool and a library construction protocol, which can maximize the number of genomic sequence reads that uniformly cover a plant genome and minimize the number of sequence reads representing chloroplast DNA and rRNA genes. One can implement the developed computational tool to feasibly design their own RAD-seq experiment to achieve expected coverage of sequence variant markers for large plant populations using information of the genome sequence and ideally, though not necessarily, information of the sequence polymorphism distribution in the genome. Advent of the next generation sequencing techniques motivates recent interest in developing sequence-based identification and genotyping of genome-wide genetic variants in large populations, with RAD-seq being a typical example. Without taking proper account for the fact that chloroplast and rRNA genes may occupy up to 60 % of the resulting sequence reads, the current RAD-seq design could be very inefficient for plant and crop species. We presented here a generic computational tool to optimize RAD-seq design in any plant species and experimentally tested the optimized design by implementing it to screen for and genotype sequence variants in four plant populations of diploid and autotetraploid Arabidopsis and potato Solanum tuberosum. Sequence data from the optimized RAD-seq experiments shows that the undesirable chloroplast and rRNA contributed sequence reads can be controlled at 3-10 %. Additionally, the optimized RAD-seq method enables pre-design of the required uniformity and density in coverage of the high quality sequence polymorphic markers over the genome of interest and genotyping of large plant or crop populations at a competitive cost in comparison to other mainstream rivals in the literature.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27316437 PMCID: PMC4983294 DOI: 10.1007/s00122-016-2736-9
Source DB: PubMed Journal: Theor Appl Genet ISSN: 0040-5752 Impact factor: 5.699
Fig. 1Workflow of modified RAD-seq library construction. a shearing the cellular DNA into fragments, b ligating the adapters to fragment ends, c pooling of samples and fragment size selection, d second round of digestion to remove the DNA fragments from rRNA genes and chloroplast sequence, e PCR amplification, f second round of fragment size selection
The number of sheared DNA fragments based on different RE combinations
| Restriction enzyme combinations | Total fragments (coverage) | Selected fragments* (coverage) | Detectable variants | Fragments in rRNA and chloroplast regions |
|---|---|---|---|---|
| (a) | ||||
| EcoRI, MseI** | 68,867 (9.24 %) | 11,080 (2.8 %) | 994 | 46 |
| EcoRI, MspI | 53,005 (37.5 %) | 9145 (2.4 %) | 814 | 38 |
| HindIII, MseI | 118,721 (14.4 %) | 17,566 (4.5 %) | 1592 | 15 |
| HindIII, MspI | 77,638 (44.9 %) | 14,624 (3.9 %) | 1371 | 16 |
| EcoRI, MseI, MspI | 69,680 (7.5 %) | 9219 (2.3 %) | 782 | 33 |
| HindIII, MseI, MspI | 121,112 (12.1 %) | 14,903 (3.7 %) | 1353 | 11 |
| EcoRI, HindIII, MseI | 176,921 (20.6 %) | 25,314 (6.42 %) | 2288 | 55 |
|
|
|
|
|
|
The RE combinations recommended in the current study are shown in bold
* Size range from 224 to 424 bps
** Recommended in (Alonso-Blanco et al. 1998; Truong et al. 2012)
Predicted proportions of sequence reads to be generated from the 20,789 selected DNA fragments
| Diploid | Tetraploid | |||||
|---|---|---|---|---|---|---|
| Genome | rRNA | Chloroplast | Genome | rRNA | Chloroplast | |
| Number of selected DNA fragments per haploid genome | 20,745 | 4 | 40 | 20,745 | 4 | 40 |
| Number of copies | 2 | 2 × 700 | 1200 | 4 | 4 × 700 | 1200 |
| Number of selected DNA fragments per cell | 41,490 | 5600 | 48,000 | 82,980 | 11,200 | 48,000 |
| Total number of selected DNA fragments per cell | 95,090 | 142,180 | ||||
| Proportion of reads mapped to different regions |
| 5.9 % | 50.5 % |
| 7.9 % | 33.7 % |
The optimized combination of REs and their cut sites in each of three types of Arabidopsis and potato DNA fragments
| RE combination | Cut sites within | Genomic DNA fragments intact | Annotated polymorphisms | ||
|---|---|---|---|---|---|
| rRNA Fragments | Chloroplast fragments | Genomic DNA fragments | |||
| (a) | |||||
|
| 4 | 40 | 12,720 | 8025 | 953 |
| AvaII, TfiI | 4 | 40 | 13,669 | 7070 | 817 |
| Sse9I | 4 | 40 | 17,488 | 3215 | 376 |
Summary of 6 pooled DNA sequencing libraries constructed using the optimized RAD-seq design (the first four) and the corresponding control design (the last two) and the number of sequence reads expected from each of the pooled sequence libraries
|
| Potato |
| Potato (control) | |||
|---|---|---|---|---|---|---|
| Diploid | Tetraploid | Diploid | Tetraploid | |||
| Sequenced samples | 2 parents + 10 offspring | 2 parents + 10 offspring | 2 parents + 10 offspring | 2 parents + 10 offspring | 2 parental lines + 1 offspring (diploid and tetraploid) | 2 parental lines + 1 offspring (diploid and tetraploid) |
| 1st digestion | EcoRI, HindIII, MspI | EcoRI, HindIII, MspI | EcoRI, | EcoRI, | EcoRI, | EcoRI, |
| 2nd digestion | SnaBI, | SnaBI, | HhaI, | HhaI, | – | – |
| Sequencing platforms | Illumina MiSeq | Illumina MiSeq | Illumina HiSeq 2000 | Illumina HiSeq 2000 | Illumina MiSeq | Illumina HiSeq 2000 |
| Sequencing lengths (bp) | 2 × 150 | 2 × 150 | 2 × 100 | 2 × 100 | 2 × 150 | 2 × 100 |
| Expected coverage | 125 | 125 | 200 | 200 | 60 | 100 |
| Expected number of reads per sample | 1.25 M | 1.25 M | 2.00 M | 2.00 M | 0.50 M | 1.00 M |
| Total number of reads | 15 M | 15 M | 24 M | 24 M | 3 M | 6 M |
Fig. 2Distribution of the number of short reads across 12 barcoded samples in each pooled RAD-seq dataset. The red dashed line shows the average number of paired reads per sample
Percentage of RAD-seq short reads aligning to different genome regions in Arabidopsis and potato
|
| Potato | |||||||
|---|---|---|---|---|---|---|---|---|
| Chloroplast and rRNA fragments unremoved | Chloroplast and rRNA fragments removed | Chloroplast and rRNA fragments unremoved | Chloroplast and rRNA fragments removed | |||||
| Diploid | Tetraploid | Diploid | Tetraploid | Diploid | Tetraploid | Diploid | Tetraploid | |
| Genomic |
|
|
|
|
|
|
|
|
| Chloroplast | 31.4 | 27.0 | 6.1 | 6.1 | 64.5 | 61.1 | 5.5 | 2.7 |
| rRNAs | 11.9 | 11.6 | 3.0 | 3.1 | 0.7 | 1.2 | 0.1 | 0.1 |
| Unmapped | 9.3 | 8.2 | 9.0 | 8.7 | 7.8 | 7.4 | 11.0 | 14.4 |
Fig. 3Length distribution of sequenced DNA fragments in each pooled RAD-seq dataset
Coverage of Arabidopsis RAD-seq reads in Mbp across the whole genome and in selected regions (2.0 Mbps)
| Sample ID | Diploid sample pools | Tetraploid sample pools | ||||||
|---|---|---|---|---|---|---|---|---|
| Whole genome | Selected regions | Whole genome | Selected regions | |||||
| Covered* | Deep** | Covered* | Deep** | Covered* | Deep** | Covered* | Deep** | |
| F2_1 | 6.95 | 2.73 | 1.86 | 1.69 | 5.74 | 2.88 | 1.89 | 1.76 |
| F2_2 | 8.51 | 3.17 | 1.93 | 1.83 | 8.38 | 3.08 | 1.91 | 1.80 |
| F2_3 | 7.36 | 2.75 | 1.83 | 1.69 | 5.50 | 2.88 | 1.84 | 1.68 |
| F2_4 | 7.69 | 2.81 | 1.76 | 1.68 | 6.97 | 2.60 | 1.88 | 1.63 |
| F2_5 | 4.96 | 2.23 | 1.68 | 1.60 | 7.76 | 2.39 | 1.87 | 1.55 |
| F2_6 | 5.97 | 3.18 | 1.84 | 1.74 | 5.94 | 2.13 | 1.84 | 1.70 |
| F2_7 | 5.55 | 2.55 | 1.82 | 1.63 | 7.60 | 2.68 | 1.82 | 1.62 |
| F2_8 | 8.68 | 2.61 | 1.82 | 1.61 | 5.10 | 2.53 | 1.84 | 1.60 |
| F2_9 | 7.41 | 2.64 | 1.81 | 1.66 | 8.28 | 2.87 | 1.90 | 1.72 |
| F2_10 | 5.79 | 3.00 | 1.81 | 1.69 | 5.69 | 2.76 | 1.88 | 1.69 |
| P1 | 6.05 | 2.54 | 1.87 | 1.53 | 5.34 | 2.23 | 1.83 | 1.69 |
| P2 | 5.57 | 2.81 | 1.57 | 1.46 | 5.63 | 2.86 | 1.54 | 1.42 |
* At least 2 reads uniquely mapped
** At least 10 reads uniquely mapped
The number of genetic variants detected from the optimized Arabidopsis and potato RAD-seq datasets
| Arabidopsis | Potato | |||
|---|---|---|---|---|
| Diploid | Tetraploid | Diploid* | Tetraploid* | |
| Candidate variants | 39,313 | 40,076 | 174,447 (1058) | 125,291 (1087) |
| SNPs | 28,779 | 28,649 | 158,272 (679) | 114,407 (797) |
| INDELs | 10,534 | 11,427 | 16,175 (379) | 10,885 (290) |
| Verified variants | 848 | 786 | 1779 | 5045 |
| SNPs | 601 | 554 | 1673 | 4688 |
| INDELs | 247 | 232 | 106 | 357 |
* The numbers in parentheses refer to tri- and tetra- allelic genetic markers
Fig. 4Distribution of detectable genetic markers in the Arabidopsis and potato genomes. The black bars below each chromosome indicate the centromere regions
Fig. 5Distribution of genotype frequency in offspring samples. Distribution of three possible genotypes (homozygote, heterozygote and homozygote) at 846 and 786 detected SNP sites from the diploid Arabidopsis F2 samples (a) and in tetraploid Arabidopsis F2 samples (b) respectively. The correlation between marker homozygous genotype frequency and distance to the centromere region in Arabidopsis tetraploid F2 samples (c). Distribution of three possible genotypes (homozygote, heterozygote and homozygote) at 1269 and 3499 detected SNP sites from the diploid potato F1 samples (d) and in tetraploid potato F1 samples (e). The correlation between marker homozygous genotype frequency and distance to the centromere region in potato tetraploid F1 samples (f)