| Literature DB >> 28824658 |
Shairul Izan1,2, Danny Esselink1, Richard G F Visser1, Marinus J M Smulders1, Theo Borm1.
Abstract
Whole Genome Shotgun (WGS) sequences of plant species often contain an abundance of reads that are derived from the chloroplast genome. Up to now these reads have generally been identified and assembled into chloroplast genomes based on homology to chloroplasts from related species. This re-sequencing approach may select against structural differences between the genomes especially in non-model species for which no close relatives have been sequenced before. The alternative approach is to de novo assemble the chloroplast genome from total genomic DNA sequences. In this study, we used k-mer frequency tables to identify and extract the chloroplast reads from the WGS reads and assemble these using a highly integrated and automated custom pipeline. Our strategy includes steps aimed at optimizing assemblies and filling gaps which are left due to coverage variation in the WGS dataset. We have successfully de novo assembled three complete chloroplast genomes from plant species with a range of nuclear genome sizes to demonstrate the universality of our approach: Solanum lycopersicum (0.9 Gb), Aegilops tauschii (4 Gb) and Paphiopedilum henryanum (25 Gb). We also highlight the need to optimize the choice of k and the amount of data used. This new and cost-effective method for de novo short read assembly will facilitate the study of complete chloroplast genomes with more accurate analyses and inferences, especially in non-model plant genomes.Entities:
Keywords: Aegilops; DNA sequencing; Paphiopedilum; Solanum; chloroplast genome; de novo assembly; k-mer analysis; whole genome shotgun sequencing
Year: 2017 PMID: 28824658 PMCID: PMC5539191 DOI: 10.3389/fpls.2017.01271
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 5.753
Species used in the study and their SRA number.
| Haploid genome | NCBI SRA | ||
|---|---|---|---|
| Species (n) | size (bases) | Group | number |
| (1) | 950 Mb | Dicot | SRR404081 |
| (2) | 4–5 Gb | Monocot | SRR124187 |
| (3) | 25–35 Gb | Monocot | Own data |
Summary statistics before and after the fetching of the chloroplast reads.
| Case | Case | Case | |
|---|---|---|---|
| study 1 | study 2 | study 3 | |
| Genome size | 950 MB | 4–5 GB | 25–35 GB |
| Total no of raw reads (pairs) | 198 264 041 | 86 067 571 | 15 142 939 |
| Total no of reads after stage 2 (pairs) | 32 701 410 | 51 717 173 | 6 172 495 |
| Total no of reads after stage 3 (pairs) | 14 855 294 | 1 582 279 | 213 669 |
Comparison of the SOAPdenovo assembly and de novo assembly derived after stages 4 and 5 from the proposed pipeline.
| Case | Number of | Number | Total assembly | Total reference |
|---|---|---|---|---|
| study | scaffold | of gap | length | length |
| SOAPdenovo | 3 | 0 | 130 181∗ | 155 461a |
| Our approach | 1 | 0 | 155 461 | |
| SOAPdenovo | 9 | 4 | 114 806∗ | 135 685b |
| Our approach | 2 | 2 | 135 760 | |
| SOAPdenovo | 12 | 3 | 122 051∗ | 174 417c |
| Our approach | 4 | 1 | 156 087 |
Variant calling for case studies 1 and 2.
| Position in the | |||
|---|---|---|---|
| Case study | Type | Variants | assembled genome |
| Case study 1 | Mismatch | G (ref) > T (ass) | 127404 |
| Case study 2 | Insertion | AGGTACCTAA | 7653–7662 |
| Insertion | Homopolymer T region | 18272–18274 | |
| Insertion | Homopolymer A region | 18614 | |
| Insertion | Homopolymer A region | 34160 | |
| Insertion | CT | 43329–43330 | |
| Insertion | Homopolymer A region | 56672–56673 | |
| Mismatch | CTCTC (ref) > TCTCT (ass) | 76298–76302 | |
| Deletion | Homopolymer A region | 78860 | |
| Insertion | TTTACTTTTATGTTTTATTTG | 107322–107342 | |
| Insertion | GCAATAATCTACTAAAAAAA | 109678–109697 | |
| Mismatch | G (ref) > N (ass) | 109894 | |
| Mismatch | T (ref) > N (ass) | 109893 |