| Literature DB >> 24353264 |
Isheng J Tsai1, Martin Hunt2, Nancy Holroyd2, Thomas Huckvale2, Matthew Berriman2, Taisei Kikuchi3.
Abstract
Advances in both high-throughput sequencing and whole-genome amplification (WGA) protocols have allowed genomes to be sequenced from femtograms of DNA, for example from individual cells or from precious clinical and archived samples. Using the highly curated Caenorhabditis elegans genome as a reference, we have sequenced and identified errors and biases associated with Illumina library construction, library insert size, different WGA methods and genome features such as GC bias and simple repeat content. Detailed analysis of the reads from amplified libraries revealed characteristics suggesting that majority of amplified fragment ends are identical but inverted versions of each other. Read coverage in amplified libraries is correlated with both tandem and inverted repeat content, while GC content only influences sequencing in long-insert libraries. Nevertheless, single nucleotide polymorphism (SNP) calls and assembly metrics from reads in amplified libraries show comparable results with unamplified libraries. To utilize the full potential of WGA to reveal the real biological interest, this article highlights the importance of recognizing additional sources of errors from amplified sequence reads and discusses the potential implications in downstream analyses.Entities:
Keywords: Illumina; SNPs; chimeric DNA; genome assembly; whole-genome amplification
Mesh:
Substances:
Year: 2013 PMID: 24353264 PMCID: PMC4060946 DOI: 10.1093/dnares/dst054
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Mapping statistics of sequenced reads from unamplified and amplified libraries
| Method | Replicate | Platform | Library type | Total reads | Mapped (%) | Duplicatesa (%) | Properb (%) | Both mapped (%) | Median insert (bp) | Median coverage |
|---|---|---|---|---|---|---|---|---|---|---|
| Unamplified | 1 | HiSeq | Short | 41 688 676 | 99.7 | 0.8 | 98.1 | 99.5 | 285 | 38 |
| 2 | HiSeq | Short | 16 333 732 | 89.4 | 0.6 | 88 | 89.3 | 357 | 13 | |
| Phi | 1 | HiSeq | Short | 48 501 916 | 99.4 | 0.7 | 98 | 98.8 | 224 | 46 |
| 2 | HiSeq | Short | 20 656 296 | 95.2 | 0.8 | 88.6 | 93.7 | 349 | 18 | |
| Tre | 1 | HiSeq | Short | 44 481 270 | 99.3 | 1.1 | 97.6 | 98.7 | 237 | 40 |
| 2 | HiSeq | Short | 25 188 788 | 96.2 | 0.7 | 92 | 95.3 | 328 | 22 | |
| Rap | 1 | HiSeq | Short | 26 277 398 | 89.3 | 2.8 | 71.6 | 80.1 | 248 | 15 |
| 2 | HiSeq | Short | 22 278 134 | 82.5 | 35.7 | 60 | 73.7 | 308 | 3 | |
| Unamplified | 1 | HiSeq | Long | 60 856 860 | 99.5 | 9 | 97.2 | 99.1 | 2631 | 41 |
| 2 | MiSeq | Long | 2 551 720 | 86.4 | 1 | 79.1 | 85.4 | 2136 | 2 | |
| Phi | 1 | HiSeq | Long | 61 735 210 | 99.5 | 3.5 | 81.5 | 99.2 | 2576 | 39 |
| 2 | MiSeq | Long | 2 760 576 | 99.3 | 1.2 | 80.2 | 98.9 | 2025 | 2 | |
| Tre | 1 | HiSeq | Long | 55 842 586 | 99.4 | 3.4 | 82.6 | 99 | 2285 | 29 |
| 2 | MiSeq | Long | 2 999 856 | 99.4 | 1 | 73 | 98.9 | 2094 | 2 | |
| Rap | 1 | HiSeq | Long | 58 443 656 | 99.4 | 7.6 | 92.1 | 99 | 2591 | 35 |
| 2 | MiSeq | Long | 3 914 622 | 99 | 6.6 | 91.8 | 98.7 | 2121 | 2 |
All percentages are relative to total number of reads in each replicate.
aReads that are identical copies of other reads and have exact mapped coordinates on the genome.
bReads mapped in the correct orientation and at a distance corresponding to that predicted by the fragment library size.
Summary statistics of assembly and scaffolding data from different libraries
| Protocol | Contig assembly | Scaffolding | ||||||
|---|---|---|---|---|---|---|---|---|
| Unamplified | Phi | Tre | Rap | Unamplified | Phi | Tre | Rap | |
| Assembly size (bp) | 94 641 187 | 94 028 877 | 94 541 978 | 88 411 985 | 96 571 590 | 96 849 620 | 96 382 191 | 96 869 541 |
| contig number | 13 386 | 14 661 | 21 721 | 34 073 | 9415 | 8744 | 9068 | 9247 |
| contig average (kb) | 7.1 | 6.4 | 4.4 | 2.6 | 10.3 | 11.1 | 10.6 | 10.5 |
| largest contig (kb) | 167.7 | 147.8 | 116.8 | 41.8 | 167.7 | 187.5 | 167.7 | 167.7 |
| N50 (kb) | 16.6 | 15.7 | 10.8 | 4.2 | 17.6 | 24.1 | 20.7 | 18.4 |
| N50 (number) | 1525 | 1597 | 2109 | 5897 | 1533 | 1067 | 1258 | 1482 |
| GAGE assessment | ||||||||
| Corrected N50 (kb) | 15.1 | 14.1 | 9.5 | 3.5 | 16.8 | 22.7 | 19.7 | 17.7 |
| Corrected N50 (number) | 1721 | 1825 | 2431 | 7493 | 1642 | 1141 | 1354 | 1577 |
| Missing reference (%) | 0.09 | 0.09 | 4.43 | 0.14 | 0.09 | 0.09 | 0.09 | 0.09 |
| Inversion | 13 | 21 | 38 | 50 | 12 (−1) | 17 (+4) | 13 | 15 (+2) |
| Relocation | 7 | 7 | 11 | 22 | 17 (+10) | 13 (+6) | 19 (+12) | 11 (+4) |
| Translocation | 12 | 16 | 37 | 30 | 12 | 12 | 12 | 11 (−1) |
Mapping statistics of improperly paired sequenced reads from unamplified and amplified libraries
| Method | Replicate | Library type | Singletons (%) | Interchromosomala (%) | Outies/inniesb (%) | Wrong orientationc (%) | Incorrect insert size (%) |
|---|---|---|---|---|---|---|---|
| Unamplified | 1 | Short | 0.2 | 0.5 | 0.5 | 0.2 | 0.2 |
| 2 | Short | 0.2 | 0.6 | 0.3 | 0.1 | 0.3 | |
| Phi | 1 | Short | 0.6 | 0.15 | 0.1 | 0.55 | 0.0 |
| 2 | Short | 1.5 | 0.4 | 0.3 | 4.1 | 0.3 | |
| Tre | 1 | Short | 0.6 | 0.2 | 0.1 | 0.8 | 0.0 |
| 2 | Short | 1 | 0.4 | 0.3 | 2.5 | 0.1 | |
| Rap | 1 | Short | 9.1 | 4.2 | 0.3 | 3.7 | 0.3 |
| 2 | Short | 8.8 | 0.5 | 1.7 | 3.1 | 8.4 | |
| Unamplified | 1 | Long | 0.4 | 1 | 0.5 | 0.3 | 0.1 |
| 2 | Long | 1 | 3.9 | 1.4 | 0.8 | 0.2 | |
| Phi | 1 | Long | 0.3 | 2 | 1.7 | 13.3 | 0.7 |
| 2 | Long | 0.4 | 5.1 | 3.1 | 10.0 | 0.5 | |
| Tre | 1 | Long | 0.4 | 1.9 | 1.0 | 12.9 | 0.6 |
| 2 | Long | 0.4 | 8.1 | 4.8 | 12.4 | 0.6 | |
| Rap | 1 | Long | 0.4 | 1.3 | 0.5 | 4.5 | 0.6 |
| 2 | Long | 0.3 | 3 | 1.3 | 2.2 | 0.4 |
All percentages are relative to total number of reads in each replicate shown in Table 1.
aReads with mates mapped to different chromosomes.
bReads with mates mapped to the same chromosome that show incorrect orientation of facing either outwards (‘←→’; outies for short-insert libraries) or inwards (‘→←’ innies for long-insert libraries).
cReads with mates mapped to the same chromosome but shows the same orientation, i.e. ‘←←’ or ‘→→’. In the case of long-insert libraries, chimera formation is one of the causes of the formation of these reads.
Figure 1.(A, B and D) Types of chimeric rearrangements. Each DNA sequence is represented by two or three adjacent segments. Arrows indicate directions of amplified fragments relative to the DNA sequence. (A) and (B) Segment a is copied, b is deleted and c is copied and reverse complemented. (D) The first part of the sequence is copied twice, with unknown sequence placed between the two copies. (C) Insert size distribution plot of wrong-orientation reads in Phi amplified libraries.
Figure 2.A plot of genome coverage against normalised average depth. Deviation from the theoretical curve (red) indicates less evenness in coverage depth distribution across the genome. Different protocols are plotted with different colours as listed in the legend, and dashed lines indicate read coverage from Replicate 1 of the long-insert libraries.
Figure 3.Normalized coverage of 10 kb windows on Chr 1 of C. elegans. Red and blue colour depicts coverage of Replicates 1 and 2, respectively.
Figure 4.Scatterplots showing relationships between (A) inverted and (B) tandem repeat content and normalized read coverage in 10 kb windows of C. elegans.
Figure 5.Distribution of GC content in sequenced reads of (A) short- and (B) long-insert libraries.
Summary of variant calls
| Protocols | Phi | Tre | Rap |
|---|---|---|---|
| (A) | |||
| Homozygous SNPs (643) | |||
| No/low coverage | 105 | 91 | 192 |
| Not called the same | 28 | 30 | 28 |
| Also called in one replicate | 150 | 154 | 423 |
| Called in both replicates | 360 | 368 | NA |
| Heterozygous SNPs (2117) | |||
| No/low coverage | 132 | 85 | 291 |
| Not called the same | 692 | 650 | 832 |
| Also called in one replicate | 705 | 813 | 994 |
| Called in both replicates | 588 | 569 | NA |
| (B) | |||
| Homozygous SNPs | |||
| Called differently in both unamplified replicates | 37 | 36 | 107 |
| Called in one replicate | 14 | 28 | 134 |
| Heterozygous SNPs | |||
| Called differently in both unamplified replicates | 105 | 158 | 528 |
| Called in one replicate | 44 | 46 | 1465 |
(A) Fate of 643 homozygous and 2117 heterozygous SNP calls from both unamplified replicates; (B) fate of additional homozygous and heterozygous SNP calls from amplified replicates.