| Literature DB >> 30400848 |
Sven Warris1, Elio Schijlen2, Henri van de Geest2,3, Rahulsimham Vegesna4,5,6, Thamara Hesselink2, Bas Te Lintel Hekkert2, Gabino Sanchez Perez2,3, Paul Medvedev6,7,8,9, Kateryna D Makova9,10, Dick de Ridder11.
Abstract
BACKGROUND: Next-generation sequencing requires sufficient DNA to be available. If limited, whole-genome amplification is applied to generate additional amounts of DNA. Such amplification often results in many chimeric DNA fragments, in particular artificial palindromic sequences, which limit the usefulness of long sequencing reads.Entities:
Keywords: Chimeric reads; High molecular weight DNA; Long read sequencing; Palindromes; Read mapping; Whole-genome amplification; de novo assembly
Mesh:
Substances:
Year: 2018 PMID: 30400848 PMCID: PMC6218980 DOI: 10.1186/s12864-018-5164-1
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1The introduction of palindromes by whole-genome amplification (WGA) and correction of these sequences with Pacasus. The colored squares in this figure indicate the four different nucleotides. In whole-genome amplification a DNA-polymerase binds to the DNA and starts making a copy of that strand (left-side of the image). Palindromes are introduced when during WGA the DNA-polymerase continues with elongation (indicated by the arrow) along an already created WGA product (a), generating a palindrome. In this example this incorrect elongation occurs several times (b), resulting in a DNA fragment containing four copies of the original fragment (c), which is sequenced. Pacasus detects the palindrome sequence by aligning the read’s reverse complement to itself (d) and splits the read in two smaller reads at the center of the alignment (split 1). This process is repeated and splits the two resulting reads again (split 2), yielding four separate, ‘clean’, reads. The full set of reads, corrected and left intact, is then used in, for example, read mapping or de novo assembly
Datasets used for the performance analysis of Pacasus
| Species | Sample | WGA | Illumina HiSeq2000 | PacBio RSII | ||
|---|---|---|---|---|---|---|
| reads | length | Reads | avg. Length | |||
|
| Ath-WGA1 | yes | 31,233,196 | 100 | 462,138 | 9326 |
| Ath-WGA2 | yes | 43,810,780 | 100 | 447,364 | 8544 | |
| Ath-Ctrl | no | 940,162 | 5680 | |||
|
| GorY-WGA | yes | 279,601,852 | 150 | 3,596,236 | 5468 |
Effect of correcting palindromes
| Sample | Before cleaning | Reads with detectable palindromes | After correcting | |||
|---|---|---|---|---|---|---|
| Number of reads | Average length (b) | Number of reads | Number of reads (%) | Number of reads | Average length (b) | |
| Ath-WGA1 | 462,138 | 9326 | 221,001 | 47.8 | 869,826 | 4660 |
| Ath-WGA2 | 447,364 | 8544 | 195,263 | 43.6 | 769,027 | 4721 |
| Ath-Ctrl | 940,162 | 5680 | 4714 | 0.5 | 938,196 | 5681 |
| GorY-WGA | 3,596,236 | 5468 | 426,188 | 11.8 | 4,546,488 | 4234 |
Effect of correcting palindromes on the number reads and average lengths of these reads. Note: the Ath-Ctrl shows a small increase in average read length after correction and a lower number of reads. This is because Pacasus removes very short reads from the output
Fig. 2%GC density plot of Ath-Ctrl (green), Ath-WGA (blue), Ath-Clean (red) and the A. thaliana reference genome (black). The curves for Ath-WGA and Ath-Clean overlap completely. All three read sets do no show biases towards a certain GC-content when compared to the reference genome
Read mapping statistics
| Reads mapped (%) | Avg. coverage | Avg. read length | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Alignment filter | – | 80% | 95% | – | 80% | 95% | – | 80% | 95% |
| Ath-WGA | 99 | 44 | 34 | 40.5 | 17.4 | 13.4 | 8987 | 5568 | 5397 |
| Ath-Clean | 99 | 81 | 66 | 55.1 | 48.1 | 41.2 | 4690 | 4532 | 4697 |
| Ath-Ctrl | 95 | 72 | 57 | 34.2 | 29.2 | 24.8 | 5799 | 5583 | 5852 |
Statistics of read mappings with BLASR to the TAIR10 reference genome, calculated without filtering for a minimum read alignment length (‘-’) and after filtering for reads aligned with at least 80% or 95% nucleotide identity
Statistics on the de novo assemblies
| PacBio-only (Canu) | Hybrid (DBG2OLC/Sparse) | |||||
|---|---|---|---|---|---|---|
| Read set | Ath-Ctrl (C1) | Ath-WGA (C2) | Ath-Clean (C3) | Ath-Ctrl (D1) | Ath-WGA (D2) | Ath-Clean (D3) |
| No. contigs | 852 | 2128 | 1015 | 476 | 4818 | 1753 |
| Ass. length (Mbp) | 115.6 | 116.8 | 123.9 | 110.9 | 108.9 | 131.0 |
| Longest contig (Kbp) | 1181 | 655 | 3402 | 5667 | 246 | 2239 |
| GC (%) | 36.0 | 36.2 | 36.12 | 35.97 | 36.57 | 36.21 |
| N50 (Kbp) | 293 | 73 | 302 | 823 | 32 | 278 |
| L50 | 117 | 479 | 109 | 33 | 951 | 113 |
| Covered (%) | 86.6 | 91.2 | 97.3 | 85.1 | 49.7 | 96.3 |
| Dupl. ratio | 1.09 | 1.07 | 1.06 | 1.08 | 1.34 | 1.13 |
Statistics on the PacBio-only and hybrid assemblies of the various datasets. Note that the TAIR10 reference genome is 119.7 Mb, with the full genome thought to be approximately 135 Mb [21]
Fig. 3Contig length distributions. Contig length (y-axis) distribution of the published gorilla Y chromosome (GorY), the contigs underlying this assembly (GorY contigs), the de novo assembly based on raw PacBio data set (GorY-WGA) and of the de novo assembly of the cleaned reads (GorY-Clean). The x-axis shows the fraction of the assembly (e.g. the N20, N50, etcetera)
Human and gorilla Y-chromosome assembly statistics
| Assembly size (Mbp) | Ns (Mbp) | Non-Ns (Mbp) | No. sequences | N50 (Kbp) | Longest seq. (Kbp) | |
|---|---|---|---|---|---|---|
| HumY | 57.2 | 30.4 | 26.8 | 1 | 57,227 | |
| GorY, contigs | 23.0 | 0 | 23.0 | 3001 | 18 | 143 |
| GorY, scaffolds | 25.4 | 2.4 | 23.0 | 697 | 98 | 486 |
| GorY-WGA, contigs | 26.5 | 0 | 26.5 | 1128 | 32 | 256 |
| GorY-Clean, contigs | 24.3 | 0 | 24.3 | 1062 | 42 | 494 |
Assembly statistics for the published human and gorilla Y chromosome assemblies and the new assemblies
Read mapping statistics on the human and different GorY assemblies
| HiSeq | GorY-WGA | GorY-Clean | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Assembly | Length (Mbp) | Genome coverage | Read cov. | Genome coverage | Read cov. | Genome coverage | Read cov. | |||
| (Mbp) | (%) | (Mbp) | (%) | (Mbp) | (%) | |||||
| HumY | 26.8 | 22.3 | 83 | 1897 | 18.2 | 68 | 58.87 | 18.5 | 69 | 74.84 |
| GorY | 23.0 | 18.3 | 80 | 1169 | 21.1 | 92 | 71.33 | 21.1 | 92 | 99.15 |
| GorY-WGA | 26.5 | 24.9 | 94 | 1353 | 26.5 | 100 | 73.15 | 26.5 | 100 | 97.08 |
| GorY-Clean | 24.3 | 22.4 | 92 | 1586 | 24.3 | 100 | 83.21 | 24.3 | 100 | 109.67 |
Mapping of HiSeq, PacBio WGA and PacBio cleaned reads on the human Y chromosome (HumY), the gorilla Y chromosome (GorY) and the newly created gorilla Y assemblies (GorY-WGA, GorY-Clean). The read coverage is the average number of reads that a nucleotide has aligned to it
Repeat content
| Assembly | Overall repeat content (% of assembly) |
|---|---|
| Canu Ath-Ctrl (C1) | 15.73 |
| Canu Ath-WGA (C2) | 15.49 |
| Canu Ath-Clean (C3) | 16.76 |
| TAIR10 | 16.88 |
Repeat content found by RepeatMasker in the different A. thaliana assemblies
Fig. 4Illumina read depth of known palindromes. Illumina read depth of the known palindrome sequences P1-P8 and the X-degenerated gene (XDG) region in the GorY assembly (a) and GorY-Clean (b). Overal read depth is decrease in GorY-Clean, however in both assemblies the median read depths for P1-P7 are twice of that of XDG