| Literature DB >> 27586129 |
Binghang Liu1, Chi-Man Liu1, Dinghua Li1, Yingrui Li1, Hing-Fung Ting1, Siu-Ming Yiu1, Ruibang Luo2, Tak-Wah Lam3.
Abstract
BACKGROUND: De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads.Entities:
Mesh:
Year: 2016 PMID: 27586129 PMCID: PMC5009518 DOI: 10.1186/s12864-016-2829-5
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Overview of the whole assembly method. There are five steps for one direction extension. Firstly, we choose an initial read by order and find an initial seed in this read. Then we use bi-directional BWT to get the SA ranges of this seed using backward exact matching. Thirdly, we build up a backward extension tree by adding bases to continue the backward matching. After removing erroneous branches and heterozygosis branches, we obtain the consensus sequence of the extended region. Finally, we continue to find a new seed in the extended region and extend iteratively
Fig. 2Remove branches in backward extension tree. In the backward extension tree, we try to remove erroneous branches, repetitive branches and heterozygosis branches to obtain the consensus sequences of the extended region. As an example, we meet node v with two child node a and b. Firstly, combined with L(v), we obtained TL(v) for a and GL(v) for b to detect erroneous branches between a and b. We incrementally calculate the depth of sub-sequences of a(sub-a with length i): T, TA, TAT, …, and b(sub-b with length i): G, GA, GAT, … until the depth of sub-a is less than user-defined threshold τ. At the same time, if Dep(sub-a ) is significantly smaller than Dep(sub-a ), Dep(sub-a ) is significantly smaller than di and Dep(sub-b ) is significantly larger than Dep(sub-a ), then branch a will be treated as a erroneous branch or repetitive branch. When there is no erroneous signal, we will further try to remove the branch, which might be caused by heterozygosis. After obtaining two sequences representing the consensus sequences of the sub-trees rooted at a and b respectively, we compare the two sequences to find the matched region and get the depth of it. Then we use this depth to calculate base depth and compare to the base depth calculated by depth of initial seed. If the two sequences have high similarity and the two base depths are similar to each other, we will treat a as heterozygous branch if W(a) is smaller than W(b)
Contig assembly of deeply sequenced bacterial genomes
| Tools | Parameters | Correct N50 | Misatch/Indel | Aligned rate | Coverage | Time(sec) | |
|---|---|---|---|---|---|---|---|
|
| SPAdes | 51,63,85 | 299,305 | 134/6 | 99.79 % | 100.00 % | 1239 |
| SOAPdenovo2 | 87-95 | 82,495 | 0/0 | 99.84 % | 99.27 % | 25;16 | |
| SGA | 29;91 | 74,584 | 7/0 | 99.81 % | 99.98 % | 1228;1149 | |
| BASE | 4 | 92,706 | 0/0 | 100.00 % | 99.97 % | 161; 93 | |
|
| SPAdes | 33,55,65,75,85,99 | 169,978 | 118/45 | 99.97 % | 99.97 % | 4616 |
| SOAPdenovo2 | 125 | 88,858 | 23/30 | 99.98 % | 99.98 % | 110;1 | |
| SGA | 29;149 | 95,711 | 58/26 | 99.80 % | 99.97 % | 2478;2884 | |
| BASE | 4 | 159,715 | 29/29 | 100.00 % | 99.75 % | 676; 388 |
S.aureus MW2 has its real reference with length 2.8 Mb and V.para has its species’ reference with length 5.1 Mb and two chromosomes. Both of these two bacteria are sequenced up to 240X. GAGE validation pipeline was used to calculate the corrected contig N50, base errors, structural errors, contig aligned rate and reference coverage. Except BASE used single thread for contig assembly part, and other the assemblies were all performed with 24 threads. The time before semicolon is for index building and after semicolon is for assembly. For SGA, indexing time contains the time used in the indexing after error correction and filtering; assembly time contains the time used in the overlap and assembly
Performance for human genome assembly
| SOAPdenovo2 | BASE | ||||||
|---|---|---|---|---|---|---|---|
| Wall time (h) | CPU time (h) | Max memory (GB) | Wall time (h) | CPU time (h) | Max memory (GB) | ||
| YH, 100 bp, 36X | Index | 4 | 46 | 163 | 5 | 18 | 200 |
| Contig | 2 | 2 | 41 | 4 | 53 | 140 | |
| Total | 6 | 48 | 163 | 9 | 71 | 200 | |
| YH, 150 bp, 64X | Index | 6 | 75 | 201 | 11 | 36 | 192 |
| Contig | 1 | 1 | 33 | 5 | 80 | 225 | |
| Total | 7 | 76 | 201 | 16 | 116 | 225 | |
| NA12878D, 150 bp | Index | 9 | 141 | 477 | 9 | 34 | 194 |
| Contig | 1 | 1 | 24 | 7 | 144 | 142 | |
| Total | 10 | 141 | 477 | 16 | 178 | 194 | |
For X Ten data, we used a different machine with larger memory to finish SOAPdenovo2 and BASE assembly, so it is improper to compare the time usage of this dataset to other dataset. Other dataset are all performed in the same machine with 24 CPU
Summary of human contig assembly
| YH, 100 bp | YH, 150 bp | NA12878, 150 bp | NA12878, 250 bp | |||||
|---|---|---|---|---|---|---|---|---|
| SOAPdenovo2,k = 41 | BASE | SOAPdenovo2,k = 61 | BASE | SOAPdenovo2,k = 41 | BASE | SOAPdenovo2,k = 61 | BASE | |
| Contig num | 3,420,897 | 3,319,617 | 2,279,026 | 2,145,792 | 8,068,278 | 1,934,261 | 1,416,658 | 1,511,270 |
| Contig size | 2.67E + 09 | 2.88E + 09 | 2.76E + 09 | 2.95E + 09 | 2.44E + 09 | 2.90E + 09 | 2.60E + 09 | 2.94E + 09 |
| Contig N50 | 2,244 | 2,279 | 3,008 | 3,126 | 1,140 | 3,823 | 3,368 | 4,199 |
| Contig aligned rate | 99.10 % | 97.07 % | 98.87 % | 95.96 % | 99.40 % | 97.62 % | 99.34 % | 96.33 % |
| Genome coverage | 90.36 % | 93.76 % | 93.12 % | 93.90 % | 84.11 % | 95.58 % | 89.55 % | 94.09 % |
| RepeatMasked coverage | 97.05 % | 96.13 % | 97.28 % | 95.32 % | 93.94 % | 97.38 % | 95.60 % | 95.99 % |
| Exon coverage | 93.76 % | 91.51 % | 95.73 % | 94.13 % | 91.48 % | 96.84 % | 93.90 % | 91.49 % |
| Mismatch base | 2,735,141 | 3,479,046 | 2,911,990 | 3,839,110 | 2,301,111 | 3,459,648 | 2,544,785 | 3,751,887 |
| Mismatch ratio | 0.103 % | 0.121 % | 0.105 % | 0.130 % | 0.094 % | 0.119 % | 0.098 % | 0.128 % |
| Indel num | 340,930 | 327,469 | 358,358 | 334,989 | 259,190 | 322,214 | 327,695 | 372,941 |
| Indel base | 1,412,005 | 1,587,265 | 1,692,213 | 1,741,947 | 1,086,014 | 1,602,240 | 1,400,230 | 1,953,311 |
| Indel ratio | 0.053 % | 0.057 % | 0.062 % | 0.061 % | 0.045 % | 0.057 % | 0.054 % | 0.069 % |
We mapped the raw contigs to Hg19. Aligned rate is the contig-aligned length divided by total contig length. To calculate genome coverage, the length of gap regions in Hg19 has been removed. For unique coverage, the repetitive regions have been further removed. For SOAPdenovo2 contig assembly, we all used single-kmer method and M1 to treat heterozygous regions
Fig. 317mer depth distribution of three human sequencing dataset. We count the depth of all 17mers in the sequencing reads, and calculate the frequency of each depth. About 35 % 17mers of YH 100 bp reads, 30 % 17mers of YH 150 bp reads and 53 % 17mers of NA12878D XTen reads having depth no more than 3. Then we say the NA12878D XTen reads should have more sequencing errors left after raw data filter and correction than YH dataset
Mismatches analysis for human genome assembly
| Dataset | Assembler | Whole genome mismatches | Exon region mismatches | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Total | Variant Call | Public SNP | Novel | Total | Variant Call | Public SNP | Novel | ||
| YH_100bp | SOAPdenovo2 | 2,735,141 | 2,423,482 | 108,044 | 203,615 | 47,926 | 40,701 | 1,590 | 5,635 |
| BASE | 3,479,046 | 2,872,396 | 208,351 | 398,299 | 47,515 | 42,124 | 1,724 | 3,667 | |
| YH_150bp | SOAPdenovo2 | 2,911,990 | 2,613,670 | 113,037 | 185,283 | 46,002 | 43,151 | 1,148 | 1,703 |
| BASE | 3,839,110 | 3,075,913 | 256,217 | 506,980 | 52,561 | 46,420 | 1,822 | 4,319 | |
| NA12878_150bp | SOAPdenovo2 | 2,301,111 | 2,025,220 | 109,497 | 166,394 | 39,702 | 35,644 | 1,740 | 2,318 |
| BASE | 3,459,648 | 3,052,269 | 129,933 | 277,446 | 49,711 | 45,361 | 1,151 | 3,199 | |
| NA12878_250bp | SOAPdenovo2 | 2,544,785 | 2,122,144 | 130,806 | 291,835 | 42,744 | 35,890 | 1,936 | 4,918 |
| BASE | 3,751,887 | 2,613,065 | 604,805 | 534,017 | 48,635 | 37,853 | 5,068 | 5,714 | |
We mapped the assembled contigs to Hg19 and got the mismatches between each contig and reference sequence. Then we used the detected SNPs and SNPs from published SNP databases to analysis these mismatches in whole genome and exon regions respecitively
Acceleration performance of GPU on BWT construction
| Read_num | Read_length | With GPU | Without GPU | Time fold | |
|---|---|---|---|---|---|
|
| 6,720,000 | 100 | 133 s | 184 s | 1.38 |
|
| 4,896,000 | 250 | 329 s | 514 s | 1.56 |
| YH, 100 bp | 1,057,750,382 | 100 | 5 h | 19 h | 3.80 |
| NA12878, 150 bp | 770,960,980 | 150 | 7 h | 30 h | 4.29 |
To evaluate the acceleration performance of GPU on BWT construction, we used two versions, one of which used GPU and the other not, to build bi-directional BWT on four datasets. The results showed that especially in large genome dataset, compared with GPU version, version without GPU costs ~4 fold more time to construct BWT