| Literature DB >> 24932000 |
Ergude Bao1, Tao Jiang1, Thomas Girke1.
Abstract
MOTIVATION: De novo assemblies of genomes remain one of the most challenging applications in next-generation sequencing. Usually, their results are incomplete and fragmented into hundreds of contigs. Repeats in genomes and sequencing errors are the main reasons for these complications. With the rapidly growing number of sequenced genomes, it is now feasible to improve assemblies by guiding them with genomes from related species.Entities:
Mesh:
Year: 2014 PMID: 24932000 PMCID: PMC4058956 DOI: 10.1093/bioinformatics/btu291
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Overview of the AlignGraph algorithm. The outline on the top (A) shows AlignGraph in the context of common genome assembly workflows, and the one on the bottom (B) illustrates its three main processing steps. (A) In Step 1, the PE reads from a target genome are assembled by a de novo assembler into contigs (here and c3). Subsequently (Step 2), the contigs can be extended (blue) and joined by AlignGraph (e1 and e2). (B) The workflow of AlignGraph consists of three main steps. (i) The PE reads are aligned to the reference genome and to the contigs, and the contigs are also aligned to the reference genome. (ii) The PE multipositional de Bruijn graph is built from the alignment results, where the red and blue subpaths correspond to the aligned contigs and sequences from PE reads, respectively. (iii) The extended and/or joined contigs (here e1 and e2) are generated by traversing the graph
Problems the PE multipositional de Bruijn graph solves in comparison with the conventional de Bruijn graph
| Problem | Consequence | Solution |
|---|---|---|
| Repeat sequences | Branched paths | Distinguishes paths for repetitive regions by incorporating PE read and alignment position information |
| Sequencing errors | False-positive paths | Corrects paths from erroneous reads with correct reads aligned to the same position |
| Low sequencing depth | Incomplete paths | Builds paths from reads in low coverage areas supported by reference |
Fig. 2.Advantages of the PE multipositional de Bruijn graph compared with the positional de Bruijn graph. In the target genome given on the top, A and and and and are repetitive regions. Each PE read of length 2 × 4 bp is sequenced with one pair from region and the other from the corresponding position of region (the pair from is omitted for simplicity). In comparison with the target genome, the reference genome has a repeat-free region ABC similar to and a region similar to . The reads from region are assembled with a de novo assembler into a contig starting from , but regions A and B are not assembled because of low sequencing depth, repeats or other problems. When aligning the contig to the reference genome, the repetitive regions C and are both aligned to C in the reference genome, and the insertion D is assigned to the end of the reference. In (A) reads are aligned directly to the reference genome to build the initial positional de Bruijn graph, and in (B–D) the reads are aligned to the preassembled contigs and then aligned to the reference to build first the extended positional de Bruijn graph and then the PE multipositional de Bruijn graph. (A) The initial positional de Bruijn graph is built here with 3-mers. Some reads cannot be aligned to the reference genome because of sequence differences in the target genome, as indicated here by 3-mers with −1 as alignment position. The repetitive regions A and (or C and ) are collapsed into one path in red in the graph. (B) The initial positional de Bruijn graph is constructed with help from the read-to-contig alignment information. The read-to-reference genome alignment information yields a more complete positional de Bruijn graph, but the repetitive regions A and (or C and ) are still collapsed resulting in branch points. (C) An extended positional de Bruijn graph is built by incorporating into each 3-mer the read alignment position to the contig. As a result of this operation, the repetitive regions C and can be distinguished into two paths, where the 3-mers have different alignment positions in the contig, but A and are still collapsed. (D) The PE multipositional de Bruijn graph is constructed by incorporating into each 3-mer their PE read alignment positions to the reference genome (the right three bases and their alignment position to the contig is omitted here). With this information the repeats A and can be distinguished into two paths, as the 3-mers have different PE alignment positions in the reference genome. The final graph contains only one single path allowing to output an extended contig corresponding to the region in the target genome
Performance evaluation of AlignGraph
| Upstream assembler | Contig set | N Contigs4 | N505 | N covered bases6 | Average length7 | Maximum length8 | MPMB9 | Average identity10 (%) |
|---|---|---|---|---|---|---|---|---|
| (a) Contigs of | ||||||||
| Velvet | All1 | 30 037 | 3515 | 82 844 417 | 2668 | 27 792 | 22.2 | 95.2 |
| Extendable2 | 8615 | 4148 | 28 007 451 | 3262 | 27 398 | 0.3 | 97.6 | |
| Extendable + AlignGraph3 | 5751 | 7876 | 32 467 110 | 5521 | 49 768 | 1.6 | 94.8 | |
| ABySS | All | 30 972 | 2559 | 69 432 667 | 2206 | 29 760 | 13.4 | 97.2 |
| Extendable | 11 693 | 2820 | 28 885 212 | 2454 | 16 343 | 0.5 | 98.7 | |
| Extendable + AlignGraph | 8427 | 5484 | 35 859 786 | 4151 | 25 321 | 1.1 | 95.8 | |
| (b) Contigs of human chromosome 14 | ||||||||
| ALLPATHS-LG | All | 4383 | 38 590 | 83 849 397 | 19 201 | 240 764 | 0.3 | 98.9 |
| Extendable | 1674 | 39 851 | 35 746 095 | 20 806 | 200 495 | 0.1 | 98.9 | |
| Extendable + AlignGraph | 785 | 71 847 | 36 441 001 | 45 358 | 305 880 | 0.0 | 97.5 | |
| ALLPATHS-LGc | All | 3856 | 43 856 | 83 860 939 | 21 818 | 275 446 | 0.2 | 99.3 |
| Extendable | 1296 | 45 719 | 31 457 201 | 24 346 | 275 446 | 0.1 | 99.5 | |
| Extendable + AlignGraph | 608 | 86 613 | 34 614 465 | 54 406 | 294 615 | 0.0 | 96.9 | |
| SOAPdenovo | All | 10 865 | 16 855 | 80 135 941 | 7623 | 147 494 | 5.9 | 94.9 |
| Extendable | 5613 | 17 412 | 45 246 077 | 8223 | 141 981 | 0.9 | 96.4 | |
| Extendable + AlignGraph | 3469 | 32 881 | 52 861 640 | 15 271 | 219 841 | 0.5 | 95.0 | |
| MaSuRCA | All | 19 034 | 5767 | 75 497 302 | 3802 | 53 837 | 13.9 | 98.9 |
| Extendable | 9241 | 6047 | 38 842 517 | 4199 | 51 249 | 0.2 | 99.2 | |
| Extendable + AlignGraph | 5665 | 11 590 | 43 930 184 | 7666 | 66 758 | 0.4 | 98.1 | |
| CABOG | All | 3118 | 46 523 | 84 989 190 | 27 401 | 296 888 | 0.3 | 97.3 |
| Extendable | 1692 | 45 669 | 46 499 763 | 27 089 | 296 888 | 0.0 | 98.7 | |
| Extendable + AlignGraph | 701 | 101 907 | 50 527 605 | 70 362 | 443 952 | 0.1 | 97.6 | |
| Bambus2 | All | 11 219 | 8378 | 64 011 072 | 5764 | 449 449 | 3.1 | 89.9 |
| Extendable | 6995 | 7521 | 37 857 989 | 5439 | 62 798 | 0.3 | 97.6 | |
| Extendable + AlignGraph | 2722 | 19 989 | 39 147 357 | 14 176 | 86 154 | 0.5 | 96.5 | |
| (c) Scaffolds of human chromosome 14 | ||||||||
| SOAPdenovo | All | 3902 | 391 693 | 85 417 248 | 24 397 | 1 852 152 | 1.0 | 82.9 |
| Extendable | 901 | 387 309 | 40 296 035 | 47 526 | 1 019 659 | 0.1 | 84.5 | |
| Extendable + AlignGraph | 767 | 544 209 | 47 823 279 | 63 525 | 2 246 638 | 0.1 | 81.0 | |
| MaSuRCA | All | 721 | 580 822 | 65 433 305 | 63 876 | 2 943 966 | 1.3 | 57.2 |
| Extendable | 101 | 289 703 | 5 554 781 | 52 820 | 1 516 804 | 0.0 | 81.9 | |
| Extendable + AlignGraph | 78 | 316 946 | 6 986 224 | 86 552 | 1 573 741 | 0.0 | 83.4 | |
| CABOG | All | 471 | 387 876 | 81 163 688 | 176 590 | 1 944 475 | 0.1 | 91.9 |
| Extendable | 146 | 358 688 | 29 372 033 | 200 539 | 1 905 529 | 0.0 | 98.2 | |
| Extendable + AlignGraph | 67 | 906 407 | 33 708 925 | 481 712 | 2 051 503 | 0.0 | 94.1 | |
| Bambus2 | All | 569 | 319 334 | 64 378 693 | 116 582 | 1 477 847 | 0.1 | 77.4 |
| Extendable | 66 | 272 436 | 6 949 338 | 119 858 | 641 463 | 0.0 | 92.0 | |
| Extendable + AlignGraph | 80 | 377 905 | 8 963 132 | 114 852 | 812 353 | 0.1 | 85.4 | |
(a) Genomic PE reads from A.thaliana were assembled with Velvet and ABySS. The resulting contigs were extended with AlignGraph using as reference the genome sequence from A.lyrata. (b–c) The subsequent panels contain assembly results for the human chromosome 14 sample from the GAGE project where the chimpanzee genome served as reference. (b) Contig assembly results are given for the de novo assemblers ALLPATHS-LG, ALLPATHS-LGc (in cheat mode), SOAPdenovo, MaSuRCA, CABOG and Bambus2. (c) Scaffolded assembly results are given for SOAPdenovo, MaSuRCA, CABOG and Bambus2. The results are organized row-wise as follows: the number of initial contigs obtained by each de novo assembler1, the ‘extendable' subset of the initial contigs that AlignGraph was able to improve2, and the extension results obtained with AlignGraph3. The additional columns give the number of contigs4, N50 values5, the number of covered bases6, the average7, and maximum8 length of the contigs, the number of misassemblies per million base pairs (MPMB)9, and the average identity among the true contigs and the target genome10. More details on these performance criteria are provided in Section 3.1.5.
Performance with reference genomes of variable similarity
| Percentage of | Chimpanzee | Gorilla | Orangutan | Gibbon | Macaque |
|---|---|---|---|---|---|
| Aligned reads | 94.5% | 91.6% | 88.9% | 49.9% | 24.9% |
| Extendable contigsb | 51.0% | 36.4% | 24.9% | 6.7% | — |
| Improved N50c | 109.9% | 84.0% | 73.2% | 65.3% | — |
The tests were performed on the human chromosome 14 sample where the listed primate genomes served as reference. The results include the percentage values of aalignable reads, bExtendable contigs relative to the initial set cImprovements of the N50 values relative to the extendable contigs. Because of space limitations, the latter two rows contain averaged percentage values for the five assemblers ALLPATHS-LG, SOAPdenovo, MaSuRCA, CABOG and Bambus2.
Improvements to published genome
| Contig set | N contigs | N50 | N total bases | Average length | Maximum length |
|---|---|---|---|---|---|
| All | 1676 | 341 653 | 112 578 343 | 67 170 | 2 930 180 |
| Extendable | 462 | 448 682 | 57 574 961 | 124 621 | 2 930 180 |
| Extendable + AlignGraph | 368 | 837 458 | 62 216 675 | 169 067 | 3 168 537 |
The published scaffolds from Landsberg erecta were extended with AlignGraph using the A.thaliana genome as reference. The rows and columns are arranged the same way as in Table 2, but several columns are missing here because it is not possible to compute the corresponding performance measures in a meaningful manner without having access to a ‘true’ target genome sequence. aIn addition, we report here the total number of bases in the contigs.