| Literature DB >> 27306641 |
Antoine Limasset1, Bastien Cazaux2,3, Eric Rivals2,3, Pierre Peterlongo4.
Abstract
BACKGROUND: Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs.Entities:
Keywords: Assembly; De Bruijn graph; Genomics; Hamiltonian path; NGS; NP-complete; Read mapping; Sequence graph; path
Mesh:
Year: 2016 PMID: 27306641 PMCID: PMC4910249 DOI: 10.1186/s12859-016-1103-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Illustration of the gadget used in the proof of Theorem 1. Encoding a directed graph into a DBG of order 2. The directed graph G (top) admits the same words than the 2-DBG G ′ (bottom), if we ignore the numbers
Fig. 2A toy example of a DBG of order k with k=4 (top) and its compacted version (bottom)
Fig. 3Unitig construction, as used in the proposed experiments (upper part of the figure) and GGMAP pipeline. Reads to be mapped can be distinct from reads used for building the graph. Long unitigs are unitigs longer than the reads. We remind that tools BCALM and BOWTIE2 are respectively published in [7, 21]
Time and memory footprints of BGREAT and BOWTIE2
|
| BOWTIE2 | ||||||
|---|---|---|---|---|---|---|---|
| CDBG Id | Mapped set | Wall clock time | CPU time | Memory | Wall clock time | CPU time | Memory |
| (nb reads) | |||||||
|
| SRR959239 | 28 s | 1m40 | 19 MB | 1m17 | 3m53 | 29 MB |
| (5,128,790) | |||||||
|
| SRR065390 | 19m21 | 72m31 | 975 MB | 8m12 | 33m | 1.66 GB |
| (67,155,743) | |||||||
|
| ′′ | 13m03 | 51m28 | 336 MB | 17m49 | 72m31 | 493 MB |
|
| SRR1522085 | 1m54 | 7m13 | 336 MB | 3m29 | 14m12 | 493 MB |
| (22,509,110) | |||||||
| Human | SRR345593 | 4h30 | 87 h | 9.7 GB | 4h38 | 90h15 | 21 GB |
| SRR345594 | |||||||
| (2,967,536,821) | |||||||
Indicated wall clock times use four cores, except for the human samples for which 20 cores were used
Fig. 4Representation of the mapping of a read (top sequence) on a CDBG, whose nodes are represented on lines 2, 3, and 4. (step 1) the overlaps of the graph that are also present in the read are found (here TACAC, GCTGC, and AGCTA, represented on line 1). (step 2) unitigs that map the beginning and the end of the read are found (those represented on line 2). (step 3) cover the rest of the read, guided by the overlaps (here with unitigs represented on lines 3 and 4)
CDBG used in this study
| CDBG Id | Reads Id |
|
| Number of unitigs | Mean length of unitigs |
|---|---|---|---|---|---|
|
| SRR959239 | 31 | 3 | 42,843 | 134 |
|
| SRR065390 | 31 | 3 | 1,627,335 | 93 |
|
| SRR065390 | 21 | 2 | 8,273,338 | 34 |
| Human | SRR345593 | 31 | 10 | 69,932,343 | 70 |
| SRR345594 |
C.elegans_cpx and C.elegans_norm are two distinct graphs, constructed using the same read set from C.elegans genome. The suffixes norm and cpx respectively stand for “normal” (using c=3 and k=31) and for “complex” (using a low threshold c=2 and small value k=21)
Percentage of mapped reads, either mapping on contigs (here obtained thank to the Minia assembler) or mapping on CDBG with GGMAP
| Set | % mapped on contigs | % mapped on CDBG |
|---|---|---|
|
| 95.57 | 97.16 |
|
| 80,60 | 93,24 |
|
| 56,33 | 89,15 |
| Human | 63,16 | 85,70 |
Fig. 5GGMAP mapping results for the different read sets. In the “C.Elegans_norm (SRR1522085)” case, reads from SRR1522085 are mapped on the CDBG obtained using reads from read set SRR065390. For all other results, the same read set was used both for constructing the CDBG and during the mapping
GGMAP mapping results on simulated reads from the reference of the human chromosome 1 with default parameters
| % Errors in simulated reads | Distance to optimum of | ||||
|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | ≥4 | |
| 0 | 100 | 0 | 0 | 0 | 0 |
| 0.1 | 99.31 | 0.52 | 0.09 | 0.04 | 0.04 |
| 0.2 | 98.79 | 0.91 | 0.21 | 0.07 | 0.02 |
| 0.5 | 97.2 | 2.17 | 0.41 | 0.17 | 0.05 |
| 1 | 94.88 | 3.72 | 0.92 | 0.41 | 0.07 |
| 2 | 90.85 | 6.43 | 1.79 | 0.83 | 0.1 |
Results show the recall of GGMAP and the quality of BGREAT mapping, as represented by the “distance to optimum” value. For instance 94.88% of the reads were mapped without error, 3.72% were mapped with a distance to the optimum of one etc. Due to approximate repeats in human chromosome 1, the reported distance to optimum is an upper bound