| Literature DB >> 31504206 |
Leena Salmela1, Kingshuk Mukherjee2, Simon J Puglisi1, Martin D Muggli3, Christina Boucher2.
Abstract
MOTIVATION: Optical mapping data is used in many core genomics applications, including structural variation detection, scaffolding assembled contigs and mis-assembly detection. However, the pervasiveness of spurious and deleted cut sites in the raw data, which are called Rmaps, make assembly and alignment of them challenging. Although there exists another method to error correct Rmap data, named cOMet, it is unable to scale to even moderately large sized genomes. The challenge faced in error correction is in determining pairs of Rmaps that originate from the same region of the same genome.Entities:
Mesh:
Year: 2020 PMID: 31504206 PMCID: PMC7005598 DOI: 10.1093/bioinformatics/btz663
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The cut site representation (above) and the fragment length representation (below)
Fig. 2.Extraction of (a) k-mers for k = 4, (b) ℓ-mers for , (c) (ℓ, k)-mers for and k = 3 and (d) spaced (ℓ, k)-mers for , k = 3 and spacing pattern S = 1111110000111111111111111000111100111111 from the Rmap . For (ℓ, k)-mers and spaced (ℓ, k)-mers we use dotted lines to show the extension beyond ℓ positions to include at least k fragments. For spaced (ℓ, k)-mers missing line segments denote spaces in the spacing pattern. The extracted k-mers, ℓ-mers, (ℓ, k)-mers and spaced (ℓ, k)-mers are shown on the right. To extract all k-mers from an Rmap, we first extract the k-mer containing the k leftmost fragments. To get the next k-mer, the leftmost fragment is dropped and the next fragment on the right is added. To extract all ℓ-mers and (spaced) (ℓ, k)-mers we first consider the ℓ-length subsegment of the line positioned at the leftmost position. To get the next ℓ-mer or (spaced) (ℓ, k)-mer, the subsegment is shifted to the right until a cut site enters or exits (the solid part of) the subsegment
Fig. 3.Overview of the error correction process in Elmeri
Fig. 4.Transforming the Rmap to a binary string. The block size in this example is 5
Datasets used in the experiments
| Dataset | Genome | Genome size | Number of Rmaps |
|---|---|---|---|
| Ecoli1 |
| 4.6 Mbp | 123 251–157 743 |
| Ecoli2 |
| 4.6 Mbp | 2504 |
| Human | Chinese individual (HX1) | 3.2 Gbp | 793 199 |
| AnaTes |
| 0.66 Gbp | 3 121 480 |
Note: The Ecoli1 dataset contains eight simulated E.coli datasets with varying error rates and thus also the number of Rmaps varies. The genome size of A.testudineus is an estimate since there exists no reference genome.
Fig. 5.Comparison of the performance of the different indexing schemes. The precision and recall of the different indexing schemes when k or ℓ is varied is shown on the left and the runtime and memory usage of the different indexing schemes on the right. The performance of the spaced (ℓ, k)-mer index with the default parameters is shown with a black rectangle
The accuracy, runtime and memory usage of Elmeri and cOMet on simulated E.coli data when the number of additional cut sites introduced per 100 kbp and the rate of missing cut sites are varied
| Added cut sites per 100 kbp | Deleted cut site rate | Percent of Rmaps with improved S-score | Mean increase in S-score | CPU time (hours) | Peak memory (GB) | ||||
|---|---|---|---|---|---|---|---|---|---|
|
| cOMet |
| cOMet |
| cOMet |
| cOMet | ||
| 0.5 | 15 | 93.22 |
|
| 12.46 |
| 24.50 | 13.67 |
|
| 1 | 5 |
| 87.36 |
| 6.30 |
| 35.85 | 16.11 |
|
| 15 |
| 94.01 |
| 13.19 |
| 28.15 | 16.27 |
| |
| 25 |
| 96.09 |
| 17.20 |
| 42.20 | 15.68 |
| |
| 2 | 5 |
| 89.23 |
| 7.93 |
| 55.01 | 20.20 |
|
| 15 |
| 92.99 |
| 13.36 |
| 25.40 | 20.14 |
| |
| 25 |
| 93.02 |
| 14.70 |
| 66.01 | 19.46 |
| |
| 5 | 15 |
| 81.35 |
| 6.71 |
| 143.15 | 29.52 |
|
Note: For each measured quantity we have highlighted the best result.
The accuracy, runtime and memory usage of Elmeri and cOMet on human Bionano data and on A.testudineus genome
| Dataset | Percent of Rmaps with improved S-score | Mean increase in S-score | CPU time (hours) | Peak memory (GB) | ||||
|---|---|---|---|---|---|---|---|---|
|
| cOMet |
| cOMet |
| cOMet |
| cOMet | |
| Human | 69.21 |
|
| 2.69 |
| 236.80 | 101.39 |
|
| AnaTes | 59.15 |
|
| 5.00 |
| 7430.99 | 399.60 |
|
Note: For each measured quantity we have highlighted the best result.
Assembly results for uncorrected Rmaps, Rmaps corrected by cOMet and Elmeri and errorfree Rmaps
| Rmap status | Number of assembled maps | Genome coverage | Missing/added cut sites |
|---|---|---|---|
| Uncorrected | 5 | 81.2 | 47 |
| Corrected by cOMet | 2 | 82.2 | 34 |
| Corrected by |
|
| 10 |
| Error free | 3 | 79.6 |
|
Note: The results for uncorrected Rmaps, Rmaps corrected by cOMet and errorfree Rmaps are directly from Mukherjee . We have highlighted the best results for each column.