| Literature DB >> 25165095 |
Abstract
MOTIVATION: PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads provides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space.Entities:
Mesh:
Year: 2014 PMID: 25165095 PMCID: PMC4253826 DOI: 10.1093/bioinformatics/btu538
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.An example of short read DBG of order k = 3. For simplicity reverse complement k-mers are ignored
Fig. 2.Long read correction method. (a) A long read is partitioned into weak and solid regions (respectively, lines and rectangles) according to the short read DBG. Weak regions starting or ending the long read are called the head or the tail, respectively, while other weak regions are inner regions. Circles in solid regions represent k-mers of the DBG. k-mers around a weak region serve as source and target nodes to search paths in the DBG. Several source/target pairs are used for each weak inner region. (b) On the second inner region, a bridging path between nodes s1 and t1 is found in the DBG to correct this region. On the third region, the path search fails to find a path between nodes s2 and t2. For the tail, an extension path is sought and found from node s3 toward the end. Once found, the corrective sequence of the path is aligned to the tail to determine the optimal substring (thick dotted arrow)
Fig. 3.Effect of parameters on the runtime and gain of our method. We varied k, solid k-mer threshold, branching limit, maximum error rate and number of target k-mers one at a time, while other parameters were kept constant
Runtime, memory, disk usage and accuracy statistics as reported by Error Correction Toolkit for the error correction tools on the E.coli (top), yeast (middle) and parrot (bottom) datasets
| Data | Method | CPU time | Elapsed time | Memory | Disk | FP | TP | FN | Sensitivity | Gain |
|---|---|---|---|---|---|---|---|---|---|---|
| PacBioToCA | 45 h 18 min | 3 h 12 min | 9.91 | 13.59 | NA | NA | NA | NA | NA | |
| LSC | 39 h 48 min | 2 h 56 min | 8.21 | 8.51 | 695773 | 3149629 | 7845597 | 0.2865 | 0.2232 | |
| LoRDEC | 2 h 16 min | 10 min | 0.96 | 0.41 | 102427 | 9994561 | 1000665 | 0.9090 | 0.8997 | |
| Yeast | PacBioToCA | 792 h 41 min | 21 h 57 min | 13.88 | 214 | NA | NA | NA | NA | NA |
| LSC | 1200 h 46 min | 130 h 16 min | 24.04 | 517 | 7766700 | 38741658 | 80597251 | 0.3246 | 0.2596 | |
| LoRDEC | 56 h 8 min | 3 h 37 min | 0.97 | 1.63 | 2784685 | 100568850 | 18770059 | 0.8427 | 0.8194 | |
| Parrot | LoRDEC | 568 h 48 min | 29 h 7 min | 4.61 | 74.85 | 10591097 | 226996640 | 26296446 | 0.8962 | 0.8544 |
Note. Memory and disk usage are in gigabytes. The statistics could not be computed for reads corrected by PacBioToCA because PacBioToCA only reports trimmed and split reads.
aRun parallel on six servers. Memory usage is for one server.
bRun parallel on three servers. Memory usage is for one server.
Alignment statistics of the reads corrected by different tools on the E.coli (top), yeast (middle) and parrot (bottom) datasets
| Data | Method | Size | Aligned | Identity | Genome coverage | |
|---|---|---|---|---|---|---|
| Expected | Observed | |||||
| Original | 1.0000 | 0.8800 | 0.9486 | 1.0000 | 0.9768 | |
| PacBioToCA | 0.7759 | 0.9965 | 0.9988 | 1.0000 | 0.9936 | |
| LSC (full) | 0.8946 | 0.9269 | 0.9579 | 1.0000 | 1.0000 | |
| LSC (trim) | 0.6824 | 0.9611 | 0.9725 | 1.0000 | 1.0000 | |
| LoRDEC (full) | 0.9318 | 0.8934 | 0.9952 | 1.0000 | 1.0000 | |
| LoRDEC (trim) | 0.8692 | 0.9419 | 0.9968 | 1.0000 | 1.0000 | |
| LoRDEC (trim + split) | 0.8184 | 0.9950 | 0.9997 | 1.0000 | 0.9979 | |
| Yeast | Original | 1.0000 | 0.7900 | 0.9276 | 1.0000 | 0.9834 |
| PacBioToCA | 0.7620 | 0.9887 | 0.9934 | 1.0000 | 0.9986 | |
| LSC (full) | 0.8760 | 0.8570 | 0.9420 | 1.0000 | 0.9988 | |
| LSC (trim) | 0.7020 | 0.9277 | 0.9544 | 1.0000 | 0.9992 | |
| LoRDEC (full) | 0.9771 | 0.8138 | 0.9741 | 1.0000 | 0.9995 | |
| LoRDEC (trim) | 0.9270 | 0.8492 | 0.9758 | 1.0000 | 0.9996 | |
| LoRDEC (trim + split) | 0.7412 | 0.9790 | 0.9928 | 1.0000 | 0.9984 | |
| Original | 1.0000 | 0.5060 | 0.9258 | 0.9235 | 0.8406 | |
| Parrot | LoRDEC (full) | 0.9719 | 0.7633 | 0.9826 | 0.9769 | 0.9103 |
| LoRDEC (trim) | 0.8423 | 0.8678 | 0.9838 | 0.9756 | 0.9085 | |
| LoRDEC (trim + split) | 0.7453 | 0.9782 | 0.9884 | 0.9773 | 0.9042 | |
Note. The first column shows the ratio between the size of the read set and the original read set, the second column shows the ratio between the size of the aligned region of the reads and the size of the read set and the third column shows the alignment identity of the aligned regions. The last two columns give the expected and observed genome coverage by aligned reads, i.e. the proportion of the reference sequence covered by at least one read.
Fig. 4.Percentage of the parrot genome covered by raw and corrected reads in function of read depth. The percentages (y-axis in log scale) are plotted for the true alignments (in black) and when considering the alignments are uniformly distributed over the genome (in white). Raw reads are represented by square and corrected reads by circles. The curves for corrected reads dominate that of raw reads, as correction increases the number of reads mapped. The black curves adopt similar shapes, suggesting that correction is not seriously impacted by repeats; their distances to the white curves suggest that a bias related to genomic location is already present in the raw reads