| Literature DB >> 31510677 |
Yan Gao1,2, Bo Liu1, Yadong Wang1, Yi Xing2,3.
Abstract
MOTIVATION: Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing technologies can produce long-reads up to tens of kilobases, but with high error rates. In order to reduce sequencing error, Rolling Circle Amplification (RCA) has been used to improve library preparation by amplifying circularized template molecules. Linear products of the RCA contain multiple tandem copies of the template molecule. By integrating additional in silico processing steps, these tandem sequences can be collapsed into a consensus sequence with a higher accuracy than the original raw reads. Existing pipelines using alignment-based methods to discover the tandem repeat patterns from the long-reads are either inefficient or lack sensitivity.Entities:
Mesh:
Year: 2019 PMID: 31510677 PMCID: PMC6612900 DOI: 10.1093/bioinformatics/btz376
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Chaining of tandem repeat anchors. Three arrows represent three copies of a template sequence. Vertical line represents seed for each k-mer. The same height between seeds indicates identical k-mers. Horizontal line represents tandem repeat hit of identical k-mers. Solid and dashed lines indicate their hit distances are likely and unlikely, respectively, to be the true repeat pattern size. After the dynamic programming, the optimal chain is expected to consist of anchors that have hit distances close to the repeat pattern size
Fig. 2.Searching for repeat unit boundary based on the global alignment. s and e are the current repeat boundaries. Two anchors and are selected as their starting positions are the closest to e. Two subsequences starting from s1 to s2 and from e1 to e2 are extracted to perform an end-to-end global alignment. The next repeat unit boundary can be calculated based on the alignment result. In this example, the base G of e is matched with the G in subsequence , whose coordinate is then considered as the putative next boundary
Fig. 3.Dynamic programming matrix of sequence-to-graph alignment and three types of operations. SIMD parallelization is applicable for the match and deletion operations as they only rely on the previous rows. Insertion operation must be processed linearly as it depends on the left cell, which is in the same row
Performance on datasets with five error rates and distributions (1000 reads for each error rate and distribution, repeat pattern size is 1000 bp, copy number is 10)
| Error rate (sub.:ins.:del.) | Tool | Accuracy (%) | Ave. copy number | Ave. identical base | Run time (CPU min) |
|---|---|---|---|---|---|
| TideHunter |
|
|
|
| |
| 13% (41:23:36) | INC-seq | 99.7 | 8.7 | 969.6 | 95.7 |
| TRF | 71.4 | 9.6 | 960.6 | 3.7 | |
| TideHunter |
|
|
|
| |
| 15%-a (37:42:21) | INC-seq | 99.9 | 7.9 | 975.2 | 96.3 |
| TRF | 58.1 | 7.4 | 925.9 | 3.5 | |
| TideHunter |
|
|
|
| |
| 15%-b (11:60:29) | INC-seq | 96.4 | 5.1 | 958.0 | 85.0 |
| TRF | 88.8 | 9.1 | 970.1 | 2.7 | |
| TideHunter |
|
|
|
| |
| 16% (28:24:48) | INC-seq | 83.5 | 4.0 | 887.0 | 72.0 |
| TRF | 39.2 | 6.2 | 886.0 | 3.7 | |
| TideHunter |
|
|
| 0.9 | |
| 20% (48:15:37) | INC-seq | 99.1 | 5.8 | 939.0 | 92.2 |
| TRF | 0.0 | 0.0 | 0.0 |
|
Note: The best performance regarding each specific feature on each dataset.
Performance on datasets with five repeat copy numbers (1000 reads for each copy number, error rate is 15%, error distribution is 37:42:21, repeat pattern size is 1000 bp)
| Copy number | Tool | Accuracy (%) | Ave. copy number | Ave. identical base | Run time (CPU min) |
|---|---|---|---|---|---|
| TideHunter |
|
| 814.5 |
| |
| 2 | INC-seq | 0.0 | 0.0 | 0.0 | 18.0 |
| TRF | 0.1 | 1.9 |
| 0.05 | |
| TideHunter | 89.3 |
| 887.9 |
| |
| 3 | INC-seq |
| 2.0 | 876.7 | 26.5 |
| TRF | 7.6 | 2.3 |
| 0.2 | |
| TideHunter |
|
|
|
| |
| 5 | INC-seq | 99.2 | 3.8 | 934.1 | 47.1 |
| TRF | 31.1 | 3.8 | 916.2 | 0.8 | |
| TideHunter | 99.9 |
|
|
| |
| 10 | INC-seq |
| 7.8 | 971.6 | 91.0 |
| TRF | 60.1 | 6.8 | 921.1 | 3.6 | |
| TideHunter |
|
|
|
| |
| 20 | INC-seq | 100.0 | 16.9 | 988.3 | 181.5 |
| TRF | 84.6 | 14.6 | 934.0 | 10.3 |
Note: The best performance regarding each specific feature on each dataset.
Performance on datasets with five repeat pattern sizes (1000 reads for each repeat pattern size, error rate is 15%, error distribution is 37:42:21, copy number is 10)
| Repeat pattern size | Tool | Accuracy (%) | Ave. copy number | Ave. identical base | Run time (CPU min) |
|---|---|---|---|---|---|
| TideHunter |
|
|
|
| |
| 100 bp | INC-seq | 37.7 | 2.6 | 36.5 | 12.3 |
| TRF | 70.2 | 8.4 | 67.3 | 0.2 | |
| TideHunter |
|
|
|
| |
| 500 bp | INC-seq | 87.4 | 4.2 | 453.8 | 50.8 |
| TRF | 71.9 | 7.6 | 463.0 | 1.8 | |
| TideHunter |
|
|
|
| |
| 1000 bp | INC-seq | 100.0 | 7.9 | 974.0 | 96.4 |
| TRF | 61.5 | 7.4 | 928.7 | 3.2 | |
| TideHunter |
|
|
| 4.3 | |
| 2000 bp | INC-seq | 99.8 | 9.0 | 1955.5 | 163.4 |
| TRF | 30.0 | 5.4 | 1819.0 |
| |
| TideHunter |
|
|
| 11.6 | |
| 3000 bp | INC-seq | 98.8 | 9.0 | 2934.2 | 300.4 |
| TRF | 0.0 | 0.0 | 0.0 |
|
Note: The best performance regarding each specific feature on each dataset.
Performance on the SIRV E2 dataset (603 906 Nanopore 1D reads with at least one splint sequence were used for evaluation)
| Tool | # consensus | # full-length reads | # mappable reads (mappable ratio %) | Error rate (%) | Run time (CPU hour) |
|---|---|---|---|---|---|
| TideHunter |
|
|
| 5.1 |
|
| C3POa | 136 243 | 119 503 | 119 267 ( |
| 204.5 |
| INC-seq | 115 963 | 110 645 | 107 159 (96.8) | 6.5 | 630.6 |
| TRF | 118 040 | 110 079 | 105 145 (95.5) | 6.7 | 13.8 |
Performance on synthetic 16S rRNA datasets (only Nanopore 2D reads were used for evaluation)
| Dataset (# reads) | Tool | # consensus | # mappable reads (mappable ratio %) | Error rate (%) | Run time (CPU min) |
|---|---|---|---|---|---|
| TideHunter |
|
|
|
| |
| Simple | INC-seq | 2178 | 2174 (99.8) | 4.8 | 119.0 |
| TRF | 2542 | 2540 ( | 4.5 | 14.5 | |
| TideHunter |
|
|
|
| |
| Rep.1 | INC-seq | 1076 | 1074 ( | 4.7 | 50.4 |
| TRF | 1360 | 1356 (99.7) | 3.5 | 5.5 | |
| TideHunter |
|
|
|
| |
| Rep.2 | INC-seq | 1183 | 1178 (99.6) | 4.3 | 43.9 |
| TRF | 1330 | 1326 ( | 3.4 | 4.8 |
Simple: simple community dataset with 3 bacteria.
Rep.1: replicate 1 of 10 bacteria community.
Rep.2: replicate 2 of 10 bacteria community.