| Literature DB >> 35307007 |
Robert Logan1,2, Zoe Fleischmann1, Sofia Annis1, Amy Wangsness Wehe3, Jonathan L Tilly1, Dori C Woods1, Konstantin Khrapko4.
Abstract
BACKGROUND: Third-generation sequencing offers some advantages over next-generation sequencing predecessors, but with the caveat of harboring a much higher error rate. Clustering-related sequences is an essential task in modern biology. To accurately cluster sequences rich in errors, error type and frequency need to be accounted for. Levenshtein distance is a well-established mathematical algorithm for measuring the edit distance between words and can specifically weight insertions, deletions and substitutions. However, there are drawbacks to using Levenshtein distance in a biological context and hence has rarely been used for this purpose. We present novel modifications to the Levenshtein distance algorithm to optimize it for clustering error-rich biological sequencing data.Entities:
Keywords: Clustering; Edit-distance; Single-molecule sequencing
Mesh:
Year: 2022 PMID: 35307007 PMCID: PMC8934446 DOI: 10.1186/s12859-022-04637-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 13GOLD accommodates upstream and downstream frameshift. A SLD is lower than LD when computing a single deletion-induced downstream frameshift. If an upstream frameshift occurs due to the single base deletion, there is no benefit to using SLD over LD. B The benefit of SLD accommodating frameshift can be rescued in the case of an upstream frameshift by calculating the SLD of mirrored sequences, effectively converting upstream frameshift to downstream frameshift
Fig. 23GOLD is faster than SLD and LD. A Computation time increases linearly with increased dataset depth. Sequence length was held constant at 50 bases. B Computation time increases quadratically with increased sequence length. Dataset depth was held constant at 10,000 sequences
Fig. 3SLD cannot accommodate weighted errors. A LD comparison of TAGCTAGC to TAGTAGCT reveals that an insertion of “C’ and a deletion of “T” are required to make the second string match the first for an unweighted edit distance of 2. SLD analysis only considers the insertion of “C” for an unweighted edit distance of 1. B When insertions are weighted 1 and deletions are weighted 5, LD is appropriately 6, reflecting an insertion of “C” and a deletion of “T”. However, SLD is not 1 as expected, but rather 3
Sensitivity and specificity of clustering tools on ONT MinION R9.4.1 biological data
| Clustering tool | Specificity | Sensitivity range | Sensitivity average | Sensitivity insignificant P-values |
|---|---|---|---|---|
| 3GOLD | 100% (0.00) | 100–67% | 98.83% (5.74) | |
| SLD | 100% (0.00) | 92–22% | 70.84% (13.04) | |
| LD | 100% (0.00) | 70–25% | 46.07% (10.45) | LD vs. CD-HIT-EST [0.0733] |
| Starcode | 100% (0.00) | 70–21% | 42.44% (10.78) | Starcode vs. LD [0.3244] |
| CD-HIT-EST | 100% (0.00) | 85–24% | 50.83% (17.50) | |
| DNACLUST | 100% (0.00) | 49–20% | 28.43% (8.59) |
Standard deviation values are presented inside parentheses. P values are presented inside brackets. Only statistically insignificant P values (P > 0.05) are presented in the table. All other P values are < 0.0001. The decision to only show insignificant values was made to reduce the size of the table for easier viewing and interpretation
Characteristics of clusters formed on ONT MinION R9.4.1 biological data
| Clustering tool | Total clustered | Singletons | Qualified clusters | Cluster size range | Time to cluster |
|---|---|---|---|---|---|
| 3GOLD | 9488 | 10 | 96 | 100–67 | 3,830.411 |
| SLD | 6801 | 2199 | 96 | 92–22 | 69,712.716 |
| LD | 4192 | 3259 | 91 | 70–25 | 72,766.322 |
| Starcode | 3735 | 3463 | 88 | 70–21 | 0.174 |
| CD-HIT-EST | 4829 | 1827 | 95 | 85–24 | 2.370 |
| DNACLUST | 995 | 4045 | 35 | 49–20 | 4.997 |
Time was measured in seconds