| Literature DB >> 30343669 |
Viachaslau Tsyvina1, David S Campo2, Seth Sims3,2, Alex Zelikovsky3, Yury Khudyakov2, Pavel Skums3,2.
Abstract
BACKGROUND: Many biological analysis tasks require extraction of families of genetically similar sequences from large datasets produced by Next-generation Sequencing (NGS). Such tasks include detection of viral transmissions by analysis of all genetically close pairs of sequences from viral datasets sampled from infected individuals or studying of evolution of viruses or immune repertoires by analysis of network of intra-host viral variants or antibody clonotypes formed by genetically close sequences. The most obvious naïeve algorithms to extract such sequence families are impractical in light of the massive size of modern NGS datasets.Entities:
Keywords: Edit distance; Filtering; Hamming distance; K-mer; Similarity join; Similarity search
Mesh:
Year: 2018 PMID: 30343669 PMCID: PMC6196405 DOI: 10.1186/s12859-018-2333-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Distribution of nucleotide entropy along the E1/E2 region of HCV for a population of 469 unrelated genotype 1a sequences obtained from NCBI
Fig. 2Example of two exact pairs of strings, but with equal (k=4) (a) and entropy-based (b) segments size and t=1. In case (a) the pair passes the filter, in case (b) it doesn’t pass the filter
Results of Filter Composition pipeline and k-mer based signature scheme filtering for Sample pair filtering and Inter-Sample Sequence Retrieval problems
| Method | Filter Composition | Signature Scheme |
|---|---|---|
| Percent of filtered sample pairs | 85.1% | 92% |
| Percent of filtered sequence pairs | 91.5% | 99.996% |
| Total Time | ∼ 5 min | ∼ 15 sec |
Algorithm run time without optimization subroutines
| Feature | Time |
|---|---|
| No sample pair filter | ∼ 21.3 |
| No sorting of | ∼ 38.1 |
Intra-sample Sequence Retrieval Running Time
| Dataset | Pairs in output | Brute force time, s | Signature method time, s |
|---|---|---|---|
| d1 | 60 421 | 6.6 | 0.2 |
| d2 | 370 262 | 25.9 | 0.3 |
| d3 | 1 800 945 | 102 | 1.8 |
| d4 | 5 848 556 | 413 | 2.8 |
| d5 | 18 570 536 | 1 624 | 4 |
| d6 | 38 835 302 | 6 499 | 7.8 |
| d7 | 155 373 208 | 26 400 | 23 |
| d8 | 621 556 832 | 105 555 | 83 |
| m1 | 51 453 578 | 883 | 17 |
Fig. 3Running times of method from [5] (blue) and the proposed method (red) on datasets d1-d8
Fig. 4Comparison of running times of equal segment size and entropy-based approaches for single sample problem
Filtering quality (unaligned sequences)
| Test | Pairs in output | Pairs that passed filtering | Filtering PPV | ||
|---|---|---|---|---|---|
| d1 | 60 421 | 65 937 | 0.9163 | 5 517 | 1.1% |
| d2 | 370 262 | 397 987 | 0.9303 | 18 754 | 0.93% |
| d3 | 1 800 945 | 1 873 268 | 0.9614 | 72 820 | 0.91% |
| d4 | 5 848 556 | 6 256 934 | 0.9347 | 411 660 | 1.28% |
| d5 | 18 570 536 | 21 028 890 | 0.8831 | 2 477 531 | 1.94% |
| d6 | 38 835 302 | 46 744 915 | 0.8308 | 7 952 495 | 1.55% |
| d7 | 155 373 208 | 187 011 650 | 0.8308 | 31 809 970 | 1.55% |
| d8 | 621 556 832 | 748 119 580 | 0.8308 | 127 239 860 | 1.55% |
| m1 | 51 453 578 | 54 640 978 | 0.9417 | 7 303 118 | 14.2% |
Filtering quality (aligned sequences))
| Test | Pairs in output | Pairs that passed filtering | Filtering PPV |
|---|---|---|---|
| d1 | 60 420 | 64 573 | 0.9357 |
| d2 | 379 233 | 385 646 | 0.9834 |
| d3 | 1 800 448 | 1 862 914 | 0.9665 |
| d4 | 5 845 274 | 6 204 049 | 0.9422 |
| d5 | 18 551 359 | 20 706 813 | 0.8959 |
| d6 | 38 792 420 | 44 939 957 | 0.8632 |
| d7 | 155 201 680 | 179 791 828 | 0.8632 |
| d8 | 620 870 720 | 719 231 312 | 0.8632 |
| m1 | 47 101 270 | 48 888 011 | 0.9635 |
Fig. 5False positive sequence pairs(l(S,Q)>t) at different edit distances l
Fig. 6Contribution of algorithm subroutines to its total running time, unaligned sequences