| Literature DB >> 33331849 |
Guillaume E Scholz1, Benjamin Linard1,2, Nikolai Romashchenko1, Eric Rivals1, Fabio Pardi1.
Abstract
MOTIVATION: Novel recombinant viruses may have important medical and evolutionary significance, as they sometimes display new traits not present in the parental strains. This is particularly concerning when the new viruses combine fragments coming from phylogenetically-distinct viral types. Here, we consider the task of screening large collections of sequences for such novel recombinants. A number of methods already exist for this task. However, these methods rely on complex models and heavy computations that are not always practical for a quick scan of a large number of sequences.Entities:
Year: 2020 PMID: 33331849 PMCID: PMC8016494 DOI: 10.1093/bioinformatics/btaa1020
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Illustration of the task of inter-strain recombination detection. Top: Example of what strains may look like in a realistic phylogeny (adapted from part of the reference tree for the HIV-pol dataset). Bottom: Illustration of the composition of a query and of the outputs of two programs. The query combines a small segment of a sequence annotated as A1, and a larger segment of a sequence annotated as B (neither of these two sequences were part of the reference alignment used to construct the reference tree). SHERPAS and jpHMM (both run with default parameters) return the partitions represented by the other two bars. Black segments represent unassigned regions
Accuracies observed on the HIV-pol dataset
| Site-wise | Mosaic | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | thr | w | N/A | sens | prec | m | sup | sub | mm |
| SCUEAL | — | — | 0.0 | 98.5 | 98.5 | 93.2 | 3.0 | 1.9 | 1.9 |
| jpHMM | — | — | 0.0 | 97.4 | 97.4 | 90.0 | 0.0 | 7.0 | 2.9 |
| jpHMM-Qb | — | — | 0.0 | 97.4 | 97.5 | 90.2 | 0 | 7.0 | 2.8 |
| SHERPAS R | 0.9 | 500 | 7.5 | 89.8 | 97.1 | 83.6 | 8.5 | 6.3 | 1.6 |
| SHERPAS R | 0.9 | 300 | 8.8 | 89.4 | 98.0 | 81.9 | 12.6 | 4.0 | 1.5 |
| SHERPAS R | 0.99 | 500 | 17.0 | 81.2 | 97.9 | 82.6 | 3.0 | 13.1 | 1.3 |
| SHERPAS R | 0.99 | 300 | 21.0 | 78.2 | 98.8 | 82.3 | 3.6 | 12.9 | 1.2 |
| SHERPAS F | 1 | 500 | 4.4 | 93.5 | 97.8 | 81.9 | 12.0 | 5.3 | 0.8 |
| SHERPAS F | 1 | 300 | 3.0 | 95.3 | 98.2 | 78.2 | 19.0 | 2.2 | 0.7 |
| SHERPAS F | 100 | 500 | 7.0 | 91.6 | 98.6 | 89.0 | 3.3 | 7.3 | 0.3 |
| SHERPAS F | 100 | 300 | 5.3 | 93.7 | 98.9 | 88.4 | 7.4 | 3.7 | 0.5 |
Note: jpHMM-Qb stands for jpHMM with the -Q blat (fast) option. ‘R’ and ‘F’ distinguish between SHERPAS-reduced and SHERPAS-full, respectively. Columns ‘thr’ and ‘w’ report the threshold and window size used by SHERPAS. Column ‘N/A’ reports the percentage of sites that are not assigned to any strain. Columns ‘sens’ and ‘prec’ report site-wise sensitivity and precision (in percentage), respectively. Columns ‘m’, ‘sup’, ‘sub’ and ‘mm’ report the percentages of mosaic matches, supersets, subsets and mismatches, respectively (see Section 3.2 for definitions).
Running times of jpHMM and SHERPAS on the four datasets
| jpHMM | SHERPAS | |||||
|---|---|---|---|---|---|---|
| Mbp | #br | Default | -Q blat |
|
| |
| HIV-pol | 16.2 | 332 (23) | 12 964 min 46 s | 1533 min 22 s | 2 min 40 s | 32 s |
| HBV-genome | 9.6 | 676 (8) | — | 673 min 24 s | 2 min 35 s | 11 s |
| HIV-genome | 26.7 | 1760 (20) | 4997 min 48 s | 2367 min 36 s | 20 min 44 s | 51 s |
| HIV-LR | 17.7 | 1760 (20) | 7414 min 17 s | — | 12 min 29 s | 33 s |
Note: Column ‘Mbp’ reports the total size of the query dataset in Mbp. Column ‘#br’ reports the number of branches for which the full pkDB (reduced pkDB) stores information. ‘HIV-LR’ refers to the dataset of simulated long reads. All times are measured in minutes (min) and seconds (s).
Accuracies observed on the HBV-genome dataset
| Site-wise | Mosaic | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | thr | w | N/A | sens | prec | m | sup | sub | mm |
| jpHMM | — | — | 0.0 | 98.5 | 98.5 | 91.4 | 0.4 | 6.8 | 1.4 |
| SHERPAS R | 0.9 | 500 | 1.6 | 93.7 | 95.3 | 80.2 | 5.1 | 14.0 | 0.7 |
| SHERPAS R | 0.9 | 300 | 2.5 | 94.6 | 97.0 | 81.4 | 11 | 6.6 | 1.0 |
| SHERPAS R | 0.99 | 500 | 3.5 | 92.6 | 96.0 | 81.2 | 2.2 | 16.5 | 0.1 |
| SHERPAS R | 0.99 | 300 | 5.0 | 92.9 | 97.8 | 86.6 | 3.7 | 9.4 | 0.3 |
| SHERPAS F | 1 | 500 | 2.1 | 93.5 | 95.5 | 76.0 | 8.7 | 14.4 | 0.8 |
| SHERPAS F | 1 | 300 | 1.5 | 95.3 | 96.8 | 74.7 | 19.3 | 5.0 | 1.0 |
| SHERPAS F | 100 | 500 | 4.8 | 92.0 | 96.6 | 80.2 | 1.3 | 18.3 | 0.2 |
| SHERPAS F | 100 | 300 | 3.8 | 94.1 | 97.8 | 84.4 | 7.5 | 7.4 | 0.8 |
Note: jpHMM stands for jpHMM launched with the -C option for circular queries. Note that, this option automatically activates the -Q blat (fast) option. All other abbreviations are as in Table 2.
Accuracies observed on the HIV-genome dataset
| Site-wise | Mosaic | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | thr | w | N/A | sens | prec | m | sup | sub | mm |
| jpHMM | — | — | 3.6 | 95.6 | 99.2 | 77.8 | 4.2 | 16.6 | 1.4 |
| jpHMM-Qb | — | — | 4.0 | 95.4 | 99.3 | 78.2 | 4.0 | 17.0 | 0.8 |
| SHERPAS R | 0.9 | 500 | 3.4 | 94.5 | 97.9 | 48.2 | 37.3 | 6.1 | 8.4 |
| SHERPAS R | 0.9 | 300 | 5.5 | 92.5 | 97.9 | 27.4 | 63.0 | 1.9 | 7.7 |
| SHERPAS R | 0.99 | 500 | 6.0 | 92.7 | 98.6 | 65.8 | 15.3 | 13.4 | 5.5 |
| SHERPAS R | 0.99 | 300 | 8.6 | 90.5 | 99.0 | 56.7 | 27.0 | 9.8 | 6.5 |
| SHERPAS F | 1 | 500 | 2.1 | 96.1 | 98.2 | 46.3 | 43.6 | 4.3 | 5.8 |
| SHERPAS F | 1 | 300 | 1.6 | 96.6 | 98.2 | 24.2 | 71.5 | 0.7 | 3.6 |
| SHERPAS F | 100 | 500 | 3.5 | 95.3 | 98.8 | 67.8 | 19.2 | 9.5 | 3.5 |
| SHERPAS F | 100 | 300 | 2.7 | 96.4 | 99.1 | 54.3 | 39.6 | 2.9 | 3.2 |
Note: All abbreviations are as in Table 2.
Accuracies observed on the dataset of simulated Nanopore HIV reads
| Site-wise | Mosaic | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | thr | w | N/A | sens | prec | m | sup | sub | mm |
| jpHMM | — | — | 15.0 | 38.7 | 45.5 | 0.7 | 49.8 | 0.4 | 49.1 |
| SHERPAS R | 0.9 | 500 | 16.7 | 65.2 | 78.4 | 2.7 | 71.2 | 0.6 | 25.5 |
| SHERPAS R | 0.9 | 300 | 25.2 | 56.3 | 75.2 | 1.1 | 75.1 | 0.2 | 23.6 |
| SHERPAS R | 0.99 | 500 | 28.7 | 59.8 | 83.9 | 9.4 | 55.0 | 2.7 | 32.9 |
| SHERPAS R | 0.99 | 300 | 38.3 | 49.3 | 79.9 | 5.0 | 56.0 | 1.6 | 37.4 |
| SHERPAS F | 1 | 500 | 21.8 | 71.8 | 91.9 | 12.3 | 58.4 | 3.9 | 25.4 |
| SHERPAS F | 1 | 300 | 21.8 | 69.3 | 88.7 | 3.2 | 75.9 | 0.4 | 20.5 |
| SHERPAS F | 100 | 500 | 25.7 | 70.3 | 94.6 | 34.1 | 30.2 | 16.8 | 18.9 |
| SHERPAS F | 100 | 300 | 21.8 | 73.8 | 94.4 | 22.3 | 51.3 | 6.3 | 20.0 |
Note: For jpHMM, only the results of launching it with its default options for HIV are reported, as the use of the -Q blat (fast) option resulted in the program failing to execute. All abbreviations are as in Table 2.
Fig. 2.Trade-off between recall and specificity for the binary classification of HIV-pol queries. Recall and specificity are plotted for SCUEAL (circle), jpHMM (diamond) and SHERPAS (colored lines). The four colored lines correspond to the different combinations of a pkDB version (full/reduced) and window size (500, 300) for SHERPAS. Each point in a colored line corresponds to a different value of the threshold, with the lowest values of the threshold (1 for SHERPAS-full and 0 for SHERPAS-reduced) resulting in the leftmost points. See the Supplementary Section S5.6 for full details. Note that, all rates fall in the interval , which is why the curves are not depicted in the full [0,1] range