| Literature DB >> 34865008 |
Francisco M Ortuño1,2, Carlos Loucera1,2, Carlos S Casimiro-Soriguer1,2, Jose A Lepe3, Pedro Camacho Martinez3, Laura Merino Diaz3, Adolfo de Salazar4, Natalia Chueca4, Federico García4, Javier Perez-Florido1,2, Joaquin Dopazo1,2,5,6.
Abstract
BACKGROUND: The current SARS-CoV-2 pandemic has emphasized the utility of viral whole-genome sequencing in the surveillance and control of the pathogen. An unprecedented ongoing global initiative is producing hundreds of thousands of sequences worldwide. However, the complex circumstances in which viruses are sequenced, along with the demand of urgent results, causes a high rate of incomplete and, therefore, useless sequences. Viral sequences evolve in the context of a complex phylogeny and different positions along the genome are in linkage disequilibrium. Therefore, an imputation method would be able to predict missing positions from the available sequencing data.Entities:
Mesh:
Year: 2021 PMID: 34865008 PMCID: PMC8643610 DOI: 10.1093/gigascience/giab078
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Imputation performance metrics (precision, recall, F1-score, MCC, and BACC) depending on missing genome percentage. (A) One random continuous block of the genome; (B) random selection of missing variants; (C) random selection of missing amplicons. In the Boxplot the box contains the two quartiles around the median, represented by the horizontal line in the box, and the wiskers represent the maximum and minimum value. Dots outside these limits are outlayers.
Figure 2:Imputation performance metrics (precision, recall, and MCC) based on the position of a missing 3-kb window along the SARS-CoV-2 genome. Left y-axis values represent variant frequencies (dashed green line). SARS-CoV-2 protein regions are represented by colored background and names specified at the top.
Performance metrics (recall, precision, and MCC)
| Subset | Imputation from genotyping assay kit | Imputation from Spike region | ||||
|---|---|---|---|---|---|---|
| Recall | Precision | MCC | Recall | Precision | MCC | |
| 1 | 0.8595 | 0.9612 | 0.9088 | 0.8129 | 0.9618 | 0.8841 |
| 2 | 0.8578 | 0.9597 | 0.9072 | 0.8121 | 0.9620 | 0.8838 |
| 3 | 0.8562 | 0.9614 | 0.9072 | 0.8100 | 0.9625 | 0.8829 |
| 4 | 0.8609 | 0.9622 | 0.9101 | 0.8106 | 0.9616 | 0.8828 |
| 5 | 0.8589 | 0.9603 | 0.9081 | 0.8109 | 0.9619 | 0.8831 |
| 6 | 0.8593 | 0.9602 | 0.9083 | 0.8106 | 0.9608 | 0.8824 |
| 7 | 0.8586 | 0.9600 | 0.9078 | 0.8126 | 0.9613 | 0.8837 |
| 8 | 0.8597 | 0.9614 | 0.9091 | 0.8106 | 0.9624 | 0.8831 |
| 9 | 0.8579 | 0.9605 | 0.9077 | 0.8115 | 0.9622 | 0.8835 |
| 10 | 0.8574 | 0.9609 | 0.9076 | 0.8121 | 0.9629 | 0.8842 |
| Mean ± SD | 0.8586 ± 0.0013 | 0.9608 ± 0.0008 | 0.9082 ± 0.0009 | 0.8114 ± 0.0010 | 0.9619 ± 0.0006 | 0.8834 ± 0.0006 |
Metrics obtained for 10-fold cross-validation subsets imputing from the genotyping assay and Spike protein regions. Values are calculated for the entire test subset imputation.
Figure 3:Principal imputation performance metrics (precision, recall, and MCC) calculated depending on imputed variant frequencies. (A) Imputation quality when imputing from the genotyping array positions; (B) imputation quality when imputing from Spike protein positions. Left y-axis (green) represents the number of variants for those frequency thresholds (log scale).
Figure 4:Lineage classification accuracy compared against 2 baseline models. (A) Lineage accuracy when imputing from the genotyping array positions; (B) lineage accuracy when imputing from Spike protein region. Levels represent lineage specification.
Figure 5:Accuracy obtained for each pair of lineages (real vs imputed) for the top frequent lineages (>500 sequences). Left heat map represents the obtained values for genotyping array imputation whereas right heat map represents accuracies for imputation from Spike protein region. Color represents the percentage of sequences in each real lineage classified by each imputed lineage (the darker, the higher).
Figure 6:Lineage classification accuracy. Accuracy is estimated for a missed region in sliding windows of 3 kb for the recent α- and β-lineages (B.1.1.7 and B.1.351, respectively)
Variant imputation metrics (precision, recall, and MCC) and lineage classification
| Sample | Recall | Precision | MCC | Real lineage | Imputed |
|---|---|---|---|---|---|
| AND00023 | 0.9000 | 1 | 0.9486 | B.1.1.7 | B.1.1.7 |
| AND00040 | 0.8571 | 1 | 0.9258 | B.1.1.7 | B.1.1.7 |
| AND00065 | 0.8636 | 1 | 0.9293 | B.1.1.7 | B.1.1.7 |
| AND00073 | 0.8571 | 1 | 0.9258 | B.1.1.7 | B.1.1.7 |
| AND00123 | 0.9231 | 1 | 0.9607 | B.1.1.7 | B.1.1.7 |
| AND00128 | 0.6000 | 1 | 0.7745 | B.1.1.7 | B.1.1.7 |
| AND00132 | 0.8696 | 1 | 0.9324 | B.1.1.7 | B.1.1.7 |
| AND00139 | 0.9091 | 1 | 0.9534 | B.1.1.7 | B.1.1.7 |
| Mean ± SD | 0.8475 ± 0.103 | 1.0000 ± 0 | 0.9188 ± 0.06 | 100% |
Values for 8 independent samples internally sequenced with both the genotyping array and whole-genome sequencing.
Study of AND00344 variants
| Mutation | Found in variant | Present | Coverage | α | β | γ |
|---|---|---|---|---|---|---|
| L18F | β/γ | No | None | ? | ? | |
| T20N | γ | No | None | ? | ||
| P26S | γ | No | None | ? | ||
| del_21765 | α | No | None | ? | ||
| D80A | β | No | None | ? | ||
| D138Y | γ | No | None | ? | ||
| del_21991 | α | No | None | ? | ||
| R190S | γ | No | None | ? | ||
| D215G | β | No | None | ? | ||
| del_22281 | β | No | Covered | No | ||
| R246I | β | No | Covered | No | ||
| K417N | β/γ | No | None | ? | ? | |
| E484K | β/γ | No | Low | No? | No? | |
| N501Y | α/β/γ | Yes | Covered | Yes? | Yes? | Yes? |
| A570D | α | No | None | ? | ||
| D614G | α/β/γ | No | None | ? | ? | ? |
| H655Y | γ | No | Covered | No | ||
| P681H | α | Yes | Covered | Yes | No | No |
| A701V | β | No | Covered | No | ||
| T716I | α | Yes | Covered | Yes | No | No |
| S982A | α | No | None | ? | ||
| T1027I | γ | No | None | ? | ||
| D1118H | α | No | None | ? | ||
| Q57H | β | No | Covered | No | ||
| P71L | β | No | Covered | No | ||
| Q27stop | α | Yes | Covered | Yes | No | No |
| T205I | β | No | Low | No? | ||
| Total | Yes (most likely) | No | No (most likely) | |||
Comparison of the available variation in the low-coverage sequence of vial sample AND00344 with respect to the α (B.1.1.7), β (B.1.351), and γ (P.1) VOCs.