| Literature DB >> 35468739 |
Inyoung Sung1, Sangseon Lee2, Minwoo Pak3, Yunyol Shin3, Sun Kim4,5,6,7.
Abstract
BACKGROUND: The widely spreading coronavirus disease (COVID-19) has three major spreading properties: pathogenic mutations, spatial, and temporal propagation patterns. We know the spread of the virus geographically and temporally in terms of statistics, i.e., the number of patients. However, we are yet to understand the spread at the level of individual patients. As of March 2021, COVID-19 is wide-spread all over the world with new genetic variants. One important question is to track the early spreading patterns of COVID-19 until the virus has got spread all over the world.Entities:
Keywords: COVID-19; Deep learning; Early spreading pattern; SARS-CoV-2; Sequence embedding
Mesh:
Year: 2022 PMID: 35468739 PMCID: PMC9036508 DOI: 10.1186/s12859-022-04679-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1The overall framework of the AutoCoV model. a Preprocessing of SARS-CoV-2 sequences: We transform the virus sequences into a k-mer vector. After frequency normalization and information theory based k-mer filtering, we obtain an informative k-mer frequency matrix as inputs for AutoCoV. b The architecture of AutoCoV: It consists of three modules for learning the spatial and temporal patterns of SARS-CoV-2. Auto-Encoder Network generates latent representations that reconstruct the input matrix. Classifier Network guides the latent representations to identify the spatial and temporal patterns. Center Loss module complements the Classifier Network to create a more dense and well-separated embedding space. c The output of AutoCoV: The embedding space generated by AutoCoV aims to imply the spatial and temporal patterns of SARS-CoV-2
The summary of datasets (as of 2020-07-17)
| NCBI (5,210) | GISAID w/o NCBI (61,210) | |
|---|---|---|
| Subclass | S (1,246), L (266), V (130), G (463), GR (418), GH (2,687) | S (4,328), L (3,856), V (4,418), G (14,982), GR (19,316), GH (14,310) |
| Spatial | Asia (454), Oceania (403), Europe (280), North America (4,073) | Asia (3,805), Oceania (2,151), Europe (41,365), North America (13,889) |
| Temporal | Early (178), Middle (2,632), Late (2,400) | Early (1,175), Middle (25,058), Late (34,977) |
Each dataset has three categories of SARS-CoV-2 characteristics: Pathogenic mutations (Subclass), Spatial, Temporal. The value in the parenthesis denotes the number of sequences. The detailed information about Subclass label was described in Additional file 1: Table S1
Fig. 2The structure of AutoCoV. It consists of three modules: auto-encoder, classifier, and center loss. For each layer, the number of neurons is shown beside the corresponding layer
Fig. 3Comparison of spreading pattern visualizations. 2D embedding spaces of baselines and AutoCoV are illustrated on a the spatial patterns and b the temporal patterns. The data fold with the median performance out of 10 folds was used as input to AutoCoV for the figures. The axes of each figure are set by the axes of training data
Performance comparison results (mean ± std)
| Method | LHS | MI | F1 | |
|---|---|---|---|---|
| Spatial | ||||
Dimension Reduction | PCA | 0.280 ± 0.040 | 0.357 ± 0.050 | 0.739 ± 0.012 |
| t-SNE | 0.215 ± 0.043 | 0.226 ± 0.034 | 0.690 ± 0.017 | |
| UMAP | 0.237 ± 0.030 | 0.448 ± 0.031 | 0.837 ± 0.011 | |
| Unsupervised | dna2vec | 0.204 ± 0.035 | 0.108 ± 0.046 | 0.736 ± 0.019 |
| seq2vec | 0.149 ± 0.041 | 0.238 ± 0.030 | 0.689 ± 0.018 | |
| Supervised | Seq2Seq+CF | 0.131 ± 0.030 | 0.099 ± 0.013 | 0.689 ± 0.017 |
| BERT+CF | 0.155 ± 0.054 | 0.327 ± 0.034 | 0.734 ± 0.013 | |
Three dimensional reduction methods (PCA, t-SNE, UMAP), two unsupervised methods (dna2vec, seq2vec), and three supervised methods (Seq2Seq+CF, BERT+CF, AutoCoV) are compared, and the bold values represent the best performance among them. In both patterns, AutoCoV outperforms the baselines in all metrics
Fig. 4Spreading patterns on each embedding space by AutoCoV: a Spatial and b temporal. Solid dots represent train data and black-edged dots represent test data
Fig. 5Spreading patterns on each embedding space by AutoCoV in GISAID dataset: a Spatial () and b Temporal (). F score of KNN is described in the parenthesis (mean ± std)