| Literature DB >> 34764587 |
M Saqib Nawaz1, Philippe Fournier-Viger1, Abbas Shojaee2, Hamido Fujita3.
Abstract
The genome of the novel coronavirus (COVID-19) disease was first sequenced in January 2020, approximately a month after its emergence in Wuhan, capital of Hubei province, China. COVID-19 genome sequencing is critical to understanding the virus behavior, its origin, how fast it mutates, and for the development of drugs/vaccines and effective preventive strategies. This paper investigates the use of artificial intelligence techniques to learn interesting information from COVID-19 genome sequences. Sequential pattern mining (SPM) is first applied on a computer-understandable corpus of COVID-19 genome sequences to see if interesting hidden patterns can be found, which reveal frequent patterns of nucleotide bases and their relationships with each other. Second, sequence prediction models are applied to the corpus to evaluate if nucleotide base(s) can be predicted from previous ones. Third, for mutation analysis in genome sequences, an algorithm is designed to find the locations in the genome sequences where the nucleotide bases are changed and to calculate the mutation rate. Obtained results suggest that SPM and mutation analysis techniques can reveal interesting information and patterns in COVID-19 genome sequences to examine the evolution and variations in COVID-19 strains respectively.Entities:
Keywords: COVID-19; Genome sequence; Mutation; Nucleotide bases; Sequential pattern mining
Year: 2021 PMID: 34764587 PMCID: PMC7888282 DOI: 10.1007/s10489-021-02193-w
Source DB: PubMed Journal: Appl Intell (Dordr) ISSN: 0924-669X Impact factor: 5.019
Fig. 1SARS-CoV-2 Structure [21]
Fig. 2Structure of the SARS-CoV-2 genome [24]
Fig. 3Proposed SPM and sequence prediction approach for analyzing COVID-19 genome sequences
A sample of a CGSC
| ID | Sequence |
|---|---|
| 1 | 〈....AATAACTCTATTGCCATACCCACAAATT.....〉 |
| 2 | 〈....TGCAGCAATCTTTTGTTGCAATATGGC.....〉 |
| 3 | 〈....CAGGTGCTGCATTACAAATACCATTTG.....〉 |
| 4 | 〈....CCCTAATGTGTAAAATTAATTTTAGTA.....〉 |
Characteristics of COVID-19 genome taken from NCBI
| ID | Release Date | Length | Location | Collection Date |
|---|---|---|---|---|
| MT745584 | 2020-07-13 | 29860 | Bahrain | 2020-06-22 |
| MT750057 | 2020-07-13 | 29782 | USA:Illinois | 2020-06-17 |
| MT750058 | 2020-07-13 | 29782 | USA: Wisconsin | 2020-06-09 |
| MT291827 | 2020-04-06 | 29858 | China: Wuhan | 2019-12-30 |
| MT291828 | 2020-04-06 | 29858 | China: Wuhan | 2019-12-30 |
Frequent nucleotide base sets discovered by Apriori
| Pattern(s) | Support | Min. Support | Pattern(s) | Support | Min. Support |
|---|---|---|---|---|---|
| A | 8915 | 100% | AGT | 52 | 10% |
| C | 5487 | 100% | ACT | 48 | 5% |
| G | 5859 | 100% | CGT | 32 | 5% |
| T | 9599 | 100% | ACG | 12 | 1% |
Nucleotides percentage in COVID-19 genomes
| ID | A (%) | C (%) | G (%) | T (%) |
|---|---|---|---|---|
| MT750057 | 8891 (29.853) | 5470 (18.311) | 5849 (19.639) | 9572 (32.140) |
| MT750058 | 8891 (29.853) | 5470 (18.311) | 5849 (19.639) | 9572 (32.140) |
| MT291827 | 8932 (29.914) | 5482 (18.360) | 5859 (19.622) | 9585 (32.101) |
| MT291828 | 8931 (29.911) | 5482 (18.360) | 5860 (19.626) | 9585 (32.101) |
Frequent nucleotides extracted by CM-SPAM
| Pattern | Support | Min. Sup | Pattern | Support | Min. Sup |
|---|---|---|---|---|---|
| AATAAC | 511 | 33% | AATAAC | 511 | 25% |
| AAAAAG | 530 | 33% | ATTATCATA | 416 | 25% |
| ACTATG | 510 | 33% | AGTAGCTAC | 369 | 25% |
| CAAAAG | 510 | 33% | CAAAAG | 510 | 25% |
| CTTTGT | 523 | 33% | CAATGTCTA | 392 | 25% |
| GTATTA | 508 | 33% | GACTATGTT | 392 | 25% |
| GTATGA | 503 | 33% | GTTGTGGTAGTG | 227 | 15% |
| CAACAA | 499 | 33% | TACTAGAAT | 403 | 25% |
| TTAACG | 499 | 33% | ACCTTAAACTAA | 243 | 15% |
| TCAGTG | 502 | 33% | TCAGTG | 502 | 25% |
Performance of CM-SPAM with varying minsup
| Min. Sup % | Time (Sec) | Patterns | Memory (Mb) | Min. Sup. |
|---|---|---|---|---|
| 33% | 2 | 3549 | 45.839 | 493 |
| 25% | 42 | 194361 | 49.256 | 374 |
| 20% | 231 | 1372868 | 48 | 299 |
Frequent nucleotides sequential patterns extracted by TKS
| Pattern | Length | Support | Pattern | Length | Support |
|---|---|---|---|---|---|
| ACG | 3 | 594 | GTATTA | 6 | 508 |
| AGT | 3 | 611 | AAATTT | 6 | 537 |
| CGT | 3 | 594 | ATTATCATA | 9 | 416 |
| CTA | 3 | 597 | CTAGGTAAA | 9 | 396 |
| TAC | 3 | 604 | GATAAAGCT | 9 | 396 |
| GTC | 3 | 589 | TACTAGAAT | 9 | 403 |
| CAAAAG | 6 | 510 | CAATGTACG | 9 | 372 |
| TCAGTG | 6 | 502 | AATAAC | 6 | 510 |
Fig. 4Sequential rules discovered in a genome sequence by ERMiner
Corpus statistics for sequence prediction
| parameter | Value |
|---|---|
| Number of Sequences | 1492 |
| Number of distinct items | 5 |
| Itemsets item ID | 4 |
| Distinct item per sequence | 20.42 |
| Occurrence for each item | 4.66 |
| Corpus size in MB | 5968 |
Accuracy of prediction models
| Models | DG | TDAG | CPT+ | CPT | Mark1 | AKOM | LZ78 | Random |
|---|---|---|---|---|---|---|---|---|
| Success | 20.643 | 18.901 | 18.035 | 18.298 | 20.174 | 19.722 | 16.1 | |
| Failure | 79.357 | 81.099 | 81.965 | 81.702 | 79.826 | 79.29 | 80.228 | |
| No Match | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Train Time | 0.011 | 0.039 | 0.002 | 0.005 | 0.057 | 0.021 | – | |
| Test Time | 00.001 | 00.000 | 0.305 | 0.001 | 00.000 | 0.002 | – |
Point mutation analysis results
| Line | Location | Position | Change | Line | Location | Position | Change |
|---|---|---|---|---|---|---|---|
| 3 | 29 | 149 | G → T | 130 | 42 | 7,782 | C → A |
| 70 | 31 | 4,171 | T → C | 139 | 55 | 8,335 | A → G |
| 94 | 37 | 5,617 | T → C | 328 | 2 | 19,662 | T → G |
| 115 | 11 | 6,851 | C → T | 415 | 31 | 24,871 | G → T |
Fig. 5COVID-19 genome mutation in whole sequences
Fig. 6COVID-19 genome mutation