Literature DB >> 35386190

Exploring the Lethality of Human-Adapted Coronavirus Through Alignment-Free Machine Learning Approaches Using Genomic Sequences.

Rui Yin^1,2, Zihan Luo³, Chee Keong Kwoh¹.

Abstract

Background: A newly emerging novel coronavirus appeared and rapidly spread worldwide and World Health Organization declared a pandemic on March 11, 2020. The roles and characteristics of coronavirus have captured much attention due to its power of causing a wide variety of infectious diseases, from mild to severe, on humans. The detection of the lethality of human coronavirus is key to estimate the viral toxicity and provide perspectives for treatment.
Methods: We developed an alignment-free framework that utilizes machine learning approaches for an ultra-fast and highly accurate prediction of the lethality of human-adapted coronavirus using genomic sequences. We performed extensive experiments through six different feature transformation and machine learning algorithms combining digital signal processing to identify the lethality of possible future novel coronaviruses using existing strains.
Results: The results tested on SARS-CoV, MERS-CoV and SARS-CoV-2 datasets show an average 96.7% prediction accuracy. We also provide preliminary analysis validating the effectiveness of our models through other human coronaviruses. Our framework achieves high levels of prediction performance that is alignment-free and based on RNA sequences alone without genome annotations and specialized biological knowledge.
Conclusion: The results demonstrate that, for any novel human coronavirus strains, this study can offer a reliable real-time estimation for its viral lethality.

Entities: Chemical

Keywords: Coronavirus; SARS-CoV; alignment-free; genomic nucleotide; lethality inference; machine learning

Year: 2021 PMID： 35386190 PMCID： PMC8922323 DOI： 10.2174/1389202923666211221110857

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.689

INTRODUCTION

Coronaviruses are positive, single-stranded RNA virus and have been identified in humans and animals. They are categorized into four genera: α, β, γ and θ [1]. Previous phylogenetic analysis revealed a complex evolutionary history of coronavirus, suggesting ancient origins and crossover events that can lead to cross-species infections [2, 3]. Bats and birds are a natural reservoir for coronavirus gene pool [4]. The mutation and recombination play critical roles that may enable cross-species transmission into other mammals and humans [5]. Human coronavirus (HCoV) was first identified in the mid-1960s [6], and up to now, seven types of coronavirus can infect people. Four of them, i.e., HCoV-229E, HCoV-NL63, HCoV-OC43 and HCoV-HKU1, usually cause mild to moderate upper-respiratory tract illnesses like common cold when infecting humans [7]. The other three members include severe acute respiratory syndrome coronavirus (SARS-CoV) and middle east respiratory syndrome coronavirus (MERS-CoV) and the most lately severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). They all belong to Betacoronavirus that led to the epidemics or pandemics [8-10]. Emerging in November 2002 in Guangdong province, China, SARS caused 8,096 human infections with 774 deaths by July 2003 [11]. MERS was first reported in Saudi Arabia in September 2012 and finally resulted in 2,494 human infections by November 2019 [12] Recently, a novel coronavirus named SARS-CoV-2 is emerging and spreading to 215 countries or territories on June 12, 2020, leading to 7,390,702 confirmed cases with 417,731 deaths according to the World Health Organization [13]. Though precautions such as lockdown of cities and social distance have been taken to curb the transmission of COVID-19, it spreads far more quickly than the SARS-CoV and MERS-CoV diseases [4, 14]. To make matters worse, the number of infected cases still increases rapidly and the global inflection point about COVID-19 is unknown. Like other RNA viruses, e.g. influenza virus [15], coronaviruses possess high mutation and gene recombination rates [16], which makes constant evolution of this virus with the emergence of new variants. From SARS in 2002 to COVID-19 in 2019, coronaviruses have caused high morbidity and mortality, and unfortunately, the fast and untraceable virus mutations take the lives of people before the immune system can produce the inhibitory antibody [17] Currently, no miracle drug or vaccines are available to treat or prevent the humans infected by coronaviruses [18, 19]. Therefore, there is a desperate need for developing approaches to detect the lethality of coronaviruses not only for SARS-CoV-2 but also the potential new variants and species. This would facilitate the diagnosis of coronavirus clinical severity and provide decision-making support. The detection of viral lethality has already been explored in influenza viruses [20, 21]. Through a meta-analysis of predicting the virulence and antigenicity of influenza viruses, we can infer the lethality of the virus timely to improve the current influenza surveillance system. Regarding the risk of novel emerging coronavirus strains, much attention has been captured to investigate the lethality or clinical severity of new emerging coronavirus. Typically, epidemiological models are certainly built to estimate the lethality and the extent of undetected infections associated with the new coronaviruses. Bastolla suggested an orthogonal approach based on a minimum number of parameters robustly fitted from the cumulative data easily accessible for all countries at the John Hopkins University database to extrapolate the death rate [22]. Bello-Chavolla et al. proposed a clinical score to evaluate the risk for complications and lethality attributable to COVID-19 regarding the effect of obesity and diabetes in Mexico [23]. The results provided a tool for quick determination of susceptibility patients in a first contact scenario. Wang et al. leveraged patient data in real-time and devise a patient information based algorithm to estimate and predict the death rate caused by COVID-19 for the near future [24]. Aiewsakun et al. performed a genome-wide association study on the genomes of COVID-19 to identify genetic variations that might be associated with the COVID-19 severity [25]. Moreover, Jiang et al. established an artificial intelligence framework for data-driven prediction of coronavirus clinical severity [26]. Saha et al. proposed a deep learning framework to identify an unknown viral sequence, using Long Short Term Memory (LSTM) [27]. Lopez-Rincon et al. designed a classification and specific primer for accurate detection of SARS-CoV-2 with convolutional neural networks [28]. The development of computational and physics-based approaches has relieved the labors of experiments by utilizing epidemiological and biological data to construct the model. However, direct evaluation of potential novel coronavirus strains for their lethality is crucial when clinicians are forced to make difficult decisions without past specific experience to guide clinical acumen. Inferring the lethality of novel coronavirus is possible by identifying the patterns from a large number of coronavirus sequences. In this paper, we propose an alignment-free framework that can leverage machine learning approaches to infer the lethality of human-adapted coronavirus. The main contribution is that we formulate the problem of estimating the lethality of human-adapted coronavirus through machine learning approaches based on genomic nucleotide, which could assist biologists or virologists for the investigation of coronavirus with new insights. By appropriate feature transformation, we can encode genomic nucleotides into numbers that allow us to convert it into a prediction task. The experimental results suggest our models deliver accurate prediction of lethality without prior biological knowledge. We also provide preliminary analysis validating the effectiveness of our models through other human coronaviruses.

METHODS

Problem Formulation

The pandemic of novel coronaviruses has caused thousands of fatalities, making tremendous treats to public health worldwide. The society is deeply concerned about its spread and evolution with the emergence of any potential new variants that would increase the lethality. Typically, lethality refers to the capability of causing death. It is usually estimated as the cumulative number of deaths divided by the total number of confirmed cases. Among all the human-adapted coronaviruses, MERS-CoV caused the highest fatality rate of 34.5% [29], followed by the SARS-CoV with 9.2% fatality rate [30] In comparison, COVID-19 indicates a lower mortality rate of 5.5% [13]. The lethality rate of COVID-19 is likely to decrease with better treatment and precautions. In this paper, we mainly focus on these three types of human-adapted coronaviruses and define the degree of viral in terms of historical fatality rates. As a result, MERS-CoV strains are high lethal, while SARS-CoV and SARS-CoV-2 strains are middle and low lethal, respectively.

Data Collection and Preprocessing

Genomic nucleotide sequences of three different coronaviruses with the human host are downloaded from the National Center for Biotechnology Information on April 30, 2020 [31]. Duplicate sequences and incomplete genomes with a length smaller than 20000 are removed from the collection to address the possible issues raised from sequence length bias. Some SARS-CoV strains from the laboratory are included that are cultivated in Vero cell cultures to enrich the training samples. Finally, we end up with 321, 351, 1638 samples for MERS-CoV, SARS-CoV and SARS-CoV-2, respectively. In addition, we also collect the genomic data of other four human coronaviruses with 27, 64, 32 and 142 distinct strains for HCoV-HKU1, HCoV-NL63, HCoV-229E and HCoV-OC43, respectively. Apart from the four symbolic bases (A, C, T, G) of each strain, we have degenerated base symbols that are an IUPAC representation [32] for a position on genomic sequences, which could denote multiple possible alternatives. These degenerate base symbols contain W (A, T), S (C, G), M (A, C), K (G, T), R (A, G), Y (C, T), B (C, G, T), D (A, G, T), H (A, C, T), V (A, C, G), N(A, C, T, G), where the letters in the bracket are alternative nucleotide representing degenerate bases. We randomly substitute these degenerate bases so that genomic sequences can be mapped into discrete numerical representation through feature transformation.

Feature Transformation

Numerical representations have been successfully employed in the field of bioinformatics [33, 34], mapping biological sequences into real-value vector space where the information or pattern characteristic of the sequence is kept in order. This is important as the existing machine learning approaches can only deal with vectors but not sequence samples. Several methods are proposed that convert genomic sequences into numerical vectors, e.g., the fixed mapping between nucleotides and real numbers without biological significance [35], based on physio-chemical properties [36], deduction from doublets or codons [37], and chaos game representation [38]. To accommodate comprehensive analysis and comparison, we adapt different types of numerical representations for biological RNA sequences. Randhawa et al. [39] showed that “Real”, “Just-A” and “Purine/Pyrimidine (PP)” numerical representation yield better performance over other methods for DNA classification, which are included for analyzing genomic data. The electro-ion interaction potential (EIIP) and nearest-neighbor based doublet are incorporated that are based on physio-chemical properties and nearest-neighbor values, respectively. Apart from the aforementioned one-dimensional representation, we have introduced 2D Chaos Game Representation (CGR) into feature transformation of original sequences. The real number representation is a fixed transformation technique through which we obtain values of four bases as: adenine (A) = -1.5, thymine (T) = 1.5, cytosine (C) = 0.5, and guanine (G) = -0.5 [40]. It is efficient in finding a complementary strand of DNA/RNA sequence and can endure complementary property. “Just-A” method maps the four bases into binary classification as the presence of adenine is labeled 1, while others are 0 [41]. PP representation is a DNA-Walk model that shows nucleotides sequences in which a step is taken upwards if the nucleotide is pyrimidine with T/C = 1, or downward if it is purine with A/G = -1 [42]. EIIP describes the distribution of the energy of free elections along with nucleotide sequences that a single EIIP indicator sequence is formed through replacing its nucleotides, where A=0.1260, C=0.1340, G=0.0806, and T=0.1335 [43]. The sequence-to-signal mapping for nearest-neighbor based doublet representation is illustrated in another study [37], where the last position is followed by the first in the sequence. Lastly, CGR is a method proposed by Jeffrey [44] that has been successfully used for a visual representation of genome sequence patterns and taxonomic classification [45, 46]. The CGR images of RNA/DNA sequences are drawn in a unit square. The four vertices of the square are labeled by four nucleotides. The first nucleotide of the sequence is plotted halfway between the center of the square and the vertex representing this nucleotide. The next base is mapped into the image that the coordinate is assigned halfway between the previous point and the vertex corresponding to the previous nucleotide. The mathematical formulation of the successive points that calculates the coordinates in the CGR of the sequences is described below: where C and C denote the X and Y coordinates of the vertices matching the nucleotide at position i of the sequence, respectively.

Model Construction

Machine learning has been utilized in many aspects of viral genomic analysis, e.g., antigenicity prediction of viruses [20], genome classification of novel pathogens [46], reassortment detection [47], receptor binding analysis [48] and vaccine recommendation [49], etc. With increasingly available genomic sequences, it will play more critical roles in helping biologists analyze large, complex biological data for prediction and discovery. In this work, we provide a comprehensive analysis for the lethality prediction of potential new human-adapted coronaviruses via alignment-free machine learning approaches. We follow a similar protocol in the studies [39] and [50], adapting it to fit our setting as follows. We utilize the retrospective way to train and test the model since the isolation time of viral strains are available. For each type of coronavirus, the samples isolated from earlier times are used to train the model, while those generated in recent times are for testing. The time threshold will be determined based on the condition that divides training and testing set in a rough 0.8:0.2 ratio. The retrospective test enables our models to infer the lethality of coronavirus for any new strains that could emerge in the near future. Six different types of numerical representations are implemented in comparison with the predictive performance of machine learning models. The proposed methods not only contain traditional machine learning models but also deep learning techniques in combination with Discrete Fourier Transform for genome analysis. Traditional machine learning models consist of logistic regression (LR), random forest (RF), K-nearest neighbor (KNN) and neural network (NN) [51], while three variants of convolutional neural networks (CNN) are leveraged. The CNN models contain AlexNet [52], VGG [53] and ResNet [54]. Logistic regression is a supervised machine learning algorithm that is a linear regression but for classification problems. Random forest is an ensemble learning method that operates by constructing a multitude of decision trees for classification and regression tasks. k-nearest neighbors algorithm is a non-parametric classification method where the function is only approximated locally that the object being assigned to the class most common among its k nearest neighbors. Neural networks are computing systems with interconnected nodes that work much like neurons in the human brain, which can recognize hidden patterns and correlations in raw data, cluster and classify it, and over time continuously learn and improve. Convolutional neural networks are very similar to ordinary neural network made up of neurons that have learnable weights and biases. The major difference is that CNN architectures allow us to encode certain properties into the architecture, which makes the forward function more efficient to implement and significantly reduce the amount of parameters in the network. Following the choices of five one-dimensional numerical representation for viral sequences, digital signal processing is introduced through Discrete Fourier Transform (DFT) techniques. We assume that the number of input sequence is and all the sequences have the same length l sequence S = (S (0), S (2),..., S (l - 1)), where 1 ≤ i ≤ n, S (k) ϵ {A, C, T, G} and 0 ≤ k ≤ l - 1, the corresponding discrete numerical representation is formulated as: where f (S (k)) denotes the numerical value after mapping by function f (.) at the position k of nucleotide sequence S. The signal N computed after DFT is represented as vector F. The formulation of F is presented below. We define that the magnitude vector that corresponds to the signal N as M, M is the absolute value of F. Typically, the length of numerical digital signal N is equal to the magnitude spectrum M that is originated from the length of the genomic sequence. However, the input genome sequences are in different lengths; thus they need to be length-normalized after DFT. Median length-normalization is leveraged for the input digital signals using zero padding. We employ anti-symmetric padding that begins from the last position if the input sequences are shorter than the median length, these short signals are extended to the median length with zero-padding, while the longer sequences are truncated after the median length. As for the two-dimensional numerical representation, i.e., CGR, a point that corresponds to a sequence of lengthl will be contained within a square with a side of length 2. We assume a square CGR image is generated with a size of 2 x 2 matrix, where k is the parameter that determines the size of the image. The frequency of occurrence of any oligomer in a sequence can be obtained by partitioning the CGR space into small squares. Therefore, the number of CGR points in each unit square of 2 x 2 grid is equal to the number of occurrences of all possible k-mers in the sequence. By counting the frequency of CGR points, it is possible to calculate oligonucleotide frequencies at various grid resolutions. We define the element a as the number of points that are located in the corresponding sub-square j, where 1 ≤ j ≤ 22k. Each sequence will be mapped into a 2 x 2 dimensional vector space based on CGR.

Implementation and Evaluation

We implement all the models by Scikit-learn [55] and PyTorch [56]. We utilize the retrospective method to train and test the model since the isolation time of viral strains is available. For each type of coronavirus, the samples generated from strains isolated before the year N are used to build the model, while those generated after the year N are for testing. The year N is determined based on the condition that divides training and testing set in a rough 0.8:0.2 ratio. The 5-fold cross-validation is performed in the training process and the independent testing set is used for validation of our models. This test can truly reflect the ability of the models in applications to predict viral lethality for future strains. The parameters are set by default with traditional machine learning models (Supplementary Materials ). For all deep learning-based models, we apply stochastic gradient descent with a minimum batch size of 64 for optimization. The drop-out (rate = 0.5) strategy is carried out with a 0.001 learning rate and all the models are fit for 50 training epochs. The predictive performance is evaluated by accuracy, precision, sensitivity, and F1 score of all models in the prediction tasks of coronavirus lethality.

RESULTS

Genome Composition of SARS-CoV, MERS-CoV and SARS-CoV-2

We first analyzed the composition of the RNA genome of the three human-adapted coronaviruses. Fig. () portrays the average distribution and variance of the nucleotides. We can observe that the proportion of A and T (in replacement of U) is high, while C and G are relatively low for all human coronaviruses. Interestingly, it is suggested that the high T and low C proportions of human coronaviruses are quite variable and act like communicating vessels. T goes up when C decreases and vice versa. The composition of T ranges from 0.139 to 0.552 while C makes the opposite movement from 0.374 to 0.107, respectively, among all human coronavirus. If we look into individual types, the SARS-CoV-2 as a novel human pathogen follows some typical composition of nucleotides but it is also characterized by some differences. We found that SARS-CoV-2 presents a higher variance compared with MERS-CoV and SARS-CoV. This is probably the rapid and widespread transmission of SARS-CoV-2 accelerates its evolution when infected with humans. More strains are generated differently from their ancestor clade. However, it is more pronounced of the nucleotide bias in the unpaired regions of the structured RNA genome, which may indicate a certain biological function of these special sequence signatures. Some studies have revealed that a clear difference in the magnitude of the nucleotide bias of the coronavirus genomes is likely to relate to the mechanism of subgenomic mRNA synthesis and the exposure of single-stranded RNA domains [57, 58]. The evidence shows that cytosine discrimination and deamination against CpG dinucleotides are the driving force that outlines the coronaviruses over evolutionary times [59]. It is indicated that the atypical nucleotide bias could reflect distinct biological functions that are the direct cause of the characteristic codon usage in these viruses [60]. Therefore, the analysis of the nucleotide and codon usage in coronaviruses can not only exhibit the clues on potential viral evolution but also improves the understanding of the viral regulation and promotes vaccine design.

K-mer-based Classifier Study

Studies have investigated the role of k-mer frequency for the fast and accurate classification of viral genomes [61]. We experiment with values for k-mer of length k = 1, 2, ..., 7 on different classifiers to measure the prediction accuracy. To explore the k-mer frequency patterns in distinct coronaviruses, we curate independent testing set to assess the performance. Fig. () shows the predictive accuracy across seven machine learning algorithms, at different values of k. Fig. () portrays the performance curves by deep learning models (in the left), and the results via traditional machine learning models (in the right). Overall, our proposed methods obtain an average accuracy of 0.956. However, we can observe the traditional machine learning methods exceed 0.98 in accuracy at all k values, whereas there is a different story for deep learning models. It is shown that VGG-19 achieves the best results, while the accuracy could be as low as 0.8 using ResNet-34 when k = 2. We can conclude that for these data the traditional machine learning methods outperform deep learning models almost at all levels of k with less fluctuation. As a result, the k-mer value 6 is used for the results of experiments with CGR representation.

Comparative Performance

We analyzed the effect on viral lethality prediction via different numerical representations for RNA sequences using machine learning approaches. The dataset used is the same as those in Fig. (). The results along with the average scores for all numerical representations and classifiers are summarized in Table . As can be observed from Table , for all numerical representations, the average scores are high over all measures. The best performance is achieved when using CGR representation, which yields an average accuracy of 0.985 in the testing set. Surprisingly, we can obtain an average accuracy of 0.967 even with a single nucleotide numerical representation “Just-A”. At the individual classifier level, traditional machine learning methods display an apparent advantage over deep learning models. Logistic regression and neural network can achieve 100% accuracy for all numerical representations, whereas the prediction accuracy ranges from 0.679 to 0.993 implemented by Resnet34, VGG19 and AlexNet. At this point, this is probably because deep learning algorithms need a large amount of data to understand the pattern. In addition to performing higher accuracy, machine learning models are computationally cheaper in this task, e.g. in CGR representation, it takes much longer time for deep learning models than classical machine learning methods on average (Supplementary Materials ). Overall, our results suggest that all these numerical representations are effective for modeling to differentiate the degree of the lethality of human coronavirus.

Validation on other Human Coronaviruses

We test the ability of our models to identify the lethality of other different human coronaviruses, i.e., HCoV-229E, HCoV-NL63, HCoV-OC43, and HCoV-HKU1. The training process is implemented on the former three types of coronavirus data. For every test dataset, we use CGR as the numerical representation with all classifiers to predict the lethality. Interestingly, the results show that, on average, 28 out of 32 HCoV-229E, 59 out of 64 HCoV-NL63, 134 out of 142 HCoV-OC43 and 25 out of 27 HCoV-HKU1 strains identified have closer lethality with SARS-CoV-2, while the rest strains are labeled middle or high. This suggests that, overall, other test human coronaviruses have lower severity than MERS-CoV and SARS-CoV. Evidence has revealed that HCoV-OC43 and HCoV-HKU1 are associated with mild to moderate upper respiratory tract illness with about 0.1% fatality [62]. These infections may be asymptomatic and are considered the second common cause of cold [63]. Similarly, it has been well documented that the majority of HCoV-NL63 infections are mild in humans, though occasionally, this coronavirus causes pneumonia or central nervous system diseases in susceptible individuals [64]. During 2009 and 2016, it accounted for about 0.5% of all acute respiratory tract infections in hospitalized patients from Guangzhou, China, but few death cases are reported [65]. HCoV-229E is a close relative of HCoV-NL63 and it will lead to similar symptoms [66]. Fig. () displays the CGR plots of different sequences of human coronavirus at the value of 6 for -mer frequency. The CGR plots visually indicate that the genomic signature of the SARS-CoV-2 isolate Wuhan-Hu-1 (Fig. is closer to the genomic signature of the SARS-CoV coronavirus isolate Canada (Fig. , followed by the strain of MERS-CoV Betacoronavirus England 1 isolate (Fig. . Moreover, the other four human coronaviruses from (Fig. and g) presents similar visual patterns, which are different from the former three types. Given the CGR plots of human coronaviruses, we further explore the trace of their origin and relation through phylogenetic analysis. We randomly select five complete genomes from each type containing the reference strain. The phylogenetic tree is constructed based on all pairwise distance with maximum likelihood techniques for the dataset. The results in Fig. () present a clear separation of seven clusters and relationships within the clusters. The average inter-cluster distances confirm that SARS-CoV-2 sequences are closest to the species of SARS-CoV (average distance 0.486), followed by MERS-CoV (4.782), which are far away from other four human coronaviruses. We also find that HCoV-OC43 and HCoV-HKU1, HCoV-229E and HCoV-NL63 may originate from the same ancestor with the genetic distance 1.842 and 2.779, respectively. But there is no evidence indicating the situation that the two different species of human coronavirus will present similar lethality if they are genetically close.

DISCUSSION

We provided a comprehensive quantitative analysis to predict the lethality of human-adapted coronavirus with six different numerical representations of RNA sequences applied in machine learning models. The models are computational efficiency because they are alignment-free. Compared with alignment-based methods, multiple sequence alignment is not needed with the leverage of DFT techniques. The experiment results show that most of the models have achieved rapid and accurate predictions for the lethality of new human-adapted coronavirus. We validated our results by a quantitative analysis based on the construction of the phylogenetic tree, which reveals the evolutionary relationships among all human coronaviruses based upon genetic information. Coronaviruses are usually thought to cause mild and non-lethal symptoms in humans before the outbreak of SARS-CoV in 2003. The high pathogenicity of SARS-CoV, MERS-CoV and newly SARS-CoV-2 captures surgent interests and concerns of the family of coronavirus. Timely analysis of genomic sequences of novel strains requires quick sequence similarity comparison with thousands of known species, which are generally performed by alignment-based methods. However, these methods are time-consuming and sometimes challenging in cases where homologous sequence continuity cannot be ensured. The application of alignment-free approaches has addressed this issue that can handle a large number of sequences effectively. Previous studies have elucidated that the origin of this SARS-CoV-2 stems from bats [10, 67]. Early sequencing of SARS-CoV-2 strains revealed over 99% similarity with some bat-like coronavirus, indicating these infections result from a recent cross-species event [68]. Bats are regarded as the natural reservoir of viruses and cross-species transmission to mammals [4, 69]. Before the emergence of SARS- CoV-2, it was uncovered that the coronavirus SARS-Cov and MERS-CoV have also originated from bats [70, 71]. The phylogenetic analyses assist in identifying the relationships between SARS-CoV-2 and other coronaviruses through the nucleotide and amino acid sequence similarities. The continuous human-to-human transmission has been confirmed and asymptomatic cases have continued to increase [72, 73]. There is a desperate need for strict precautions to prevent the spread of the virus and protect public health. Vaccines and miracle drugs are the most efficient ways of fighting against this crisis. Currently, the development of vaccines has been into Phase 3 trials in some countries, while the human ACE2 receptor has been identified as the potential receptor for COVID-19 and serves as a potential target for treatment [74, 75]. Nevertheless, with the circulation of bat-related coronavirus and geographic coverage, it is critical to monitor the evolution of coronavirus. Currently, seven known types of coronavirus can infect humans. Novel strains of these coronaviruses can likely arise and attack human again through reassortment and mutation when two different or more strains co-infect the same host. Preparation is necessary to prevent potential epidemics and pandemics caused by a novel coronavirus. As a result, our work paves the basis for surveillance by inferring the lethality of any potential human coronaviruses that may emerge in the future. This study is subject to a variety of limitations. The definition of classifying the degree of coronavirus lethality is mainly based on the mortality rate. We assume that the higher the mortality, the more lethal for the virus, and thus make three categories of the lethality level for all viruses with a different threshold. However, our estimation for these values lies within the range of fatality rate from the literature, which we do not have sufficient data to obtain and parameterize the case-structured model, especially for viruses with few samples. Besides, some other factors such as innate immune system and comorbidities could make a significant impact on the lethality of virus when infecting humans. Moreover, the limited data points for the human coronavirus pale the high predictive accuracy, as most of the machine learning algorithms possess a superb generation ability to discover inherent patterns from training samples, particularly in the small dataset. But like typical machine learning approaches, our models are not qualified to provide a direct and accessible explanation that explicitly interprets why a certain coronavirus strain is more lethal to humans. Some rule-based methods or clinical study might provide a better rationale for their results.

CONCLUSION

We provide a comprehensive analysis through alignmentfree machine learning-based methods for the prediction of the lethality of existing human-adapted coronavirus. The results show that on average, CGR, EIIP, and Just-A representations perform better than others, with an average accuracy of 0.985, 0.968 and 0.963, respectively. Interestingly, traditional machine learning methods display obvious merit both in computational efficiency and performance than deep learning models on this task. Validation of other types of human coronavirus in combination with phylogenetic analysis further demonstrates our predictive results. We hope this work would facilitate the research of COVID-19 for biologists and clinicians that are in the frontline to detect the lethality of new emerging variants of SARS-CoV-2. Future work includes the construction of novel coronavirus surveillance and in vitro evaluation of the computational models.

Table 1

The performance for the lethality prediction of human-adapted coronaviruses via seven different classifiers. Average results for each numerical representation are in bold.

Numerical Representation	Model	Training Data			-	Testing Data
Numerical Representation	Model	Accuracy	Precision	Recall	F-score	Accuracy	Precision	Recall	F-score
Real	LR	0.999	0.999	1.000	0.999	1.000	1.000	1.000	1.000
	KNN	0.999	1.000	0.999	0.999	0.984	0.994	0.983	0.988
	NN	0.999	0.998	1.000	0.999	1.000	1.000	1.000	1.000
	RF	0.998	0.998	1.000	0.999	1.000	1.000	1.000	1.000
	ResNet34	0.964	0.990	0.990	0.990	0.979	0.992	0.986	0.989
	VGG19	0.961	0.989	0.989	0.989	0.981	0.988	0.988	0.988
	AlexNet	0.671	0.841	0.841	0.841	0.679	0.893	0.670	0.765
	Average	0.941	0.973	0.974	0.973	0.946	0.981	0.946	0.961
Nearest neighbor based doublet	LR	0.999	0.999	1.000	0.999	1.000	1.000	1.000	1.000
	KNN	0.998	0.999	0.996	0.998	0.981	0.993	0.981	0.987
	NN	0.999	0.998	1.000	0.999	1.000	1.000	1.000	1.000
	RF	0.998	0.998	1.000	0.999	1.000	1.000	1.000	1.000
	ResNet34	0.966	0.991	0.991	0.991	0.977	0.991	0.988	0.989
	VGG19	0.946	0.981	0.981	0.981	0.967	0.987	0.984	0.986
	AlexNet	0.857	0.936	0.936	0.936	0.714	0.902	0.712	0.796
	Average	0.966	0.986	0.986	0.986	0.948	0.981	0.952	0.964
EIIP	LR	0.999	0.999	1.000	0.999	1.000	1.000	1.000	1.000
	KNN	0.998	0.999	0.995	0.997	0.981	0.993	0.981	0.987
	NN	0.998	0.997	0.999	0.998	1.000	1.000	1.000	1.000
	RF	0.999	0.997	0.998	0.997	1.000	1.000	1.000	1.000
	ResNet34	0.962	0.989	0.989	0.989	0.972	0.989	0.980	0.984
	VGG19	0.940	0.978	0.978	0.978	0.979	0.992	0.989	0.990
	AlexNet	0.839	0.927	0.927	0.927	0.848	0.949	0.936	0.942
	Average	0.962	0.983	0.983	0.983	0.968	0.989	0.983	0.986
PP	LR	0.999	0.999	1.000	0.999	0.995	0.998	0.995	0.997
	KNN	0.999	1.000	0.998	0.999	0.981	0.993	0.981	0.987
	NN	0.999	0.998	1.000	0.999	1.000	1.000	1.000	1.000
	RF	0.999	0.999	0.998	0.999	0.998	0.999	0.998	0.998
	ResNet34	0.963	0.989	0.989	0.989	0.977	0.991	0.985	0.988
	VGG19	0.943	0.980	0.980	0.980	0.993	0.997	0.994	0.996
	AlexNet	0.662	0.837	0.837	0.837	0.681	0.894	0.669	0.765
	Average	0.937	0.971	0.971	0.971	0.946	0.981	0.946	0.961
Just-A	LR	0.999	0.999	1.000	0.999	1.000	1.000	1.000	1.000
	KNN	0.998	0.999	0.996	0.998	0.986	0.994	0.985	0.990
	NN	0.999	0.998	1.000	0.999	1.000	1.000	1.000	1.000
	RF	0.999	0.998	1.000	0.999	1.000	1.000	1.000	1.000
	ResNet34	0.969	0.992	0.992	0.992	0.960	0.984	0.969	0.977
	VGG19	0.969	0.992	0.992	0.992	0.984	0.989	0.992	0.991
	AlexNet	0.842	0.928	0.928	0.928	0.841	0.942	0.933	0.938
	Average	0.967	0.986	0.986	0.986	0.967	0.987	0.982	0.985
CGR	LR	1.000	1.000	1.000	1.000	0.995	0.998	0.995	0.997
	KNN	0.999	1.000	0.999	0.999	0.993	0.997	0.993	0.995
	NN	0.999	0.998	1.000	0.999	1.000	1.000	1.000	1.000
	RF	0.999	0.997	0.998	0.999	0.995	0.998	0.995	0.997
	ResNet34	0.975	0.996	0.996	0.996	0.934	0.975	0.933	0.954
	VGG19	0.948	0.982	0.982	0.982	0.993	0.997	0.994	0.996
	AlexNet	0.955	0.986	0.986	0.986	0.988	0.995	0.992	0.994
	Average	0.982	0.994	0.994	0.994	0.985	0.994	0.986	0.990

57 in total

Exploring the Lethality of Human-Adapted Coronavirus Through Alignment-Free Machine Learning Approaches Using Genomic Sequences.

INTRODUCTION

METHODS

Problem Formulation

Data Collection and Preprocessing

Feature Transformation

Model Construction

Implementation and Evaluation

RESULTS

Genome Composition of SARS-CoV, MERS-CoV and SARS-CoV-2

K-mer-based Classifier Study

Comparative Performance

Validation on other Human Coronaviruses

DISCUSSION

CONCLUSION

1. History and recent advances in coronavirus discovery.

2. Chaos game representation of gene structure.

3. Developing Covid-19 Vaccines at Pandemic Speed.

4. Understanding Human Coronavirus HCoV-NL63.

5. Human coronavirus NL63 and 229E seroconversion in children.

6. Time series computational prediction of vaccines for influenza A H3N2 with recurrent neural networks.

Review 7. Coronavirus diversity, phylogeny and interspecies jumping.

8. Isolation and characterization of viruses related to the SARS coronavirus from animals in southern China.

9. Predicting antigenic variants of H1N1 influenza virus based on epidemics and pandemics using a stacking model.

Review 10. Origin and evolution of pathogenic coronaviruses.