Literature DB >> 31182019

Scoring amino acid mutation to predict pandemic risk of avian influenza virus.

Abstract

BACKGROUND: Avian influenza virus can directly cross species barriers and infect humans with high fatality. As antigen novelty for human host, the public health is being challenged seriously. The pandemic risk of avian influenza viruses should be analyzed and a prediction model should be constructed for virology applications.
RESULTS: The 178 signature positions in 11 viral proteins were firstly screened as features by the scores of five amino acid factors and their random forest rankings. The Supporting Vector Machine algorithm achieved well performance. The most important amino acid factor (Factor 5) and the minimal range of signature positions (63 amino acid residues) were also explored. Moreover, human-origin avian influenza viruses with three or four genome segments from human virus had pandemic risk with high probability.
CONCLUSION: Using machine learning methods, the present paper scores the amino acid mutations and predicts pandemic risk with well performance. Although long evolution distances between avian and human viruses suggest that avian influenza virus in nature still need time to fix among human host, it should be notable that there are high pandemic risks for H7N9 and H9N2 avian viruses.

Entities: CellLine Chemical Disease Gene Species

Keywords: Amino acid mutation; Avian influenza virus; Machine learning; Pandemic risk

Mesh：

Substances：
Amino Acids

Year: 2019 PMID： 31182019 PMCID： PMC6557742 DOI： 10.1186/s12859-019-2770-0

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Influenza A virus contains eight segments of single-strand negative RNA. Segment 4 codes hemagglutinin (HA) gene and segment 6 codes neuraminidase (NA) gene. According to the antigenic characteristics of HA and NA, avian influenza A virus has 16 subtypes HA and nine subtypes NA [1]. Since the mutation rates of viral genome were fast, the phenotype of antigen, drug-resistance, and virulence changed in a relative short time. Moreover, segmental pattern facilitates the reassortment of viral genome and promote fast change of phenotypes [1]. Avian influenza virus (AIV) could across the species barrier and infect human fatally, which caused huge loss of economy and attracted extensive attention of the society. The highly pathogenic AIV of H5N1 subtype was firstly reported in Asia in 1996 [2]. The fact that H5N1 virus cross species barriers directly and fatally infect the respiratory system were confirmed by the isolation of human-origin H5N1 virus from clinical samples in 1997 [3, 4]. Human infections of H5N1 subtype were continuously reported widely since 2003 and huge data were deposited in public database [5-8]. Besides H5N1 virus, other subtypes can also infect human by direct interspecies transmission. There are two infection cases of H9N2 in 1999 and 2003 [9, 10]. H7N7 virus infected farmers in the Netherlands in 2003 [11], Moreover, H7N9 occurred in 2013 and infections of human cases were still reported up to now [12, 13]. Interspecies transmission of AIV had two phenotypes in the view of transmission efficiency: (1) keeping popular among poultry or causing human infection with low probability; (2) adaptation to human host and human-to-human transmission with high efficiency. Thus far, AIVs in nature had not the second phenotype, which represents initial adaption to the new host and low efficiency of transmission among human. Seasonal and pandemic influenza virus had high efficiency of transmission among human. Unfortunately, more and more reports about transmission efficiency proved that AIV with adequate amino acid (AA) mutations could have the ability of highly efficient transmission among mammals, which strongly suggested that pandemic risk of AIVs among human was rising [14-20]. As high fatality and antigen novelty for human host, the public health is being challenged seriously by AIVs. So, computational tools in the field of bioinformatics should be proposed to screen mutations in viral proteins not only for the study of high efficiency transmission among human but also for the prediction of transmission phenotype and the corresponding pandemic risk of AIVs. In a previous study, five amino acid factors summarized from 491 highly redundant amino acid attributes were associated with specific physiochemical amino acid properties, namely, polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge [21]. In this paper, we used five AA factors to transform viral proteins and used the random forest (RF) method to select features from high-dimensional protein data and score them by their contributions to the efficiency of transmission and pandemic risk. After ranking the positions containing important mutation information, the classifier could predict the transmission phenotype of high efficiency to evaluate the pandemic risk. In the paper, we first identified 178 signature mutation positions by the RF scoring, then predicted AIV occurrence by four popular machine learning methods. Using the most effective classifier, we explored the important amino acid factors and the minimal range of signature positions. The study results could benefit pandemic surveillance and future study on the efficiency of AIV transmission.

Results

Dataset

The final dataset contained 869 high-quality AIV strains (440 avian-origin AIVs with H1–H14, H16 subtypes; 429 human-origin AIVs with H5N1, H5N6, H7N3, H7N7, H7N9 and H9N2 subtypes) and 914 seasonal, pandemic human, and artificial viruses (H1N1, H1N2, H3N2 subtype; H5N1 artificial virus). As the 869 AIVs have low efficiency of transmission and low pandemic risk among human, they were regarded as negative samples. The 914 human or artificial viruses were regarded as positive samples since they were verified to have high efficiency of transmission among humans or mammals. The information related to these strains is summarized in Additional file 1.

Signature amino acid residues

The importance score at each position in the 11 viral proteins was computed by the RF model to screening the signature positions. The slope of the curve obviously changed at an importance score of 10 (Fig. 1a). Therefore, 10 was preliminary selected as cutoff score. The 178 signature positions were founded and the initial amino acid mutation set was generated for further machine learning.

Fig. 1

Importance score curve and the performances of k-nearest neighbor (KNN), naïve Bayes (NB), support vector machine (SVM), and random forest (RF) classifiers. a The ranked scores were calculated from five AA factors using the random forest method. The x and y coordinates denote the total length of the 11 protein alignments and the importance scores, respectively. The cutoff value 10 is indicated by the thin horizontal line. b Performances of the four classifiers were evaluated from 100 repeats of 10-fold cross-validation. The area under the curve (AUC) ranges from 0 to 1 As shown in Table 1, the hemagglutinin protein (HA) contained the largest number of signature positions (41 amino acid residues; about 41/178 = 23%), suggesting that HA is very important for highly efficient transmission of AIVs among human. HA is mainly involved in receptor-binding and fusion activities. Positions HA102-HA290 locate in or close to the region of host receptor binding [22, 23], and HA158, H163, HA189, HA190, HA224, HA226, HA228H is reportedly related to the specificity of receptor binding [14-19]. HA94, HA101, HA327, HA367, and HA393 locate at or near the fusion peptide [24], which triggers fusion activity in acidic environments and favors transmission to humans. The HA327 position in the cleavage site are important virulence sites [25]. The 627 position in the polymerase basic protein 2 (PB2) has been implicated in increased replication or virulence of AIVs in mammals and transmission among humans [19, 26]. The 93 and 95 positions in the matrix protein 2 (M2), which are affiliated with viral particle ensembles [27], were also screened. The 372 and 375 positions in the nucleoprotein (NP) are reportedly involved in intracellular transport of viral proteins [28, 29].

Table 1

Scores for the 178 signature amino acids of avian influenza viruses

Num	Pro^a	Pos^b	Score	Num	Pro	Pos	Score	Num	Pro	Pos	Score
1	PB2	44	12.13	61	HA	124	12.51	121	NP	430	10.74
2	PB2	61	14.04	62	HA	137	10.16	122	NP	442	18.71
3	PB2	81	11.32	63	HA	141	10.61	123	NP	444	12.48
4	PB2	105	13.20	64	HA	144	10.61	124	NP	455	10.90
5	PB2	199	12.53	65	HA	155	10.56	125	NP	456	10.54
6	PB2	225	10.72	66	HA	158	10.36	126	NP	473	10.14
7	PB2	271	21.13	67	HA	160	16.89	127	NA	105	10.92
8	PB2	323	20.77	68	HA	163	11.46	128	NA	200	10.32
9	PB2	368	10.29	69	HA	164	10.57	129	NA	247	17.48
10	PB2	391	13.25	70	HA	169	10.33	130	NA	347	10.16
11	PB2	475	16.03	71	HA	171	10.35	131	NA	372	10.85
12	PB2	526	11.05	72	HA	172	10.15	132	NA	399	12.30
13	PB2	559	10.52	73	HA	189	13.15	133	M1	15	10.57
14	PB2	567	12.55	74	HA	190	19.80	134	M1	30	16.61
15	PB2	588	14.51	75	HA	193	13.31	135	M1	37	10.92
16	PB2	591	10.79	76	HA	203	12.47	136	M1	115	16.49
17	PB2	627	11.12	77	HA	224	16.94	137	M1	116	19.25
18	PB2	645	11.76	78	HA	225	14.94	138	M1	137	11.12
19	PB2	674	11.02	79	HA	226	15.14	139	M1	142	11.79
20	PB1	99	16.40	80	HA	228	15.09	140	M1	207	12.04
21	PB1	287	11.47	81	HA	246	10.95	141	M1	209	14.74
22	PB1	336	14.27	82	HA	272	12.19	142	M1	214	16.99
23	PB1	339	11.25	83	HA	276	12.12	143	M2	13	11.26
24	PB1	361	13.80	84	HA	285	10.01	144	M2	14	12.88
25	PB1	368	12.23	85	HA	299	10.21	145	M2	18	12.14
26	PB1	375	13.66	86	HA	327	12.71	146	M2	20	10.14
27	PB1	486	11.37	87	HA	367	11.62	147	M2	27	14.13
28	PB1	581	19.70	88	HA	393	13.32	148	M2	28	10.01
29	PB1	584	11.53	89	HA	406	11.53	149	M2	31	11.00
30	PB1	741	14.47	90	HA	413	10.90	150	M2	43	13.76
31	PB1_f2	11	11.53	91	HA	462	11.17	151	M2	50	10.52
32	PB1_f2	27	13.74	92	HA	490	10.82	152	M2	54	12.38
33	PB1_f2	59	10.05	93	HA	493	11.39	153	M2	57	11.57
34	PB1_f2	60	14.89	94	HA	530	10.65	154	M2	65	11.31
35	PB1_f2	73	11.33	95	HA	531	12.94	155	M2	66	13.29
36	PB1_f2	78	13.81	96	NP	16	11.77	156	M2	77	10.00
37	PB1_f2	83	14.22	97	NP	21	10.30	157	M2	78	17.00
38	PA	28	16.42	98	NP	33	15.12	158	M2	79	12.74
39	PA	55	13.03	99	NP	61	12.44	159	M2	86	15.40
40	PA	57	11.25	100	NP	99	22.49	160	M2	93	18.45
41	PA	65	10.78	101	NP	100	18.84	161	M2	95	11.66
42	PA	66	10.42	102	NP	119	10.82	162	NS1	7	10.23
43	PA	94	13.02	103	NP	136	13.19	163	NS1	22	11.57
44	PA	163	11.54	104	NP	189	11.58	164	NS1	53	10.59
45	PA	225	12.94	105	NP	190	13.23	165	NS1	60	13.19
46	PA	268	12.08	106	NP	283	17.10	166	NS1	74	10.03
47	PA	277	14.17	107	NP	289	11.37	167	NS1	81	14.88
48	PA	337	13.50	108	NP	293	12.50	168	NS1	114	13.45
49	PA	391	10.69	109	NP	305	20.80	169	NS1	125	11.55
50	PA	400	11.18	110	NP	313	16.72	170	NS1	171	10.83
51	PA	421	12.85	111	NP	345	13.40	171	NS1	189	13.00
52	PA	520	11.20	112	NP	351	10.14	172	NS1	205	11.00
53	PA	552	16.33	113	NP	353	10.09	173	NS1	215	11.37
54	PA	669	11.64	114	NP	357	21.26	174	NS1	227	12.45
55	HA	12	14.87	115	NP	372	12.50	175	NEP	32	12.43
56	HA	94	10.48	116	NP	375	12.09	176	NEP	70	18.26
57	HA	101	11.00	117	NP	400	10.68	177	NEP	89	11.59
58	HA	110	17.94	118	NP	422	14.84	178	NEP	107	14.74
59	HA	111	11.00	119	NP	425	12.20
60	HA	117	11.34	120	NP	426	10.69

aViral protein; bPosition of amino acid residue as H3 subtype numbering

Scores for the 178 signature amino acids of avian influenza viruses aViral protein; bPosition of amino acid residue as H3 subtype numbering The viral proteins were transformed by the five amino acid factors and 178 signature positions were screened by the RF method. Part of the signature positions had been verified to be related with the mechanism of interspecies transmission or high efficiency of transmission among humans, which would rationalize model construction and benefit predicting accuracy. Moreover, the rest amino acid mutation without trial verification would facilitate the exploration of molecular mechanisms about high efficiency transmission among humans.

Performance of the prediction model

The 10-fold cross validation and the receiver operating characteristic (ROC) curve were used to evaluate the performance of the classifiers. The area under the ROC curve (AUC) reveals the optimal parameters in the four classifiers. As shown in Fig. 1b, the performances were different obviously. The AUC medians of the Supporting Vector Machine (SVM) and RF models were almost 1 while that for the K-Nearest Neighbor (KNN) model were almost 0.5. The KNN model had not good performance and the reason may be the nonlinear prediction rules in feature space. The performance of the Naïve Bayes (NB) classifier was slightly poorer and less stable than those of the SVM and RF classifiers. Considering the benefit of small samples and the computation complex, the SVM classifier was selected as the optimal machine learning model for predicting pandemic risk of AIVs.

Contributions of the AA factors

AIVs were characterized by the scores of 178 amino acid mutations. The five AA factors were associated with specific physiochemical amino acid properties: polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. To understand the importance of the five AA factors, the SVM classifier was used to evaluate all combination patterns. As shown in Fig. 2a, most of the stable performances of the SVM classifier were contributed by AA Factor 5 or combinations with AA Factor 5. Notably, the median AUC values were almost 1 and remained stable under AA Factor 5 alone. The performances of the SVM classifiers under AA Factor 1, or AA Factor 2 alone were not as good as AA Factor 5. These results indicate an important role for AA Factor 5 in the mechanism of AIVs transmission. Therefore, AA Factor 5 was employed in further analysis.

Fig. 2

Contributions of AA factors and different mutation sets. a Performance of SVM classifier for different combinations of the five AA factors. The x and y coordinates denote the 31 combination patterns and the AUC values (from 0 to 1), respectively. Along the x axis, ‘15’ denotes that the set of 178 amino acid residues was transformed using AA Factor 1 and AA Factor 5 together, for example. b Contributions of mutation positions for different cutoff values (range 10–20). The y coordinate shows the AUC values

Contributions of the mutation sets

One hundred seventy-eight mutation sites were achieved under a cutoff value of 10 as mentioned above. To further explore the minimum mutations set associated with transmission efficiency, the cutoff value was adjusted and was incremented in steps of 1. The SVM classifier was still calculated with the five AA factors together. As shown in Fig. 2b, the SVM classifier destabilized at higher cutoffs and achieved stable and best performance at cutoffs 13. The performance of the SVM classifier with AA Factor 5 alone was also calculated for different cutoffs. As shown in Fig. 3a, the SVM classifier performed stably and well up to a cutoff of 17 and the best performance was achieved at cutoff 13, which giving 63 signature positions (Table 2). These 63 signature residues were regarded as the minimum mutation set of amino acid residues and were transformed by AA Factor 5 alone to show the pattern of avian and human influenza viruses by the multidimensional scaling method [see Additional file 2].

Fig. 3

Table 2

Minimal amino acid set for predicting AIVs

Num	Pro^a	Pos^b	Score	Num	Pro	Pos	Score	Num	Pro	Pos	Score
1	PB2	61	14.04	22	PA	337	13.50	43	NP	345	13.40
2	PB2	105	13.20	23	PA	552	16.33	44	NP	357	21.26
3	PB2	271	21.13	24	HA	12	14.87	45	NP	422	14.84
4	PB2	323	20.77	25	HA	110	17.94	46	NP	442	18.71
5	PB2	391	13.25	26	HA	160	16.89	47	NA	247	17.48
6	PB2	475	16.03	27	HA	189	13.15	48	M1	30	16.61
7	PB2	588	14.51	28	HA	190	19.80	49	M1	115	16.49
8	PB1	99	16.40	29	HA	193	13.31	50	M1	116	19.25
9	PB1	336	14.27	30	HA	224	16.94	51	M1	209	14.74
10	PB1	361	13.80	31	HA	225	14.94	52	M1	214	16.99
11	PB1	375	13.66	32	HA	226	15.14	53	M2	27	14.13
12	PB1	581	19.70	33	HA	228	15.09	54	M2	43	13.76
13	PB1	741	14.47	34	HA	393	13.32	55	M2	66	13.29
14	PB1_f2	27	13.74	35	NP	33	15.12	56	M2	78	17.00
15	PB1_f2	60	14.89	36	NP	99	22.49	57	M2	86	15.40
16	PB1_f2	78	13.81	37	NP	100	18.84	58	M2	93	18.45
17	PB1_f2	83	14.22	38	NP	136	13.19	59	NS1	60	13.19
18	PA	28	16.42	39	NP	190	13.23	60	NS1	81	14.88
19	PA	55	13.03	40	NP	283	17.10	61	NS1	114	13.45
20	PA	94	13.02	41	NP	305	20.80	62	NEP	70	18.26
21	PA	277	14.17	42	NP	313	16.72	63	NEP	107	14.74

aViral protein; bPosition of amino acid residue as H3 subtype numbering

The distribution of influenza viruses by minimal amino acid set. a Contributions of reduced mutation position sets. The x and y coordinates denote the cutoff (range 10–20) and the AUC values (range 0–1), respectively. b Patterns of human and avian influenza viruses clustered by the multidimensional scaling (MDS) method. Avian influenza viruses were marked by hollow triangle (Red, human-origin; Green, avian-origin). Seasonal and pandemic human influenza viruses were marked by red hollow rectangle. The six artificial H5N1 virus were marked by red solid circle. c human and avian influenza H3N2 viruses in group 4 Minimal amino acid set for predicting AIVs aViral protein; bPosition of amino acid residue as H3 subtype numbering The distribution of human and avian influenza virus in two dimensions were shown in Fig. 3b. In the view of pandemic risk, most of avian viruses were cluster at the low left while human viruses formed three separate clusters at the right. Avian influenza virus 1 (EPI_ISL_64953, A/turkey/NC/353568/2005, H3N2), 2 (EPI_ISL_3141, A/Duck/Nanchang/4–184/2000, H2N9) and 3 (EPI_ISL_3362, A/duck/NC/91347/2001, H1N2) were closed to the human viruses, which should be strictly supervised in the future. The viruses in group 4 were composed by seasonal human and avian virus of H3N2 subtype isolated from 2005 to 2013 in North America (Fig. 3c), which suggested that direct interspecies transmission once occurred. As shown in Table 2, the 63 signature positions were screened with the cut-off value 13. The nucleoprotein (NP) contained the largest number of signature positions (12 amino acid residues; about 12/63 = 19%), suggesting that NP is very important for host range of influenza virus [1]. The HA protein contained the similar number of signature positions to the NP protein (11 amino acid residues; about 11/63 = 17%), which further confirmed that HA is very important for highly efficient transmission of AIVs among human. Although amino acid mutations in the HA protein are essential for AIV transmission in mammals [14-19], mutations in other proteins are also necessary and should be further verified by trials [14, 15, 20]. Mutations distribution in different viral proteins suggested that the role of synergy and nonlinearity among viral proteins should be focused in the study of AIVs.

Pandemic risk of human-origin AIVs

It was supposed that potential pandemic may be triggered by the reassortment of viral genomes [1], which means that genome segments of human viruses (excluding the HA segment) were inserted into the genome of AIVs. To value the pandemic risk of human-origin AIVs, the artificial stimulation of genome reassortment between human-origin AIVs and human influenza viruses (seasonal human virus and 2009 pandemic virus) was performed. As shown in Table 3, three or four genome segments were needed at least to achieve the change of transmission phenotype with high probability (> = 0.90). The computing results were compatible with the reports from Zhang Y., et al. 2013 [20]. It should be notable that there was high pandemic risk for H7N9 virus (only three segments needed) and H9N2 virus (flexible patterns of genome reassortment), which was very important for the surveillance of avian influenza virus in the future.

Table 3

Artificial simulation of genome reassortment

Human-origin AIV	Human influenza virus	Probability	Genome segment
A/Egypt/682/2015_H5N1	A/Ohio/09/2015_H1N1	0.90	seg1seg3seg5seg7
A/Zhejiang/9/2015_H7N9	A/Ohio/09/2015_H1N1	0.91	seg1seg5seg7
A/Hunan/44558/2015_H9N2	A/Ohio/09/2015_H1N1	0.94	seg1seg2seg5seg7
A/Hunan/44558/2015_H9N2	A/Ohio/09/2015_H1N1	0.93	seg1seg3seg5seg7
A/Hunan/44558/2015_H9N2	A/Ohio/09/2015_H1N1	0.91	seg1seg5seg7seg8
A/Hunan/44558/2015_H9N2	A/Sichuan/1/2009_H1N1	0.91	seg1seg2seg5seg7

Artificial simulation of genome reassortment

Discussion

Avian influenza viruses can cross the species barrier, potentially causing a human pandemic. In this paper, AIV pandemic risk was predicted by the SVM model with excellent performance. We firstly screened 178 mutation positions in the 11 viral proteins by the RF method. Part of the residues at these positions have been related to interspecies transmission in earlier reports, such as HA158, H163, HA189, HA190, HA224, HA226, HA228H [14–16, 18], H163 [17], HA94, HA101, HA327, HA367, and HA393 [24], M2 93, M2 95 [27], NP372, NP375 [28, 29], PB2 627 [26], which guarantee the accuracy and the biologically meaningful of the predicting model. The proposed models provide important clues for future surveillance in the field of virology and is a useful pre-screening tool for phenotype screening in high-level biological safety laboratories. Amino acid mutations in the HA protein are essential for highly efficient transmission in mammals [16], but mutations in other viral proteins are also necessary [14, 15]. Mutations in different proteins introduce synergy and nonlinearity among these viral proteins, which was supported by the results in the paper. The linear classifier (the KNN model) showed poor predictive performance on the initial set of 178 signature positions. Moreover, the minimal signature position set was composed by 63 amino acid residues and distributed among different viral proteins as shown in Table 2. This synergistic effect should be notable in further study. Moreover, the NP protein contained the largest number of signature positions (12 amino acid residues; about 12/63 = 19%), suggesting that NP is very important for host range of influenza virus [1]. The role of NP protein for transmission should be focused in the future. The molecular characteristics of AA Factor 5 are related to electrostatic charge with high coefficients on isoelectric point and net charge [21]. Electrostatic charge is strong related with the binding of biology molecules, such as the binding between viral surface protein and host receptor, the binding between viral enzyme and host molecules. The poor performance of other four factors may suggest that host receptor binding, and viral polymerase activity play key roles for the adaption of human host and transmission of avian influenza virus with high efficiency. Four popular classifiers were used to predict the phenotype of AIVs. With the empirical parameters, the SVM model achieved well performance while KNN not. The KNN parameters were adjust from k = 1 to 20 and the performance was still not good. The reason may be that the size of data was not adequate for the dimension of feature vector. In the paper, all of the 1783 influenza viruses in the final dataset were represented by a 178 × 5 = 890 dimension vector. The KNN algorism had weak performance for our data. As shown in Table 3, three or four genome segments were needed for H7N9 and H9N2 virus to achieve the change of transmission phenotype with high probability (> = 0.90), which was very important for the surveillance of AIVs in the future. Moreover, when avian and human virus with the predicted genome pattern were founded in the same region or in the same case, the pandemic risk should be notable.

Conclusions

The 178 signature mutations in 11 viral proteins were firstly screened by the random forest model. AIV pandemic risk was predicted by the SVM model with excellent performance. Although long evolution distance between avian and human influenza suggested that avian influenza virus in nature still need long time to fix among human, it should be notable that there are high pandemic risks for H7N9 and H9N2 AIVs. The novel findings in the paper provide important clues for pandemic surveillance.

Methods

The genome data of 16,551 influenza viruses isolated from nature were collected from the EpiFlu public database [30, 31] and those of six artificial H5N1 viruses with pandemic risk were collected from the ref. [14], which were processed and modeled using multiple public bioinformatics tools and algorithms as shown in Fig. 4. The strains were isolated between January 1996 and February 2016. The details for data cleaning are the same as those in the ref. [32-34].

Fig. 4

Flowchart of machine learning algorism used in the paper

Flowchart of machine learning algorism used in the paper The final dataset for predicting pandemic risk contained two category virus in the view of pandemic risk: 1) 869 high-quality AIV strains with low transmission efficiency among human: 440 avian-origin AIVs (H1–H14, H16 subtypes) and 429 human-origin AIVs (H5N1, H5N6, H7N3, H7N7, H7N9 and H9N2 subtypes); 2) 914 influenza strains with high transmission efficiency among human: 908 seasonal or pandemic human influenza (H1N1, H1N2 and H3N2 subtypes) and six artificial H5N1 viruses [14]. Considering the balance of data size and high similarity of viral protein sequence, seasonal and pandemic human virus in nature should differ by isolation location, isolation time, or antigen subtype. The information related to these strains is summarized in Additional file 1.

Scoring amino acid mutation

Random Forest is a collection of a large number of decision trees. The contribution of each feature to each tree in the random forest were calculated. All of the features were ranked according to the average of contributions to all of the trees in the model. The random forest method is very popularly used for feature selection of prediction problems and can rank the importance of the features in a large scale to discriminate the different categories. In this paper, transmission phenotype of high efficiency was predicted to evaluate the pandemic risk. Before the construction of classifier models, molecular features associated with transmission efficiency were firstly screened. The positive samples (high transmission efficiency) and negative samples (low transmission efficiency) were then classified by their importance scores at each amino acid position. The RF method was used to screen the signature mutation in the 11 viral proteins [35]. To facilitate the computing of importance scores, the 11 proteins in each strain were artificially concentrated as order: Polymerase basic protein 2 (PB2), Polymerase basic 1 (PB1), The second protein expressed in the PB1 gene (PB1-F2), Polymerase acidic protein (PA), Hemagglutinin (HA), Nucleoprotein (NP), Neuraminidase (NA), Matrix protein 1 (M1), Matrix protein 2 (M2), Non-structural protein 1 (NS1), Nuclear export protein (NEP). Numerical sequences of the amino acid factor were achieved with the transformation of the artificial protein with the length of 4620 amino acids. Any deletions or insertions in the protein were replaced by zeros. All of the viruses were processed sequentially and were input to the RF model for the ranking of signature position. Breiman’s random forest algorithm was used as default. As five factors were used to select the feature and construct the classifiers, the final importance score at each position was the sum of five calculations. In brief, highly scoring positions were important for distinguishing positive and negative samples. Signature positions with high scores were regarded as important amino acid mutations associated with the phenotype of highly efficient transmission.

Constructing the predicting model

Two-class model was constructed to predict and evaluate the pandemic risk of AIVs in the paper. After the ranking of amino acid mutations in all of the 11 viral proteins, each strain was represented as a numeric vector of length 5 N, where N is the length of the screened amino acid residue set. The pandemic risk was then predicted by four popular machine learning models: 1) Support vector machine [36]. The optimal hyperplane is determined with the regularization parameter C (C = 1) and the radial basis function (RBF) as default. 2) Random forest [35]. The RF model was implemented with the default parameter in the package. 3) Naïve Bayes [36]. The NB model was also implemented with the default parameter in the package. 4) K-nearest neighbor [37]. The KNN classifier is a nonparametric method to determine a sample category by a majority vote of its neighbors; the number of neighbors in this paper was set to be 3 (k = 3). All of the four classifiers were implemented in the R environment and related packages.

Evaluating the performance of different classifiers

All of the four models were trained on 823 positive samples (high transmission efficiency) and 782 negative samples (low transmission efficiency) randomly selected from the cleaned dataset of influenza virus. The remaining 10% of samples (91 positive and 87 negative samples) were reserved as an independent test dataset for assessing the performances of the classifiers. The 10-fold cross validation and the receiver operating characteristic curve were used to evaluate the performance of the SVM, NB, RF and KNN classifiers. The area under the ROC curve reveals the optimal parameters in the four classifiers. To compare the classifier performances, we repeated the evaluation process 100 times and plotted the distributions of the resulting AUC values. The AUC was calculated in R [38]. The AUC value ranges from 0 to 1. The performance and robustness of the four classifiers was evaluated by the AUC values and its distribution. The 1783 influenza viruses in the final dataset were shown by the multidimensional scaling method in R [37].

Artificial simulation of genome reassortment

As human influenza virus and human-origin avian influenza virus existed simultaneously in nature, mix infection in one case could cause the occurrence of pandemic virus by the mechanism of genome reassortment [20]. The perfect SVM classifier was used to analysis the artificial stimulation of genome reassortments between three human-origin AIVs and three human viruses. The artificial data were treated and predicted as above. Platt scaling was used to transform the output of the SVM model into a probability over two classes and evaluated the pandemic risk of genome reassortment viruses. In the paper, three human viruses with high efficiency of transmission in positive samples: A/Ohio/09/2015 (EPI_ISL_179403; H1N1), A/Wisconsin/13/2015 (EPI_ISL_176723; H3N2), and A/Sichuan/1/2009 (EPI_ISL_30411; H1N1; 2009 pandemic swine virus) and three human-origin avian viruses with low efficiency of transmission in negative samples: A/Egypt/682/2015 (EPI_ISL_195659; H5N1), A/Zhejiang/9/2015 (EPI_ISL_192505; H7N9) A/Hunan/44558/2015 (EPI_ISL_203644; H9N2) were used. The nomenclature for influenza virus in the final dataset. (XLSX 98 kb) The clustering details for the MDS method. (XLSX 135 kb)

34 in total

1. N- and C-terminal residues combine in the fusion-pH influenza hemagglutinin HA(2) subunit to form an N cap that terminates the triple-stranded coiled coil.

Authors: J Chen; J J Skehel; D C Wiley
Journal: Proc Natl Acad Sci U S A Date: 1999-08-03 Impact factor: 11.205

2. Several protein regions contribute to determine the nuclear and cytoplasmic localization of the influenza A virus nucleoprotein.

Authors: R Bullido; P Gómez-Puertas; C Albo; A Portela
Journal: J Gen Virol Date: 2000-01 Impact factor: 3.891

3. Molecular determinants within the surface proteins involved in the pathogenicity of H5N1 influenza viruses in chickens.

Authors: Diane J Hulse; Robert G Webster; Rupert J Russell; Daniel R Perez
Journal: J Virol Date: 2004-09 Impact factor: 5.103

4. Generation of influenza A virus NS2 (NEP) mutants with an altered nuclear export signal sequence.

Authors: Kiyoko Iwatsuki-Horimoto; Taisuke Horimoto; Yutaka Fujii; Yoshihiro Kawaoka
Journal: J Virol Date: 2004-09 Impact factor: 5.103

5. Molecular basis for high virulence of Hong Kong H5N1 influenza A viruses.

Authors: M Hatta; P Gao; P Halfmann; Y Kawaoka
Journal: Science Date: 2001-09-07 Impact factor: 47.728

6. Human infection with influenza H9N2.

Authors: M Peiris; K Y Yuen; C W Leung; K H Chan; P L Ip; R W Lai; W K Orr; K F Shortridge
Journal: Lancet Date: 1999-09-11 Impact factor: 79.321

7. Genetic characterization of the pathogenic influenza A/Goose/Guangdong/1/96 (H5N1) virus: similarity of its hemagglutinin gene to those of H5N1 viruses from the 1997 outbreaks in Hong Kong.

Authors: X Xu; N J Cox; Y Guo
Journal: Virology Date: 1999-08-15 Impact factor: 3.616

8. Genesis of a highly pathogenic and potentially pandemic H5N1 influenza virus in eastern Asia.

Authors: K S Li; Y Guan; J Wang; G J D Smith; K M Xu; L Duan; A P Rahardjo; P Puthavathana; C Buranathai; T D Nguyen; A T S Estoepangestie; A Chaisingh; P Auewarakul; H T Long; N T H Hanh; R J Webby; L L M Poon; H Chen; K F Shortridge; K Y Yuen; R G Webster; J S M Peiris
Journal: Nature Date: 2004-07-08 Impact factor: 49.962

9. Structure of the uncleaved human H1 hemagglutinin from the extinct 1918 influenza virus.

Authors: James Stevens; Adam L Corper; Christopher F Basler; Jeffery K Taubenberger; Peter Palese; Ian A Wilson
Journal: Science Date: 2004-02-05 Impact factor: 47.728

10. Avian influenza A virus (H7N7) associated with human conjunctivitis and a fatal case of acute respiratory distress syndrome.

Authors: Ron A M Fouchier; Peter M Schneeberger; Frans W Rozendaal; Jan M Broekman; Stiena A G Kemink; Vincent Munster; Thijs Kuiken; Guus F Rimmelzwaan; Martin Schutten; Gerard J J Van Doornum; Guus Koch; Arnold Bosman; Marion Koopmans; Albert D M E Osterhaus
Journal: Proc Natl Acad Sci U S A Date: 2004-01-26 Impact factor: 11.205

8 in total

1. Evolution of the North American Lineage H7 Avian Influenza Viruses in Association with H7 Virus's Introduction to Poultry.

Authors: Sungsu Youk; Christina Leyson; Mary Lea Killian; Mia Kim Torchetti; Dong-Hun Lee; David L Suarez; Mary J Pantin-Jackwood
Journal: J Virol Date: 2022-07-07 Impact factor: 6.549

Scoring amino acid mutation to predict pandemic risk of avian influenza virus.

Background

Results

Dataset

Signature amino acid residues

Performance of the prediction model

Contributions of the AA factors

Contributions of the mutation sets

Pandemic risk of human-origin AIVs

Discussion

Conclusions

Methods

Scoring amino acid mutation

Constructing the predicting model

Evaluating the performance of different classifiers

Artificial simulation of genome reassortment

1. N- and C-terminal residues combine in the fusion-pH influenza hemagglutinin HA(2) subunit to form an N cap that terminates the triple-stranded coiled coil.

2. Several protein regions contribute to determine the nuclear and cytoplasmic localization of the influenza A virus nucleoprotein.

3. Molecular determinants within the surface proteins involved in the pathogenicity of H5N1 influenza viruses in chickens.

4. Generation of influenza A virus NS2 (NEP) mutants with an altered nuclear export signal sequence.

5. Molecular basis for high virulence of Hong Kong H5N1 influenza A viruses.

6. Human infection with influenza H9N2.

7. Genetic characterization of the pathogenic influenza A/Goose/Guangdong/1/96 (H5N1) virus: similarity of its hemagglutinin gene to those of H5N1 viruses from the 1997 outbreaks in Hong Kong.

8. Genesis of a highly pathogenic and potentially pandemic H5N1 influenza virus in eastern Asia.

9. Structure of the uncleaved human H1 hemagglutinin from the extinct 1918 influenza virus.

10. Avian influenza A virus (H7N7) associated with human conjunctivitis and a fatal case of acute respiratory distress syndrome.

1. Evolution of the North American Lineage H7 Avian Influenza Viruses in Association with H7 Virus's Introduction to Poultry.

Review 2. Inventory of molecular markers affecting biological characteristics of avian influenza A viruses.

3. Predicting Cross-Species Infection of Swine Influenza Virus with Representation Learning of Amino Acid Features.

4. Influenza virus genotype to phenotype predictions through machine learning: a systematic review.

5. Using amino acid features to identify the pathogenicity of influenza B virus.

6. Maintaining proper health records improves machine learning predictions for novel 2019-nCoV.

Review 7. Key Role of the Influenza A Virus PA Gene Segment in the Emergence of Pandemic Viruses.

8. Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus.