| Literature DB >> 31026604 |
Myeongji Cho1, Hyeon Seok Son2.
Abstract
Studies of host factors that affect susceptibility to viral infections have led to the possibility of determining the risk of emerging infections in potential host organisms. In this study, we constructed a computational framework to estimate the probability of virus transmission between potential hosts based on the hypothesis that the major barrier to virus infection is differences in cell-receptor sequences among species. Information regarding host susceptibility to virus infection was collected to classify the cross-species infection propensity between hosts. Evolutionary divergence matrices and a sequence similarity scoring program were used to determine the distance and similarity of receptor sequences. The discriminant analysis was validated with cross-validation methods. The results showed that the primary structure of the receptor protein influences host susceptibility to cross-species viral infections. Pair-wise distance, relative distance, and sequence similarity showed the best accuracy in identifying the susceptible group. Based on the results of the discriminant analysis, we constructed ViCIPR (http://lcbb3.snu.ac.kr/ViCIPR/home.jsp), a server-based tool to enable users to easily extract the cross-species infection propensities of specific viruses using a simple two-step procedure. Our sequence-based approach suggests that it may be possible to identify virus transmission between hosts without requiring complex structural analysis. Due to a lack of available data, this method is limited to viruses whose receptor use has been determined. However, the significant accuracy of predictive variables that positively and negatively influence virus transmission suggests that this approach could be improved with further analysis of receptor sequences.Entities:
Keywords: Amino acid substitution; Cross-species infection; Discriminant analysis; Evolutionary distance; Sequence similarity
Mesh:
Substances:
Year: 2019 PMID: 31026604 PMCID: PMC7106226 DOI: 10.1016/j.meegid.2019.04.016
Source DB: PubMed Journal: Infect Genet Evol ISSN: 1567-1348 Impact factor: 3.342
List of the 18 virus receptors used in this study.
| Virus | Receptor |
|---|---|
| HIV | CD4 |
| Hantaviruses, foot-and-mouth disease virus | Integrin αvβ3 |
| SARS coronavirus | ACE2 |
| Rabies virus | nAchR |
| Echovirus (E-6, E-7, E-11, E-12, E-20, E-21 and E-70), coxsackievirus A and B (CV-A21, CV-B1, CV-B3 and CV-B5) | CD55 |
| HCoV-229E (severe acute respiratory syndrome-associated coronavirus) | APN |
| Vesicular stomatitis virus | PS receptor |
| Encephalomyocarditis virus | VCAM1 |
| Hepatitis A virus | HAVCR1 |
| Measles virus vaccine strains | CD46 |
| Measles virus wild-type strains | SLAM |
| MERS coronavirus | DPP4 |
| Nipah virus | Ephrin B2, Ephrin B3 |
| Lassa virus, lymphocytic choriomeningitis virus | DAG1 |
| Junin arena virus, Machupo virus | TRFC |
| Sendai virus | ASGR2 |
The amino-acid sequences of 18 receptors for 20 viruses were used to construct an original data set consisting of receptor sequence pairs. CD4: cluster of differentiation 4; CD61: cluster of differentiation 61; ACE 2: angiotensin converting enzyme 2; nAchR: nicotinic acetylcholine receptor; CD55: Complement decay-accelerating factor; aminopeptidase N; PS: Phosphatidylserine; VCAM1: vascular cell adhesion molecule 1; HAVCR1; Hepatitis A virus cellular receptor 1; CD46: cluster of differentiation 46; SLAM: signaling lymphocytic activation molecule; DPP4: Dipeptidyl peptidase-4; DAG1: Dystroglycan1; TRFC: transferrin receptor; ASGR2: asialoglycoprotein receptor 2.
List of reference sequences of 24 virus genomes collected and stored for system construction.
| Virus name | Organism | Strain | ncbi accession | bp length |
|---|---|---|---|---|
| EMCV | Encephalomyocarditis virus | Ruckert | 7835 | |
| Human echovirus 7 | echovirus E7 | Wallace | 7427 | |
| FMDV | Foot-and-mouth disease virus - type O | Unknown | 8134 | |
| Hantavirus | Hantaan orthohantavirus | Unknown | 3616 | |
| Hepatitis A virus | Hepatovirus A | Unknown | 7478 | |
| HCoV-229E | Human coronavirus 229E | 229E | 27,317 | |
| HIV (SIV; Lentivirus) | Human immunodeficiency virus 1 | Unknown | 9181 | |
| Junin virus | Junin mammarenavirus | Unknown | 7114 | |
| Lassa virus | Lassa mammarenavirus | Josiah | 7279 | |
| LCMV | Lymphocytic choriomeningitis mammarenavirus | Unknown | 6680 | |
| Machupo virus | Machupo mammarenavirus | Carvallo | 7196 | |
| Measles virus | Measles morbillivirus | Ichinose-B95a | 15,894 | |
| MERS-CoV | Middle East respiratory syndrome-related coronavirus | HCoV-EMC | 30,119 | |
| Nipah virus | Nipah henipavirus | Unknown | 18,246 | |
| Rabies virus | Rabies lyssavirus | Unknown | 11,932 | |
| SARS-associated CoV | Porcine epidemic diarrhea virus | CV777 | 28,033 | |
| SARS-CoV | SARS coronavirus | Unknown | 29,751 | |
| Sendai virus | Murine respirovirus | Ohita | 15,384 | |
| TGEV | Feline infectious peritonitis virus | Unknown | 29,355 | |
| VSV | Vesicular stomatitis Indiana virus | Unknown | 11,161 |
A reference sequences for 24 virus genomes were collected and stored to generate a target sequence resource for performing a similarity search engine (Sequence Similarity Scoring System) in ViCIPR according to a virus sequence query input. In constructing the gene database and the protein database, 118 genes and protein sequences were collected and processed to construct target sequence resources by parsing cds regions of 24 reference genome sequences.
List of data in MySQL database.
| Field name | Type | Description |
|---|---|---|
| Class | varchar ( | Data set class of each case: 50 class for training data sets and 6 class for test data sets are designated for each case |
| Casenum | varchar( | Index number for database primary key: a1-a56, b1-b56 used to build training datasets and test datasets contain all data items with eliminating duplicate values |
| Receptor | varchar(50) | Receptor proteins: ACE2, APN, ASGR2, CD4, CD46, CD55, CD61, DAG1, DPP4, Ephrin B2, Ephrin B3, HAVCR1, Integrin alpha 5, NAchR, PS receptor, SLAM, TRFC, VCAM1 |
| Virus | varchar(50) | Virus name: EMCV, human echovirus 7, FMDV, hantavirus, Hepatitis A virus, HCoV-229E, HIV (SIV; Lentivirus), junin virus, lassa virus, LCMV, machupo virus, measles virus, MERS-CoV, nipah virus, rabies virus, SARS-associated CoV, SARS-CoV, sendai virus, TGEV, VSV |
| Reservoir | varchar(50) | Information of reservoir hosts of viruses including |
| Host1 | varchar(50) | 29 species of donor host organisms |
| Host2 | varchar(50) | 29 species of recipient host species |
| pairwise_distance | Double | Pairwise distance score: 0.008–1.712 for the original data set, 0.375–1.712 for non-infectious group, and 0.008–0.36 for infectious group |
| relative_distance | Double | Relative distance score: 0.004–0.994 for total dataset, 0.433–0.994 for non-infectious group, and 0.004–0.297 for infectious group |
| total_similarity | Double | Overall sequence similarity: 0.156–0.981 for total dataset, 0.156–0.643 for non-infectious group, and 0.608–0.981 for infectious group |
| g_verified | int( | Predetermined group 1 for cases verified as non-infectious group through literature review |
| disg_predicted | int( | Score-based predicted group 1 for cases predicted as infectious group by discriminant model |
| disds_calculated | Double | Calculated discriminant z-scores which range from −5.804 to 3.526 |
| disds_trimmed | Double | z′-scores converted from z-score based on the group centroids, which range from −4.316 to 2.326 |
| Infectindex | Double | Propensity scores for total data set which range from 0.001 to 99.994% |
This table shows the data items, types and the values with description of each data item stored in MySQL DBMS for interworking with the web server ViCIPR.
Fig. 1A computational framework to estimate propensities for virus infection between host species. The framework largely consists of four stages: 1) amino acid sequence analysis and predictive variable selection, 2) construction of the classification model, 3) model-based prediction, and 4) calculation of propensity scores. Based on the results of the discriminant analysis, three measures were selected as predictive variables and used to construct the model. We derived covariant matrices and pool-within-class covariant matrices for the discriminant analysis. SPSS 24.0 software was used to calculate inverse matrix and discriminant coefficients, to derive the discriminant model, and to evaluate the contributions of predictive variables. In this figure, C1 and C2 indicate the group centroid of each group used for z′-value computation, and α and β are the coefficients used to transform the z′-score for propensity estimation.
Scores for distance, similarity, discrimination and cross-species infection propensity of receptor sequence pairs.
| Virus | Host1 | Host2 | gSi,1 | gSi,2 | gSi,3 | Group | DS | Infect-index |
|---|---|---|---|---|---|---|---|---|
| Sendai virus | 0.130 | 0.068 | 0.837 | 1 | 2.326 | 99.993 | ||
| MERS-CoV | 0.084 | 0.072 | 0.888 | 1 | 2.612 | 98.759 | ||
| VSV | 0.025 | 0.066 | 0.893 | 1 | 2.616 | 98.700 | ||
| SARS-CoV | 0.131 | 0.106 | 0.852 | 1 | 2.226 | 98.495 | ||
| SARS-CoV | 0.085 | 0.069 | 0.897 | 1 | 2.693 | 97.546 | ||
| HIV (SIV;Lentivirus) | 0.091 | 0.045 | 0.904 | 1 | 2.879 | 94.749 | ||
| HIV (SIV;Lentivirus) | 0.088 | 0.044 | 0.906 | 1 | 2.895 | 94.505 | ||
| Hantavirus | 0.038 | 0.094 | 0.962 | 1 | 2.963 | 93.474 | ||
| Lassa virus, LCMV | 0.071 | 0.040 | 0.927 | 1 | 3.046 | 92.233 | ||
| Hantavirus | 0.099 | 0.244 | 0.901 | 1 | 1.788 | 91.898 | ||
| Lassa virus, LCMV | 0.064 | 0.036 | 0.933 | 1 | 3.102 | 91.386 | ||
| Measles virus wild-type strains | 0.045 | 0.057 | 0.967 | 1 | 3.207 | 89.812 | ||
| Junin virus, machupo virus | 0.161 | 0.198 | 0.836 | 1 | 1.647 | 89.784 | ||
| Nipah virus | 0.025 | 0.043 | 0.971 | 1 | 3.289 | 88.572 | ||
| Nipah virus | 0.022 | 0.038 | 0.974 | 1 | 3.334 | 87.895 | ||
| Rabies virus | 0.027 | 0.024 | 0.963 | 1 | 3.338 | 87.835 | ||
| Human echovirus 7 | 0.797 | 0.433 | 0.420 | 2 | −1.862 | 36.946 | ||
| EMCV | 1.335 | 0.994 | 0.234 | 2 | −5.628 | 19.752 | ||
| Sendai virus | 1.075 | 0.560 | 0.300 | 2 | −3.093 | 18.411 | ||
| Hepatitis A virus | 1.050 | 0.822 | 0.160 | 2 | −5.524 | 18.180 | ||
| Measles virus wild-type strains | 0.405 | 0.512 | 0.360 | 2 | −3.135 | 17.778 | ||
| Rabies virus | 1.126 | 0.986 | 0.294 | 2 | −5.389 | 16.151 | ||
| Lassa virus, LCMV | 1.671 | 0.941 | 0.200 | 2 | −5.217 | 13.559 | ||
| Measles virus vaccine strains | 0.776 | 0.791 | 0.431 | 2 | −3.748 | 8.557 | ||
| HIV (SIV;Lentivirus) | 1.359 | 0.676 | 0.250 | 2 | −3.766 | 8.275 | ||
| VSV | 0.375 | 0.989 | 0.630 | 2 | −3.856 | 6.931 | ||
| Nipah virus | 0.625 | 0.819 | 0.44 | 2 | −3.999 | 4.769 | ||
| HCoV-229E | 1.159 | 0.840 | 0.289 | 2 | −4.597 | 4.230 | ||
| Hepatitis A virus | 0.783 | 0.613 | 0.186 | 2 | −4.497 | 2.718 | ||
| Nipah virus | 0.570 | 0.991 | 0.536 | 2 | −4.316 | 0.001 |
Host1, original/donor host species; host2, alternative/recipient host species. gSi,1, gSi,2, and gSi,3, pair-wise distance, relative distance, and total similarity, respectively. Groups were classified based on the discrimination scores (DSs) (1, infectious group; 2, non-infectious group). The DS was calibrated for correct classification and propensity calculation in the dataset. The group centroid of the discriminant function was 2.428 for the infectious group and −4.316 for the non-infectious group, and was used to calculate the Infectindex.
Fig. 2Main components and data flow of ViCIPR, a web-based prediction system. The data flow is shown with the process of establishing a discriminative model and predictive protocol, a database capable of dynamic interaction, and a user-friendly web interface as the key components for the operation of analytical systems in ViCIPR (http://lcbb3.snu.ac.kr/ViCIPR/home.jsp). As shown, we tried to construct ViCIPR based on our own statistical protocol. The results of the protocols and prediction studies were stored in a database that can be used in ViCIPR. Operation of the ViCIPR analysis system is initiated by the input of query sequences (nucleotides or proteins) and selection of the primary (donor) host species. Next, a similarity search is performed on the query sequence and host information with the user's input. Based on the similarity search result, a virus species with maximum similarity is given as the output, and a selectable secondary (recipient) host species is presented in connection with the MySQL database. The results show a calculated value for the Infectindex of the selected host pair of the corresponding virus species at the same time as the selection of the secondary host species.
Fig. 3A simple two-step procedure in the ViCIPR web interface. The process and results of the extraction of Infectindex for the SARS-CoV are shown. In the first step, a similarity search of the viral genome sequence data library among the target sequence resources in the ViCIPR genomic database was performed, and the results were output. Using the built-in search function of ViCIPR, the maximum matching score, the virus species with the best and most relevant hits, and the virus species with hits (%) and selectable hosts corresponding to the results of significant sequence alignments were retrieved. In the second step, a list of selectable primary host species is presented by the user's selection of the virus, which is based on the information of the virus species with the maximum percent identity among the viruses corresponding to the target sequences. Finally, the cross-species infection propensity of the host pair determined according to the user's selection is calculated and output to the result window. As shown, ViCIPR database similarity search results indicated that the SARS-CoV was the virus species most similar to the query sequence. We can confirm that the selected secondary hosts H. sapiens and M. putorius furo for the primary host F. cattus are presented in the select box. Simultaneously with selection of the secondary host species, the results of the Infectindex calculation were output to the box.