Literature DB >> 28391493

Computational Comparison and Visualization of Viruses in the Perspective of Clinical Information.

António M Lopes¹, J A Tenreiro Machado², Alexandra M Galhano².

Abstract

This paper addresses the visualization of complex information using multidimensional scaling (MDS). MDS is a technique adopted for processing data with multiple features scattered in high-dimensional spaces. For illustrating the proposed techniques, the case of viral diseases is considered. The study evaluates the characteristics of 21 viruses in the perspective of clinical information. Several new schemes are proposed for improving the visualization of the MDS charts. The results follow standard clinical practice, proving that the method represents a valuable tool to study a large number of viruses.

Entities: CellLine Disease Gene Species

Keywords: Complex data; Computational visualization; Multidimensional scaling; Viruses diseases

Mesh：

Year: 2017 PMID： 28391493 PMCID： PMC7090701 DOI： 10.1007/s12539-017-0229-4

Source DB: PubMed Journal: Interdiscip Sci ISSN： 1867-1462 Impact factor: 2.233

Introduction

Presently, reliable and assertive data about many real-world phenomena are available for computer processing. One example consists of clinical information about viral diseases. Viruses infections are an important cause of mortality and morbidity. More than 2000 viruses were identified and many can infect humans, or animals [1]. In general, viral diseases have very diverse characteristics and complexity, and computational methods for data mining and feature extraction are relevant strategies to adopt. As usually occurs with real-world data, information is scattered, and exhibits multiple characteristics with distinct levels of relevance. Therefore, it is important to explore reliable algorithms for highlighting the main details, and to take advantage of modern computational resources to visualize the relations embedded within the data. Herein, we adopt the multidimensional scaling (MDS) technique to compare the relationships among several viruses responsible for human diseases. New schemes for improving the visualization of the MDS charts are proposed. In what concerns selection of the “objects” under study, most are based on their impact on people and visibility in communication media (e.g., subtype H5N1 of Influenza A virus, Ebola, Chikungunya and Zika), others due to historical reasons (e.g., Rabies, Poliomyelitis, and Smallpox), and some because of their incidence and prevalence in humans (e.g., Influenza, Rhinovirus, and Norovirus). The viruses are compared by means of their characteristics and the symptoms of the diseases that they may cause in humans. The MDS can lead to a new perspective in the study of human pathologies. MDS is a statistical technique for analyzing similarities in information that generates geometric representations for complex objects [2]. MDS appeared in the context of behavioral sciences, for understanding judgments of individuals about features in a set of objects [3, 4]. Presently, the MDS is used in real-world data, such as biological taxonomy [5], finance [6], marketing [7], sociology [8], physics [9], geophysics [10-12], communication networks [13], biology and biomedicine [14], among others [15]. The paper is organized as follows. Section 2 introduces the MDS technique. Section 3 studies and compares data characterizing the clinical effects of 21 viruses. Finally, Sect. 4 draws the conclusions.

Multidimensional Scaling

We consider s objects defined in a m-dim space, , and a proximity measure, , between objects i and j. The first step consists of calculating (), of item-to-item dissimilarities. The MDS produces a configuration (), where the dimension is chosen by the user. Thus, attempts to replicate in a low-dimensional space, , the proximities between the s elements in . In general, the MDS unveils the embedded data patterns, being different from other techniques [16, 17], not only because it requires no a priori assumptions for each dimension, but also due to its good convergence [18, 19]. To arrive to configuration , MDS evaluates different alternative values to minimize some fitness function, such as [20] the raw stress, :where is a weight and measures the dissimilarities among the items i and j in the embedding space . Therefore, a distance measure is often adopted for implementing [21]. Besides (1), there are several stress measures [22], namely, the normalized raw stress, the Kruskal’s stress-1 and stress-2, and the S stress. To assess the quality of the MDS solutions, it is used the Shepard diagram that represents the pairs . The Shepard diagram displays the outliers and residuals resulting from the MDS. A narrow scatter following the 45 line corresponds to a good fit between and . Another test to the MDS quality is the stress plot that represents versus q. The curve is monotonic decreasing and the user chooses q as a compromise between reducing and having small values of q. The MDS interpretation focuses on the emerging clusters and considers the distances between points in the produced chart. Therefore, the user can rotate, shift, or zoom the chart, while the distances remain invariant. Usually, , or , is adopted, since they allow a direct graphical representation.

Data Analysis and Visualization

We analyze data for viruses responsible for infectious diseases. These are Bird Flu, Chicken Pox, Chikungunya, Dengue Fever, Ebola, Hepatitis B, HIV, Marburg disease, Measles, MERS, Mumps, Norovirus, Polio, Rabies, Rhinovirus, Rotavirus, Rubella, SARS, Seasonal Flu, Smallpox, and Zika virus infection, with acronyms BFlu, CPox, Chi, Den, Ebo, HepB, HIV, Mar, Mea, MERS, Mum, Nor, Pol, Rab, Rhi, Rot, Rub, SARS, SFlu, Sma, and ZIKV. For the ith virus, , we associate quantitative attributes, namely, (i) the fatality rate, (ii) the average basic reproductive number, (iii) the average serial interval, (iv) the incubation period, and (v) the virus survival time outside a host. Table 1 lists the data, where the numerical values correspond to the matrix , , .

Table 1

Attributes of the viruses

i	Virus	Acronym	k
			Fatality rate (%)	Average basic reproductive number	Average serial interval (days)	Incubation period (days)	Survival outside host (days)
			1	2	3	4	5
1	Bird Flu	BFlu	59.00	2.00	3.00	3.0	30.0
2	Chicken Pox	CPox	0.00	7.50	14.00	14.0	2.0
3	Chikungunya	Chi	0.40	4.00	23.00	2.5	–
4	Dengue Fever	Den	5.00	3.00	16.00	7.0	63.0
5	Ebola	Ebo	75.00	2.50	15.30	11.4	50.0
6	Hepatitis B	HepB	0.75	6.00	25.00	75.0	28.0
7	HIV	HIV	2.10	3.50	–	60.0	42.0
8	Marburg virus disease	Mar	25.00	1.60	9.00	6.0	21.0
9	Measles	Mea	0.20	15.00	11.70	11.0	0.1
10	MERS	MERS	27.00	0.50	7.60	5.0	3.0
11	Mumps	Mum	0.01	5.50	18.00	17.0	0.3
12	Norovirus	Nor	0.08	3.70	1.86	1.5	24.0
13	Polio	Pol	22.00	6.00	–	13.0	160.0
14	Rabies	Rab	0.00	1.60	–	40.0	6.0
15	Rhinovirus	Rhi	0.00	3.70	7.50	3.0	1.0
16	Rotavirus	Rot	0.00	3.50	7.00	1.5	60.0
17	Rubella	Rub	0.00	6.50	18.30	17.7	0.9
18	SARS	SARS	11.00	3.50	10.00	8.0	9.0
19	Seasonal Flu	SFlu	0.01	1.30	3.30	2.0	2.0
20	Smallpox	Sma	15.00	6.00	17.70	14.0	1.5
21	Zika virus disease	ZIKV	0.3	0.5	13.0	6.0	180.0

Attributes of the viruses For constructing Table 1, data were obtained from several distinct sources: Influenza A virus, subtype H5N1 (or “Bird Flu”) [23-25]; Chicken Pox (varicella-zoster infection) [26-28]; Chikungunya [29-31]; Dengue Fever [32, 33]; Ebola [34-36]; Hepatitis B [37-39]; Human Immunodeficiency Virus (HIV) [40-42]; Marburg hemorrhagic fever [36, 43]; Measles [44-47]; Middle East Respiratory Syndrome (MERS) [48-50]; Mumps [51, 52]; Norovirus [53, 54]; Poliomyelitis [55-57]; Rabies [58-60]; Rhinovirus [61-63]; Rotavirus [64-67]; Rubella [46, 68]; Severe Acute Respiratory Syndrome (SARS) [49, 69]; Seasonal flu [25, 70, 71]; Smallpox [72, 73]; Zika virus disease [74, 75].

MDS Analysis using the Arc-cosine Distance

Previous to applying MDS, the data are “normalized” to avoid saturation effects of the numerical values. Therefore, the elements of each column of matrix are converted to the interval [0, 1], producing the data matrix . The vectors of features for item-to-item comparison correspond to the lines of and will be denoted by . Various distance measures were tested for constructing the matrix . Here, we present results for the arc-cosine distance, , since it leads to charts that are easy to interpret. Other distances are possible and have also been used in distinct applications [6, 12], but several numerical tests confirmed that the arc-cosine leads to reliable results. Therefore, for items i and j, we havewhere , , represent weights specified by the user. Given expression (2), the matrix can be computed for feeding the MDS. Figure 1 represents the 2D and 3D charts ( and ) resulting from the MDS using the weights , where the points represent viruses. The relationships between the items are inferred from the coordinates of the points. Objects that are similar (dissimilar) appear closer (farther) to each other in space.

Fig. 1

MDS charts resulting from the arc-cosine distance , and

With alternative distances, we capture different characteristics of the phenomena that yield distinct plots, but in general lead to identical conclusions. A “good” distance is the one that produces a MDS reflecting the real-world phenomenon in a direct and clear visualization. MDS charts resulting from the arc-cosine distance , and Figure 2 depicts the Shepard diagram for and the stress plot. The Shepard diagram depicts a good scatter of points around the 45 line for , demonstrating a good fit between the distances and the dissimilarities. The curvature of the stress plot is often adopted for deciding the value of q. In this case, we observe that is insufficient, seems to be a good choice, and leads to limited improvements. However, if we adopt the question remains of visualizing efficiently the MDS information, since for 3D representations, we often have to zoom, shift, and rotate the MDS graph to perceive assertively the real location of the objects in space. This question will further discussed in Sect. 3.3.2.

Fig. 2

Quality of the MDS solution for the arc-cosine distance assessed by the Shepard diagram for and the stress plot

Quality of the MDS solution for the arc-cosine distance assessed by the Shepard diagram for and the stress plot Before continuing, two numerical aspects need to be clarified: the weights used and the missing data in Table 1. The weights were tuned for highlighting the importance of the features recognized as being more harmful from the medical point of view: first, the fatality rate and, second, the average basic reproductive number. However, the question remains on how to choose . In this perspective, we performed several experiments varying the weights. Figure 3 depicts the results obtained with four distinct sets of values, namely, , and . For each set , we generate one MDS chart, and afterwards, the charts are combined using Procrustes analysis [76]. Procrustes involves the operations of translation, reflection, orthogonal rotation, and scaling, to best conform the points in a given matrix under modification in relation with the points of a reference matrix.

Fig. 3

MDS global chart for and the arc-cosine distance , obtained by Procrustes with four different sets of weights

In our case, we (i) choose the first chart for initial reference, (ii) use Procrustes to superimpose the next MDS chart into the current reference, (iii) make the current set of superimposed charts the new reference, and (iv) continue to step (ii) until all charts have been conformed. The results obtained reveal identical patterns, meaning that the method is robust to distinct values of . MDS global chart for and the arc-cosine distance , obtained by Procrustes with four different sets of weights In Fig. 1, the unknown data, denoted by ‘-’ in Table 1, are considered zero. Therefore, these values do not contribute to the distance used for comparing items. Moreover, as the missing data occur only in four values of the less weighted features, their influence is not as significant as for the rest of the information. In addition, as will be shown in Sect. 3.2, the results reveal small sensitivity to possible noise in the data, which includes the uncertainty in the unknown values that were set to zero. Nonetheless, a different criterion for dealing with that problem could be adopted. Experiments with the missing data replaced not only by zero, but also by the minimum, average, and maximum values in the third and fifth columns of Table 1 led to results qualitatively similar, as depicted in Fig. 4, revealing the effectiveness of the criterion adopted.

Fig. 4

MDS global chart for and arc-cosine distance , obtained by Procrustes with missing data replaced by zero, minimum, average, and maximum values of the third and fifth features

Sensitivity Analysis

The 21 viruses were compared in the perspective of quantitative features. However, the data diverge slightly, depending on factors such as the time of the study or the operational conditions, namely, environmental conditions, geographic region, development level, or medical assistance. Therefore, we analyze here the sensitivity results with respect to the input data. We start by adding random noise to the quantitative features, , with amplitude of the values in Table 1. Moreover, any negative values are avoided by limiting numbers to zero. A sample of 50 experiments, each yielding one MDS chart, is performed and the charts are combined using the Procrustes scheme. Figure 5 illustrates the MDS chart for produced by the Procrustes algorithm. We verify that the method has low sensitivity to variations in the quantitative features, since the location of the points reveals minor variations.

Fig. 5

MDS global chart for and arc-cosine distance, , generated by Procrustes with random variations added to the values of the five features

Data Clustering and Visualization

The MDS interpretation focuses on the distances between points in the produced charts. For identifying clusters, we can adopt some kind of ad hoc strategy based on the direct visualization of the MDS plots, or we can implement an algorithm for obtaining an automatic clustering. In addition, the configuration, , produced by the MDS tries to replicate, in the low-dimensional space, , the original proximities between pairwise elements. For , this leads to a direct visualization, but neglects the information described in the higher dimensional components of . In this line of thought, in the next subsections, we introduce the non-hierarchical clustering algorithm K-means for automatic cluster identification and we propose a technique for an improved visualization of MDS information in the 2D space by embedding information of the extra dimensions.

The K-Means Clustering

Clustering is a technique that groups objects similar to each other in some sense. The K-means is a non-hierarchical clustering algorithm [77] that starts with a set of s objects, where each one is represented by a point in a q-dim space, and a certain number of clusters, K, specified in advance. The K-means groups the s objects into clusters, to minimize the sum of distances between the points and the centers of their clusters. The K-means produces a solution where objects in a cluster are close to each other and far from objects in other clusters. An important issue in K-means is to specify K, since the notion of “good clustering” is subjective. Nevertheless, we can adopt different measures for assessing the quality of the solution, such as the Calinski-Harabasz, Davies-Bouldin, and silhouette [78]. Here, we consider the silhouette, S, to assess if an object lies “adequately” within its cluster. The silhouette varies in the interval , so that values close to correspond to object assignments. Knowing the coordinates of the objects produced by the MDS in the dim space, we assess the quality of the clusters in the interval . Figure 6 depicts the corresponding silhouettes and the mean value for each cluster (blue marks). The optimum value is obtained corresponding to the maximum silhouette mean value .

Fig. 6

Silhouettes assessing the quality of the clustering for , the arc-cosine distance , and . The blue marks depict the mean silhouette value for each cluster

Silhouettes assessing the quality of the clustering for , the arc-cosine distance , and . The blue marks depict the mean silhouette value for each cluster For , the clusters are CPox, Mea, Mum, Nor, Rhi, Rot, Rub, SFlu, HepB, HIV, Rab, BFlu, Ebo, Mar, MERS, Pol, SARS, Sma and Chi, Den, ZIKV. These clusters are further discussed in the next subsection.

Improved Visualization in 2D Space

The geometrical shape of the chart produced by MDS varies significantly with the distance measure adopted for quantifying the distances between items. However, this characteristic does not precludes that we use the MDS chart taking full advantage of all its properties. Consequently, we may interpret the collection of points as “samples” of an abstract locus corresponding to the projection of the m initial dimensions into a lower dimensional (abstract) space. We adopt a scheme that allows for a direct visualization of the MDS, while including information up to Therefore, we approximate the dimension of with a contour generated by means of a linear radial basis function interpolation [79]. Moreover, we improve the identification of patterns by superimposing a tree in the MDS chart. The nodes of the tree are the points representing items (viruses). In a first phase, we connect the group of points that are closer in the MDS chart producing the sets, , of interconnected points (nodes). In a second phase, the sets, , are compared through the distances between their constitutive nodes. The distance can be calculated taking into account any number of components. A connection is established in the q-dim chart, only between the two closest nodes (i.e., and ). This calculation generates a second level of interconnection, and the scheme is repeated iteratively until there is a continuous route between all points. Therefore, the interpretation of the MDS chart is based not only in the relative position of the points, but also in the structure interconnecting them. Figure 7 depicts a projection of the MDS chart for the contour that approximates the dimension , and the superimposed interconnections generated by calculating the distances between objects with . We observe easily the four clusters identified in the previous subsection. Moreover, we verify that the proposed methodology leads to a clear visualization and produces a richer chart of the objects.

Fig. 7

MDS chart for and the arc-cosine distance . The contour represents the dimension and the superimposed tree allows for an easier identification of patterns

MDS chart for and the arc-cosine distance . The contour represents the dimension and the superimposed tree allows for an easier identification of patterns In synthesis, besides the observation based on the relative distances in 2D space, we now verify that the ZIKV has a relevant position along the dimension, somehow strengthening the characteristics revealed by the Chikungunya and Dengue.

Discussion of the Results

The clusters do not follow an epidemiological line of thought, but may be of medical value, since they reflect characteristics measured by health care practice. In cluster, are included viruses of Risk Group 2 that in general do not cause serious illness nor life threatening. In cluster , we find the Lentivirus that is responsible for HIV and acquired immunodeficiency syndrome (AIDS), a Risk Group 3 agent. We find also the Hepatitis B and the Rabies virus, a Lyssavirus genus and Rhabdoviridae family virus, of Risk Group 2. In cluster , we can consider two subclusters. The first subcluster includes the Ebola and Marburg viruses that belong to the Risk Group 4. In addition, in this subcluster, the agents responsible for MERS and Bird flu are classified as Risk Group 3. The second subcluster includes viruses of different Risk Groups, namely, the Polio virus and the SARS–associated coronavirus, belonging to Risk Groups 2 and 3, respectively. Smallpox is also present [80]. Cluster includes Chikungunya, considered a Risk Group 3 pathogen. Also included in are the Dengue fever virus, a Risk Group 2 arbovirus pathogenan, and ZIKV, recognized as being similar to Chikungunya and Dengue viruses. In conclusion, we verified that the MDS provides a powerful computational visualization technique of viruses data and the obtained charts may be of medical interest in the scope of present and future viral outbreaks.

Conclusions

This paper discussed the computational analysis of real-world data describing viruses main quantitative characteristics. By encompassing complex scattered data, researchers have to choose between comparing all aspects and detecting the main properties. This problem represents a challenge since some information (or its absence) may lead to incomplete or eventually to incorrect conclusions. Therefore, complex information calls for computational and visualization tools capable of revealing the most relevant issues. The MDS technique was adopted, leading to substantive results that follow present-day scientific knowledge.

42 in total

1. Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry combined with multidimensional scaling, binary hierarchical cluster tree and selected diagnostic masses improves species identification of Neolithic keratin sequences from furs of the Tyrolean Iceman Oetzi.

Authors: Klaus Hollemeyer; Wolfgang Altmeyer; Elmar Heinzle; Christian Pitra
Journal: Rapid Commun Mass Spectrom Date: 2012-08-30 Impact factor: 2.419

2. Chikungunya outbreaks--the globalization of vectorborne diseases.

Authors: Rémi N Charrel; Xavier de Lamballerie; Didier Raoult
Journal: N Engl J Med Date: 2007-02-22 Impact factor: 91.245

3. Computational Analysis and In silico Predictive Modeling for Inhibitors of PhoP Regulon in S. typhi on High-Throughput Screening Bioassay Dataset.

Authors: Harleen Kaur; Mohd Ahmad; Vinod Scaria
Journal: Interdiscip Sci Date: 2015-08-23 Impact factor: 2.233

4. The epidemiological profile of rubella and congenital rubella syndrome in the United States, 1998-2004: the evidence for absence of endemic transmission.

Authors: Susan E Reef; Susan B Redd; Emily Abernathy; Laura Zimmerman; Joseph P Icenogle
Journal: Clin Infect Dis Date: 2006-11-01 Impact factor: 9.079

Computational Comparison and Visualization of Viruses in the Perspective of Clinical Information.

Introduction

Multidimensional Scaling

Data Analysis and Visualization

MDS Analysis using the Arc-cosine Distance

Sensitivity Analysis

Data Clustering and Visualization

The K-Means Clustering

Improved Visualization in 2D Space

Discussion of the Results

Conclusions

1. Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry combined with multidimensional scaling, binary hierarchical cluster tree and selected diagnostic masses improves species identification of Neolithic keratin sequences from furs of the Tyrolean Iceman Oetzi.

2. Chikungunya outbreaks--the globalization of vectorborne diseases.

3. Computational Analysis and In silico Predictive Modeling for Inhibitors of PhoP Regulon in S. typhi on High-Throughput Screening Bioassay Dataset.

4. The epidemiological profile of rubella and congenital rubella syndrome in the United States, 1998-2004: the evidence for absence of endemic transmission.

5. H5N1 influenza in Hong Kong: virus characterizations.

6. ConnectViz: Accelerated Approach for Brain Structural Connectivity Using Delaunay Triangulation.

Review 7. Measles - The epidemiology of elimination.

8. A complete analysis of HA and NA genes of influenza A viruses.

Review 9. The aetiology, origins, and diagnosis of severe acute respiratory syndrome.

10. Estimating the basic reproduction number for single-strain dengue fever epidemics.

1. Visualization of cross-resistance between antimicrobial agents by asymmetric multidimensional scaling.