António M Lopes1, José P Andrade2, J A Tenreiro Machado3. 1. UISPA-LAETA/INEGI, Faculty of Engineering, University of Porto, Rua Dr. Roberto Frias, 4200-465 Porto, Portugal. Electronic address: aml@fe.up.pt. 2. Department of Anatomy, Faculty of Medicine, University of Porto, Alameda Prof. Hernâni Monteiro, 4200-319 Porto, Portugal. 3. Institute of Engineering, Department of Electrical Engineering, Polytechnic of Porto, Rua Dr. António Bernardino de Almeida, 431, 4249-015 Porto, Portugal.
Abstract
BACKGROUND AND OBJECTIVE: Viruses are infectious agents that replicate inside organisms and reveal a plethora of distinct characteristics. Viral infections spread in many ways, but often have devastating consequences and represent a huge danger for public health. It is important to design statistical and computational techniques capable of handling the available data and highlighting the most important features. METHODS: This paper reviews the quantitative and qualitative behaviour of 22 infectious diseases caused by viruses. The information is compared and visualized by means of the multidimensional scaling technique. RESULTS: The results are robust to uncertainties in the data and revealed to be consistent with clinical practice. CONCLUSIONS: The paper shows that the proposed methodology may represent a solid mathematical tool to tackle a larger number of virus and additional information about these infectious agents.
BACKGROUND AND OBJECTIVE: Viruses are infectious agents that replicate inside organisms and reveal a plethora of distinct characteristics. Viral infections spread in many ways, but often have devastating consequences and represent a huge danger for public health. It is important to design statistical and computational techniques capable of handling the available data and highlighting the most important features. METHODS: This paper reviews the quantitative and qualitative behaviour of 22 infectious diseases caused by viruses. The information is compared and visualized by means of the multidimensional scaling technique. RESULTS: The results are robust to uncertainties in the data and revealed to be consistent with clinical practice. CONCLUSIONS: The paper shows that the proposed methodology may represent a solid mathematical tool to tackle a larger number of virus and additional information about these infectious agents.
Viruses exert enormous damage on humans worldwide and are the single most important cause of infectious morbidity and mortality. History was, and still is, shaped since ancient times by viral diseases. These diseases began to be characterized in the 19th century leading to the identification and differentiation of many viral illnesses [1]. The first viruses were identified at the end of the 19th century and since then the process of discovery has continued steadily with a growing momentum in these years. In fact, in recent years it is possible to visualize viral structure at an atomic level of resolution, nucleotide sequences of viral genomes are known, and functional domains of numerous viruses and enzymes have been established [1], [2]. This information is now being applied to the development of diagnostic tools and effective antiviral therapies.The classification of viruses has also evolved. Firstly, sub-classifications were based on pathologic features such as the preference of a specific organ (for example, the liver in viral hepatitis). Secondly, some epidemiologic characteristics were defined as the transmission by arthropods (arbovirus, for example) [1]. The current classifications are based on the type and structure of the viral nucleic acid and its replication strategy, the symmetry type of the capsid of the virus, and the presence or absence of a lipid envelop [1], [2].More than 2000 species of viruses have been identified and approximately 650 are capable of infecting humans and animals [2]. Diseases can range from the common cold to fatal events such as Ebola, Smallpox or Rabies [2]. Globally, viral diseases are very diverse and present several degrees of complexity.In this study we will adopt multidimensional scaling (MDS) to visualize the relationships between 22 selected human viral infectious diseases. Some viruses were selected based on recent viral outbreaks and presence in the media (for example, Influenza A virus subtype H5N1, Ebola and Chikungunya), others were chosen due to historical reasons (for example, Rabies, Poliomyelitis and Smallpox), and still others due to their prevalence and incidence in human populations (for example, Influenza, Rhinovirus and Norovirus). In two viral diseases (Human Immunodeficiency Virus and Rabies) we consider both the treated and untreated paradigms of the disease due to the huge discrepancy in mortality.MDS is proven to obtain a new perspective on visualizing global data associated with human pathologies. MDS is a set of techniques used to analyse similarities in data that produce spatial or geometric representations of complex objects [3], [4], [5]. MDS had its origin in behavioural sciences for its help in understanding judgements of individuals (as preference, or relatedness) concerning elements in a set of objects [6], [7], [8]. Nowadays, MDS is used with a large variety of real data, such as biological taxonomy [9], [10], [11], [12], finance [13], [14], marketing [15], sociology [16], physics [17], geophysics [18], [19], [20], communication networks [21], [22], biology and biomedics [23], [24], among others [25], [26].Bearing these ideas in mind, the paper is organized as follows. In Section 2 we present the MDS technique. In Section 3 we study and compare data regarding 22 virus diseases. Finally, in Section 4 we draw the main conclusions.
Multidimensional scaling
Given s objects in a m-dimensional space and a measure of proximity, , between objects i and j, a symmetric s × s matrix, , of item to item (dis)similarities is calculated in a first step. The MDS algorithm produces a s × q (q < m) configuration, X, representing point coordinates (items), where q is specified by the user. Thus, row i from matrix X gives the coordinates of object i in the q-dimensional embedding space. Configuration X preserves, as best as possible, the proximities between pairwise elements in the higher m-dimensional space and unveils the underlying data structure. MDS is, consequently, different from other similar techniques, such as factor and cluster analysis, because there are no assumptions concerning which factors might drive each dimension. Additionally, MDS is able to treat distinct types of data, has better convergence rates, and is less complex than other methods [3], [27].In order to arrive at the best configuration X, MDS evaluates different alternative configurations while minimizing a goodness-of-fit function. This problem, equivalent to minimizing the raw stress function, σ
2, can be formulated as [28]:where is a user chosen non-negative weight and is a measure of the (dis)similarities among the items in the embedding space. Therefore, is usually a distance measure. Smaller (larger) distances between two objects translate into more (less) similarities between them. For example, the Minkowski distance provides a general way to specify distance for quantitative data in a multidimensional space:where is the value of dimension k for object i and α is a weight factor. When α = 1, the Euclidean and the city-block distances are obtained for r = 2 and r = 1, respectively. Nevertheless, the MDS technique allows users to choose other metrics for the comparison of objects that can be better adequate for their data. In the sequel we will adopt the Canberra distance and the cosine correlation.There are different stress measures, such as the normalized raw stress, which is σ
2 divided by the sum of squared dissimilarities. Possible alternatives are Kruskal's stress-1 and Kruskal's stress-2, which divide σ by the sum of squared distances, or by a function of the variances of distances, respectively. Another example is the S-stress measure given by the sum of squared errors between squared distances and squared dissimilarities [29], [30].The Shepard diagram is used to infer the quality of the MDS solution. Let denote the similarities between objects i and j. A Shepard diagram consists of pairs and . If a line connecting the pairs is drawn, then the approximation error, concerning dissimilarities of each object, is given by . The Shepard diagram is thus useful for visualizing the residuals and outliers resulting from the MDS application to the data. A narrow scatter around the 45 degree line indicates a good fit between and .The stress plot represents σ
2 versus the number of dimensions q of the MDS maps. Usually, we get a monotonic decreasing chart and we choose q as a compromise between reducing σ
2 and having a low dimension for the MDS charts.MDS can be divided according to the classification of data similarities, the number of similarity matrices and the nature of the MDS model. We thus have the non-metric, or metric MDS, if similarity data are qualitative or quantitative. In what concerns the number of similarity matrices and nature of the model we have classical MDS (i.e., with one matrix and unweighted models), replicated MDS (i.e., with several matrices and unweighted models) and weighted MDS (i.e., with several matrices and weighted models).The MDS interpretation is based on the emerging clusters and distances between points in the map, rather than on their absolute coordinates, or the geometrical form of the locus. Thus, we can rotate or translate the MDS chart since the distances between points remain identical. Usually, two or three dimensional charts are selected, because they allow a direct graphical representation.MDS has advantages over other methods, such as principal component analysis (PCA), since MDS can follow similarity/dissimilarity matrices based on several distinct metrics. MDS uses the inter-object distances rather than the coordinates of the objects and, therefore, it turns out that the MDS is a more general method than PCA [18], [31].
Multidimensional scaling analysis
In this section we use MDS tools to visualize the relationships between s = 22 infectious diseases caused by viruses, namely Bird Flu (BFlu), Chicken Pox (CPox), Chikungunya (Chi), Dengue Fever (Den), Ebola (Ebo), Hepatitis B (HepB), HIV (HIV), HIV—untreated (HIV Un), Marburg virus disease (Mar), Measles (Mea), MERS (MERS), Mumps (Mum), Norovirus (Nor), Polio (Pol), Rabies (Rab), Rabies—untreated (Rab Un), Rhinovirus (Rhi), Rotavirus (Rot), Rubella (Rub), SARS (SARS), Seasonal Flu (SFlu) and Smallpox (Sma).To the disease, , we associate m
1 = 5 quantitative and m
2 = 2 qualitative features. Therefore, we have a dimensional space of attributes. We start by pre-processing the quantitative and the qualitative data, yielding a new equivalent µ-dimensional space (to be defined in the sequel) for disease comparison.Table 1
lists the quantitative data, , , . We consider the disease fatality rate, the average basic reproductive number, the average serial interval, the incubation period and the virus survival time outside the host.
Table 1
Quantitative attributes of the diseases considered in the study.
Disease
Acronym
Fatality rate (%)
Average basic reproductive number
Average serial interval (days)
Incubation period (days)
Survival outside host (days)
i
g
1
2
3
4
5
1
Bird flu
BFlu
59.00
2.00
3.00
3.0
30.0
2
Chicken pox
CPox
0.00
7.50
14.00
14.0
2.0
3
Chikungunya
Chi
0.40
4.00
23.00
2.5
–
4
Dengue fever
Den
5.00
3.00
16.00
7.0
63.0
5
Ebola
Ebo
75.00
2.50
15.30
11.4
50.0
6
Hepatitis B
HepB
0.75
6.00
25.00
75.0
28.0
7
HIV
HIV
2.10
3.50
–
60.0
42.0
8
HIV—untreated
HIV Un
80.00
3.50
–
60.0
42.0
9
Marburg virus disease
Mar
25.00
1.60
9.00
6.0
21.0
10
Measles
Mea
0.20
15.00
11.70
11.0
0.1
11
MERS
MERS
27.00
0.50
7.60
5.0
3.0
12
Mumps
Mum
0.01
5.50
18.00
17.0
0.3
13
Norovirus
Nor
0.08
3.70
1.86
1.5
24.0
14
Polio
Pol
22.00
6.00
–
13.0
160.0
15
Rabies
Rab
0.00
1.60
–
40.0
6.0
16
Rabies—untreated
Rab Un
100.00
1.60
–
40.0
6.0
17
Rhinovirus
Rhi
0.00
3.70
7.50
3.0
1.0
18
Rotavirus
Rot
0.00
3.50
7.00
1.5
60.0
19
Rubella
Rub
0.00
6.50
18.30
17.7
0.9
20
SARS
SARS
11.00
3.50
10.00
8.0
9.0
21
Seasonal flu
SFlu
0.01
1.30
3.30
2.0
2.0
22
Smallpox
Sma
15.00
6.00
17.70
14.0
1.5
Quantitative attributes of the diseases considered in the study.Table 2, Table 3 summarize the qualitative features, namely the transmission mode, and the main symptoms of the disease.
Table 2
Main transmission mode of the diseases considered in the study.
Disease
Acronym
Animal–human
Airborne droplet
Bites
Body fluids
Sexual contact
Surfaces
Faecal–oral
i
h
1
2
3
4
5
6
7
1
Bird flu
BFlu
1
0
0
0
0
0
0
2
Chicken pox
CPox
0
1
0
0
0
0
0
3
Chikungunya
Chi
0
0
1
0
0
0
0
4
Dengue fever
Den
0
0
1
0
0
0
0
5
Ebola
Ebo
0
0
0
1
0
0
0
6
Hepatitis B
HepB
0
0
0
1
1
0
0
7
HIV
HIV
0
0
0
1
1
0
0
8
HIV—untreated
HIV Un
0
0
0
1
1
0
0
9
Marburg virus disease
Mar
0
0
0
1
0
0
0
10
Measles
Mea
0
1
0
0
0
0
0
11
MERS
MERS
0
1
0
0
0
0
0
12
Mumps
Mum
0
1
0
0
0
1
0
13
Norovirus
Nor
0
0
0
0
0
1
1
14
Polio
Pol
0
0
0
0
0
0
1
15
Rabies
Rab
0
0
1
0
0
0
0
16
Rabies—untreated
Rab Un
0
0
1
0
0
0
0
17
Rhinovirus
Rhi
0
1
0
0
0
0
1
18
Rotavirus
Rot
0
0
0
0
0
1
1
19
Rubella
Rub
0
1
0
0
0
0
0
20
SARS
SARS
0
1
0
0
0
0
1
21
Seasonal flu
SFlu
0
1
0
0
0
0
0
22
Smallpox
Sma
0
1
0
0
0
0
0
Table 3
Main symptoms of the diseases considered in the study.
Disease
Acronym
Fever
Cough
Sore throat
Muscle and body aches
Nausea
Abdominal pain
Diarrhoea
Vomiting
Rash
Itchy
Fluid blisters
Tiredness, fatigue, weakness
Loss of appetite
Headache
Joint pain
Joint swelling
Bleeding
Jaundice
Swollen glands
Chills
Chest pain
Runny nose
Conjunctivitis
Red spots
Shortness of breath
Confusion
Agitation
Sneezing
i
l
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1
Bird flu
BFlu
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
Chicken pox
CPox
1
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
Chikungunya
Chi
1
0
0
1
0
0
0
0
1
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
4
Dengue fever
Den
1
0
0
1
0
0
0
0
1
0
0
0
0
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
5
Ebola
Ebo
1
0
0
1
0
1
1
1
0
0
0
1
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
6
Hepatitis B
HepB
1
0
0
0
1
1
0
1
0
0
0
1
1
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
7
HIV
HIV
1
0
1
1
0
0
0
0
1
0
0
1
0
1
1
0
0
0
1
0
0
0
0
0
0
0
0
0
8
HIV—untreated
HIV Un
1
0
1
1
0
0
0
0
1
0
0
1
0
1
1
0
0
0
1
0
0
0
0
0
0
0
0
0
9
Marburg virus disease
Mar
1
0
1
1
1
1
1
1
1
0
0
0
0
1
0
0
1
0
0
1
1
0
0
0
0
0
0
0
10
Measles
Mea
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
11
MERS
MERS
1
1
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
12
Mumps
Mum
1
0
0
1
0
0
0
0
0
0
0
1
1
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
13
Norovirus
Nor
1
0
0
1
1
1
1
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
14
Polio
Pol
1
0
1
0
1
1
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
15
Rabies
Rab
1
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
16
Rabies—untreated
Rab Un
1
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
17
Rhinovirus
Rhi
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
18
Rotavirus
Rot
1
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
19
Rubella
Rub
1
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
20
SARS
SARS
1
0
0
1
0
0
1
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
21
Seasonal flu
SFlu
1
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
22
Smallpox
Sma
1
0
0
1
0
0
0
1
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
The transmission mode dimension is represented as binary data (Table 2
), , , . This means that we consider t = 7 conditions, specifically animal–human, airborne droplet, bites, body fluids, sexual contact, surfaces and faecal–oral. The value means that disease i has the transmission mode h, and means that disease i does not have the transmission mode h.Main transmission mode of the diseases considered in the study.In a similar way, for the symptoms dimension we consider y = 28 indicators, represented as binary data , , (Table 3
). The value (or 0) means that disease i has (or has not) symptom l.Main symptoms of the diseases considered in the study.Before applying the MDS algorithm we start by “normalizing” the quantitative data to the interval [0, 1], i.e., . For the qualitative data, we apply and , meaning that and . In this way we avoid having some features saturating the numerical values.We proceed by constructing the vectors of features , yielding a µ-dimensional space that embeds all quantitative and qualitative data. This is equivalent to the m-dimensional space defined previously for disease comparison.In the next subsections we use two indices to compare the preprocessed data, namely the Canberra distance, , and the cosine correlation, . Other indices can be adopted, but these two are sufficient to explain the working concepts. We then apply the MDS technique and interpret the generated maps. Fig. 1
depicts a synoptic diagram of the disease's characteristics and quantification method.
Fig. 1
Synoptic diagram of the disease's characteristics and quantification method. The indices denote: i—disease; g—quantitative data; h—qualitative data, transmission mode; l—qualitative data, symptoms.
Synoptic diagram of the disease's characteristics and quantification method. The indices denote: i—disease; g—quantitative data; h—qualitative data, transmission mode; l—qualitative data, symptoms.In constructing Table 1, Table 2, Table 3, data were collected from the following sources: Influenza A virus subtype H5N1 commonly known as “Bird Flu” [32], [33], [34], [35]; Chicken Pox (varicella-zoster virus infection) [36], [37], [38], [39]; Chikungunya [40], [41], [42]; Dengue Fever [43], [44]; Ebola [45], [46], [47]; Hepatitis B [48], [49], [50]; HumanImmunodeficiency Virus (HIV) and HIV—untreated [51], [52], [53], [54]; Marburg haemorrhagic fever [47], [55]; Measles [56], [57], [58], [59], [60]; Middle East Respiratory Syndrome (MERS) [61], [62], [63]; Mumps [64], [65]; Norovirus [66], [67]; Poliomyelitis [68], [69], [70]; Rabies and Rabies—untreated [71], [72], [73], [74]; Rhinovirus [75], [76], [77]; Rotavirus [78], [79], [80], [81]; Rubella [58], [82]; Severe Acute Respiratory Syndrome (SARS) [61], [83]; Influenza causing seasonal flu [34], [84], [85]; and Smallpox [86], [87].
MDS analysis based on the Canberra distance
In this subsection we consider the construction of matrix X using a measure based on the Canberra distance, , between diseases i and j
:Given this index, the s × s symmetric matrix, , is then computed and the MDS tool applied. While several MDS criteria were tested, the Sammon criterion revealed good results and was adopted in all calculations. It should be noted that this criterion tries to optimize a cost function that describes how well the pairwise distances in a data set are preserved [88], [89].Fig. 2
depicts the 2- and 3-dimensional (2D and 3D) maps produced by MDS. Each point represents a disease, denoted by the corresponding label as shown in Table 1, Table 2, Table 3. We can observe that the Canberra index leads to poor clustering. Nonetheless, we should note that MDS is merely a mathematical clustering and visualization tool and that a physical perspective of the reported results must be found in the light of the comparison index [90]. Therefore, a further explanation about physical mechanisms associated with the results must be envisaged by standard complementary procedures.
Fig. 2
MDS maps for the Canberra index with representations: (A) 2D; (B) 3D.
MDS maps for the Canberra index with representations: (A) 2D; (B) 3D.Fig. 3, Fig. 4
depict the Shepard and stress plots, respectively, which represent standard tools for the assessment of the MDS results. The Shepard diagram shows a good distribution of points around the 45 degree line, particularly for the 3D representation, which means a good fit of the distances to the dissimilarities. The stress plot reveals that a three dimensional space describes well the locus of the s = 22 diseases. In fact, the stress diminishes strongly until q = 2, moderately towards q = 3 and weakly. The maximum curvature point of the stress plot is often adopted as the criterion for deciding the dimensionality of the MDS maps. This means that, although four or more dimensions would represent the data slightly more accurately, 3D maps represent a good compromise between accuracy and easiness of visualization.
Fig. 3
Shepard plots for the Canberra index for representations: (A) 2D; (B) 3D.
Fig. 4
Stress versus q plot for the Canberra index .
Shepard plots for the Canberra index for representations: (A) 2D; (B) 3D.Stress versus q plot for the Canberra index .
MDS analysis based on the cosine correlation
In this subsection we adopt the cosine correlation, , to construct the matrix X. For each disease pair i and j
we have:where represents weights, specified by the user, which are usually chosen to favour adequate clustering. Given expression (4) the s × s symmetric matrix, , is computed and the MDS is applied.Fig. 5
represents the 2D and 3D maps resulting from the MDS analysis for , α = 0.5, α = 2. The Shepard and stress plots are identical to the maps shown in Fig. 3, Fig. 4, revealing, as before, that the MDS results are trustworthy. We observe now the emergence of a different pattern, but the main idea of clustering remains. This observation is usual in MDS charts, where alternative indices capture different characteristics of the phenomena and lead to distinct plots, but allowing the same conclusions. The “best” index is simply the one that produces a MDS map where clusters reflect real-world in a more direct perspective.
Fig. 5
MDS maps for the cosine correlation with representations: (A) 2D; (B) 3D.
MDS maps for the cosine correlation with representations: (A) 2D; (B) 3D.
Clustering analysis
The standard MDS analysis is based on the object groups in the final map. We can rely either in the direct visualization of the plot, or in the implementation of some extra algorithm to extract the clusters.Bearing this idea in mind, in subsection 3.3.1 we adopt the non-hierarchical clustering algorithm K-means to identify clusters in the MDS map. In subsection 3.3.2 we use hierarchical clustering to confirm the results obtained. In subsection 3.3.3 we analyse the sensitivity of the MDS maps. In subsection 3.3.4 we discuss the results.We restrict the analysis to the cosine correlation metric since it revealed better results.
K-means clustering
Clustering consists on grouping objects that are, in some sense, similar to each other. The K-means is a non-hierarchical clustering method commonly used in machine learning and data mining [91]. The algorithm starts with a collection of s objects, where each object is a point in a q-dimensional space, and a given number of clusters, K, specified in advance by the user. The K-means groups the s objects into K ≤ s clusters, so as to minimize the objective function given by the sum of distances between the points and the centres of their clusters. The K-means arrives at a solution in which objects within each cluster are as close to each other as possible, and as far from objects in other clusters as possible.A key issue in K-means is how to determine the correct number of clusters, K. It should be noted that the very notion of “good clustering” is subjective and is a question of point of view. However, we can rely on different indices to measure the quality of the clustering, namely the Davies–Bouldin, the Caliński–Harabasz and the silhouette indices [92], [93], [94].In this work we adopt the silhouette to compare different solutions. The silhouette value, S, for each object, is a measure of how well each object lies within its cluster. Silhouette values vary in the interval S ∈ [−1, 1]. Silhouette values closer to S = 1 correspond to objects that are very distant from neighbouring clusters and, therefore, they are assigned to the right cluster. If S = 0, then the objects could be assigned to another cluster. When S = −1, then the objects are correctly assigned.Given the coordinates of the s = 22 objects in the q = 3 dimensional space generated by the MDS, we evaluate the clusters identified by the K-means algorithm when varying the number of clusters in the interval K ∈ [2, 7].Fig. 6
depicts the silhouette average values versus the number of clusters, K, leading to the optimum value K = 4. Fig. 7
illustrates the shape of the silhouettes obtained for , where we can see that the best shape is obtained for K = 4.
Fig. 6
Silhouette average values versus number of clusters K, for K ∈ [2, 7].
Fig. 7
Silhouette shape for .
Silhouette average values versus number of clusters K, for K ∈ [2, 7].Silhouette shape for .For K = 4, the K-means generates the clusters , , and , to be discussed in subsection 3.3.4.
Hierarchical clustering
As an alternative approach, not involving the MDS, we use a hierarchical clustering algorithm that is fed directly with matrix .Fig. 8
depicts the dendrogram generated by successive (agglomerative) clustering and average-linkage method [95], [96]. We cut the tree at the level 0.27, since below this value we see that the clusters became too close from each other. We see that four clusters emerge with this method, confirming the results obtained in the previous subsection. Nevertheless, MDS uses more efficiently the space and produces charts with a more fruitful map for the objects.
Fig. 8
Dendrogram generated by the hierarchical clustering algorithm for the cosine correlation index.
Dendrogram generated by the hierarchical clustering algorithm for the cosine correlation index.
Sensitivity analysis
The s = 22 virus diseases were compared based on quantitative and qualitative features. As qualitative characteristics are subjective, their influence upon the final results needs to be analysed, so as to prevent biased conclusions.In this line of thought, we vary the weights, {α, α}, of the two qualitative features in the interval and we check their influence in the generated MDS map. Each weight is discretized into r distinct values evenly spaced in the intervals and then r × r instances of the s × s dimensional similarity matrix are calculated. We use these matrices as the input for the MDS algorithm that generates r × r intermediate maps of “points” (i.e., one map per {α, α} pair). Finally, the charts are processed by means of Procrustes analysis in order to obtain a single global plot of “shapes”, where the “points” of the original maps are optimally superimposed [97].Procrustes analysis performs linear transformations, namely translation, reflection, orthogonal rotation and scaling, with the objective of minimizing a measure of the difference between the “points” in the original maps. The algorithm (i) chooses one MDS map for reference (by selecting one of the available instances); (ii) superimposes all other MDS instances into the current reference; (iii) computes the mean form of the current set of superimposed maps; (iv) compares the distance between the mean and the reference instances to a given threshold value and, if above, sets the reference to the mean form and continues to step (ii).Fig. 9
shows the 3D MDS global map obtained by the Procrustes algorithm, as well as the clusters identified previously. As can be seen, the results are quite robust to large variations in the qualitative features, since the r × r points corresponding to each original object (disease) deviate somehow, but the clusters remain.
Fig. 9
Three-dimensional MDS global map for the cosine correlation , obtained by Procrustes with .
Three-dimensional MDS global map for the cosine correlation , obtained by Procrustes with .In a complementary perspective we address in the sequel the sensitivity of the MDS results to quantitative features. In fact, the quantitative values found in the literature diverge slightly, since they depend on the time of the study and on the conditions observed in each particular case, namely environmental conditions (e.g., temperature, humidity, pressure), geographic region, human development, medical assistance, among others.To assess the sensitivity we add random noise to the values of the quantitative features, with amplitude in the interval ±10% of the values shown in Table 1 (values are limited to zero to avoid negative numbers). In these conditions, we perform ten experiments, generating one MDS map per trial, and then the MDS individual maps are combined using Procrustes.Fig. 10
illustrates the 3D MDS global map obtained and the corresponding superimposed clusters . We conclude that the method is robust to variations in the values of the quantitative features used in the study.
Fig. 10
Three-dimensional MDS global map for the cosine correlation , obtained by Procrustes with random variations in the values of the quantitative features.
Three-dimensional MDS global map for the cosine correlation , obtained by Procrustes with random variations in the values of the quantitative features.
Discussion
The clusters have interesting medical and epidemiological value. In cluster , we find the Ebola and the Marburg viruses, which can cause serious, or most of the times, a lethal human disease, even with therapeutic interventions. According to the National Institutes of Health (NIH) Guidelines for Research Involving Recombinant or Synthetic Nucleic Acid Molecules (http://osp.od.nih.gov/sites/default/files/NIH_Guidelines.html#_Toc351276292) of the USA, from November 2013, those viruses are classified as Risk Group 4 pathogens. There are no vaccines or effective treatments for their infections. Accordingly, these viruses are manipulated only in Biosafety Level 4 conditions due to their high individual and community risk. They are also considered biological agents with material threat determinations in the scope of bioterrorism, in the USA [98]. Also, the highly infectious agents responsible for MERS (from the Coronaviridae Subfamily) and Bird flu (Influenza A virus subtype H5N1), classified as Risk Group 3, are mapped in . These viruses are recommended to be manipulated with Biosafety Level 3 precautions, indicated for agents that may cause serious or potentially lethal disease. In this cluster, untreated HIV and untreated rabies infections are also present.In contrast, and distant in the MDS map, we have cluster . Most of the viruses are in Risk Group 2, because they generally do not cause serious or life threatening illness and most of them are readily treated or prevented easily with vaccines. They are manipulated, as most viruses, in Biosafety Level 2 environments. There is the exception of the Chikungunya, an important cause of febrile illness in the world, and now re-emerging as cause of large outbreaks of human disease [99]. The arbovirus (arthropod-borne) alphavirus responsible for this pathology is considered a Risk Group 3 pathogen and requires Biosafety Level 3 precautions [100].Cluster presents several virus species of different Risk Groups. Polio virus is a Risk Group 2 pathogen as well as the dengue fever virus, an arbovirus. On the other hand, the SARS-associated coronavirus is a pathogen of the Risk Group 3. Furthermore, smallpox, the disease caused by the variola virus, is also present in . Variola is considered a life-threatening disease posing the highest risk to national security due to its potential use as a biological weapon due to the high mortality rates and the major public health impact [98]. The reason is that smallpox was declared eradicated by the World Health Organization (WHO) in 1980, and vaccination, once widely practised, stopped in the same year [98]. Therefore, cluster can be considered as a transition cluster from to . In other words, it is located in the MDS map “equidistant” from and .Identical reasoning can be applied to cluster . In this cluster are the lentivirus (a subgroup of retrovirus) that causes HIV infection and acquired immunodeficiency syndrome (AIDS), a Risk Group 3 pathogen but generally manipulated with Biosafety Level 2 precautions, and the hepatitis B virus, considered belonging to Risk Group 2, equally transmitted by body fluids, and requiring a Biosafety Level 2 environment. Also considered of the Risk Group 2 is the third pathogen present in , i.e., the virus of Lyssavirus genus of the Rhabdoviridae family that causes rabies. Note that if the diseases caused by HIV and rabies are non-treated, then there is a high lethality in humans. They emerge in cluster of the MDS map.In conclusion, the MDS map resulted in a new visualization of the complex quantitative and qualitative data of several diseases caused by viruses, and several clusters were organized having some medical and epidemiological interest. In particular, a cluster emerged with viruses like Ebola and MERS, which are responsible for some recent viral outbreaks. In contrast, in the same MDS map, and distant from the previous group, there is a cluster of viruses associated with human diseases that present generally preventive and therapeutic interventions. The development of this methodology may help in understanding the dynamics of viral diseases.
Conclusion
This paper addressed the clinical characteristics of 22 viruses. A significant number of quantitative and qualitative characteristics were considered. When handling a large volume of information, we are confronted with the problem of comparing all details, but highlighting the most important properties. Discharging information a priori may lead to incomplete or even biased results. Therefore, embedding all details requires adequate statistical, computational and visualization techniques capable of revealing the main aspects while “filtering” the information with low relevance. The MDS technique adopted in this study proved to produce solid results in accordance with present day knowledge about those infectious agents.
Authors: Susan E Reef; Susan B Redd; Emily Abernathy; Laura Zimmerman; Joseph P Icenogle Journal: Clin Infect Dis Date: 2006-11-01 Impact factor: 9.079
Authors: Giuseppe Quaranta; Giovanni Formica; J Tenreiro Machado; Walter Lacarbonara; Sami F Masri Journal: Nonlinear Dyn Date: 2020-09-01 Impact factor: 5.022