Literature DB >> 24672565

Manipulating measurement scales in medical statistical analysis and data mining: A review of methodologies.

Hamid Reza Marateb¹, Marjan Mansourian², Peyman Adibi³, Dario Farina⁴.

Abstract

BACKGROUND: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal-variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD). ORDINAL-TO-INTERVAL SCALE CONVERSION EXAMPLE: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests.
RESULTS: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable.
CONCLUSION: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables.

Entities: Chemical Disease Gene Species

Keywords: Biostatistics; breast cancer; cluster analysis; data mining; research design

Year: 2014 PMID： 24672565 PMCID： PMC3963323

Source DB: PubMed Journal: J Res Med Sci ISSN： 1735-1995 Impact factor: 1.852

INTRODUCTION

In medical research, the design of a study is the most important part that directs other steps of research, especially, all type of data analysis. A badly designed study could never be retrieved, whereas a poorly analyzed one can usually be re-analyzed.[1] Another important issue, such as sample size calculation, also depends on the kind of experimental design and kind of measurements that exist in the study. Above all, the main question is: What types of data are being measured? The other steps of the analysis are indeed determined by the type of variable used.[23456] In this regard, analyzers assume that the variables have specific levels of measurement. Stevens proposed his typology in 1946.[7] In his article, Stevens claimed that all measurements in science were conducted using four types of scales that he called ‘nominal’, ‘ordinal’, ‘interval’ and ‘ratio’, unifying both qualitative (which are described by his ‘nominal’ type) and quantitative (to a different degree, all the rest of his scales). The concept of scale types later received the mathematical rigor that it lacked at its inception with the work of mathematical psychologists Theodore Alper,[89] Louis Narens,[1011] and R. Duncan Luce.[121314] Nowadays, the ordinal scale is considered as a qualitative variable.[15] However, this scale typology has received a lot of criticism.[6161718] Alternative scale taxonomies have therefore been suggested[19] that consists of grades, ranks, counted fractions, counts, amounts, and balances.[6] Most of the conflict between the pro-Stevens (‘conservative’) and the anti-Stevens (‘liberal’) camps begins after both sides agree that a certain variable is ordinal. But they part company when analyzing the data generated by that variable. The exchange in Nursing Research between Armstrong and Knapp is illustrative of the competing positions.[20]

Measurement scales

Nominal scales are only used for qualitative classification. They can be only measured whether the individual items belong to certain distinct categories. However, it is not possible to quantify or rank order the categories. Nominal data has no order, and the categories assignment is arbitrary. Also, it is not possible to perform arithmetic or logical operations on the nominal data.[18] Briefly, nominal data have three distinct features: 1) no ordering of the different categories, 2) no measure of distance between values, and 3) categories can be listed in any order without affecting the relationship between them. Nominal variables are also called (nonranked) categorical in the literature. The number of occurrences in each category is referred to as the frequency count for that category.[6] The other category dichotomous (binary) is defined as the variables that are nominal variables that have only two categories or levels. Examples of normal variable are gender, marital status, eye color, nationality, affiliation, religious preference, surgical outcome (dead/alive), blood type, and epidemiological status (healthy, patient), having any symptoms in a questionnaire (yes/no). A discrete–ordinal scale is a nominal variable, but the different states are ordered in a meaningful sequence. Ordinal data have order, but the intervals between scale points may be uneven. Because of the lack of equal distances, arithmetic operations are not possible, but logical operations can be performed.[21] Under an ordinal scale, the subjects or objects are ranked in terms of degree to which they possess a characteristic of interest.[6] An ordinal scale indicates direction, in addition to providing nominal information. In medicine, ordinal variables often describe the patient's characteristics, attitude, behavior, or status. Examples of ordinal variables might include: stages of cancer (stage I, II, III, IV), education level (elementary, secondary, college), pain level (1-10 scale), satisfaction level (very dissatisfied, dissatisfied, neutral, satisfied, very satisfied), social status (upper, middle, lower), type of degree (BS, MS, PhD), the Likert variable[22] such as the attitudinal response variable (agreement level) with four levels (strongly disapprove, disapprove, approve, strongly approve), or 4-item-rating scale (always, often, sometimes, never), graduation rank, visual analog scale (VAS), BMI (body mass index)-based nutritional status (sever thin, thin, normal, overweight, and obese). Continuous — ordinal scales occur when the measurements are continuous, but one is not certain whether they are on a linear scale, the only trustworthy information being the rank order of the observations. For example, if a scale is transformed by an exponential, logarithmic, or any other nonlinear monotonic transformation, it loses its interval scale property. Here, it would be expedient to replace the observations by their ranks.[21] Interval scales are metric scales that have constant, equal distances between values, but the zero point is arbitrary. They are measured on a linear scale, and can take on positive or negative values. It is assumed that the intervals keep the same importance throughout the scale.[21] In an interval scale, such as body temperature (°C, °F) or calendar dates, a difference between two measurements has meaning, but their ratio does not.[23] Counts are interval scale measurements, such as counts of publications or citations, years of education, intelligence (IQ test score), BMI, and age (years). The ratio scales are metric scales and the most informative scale. It is an interval scale with the additional property that its zero position indicates the absence of the quantity being measured. Briefly, ratio scales have equal intervals between values, the zero point is meaningful, and the numerical relationships (e.g. division) between numbers are meaningful. Examples of the ratio scales include weight, pulse rate, respiratory rate, body temperature (°K), and body length in infants or height in adults. Since the statistical tests on the ratio scales are the same as those of interval scales, the inferential statistics will be discussed on normal, ordinal, and interval scales. Statistics are part of our everyday life. Anyone who lacks fundamental statistical literacy, reasoning, and thinking skills might not be able to perform acceptable research. Kuzma provided a formal definition of the term ‘statistics’:[24] ‘A body of techniques and procedures dealing with the collection, organization, analysis, interpretation, and presentation of information that can be stated numerically’. The statistical analysis divided in two important branches; descriptive and inferential analysis.

Descriptive and inferential statistics for different types of variables

Descriptive statistics is the strategy of quantitatively describing the main features of a collection of data and presented by central and dispersion tendencies. The central tendency of nominal variables is defined as the mode, the most common item. For the ordinal variables, the median (middle-ranked item), or the mode can be used as the central tendency estimates. For interval variables, the mode, median, and arithmetic mean could be used as the central tendency, yet in addition to the aforementioned operators, the geometric (the samples root of the product of the data samples) and harmonic (the reciprocal of the arithmetic mean of the reciprocals of the data samples) means are allowed for ratio variables. Statistical dispersion is not defined for nominal and ordinal scales. For interval variables, the range, and standard deviation could be used as the dispersion measure, yet in addition to the aforementioned operators, the studentized range (the difference between the largest and smallest data, divided by the standard deviation) and the coefficient of variation (the ratio of the standard deviation to the mean) are allowed for ratio variables. The inferential statistics used to describe systems of procedures that can be used to draw conclusions from datasets arising from systems is affected by random variation. Any statistical inference requires some assumptions. Rejection of a hypothesis is an important part of inferential statistics using suitable statistical tests as parametric or nonparametric. In parametric tests, the probability distributions describing the data-generation process are assumed to be fully described by a family of probability distributions involving only a finite number of unknown parameters whereas in nonparametric tests the assumptions made about the process generating the data are much less than in parametric statistics and may be completely undefined. The purpose of the analysis and the scale of the measurement of the data define the suitable statistical test.[4] Usually, the statistical parametric tests rely on the normality of the distribution of the interval-scale data. Thus, normality tests such as Kolmogorov–Smirnov or Shapiro–Wilks are used to check the normality assumption.[25] The power of the parametric tests is higher than the corresponding nonparametric tests. Thus, the transformation of the interval variables is sometimes used to guarantee normality assumption.[26] The appropriate tests for different variable scales for comparisons between two or more groups containing independent or paired samples are listed in [Table 1]. The following clinical examples are given to elaborate the issue of correct statistical test to use the following.

Table 1

Selecting the appropriate test for comparisons between two or more than two groups based on different scales

Selecting the appropriate test for comparisons between two or more than two groups based on different scales Comparing the HDL (High-density lipoprotein) value in the healthy and diabetic patients, two independent sample t-test is used if HDL values are normally distributed in the classes, otherwise Wilcoxon–Mann Whitney test is used. To identify whether gender is equally distributed among abdominal obese people, the Chi-square test can be used. If the distribution of the BMI-based nutritional status (sever thin, thin, normal, overweight and obese) is the same among the patients with liver cancer, the Wilcoxon–Mann Whitney test is used. Finding whether the prevalence of high diastolic pressure is the similar in the Normoalbuminuria, Microalbuminuria, and Macroalbumineria groups, the Chi-square test could be used. The effectiveness of an educational program on the correct diagnosis of a disorder is identified using the McNemar test. The difference of blood sample vitamin-D concentration in normal, pre-diabetic and diabetic patients is identified using one-way ANOVA. The comparison of blood HbA1C concentration among pregnant women in the first, second and third semester of the pregnancy is performed using one-way repeated measurements ANOVA. Additionally, appropriate modeling methods for different variable scales are listed in [Table 2]. Modeling is usually used when we want to reduce the effect of confounders and the type of the modeling is determined by the scale of the dependent variable(s). Here are some clinical modeling examples.

Table 2

Selecting the appropriate test or modeling for different categories of dependent and independent variables

Selecting the appropriate test or modeling for different categories of dependent and independent variables The gender-specific difference of blood sample vitamin-D concentration in normal, pre-diabetic and diabetic patients is identified using factorial ANOVA. The effect of air pollutant concentration on the born weight considering mother's nutritional status and the supplementary intake is determined using the multiple linear regression. In the later example, if the born weight is categorized by the underweight and normal groups, the simple logistic regression is used. The effectiveness of a treatment method on stage of tumor (grades I–IV), cancelling the effect of confounders such as gender, age, and immunologic factors of patients is determined by using ordered logistic regression. For detailed description of the aforementioned methodologies the reader is referred to the selected textbooks and guidelines.[4627282930]

Data mining for different types of variables

Data mining (DM) is the process of discovering new patterns embedded in large data sets. DM uses this information to build predictive models. A lot of complex data are generated by healthcare systems in which manual analysis has become impractical. DM can generate information that can be useful to health care, including patients by identifying effective treatments. DM of medical data requires specific medical and DM knowledge. Medical DM activities include clustering, classification and estimation, and treatment effectiveness.[313233] In this section, we focus on clustering. However, the issues considered can be extended to other DM methods. Clustering is the task of grouping a set of objects in such a way that objects belonging to the same cluster are similar to each other (homogeneity) and objects belonging to different clusters are dissimilar to each other (separation). A clinical example is now given for clarification of clustering procedure: in year 2000, a paper was published in Nature by Alizadeh et al.,[34] in which the gene expression profiles (micro array) of 72 patients diagnosed as either acute myeloid leukemia (AML) or acute lymphatic leukemia (ALL) were analyzed. The authors could distinguish two similar groups corresponding to AML and ALL by clustering and match the groups with the routine leukemia diagnosis. Based upon this Roland Eils designed an expert system for prediction of genetic disease.[35] In the other words, if a new microarray gene profile is tested, it is possible to diagnose type of leukemia. The similarity between objects plays an important role in any clustering algorithm, since similar objects belong to a cluster. An object could be a patient with variety of recorded clinical data (features). Similar objects have similar features. Features could be interval, ordinal, and nominal variables. The question is how the similarity is measured for various types of data scales? The dissimilarity measure (distance) can be easily defined for interval variables. The Euclidean, Manhattan, Maximum, Minkowski, Mahalonobis, Average, Chord, Canberra, and Czekanowski distances could be used in this case.[36] For the nominal variables, simple matching, Russell-Rao, Jaccard, Dice, Rogers-Tanimoto, and Kulczynski distances might be used, while there are more than 76 distance measures such as Yule, Sokal-Sneath-c, and Hamann measures that could be used for the binary data.[363738] An example is shown in [Figure 1] for better clarification. However, there are many problems in defining dissimilarity measures for ordinal variables. The distance measure for the ordinal data cannot be defined unless the ordinal to interval variable conversion is used. Moreover, defining proper similarity measure can also affect statistical feature reduction and visualization techniques such as multidimensional scaling (MDS), in which the distance measure is defined for different measurement scales (e.g. using the weighted Euclidean model).[39404142]

Figure 1

An example of calculating the distance between two objects of ordinal variables, using the simple dissimilarity measure

Ordinal to interval variable conversion

Consider the four-item rating scale (always, often, sometimes, never) that is widely seen in the questioners of psychological,[43] gastrointestinal,[44] nutritional,[45] and public health[46] researches. One approach to handle ordinal variables is introducing a dummy binary variable by merging [always, sometimes] and [rarely, never] as ‘yes’ or ‘no’. Thus, the ordering information is discarded and a suitable binary distance measure can be used. However, some information is lost, that could have potentially improved the predictive performance of the groups’ dissimilarity.[47] The other strategy is monotonic nonrandom and random assignments of numbers to rank order and treat them as if they conform to interval scale.[4849] The first approach is called equal distance scoring (EDS), while the other solution is entitled as monotonic random scoring (MRS) in the literature. Using EDS, interval variables such as [0, 1, 2, and 3] are used for the four-item rating scale. Accordingly, the distance between ‘sometimes’ and ‘never’ is the same as that of ‘sometimes’ and ‘often’. This is not really correct. Additionally, EDS has received criticisms in the literature and proved not to be efficient even in correlation analysis in some cases where the ranks are not uniformly distributed.[50] Although, MRS has been extensively used in the literature, it has also received criticisms.[51] In MRS, uniform and normal monotonic random numbers are generated and used instead of the ordinal scale. Using MRS, the aforementioned four-item rating scale might be represented by the following uniform monotonic random numbers [0.1270, 0.8147, 0.9058, and 0.9134]. Using the random number generator again, the new mapping would be [0.0975, 0.2785, 0.5469, and 0.6324]. The question is whether the transformation is unique at every MRS run, and also if the problem mentioned in EDS is resolved? The optimal ordinal-to-interval conversion is still debatable and many complicated approaches have been introduced in the literature.[5152] In none of which, the mapping was not defined as to maximize the separation of the groups in the clustering procedure. In the next section, clustering methods defined for different variable scales are discussed and the relationship between this mapping and clustering is considered.

Clustering methods for different variable scales

Most previous clustering methods focus on interval data for which the dissimilarity could be calculated easily, such as density-based (DBSCAN,[53] OPTICS[54]), partitioning (k-means,[55] k-medoids,[56] fuzzy c-means,[57] ISODATA[58]), hierarchical (different linkage algorithms,[5960] MONA,[61] DIANA[62]), and grid-based (WaveCluster,[63] Fractal Clustering[64]). Nonranked categorical clustering algorithms have been extensively proposed in the literature, such as LIMBO,[65] COOLCAT,[66] CACTUS,[67] ROCK,[68] MMR,[69] CLICKS,[70] HD vector,[71] AUTOCLASS,[72] K-modes,[73] fuzzy K-modes,[74] fuzzy centroids,[75] genetic fuzzy k-modes,[76] and fuzzy centroids.[75] However, the dissimilarity measures and cluster representatives have great impact on the clustering performance and convergence.[777879] It is possible to use dummy binary variables for ordinal data, and then use any of the above clustering methods at the expense of losing details. There are few algorithms proposed for clustering ordinal data, such as median fuzzy c-means[80] and a modified fuzzy c-means clustering method in which the ordinal-to-interval mapping is simultaneously determined by particle swarm optimization.[81] In the later method, the mapping is calculated so as to maximize the inter-cluster distance and minimize the intracluster distance. This algorithm is one of the few clustering methods in which the mentioned transformation is adaptively estimated for each ordinal variable. This algorithm will be used at the next section of this manuscript for clustering a cancer dataset with ordinal variables.

Latent variable models

Latent variable models, specifically item response theory, have also been used for modeling and clustering of ordinal data.[828384] The mixture of item response models could be used for the clustering of such data. It is assumed that the observed ordinal data are discrete versions of an underlying latent Gaussian variable. The clustering is then achieved by fitting a mixture model to the latent Gaussian data.[85] However, this method relies on the posterior mean of the latent Gaussian data and the Gaussian assumption could be valid for a sufficiently large data set (number of variables and also levels of ordinal variable) which cannot be always taken for granted.[85]

Latent class analysis

Latent class analysis (LCA) is a subset of structural equation modeling, used to find groups or subtypes of cases in multivariate categorical data. These subtypes are called ‘latent classes’.[86] One of the common statistical application areas of LC analysis is the clustering, in which LC cluster models are introduced. These models have advantages over traditional clustering methods: such as probability-based classification (similar to fuzzy memberships), handling continuous, categorical, counts,[87] or mixed mode data[888990] and the application of demographics and other covariates for clustering analysis.[91929394] LC models are model-based clustering methods in which explicit assumptions are made about the form of the probability density function describing the population of the observed data.[9596] Clustering analysis and further inferences about the numbers of clusters and cluster membership are based on estimation of the unknown parameters in the probability model used.[97] Two main methods to estimate the parameters of the various types of LC cluster models are the maximum-likelihood (ML) method and the maximum-posterior (MAP) method; thus, a well-known problem in LC analysis is the occurrence of local solutions. Accordingly, the analyst must interpret estimates cautiously. Moreover, the weak identifiability of LC clustering,[98] the complexities of the likelihood function and likelihood surface make the procedure sensitive to initial estimates.[99] Also, the model selection issue is one of the main research topics in LC clustering, that is, estimation of the number of clusters and the form of the model given the number of clusters. Akaike (AIC), Bayesian (BIC), and consistent Akaike (CAIC) information criteria have been used for model selection.[100] Software packages such as MCLUST,[101] Mplus,[102] poLCA,[103] Latent GOLD,[104] and SAS[99] can be used for LC cluster analysis.[105]

Mixed data

In many applications, each instance in a data set is described by more than one type of attribute. For example, we would like to group people based on their recorded anthropometric or clinical data. This grouping can identify different diseases. The recorded data for each person contain gender (binary variable), the assignment to (underweight, normal, overweight, and obese classes) (ordinal variable), HDL and LDL cholesterol values (interval), etc. This is an example of mixed-type data, in which similarity and dissimilarity between two instances (e.g. people) cannot be calculated using the methods discussed so far. A general distance coefficient and a generalized Minkowski distance was introduced for mixed-type data in the literature.[36] Other methods have also been introduced in the literature.[106107108109110111112]

ORDINAL-TO-INTERVAL SCALE CONVERSION EXAMPLE

Since there are few studies on ordinal data clustering, an example is given based on the breast cancer databases obtained from the Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)). This database, known as Wisconsin Breast Cancer Data (WBCD) with the number of web hit of 98032, was obtained from the university of Wisconsin Hospitals, Madison by Dr. William H. Wolberg[113114115116] and has been extensively used as a clustering benchmark in the literature.[81117118] There are 699 patient records in the database. Each attribute has 10 ordinal values. Sixteen patient recordings had missing values, excluded. Thus, the sample size was 683. Each recording represents nine measurements made on a fine needle aspirate (FNA) taken from the patient breast. The nine cytological measurements are the clump thickness, size uniformity, shape uniformity, marginal adhesion, cell size, bare nuclei, bland chromatin, normal nucleoli, and mitosis. Each of these measurements are described by an ordinal integer label between 1 and 10, the larger the number the greater likelihood of malignancy.[115] These ratings were done by the clinical experts. All malignant aspirates were histologically confirmed whereas FNAs diagnosed as benign masses were biopsied only at the patient's request. The remainder of benign cytologies was confirmed by clinical re-examination 3 and 12 months after the aspiration. Masses that produced unsatisfactory or suspicious FNAs were surgically biopsied.[114] Accordingly, 239 cases were diagnosed as malignant and 444, as benign. The class labels were saved as the gold standard and kept for comparison. The class labels were excluded from the data set; thus 683 10-dimensional ordinal dataset was used for clustering. The number of clusters (groups) was estimated and the accuracy of malignant and benign classification was assessed by comparison with the gold standard. Since ordinal data clustering is more challenging than clustering other types of data, we consider two different ordinal clustering methods for analyzing WBCD. The first approach was taken from the literature while the second one is proposed by the authors of this manuscript.

Ordinal data clustering based on modified FCM analysis (clustering #1)

Using the ordinal dataset, a modified fuzzy c-means whose ordinal-to-interval conversion was estimated based on the particle swarm optimization was used.[81] The algorithm was run from 2 to 10 numbers of clusters, and the clustering structure with optimum Xe-Beni clustering validity index[119] was selected. In the other words, number of clusters with better relative compactness (minimum intra-cluster distance) and separation (maximum intercluster distance) was chosen.[120] In the selected clustering structure, the malignant and benign clusters were identified by comparison with the gold standard and the errors were reported. Errors included number of malignant cases in the benign cluster and vice versa.

Ordinal data clustering based on modified OPTICS analysis (clustering #2)

The ordinal data were converted to interval data by using the EDS algorithm. It was because the ordinal scales were equally assigned without prior expert-based knowledge. Then, a density-based clustering method OPTICS was used to identify the clustering structure. OPTICS resolves the problem of detecting meaningful clusters in data of varying density such that points that are spatially closest in the multidimensional space become neighbors in the ordering. OPTICS can identify clustering structure, and unlike FCM does not need major input parameters or postprocessing such as clustering validity analysis.[121] Like the previously mentioned clustering method, the malignant and benign clusters were identified by comparison with the gold standard and the errors were reported.

Clustering performance analysis

The values of true positive (TP), true negative (TN), false positive (FP), and false Negative (FN) were calculated for each of the aforementioned clustering methods, by comparing the clustering results with those of the gold standard. Then, the information theory parameters were calculated as the following: Sensitivity (Se) = Recall (Re) = TP/(TP+FN); Specificity (Sp) = TN/(FP+TN); Precision (Pr) = TP/(TP+FP); Type I error: FP rate (α) =1-Sp; Type II error: FN rate (β) =1-Se; Power =1-β =Se; F-score =2*(Pr*Re)/(Pr+Rl) = harmonic mean (Pr,Rl); Accuracy (Acc) = (TP+TN)/(TP+TN+FN+FP); The codes of the above-given two clustering algorithms and the validation program were written in Matlab (Matlab and Statistics Toolbox Release 2012b, The MathWorks, Inc., Natick, Massachusetts, United States), and is available upon the request to the authors.

RESULTS

0In the first clustering method, Xe-Beni index showed the optimum value at two clusters. It showed that there were two clusters in the data, which is quite reasonable. The FCM clustering algorithm was run 10 times, and the clustering results with the best compactness and separation were used.[122] The ordinal-to-interval conversion matrix for nine ordinal variables with 10 ranks was listed in [Table 3]. The ranks of different ordinal variables were transformed differently. In the other words, the transformation was done, so as to optimize the clustering structure. Comparing with the gold standard, the performance of the first clustering method is listed in [Table 4].

Table 3

The ordinal-to-interval conversion matrix for nine ordinal variables (columns) with 10 ranks (rows) studied on the WBCD using the clustering method #1

Table 4

The performance of the clustering methods studied on the WBCD

The ordinal-to-interval conversion matrix for nine ordinal variables (columns) with 10 ranks (rows) studied on the WBCD using the clustering method #1 The performance of the clustering methods studied on the WBCD Using the clustering method #2 with 40-nearest neighbors (40-NN), the reachability distance plot (RD-plot) was shown in [Figure 2]. This 1D plot shows the clustering structure of the multidimensional data, in which major local minimums correspond with a cluster. In this plot, two major clusters were detected related to malignant and benign groups, respectively. Although the major local minimums could be detected manually, there are methods for automatically detecting including clusters.[121] The performance of this clustering method was shown in [Table 4].

Figure 2

The clustering structures of WBCD, found by the second ordinal– variable clustering method. Each major valley (local minimum) of the reachability distance plot (RD-plot) corresponds with a possible cluster. In this example, the first cluster is the malignant group while the second one is the benign group. The power of both of clusters methods are 98%, while the type-I error (α) was 0.03 and 0.09 for the clustering methods #1 and #2, respectively. In both of the clustering methods, the FN-rate (β) was 0.02. A FN is much more serious than a FP since it means that the subject will not be treated.[81] Both of aforementioned methods, showed ‘almost perfect agreement’ with the gold standard.

DISCUSSION

One of the important elements of a good medical research is identifying the key variables of the study and their method of measurement (measurement scale) and unit of measurement.[123] In addition to different types of variables,[124] such as independent (risk factors), dependent (outcome), confounding (intervening), and background variables, the scale of variables (qualitative versus metric) plays an important role of selecting appropriate statistical tests. Due to the importance of selecting appropriate statistical comparison and modeling tests, they have been mentioned in [Tables 1 and 2], in detail. Also, clinical examples taken from different medical studies were given in this paper for better elaboration. Although the selection of appropriate tests have been studied in the manuscripts,[46] this manuscript is one of the first one of its kind to discuss about different variable scales and their suitable statistical and data mining methods with several examples. Much of what was written in the literature is about clustering analysis and validity analysis of interval data,[62120125] but little was mentioned about the analysis of categorical variables. In this paper, we discussed about different clustering methods for categorical data and as the first manuscript in review, two different clustering methods were used for analyzing the ordinal WBCD. The first approach was already proposed and tested,[81] while the second approach was proposed by the authors. We hope that this review will be of use for researchers in the field of biomedical sciences. One of the main limitations of this manuscript is that most of the nominal-data clustering methods were only mentioned and cited. There was no criterion to select in this paper. We have been contacting the authors of the corresponding papers. Most of the clustering programs were received. Some of which were re-compiled in different operating systems, for example, Linux, with the help of other data-mining researchers from different countries. We will be trying to run several clustering algorithms on categorical data on standard Benchmark datasets to have a fair comparison. It will be the focus of our future work.

17 in total

1. Conditions Equivalent to Unit Representations of Ordered Relational Structures.

Authors: R. Duncan Luce
Journal: J Math Psychol Date: 2001-02 Impact factor: 2.223

2. Selected techniques for data mining in medicine.

Authors: N Lavrac
Journal: Artif Intell Med Date: 1999-05 Impact factor: 5.326

Review 3. Survey of clustering algorithms.

Authors: Rui Xu; Donald Wunsch
Journal: IEEE Trans Neural Netw Date: 2005-05

4. The Copenhagen Psychosocial Questionnaire--a tool for the assessment and improvement of the psychosocial work environment.

Authors: Tage S Kristensen; Harald Hannerz; Annie Høgh; Vilhelm Borg
Journal: Scand J Work Environ Health Date: 2005-12 Impact factor: 5.024

5. Which is the correct statistical test to use?

Authors: Evie McCrum-Gardner
Journal: Br J Oral Maxillofac Surg Date: 2007-10-24 Impact factor: 1.651

6. Neural-network feature selector.

Authors: R Setiono; H Liu
Journal: IEEE Trans Neural Netw Date: 1997

7. Treating ordinal scales as interval scales: an attempt to resolve the controversy.

Authors: T R Knapp
Journal: Nurs Res Date: 1990 Mar-Apr Impact factor: 2.381

8. The impact of cluster representatives on the convergence of the k-modes type clustering.

Authors: Liang Bai; Jiye Liang; Chuangyin Dang; Fuyuan Cao
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2013-06 Impact factor: 6.226

9. Dietary sodium intake and mortality: the National Health and Nutrition Examination Survey (NHANES I).

Authors: M H Alderman; H Cohen; S Madhavan
Journal: Lancet Date: 1998-03-14 Impact factor: 79.321

10. The Stanford Health Assessment Questionnaire: a review of its history, issues, progress, and documentation.

Authors: Bonnie Bruce; James F Fries
Journal: J Rheumatol Date: 2003-01 Impact factor: 4.666

6 in total

1. BALANCE TRAINING: DOES ANTICIPATED BALANCE CONFIDENCE CORRELATE WITH ACTUAL BALANCE CONFIDENCE FOR DIFFERENT UNSTABLE OBJECTS?

Authors: Scott W Cheatham; Gioella Chaparro; Morey J Kolber
Journal: Int J Sports Phys Ther Date: 2020-12

2. A Hybrid Computer-aided-diagnosis System for Prediction of Breast Cancer Recurrence (HPBCR) Using Optimized Ensemble Learning.

Authors: Mohammad R Mohebian; Hamid R Marateb; Marjan Mansourian; Miguel Angel Mañanas; Fariborz Mokarian
Journal: Comput Struct Biotechnol J Date: 2016-12-06 Impact factor: 7.271

3. Prediction of dyslipidemia using gene mutations, family history of diseases and anthropometric indicators in children and adolescents: The CASPIAN-III study.

Authors: Hamid R Marateb; Mohammad Reza Mohebian; Shaghayegh Haghjooy Javanmard; Amir Ali Tavallaei; Mohammad Hasan Tajadini; Motahar Heidari-Beni; Miguel Angel Mañanas; Mohammad Esmaeil Motlagh; Ramin Heshmat; Marjan Mansourian; Roya Kelishadi
Journal: Comput Struct Biotechnol J Date: 2018-03-02 Impact factor: 7.271

4. A descriptive cross-sectional study on various uses and outcomes of Garcinia kola among people of Oshimili North in the Delta State of Nigeria.

Authors: Vincent Icheku; 'Ifeanyichukwu Fidelis Onianwah; Augustine Nwulia
Journal: Ayu Date: 2018 Jul-Sep

5. Comparing Steady-State Visually Evoked Potentials Frequency Estimation Methods in Brain-Computer Interface With the Minimum Number of EEG Channels.

Authors: Mehrnoosh Neghabi; Hamid Reza Marateb; Amin Mahnam
Journal: Basic Clin Neurosci Date: 2019-05-01

6. Pupillary unrest, opioid intensity, and the impact of environmental stimulation on respiratory depression.

Authors: Rachel Eshima McKay; Michael A Kohn; Merlin D Larson
Journal: J Clin Monit Comput Date: 2021-03-02 Impact factor: 1.977

6 in total