Literature DB >> 29509813

All metrics are equal, but some metrics are more equal than others: A systematic search and review on the use of the term 'metric'.

Núria Duran Adroher^1,2, Birgit Prodinger^1,2,3, Carolina Saskia Fellinghauer^1,2, Alan Tennant^1,2.

Abstract

OBJECTIVE: To examine the use of the term 'metric' in health and social sciences' literature, focusing on the interval scale implication of the term in Modern Test Theory (MTT).
MATERIALS AND METHODS: A systematic search and review on MTT studies including 'metric' or 'interval scale' was performed in the health and social sciences literature. The search was restricted to 2001-2005 and 2011-2015. A Text Mining algorithm was employed to operationalize the eligibility criteria and to explore the uses of 'metric'. The paradigm of each included article (Rasch Measurement Theory (RMT), Item Response Theory (IRT) or both), as well as its type (Theoretical, Methodological, Teaching, Application, Miscellaneous) were determined. An inductive thematic analysis on the first three types was performed.
RESULTS: 70.6% of the 1337 included articles were allocated to RMT, and 68.4% were application papers. Among the number of uses of 'metric', it was predominantly a synonym of 'scale'; as adjective, it referred to measurement or quantification. Three incompatible themes 'only RMT/all MTT/no MTT models can provide interval measures' were identified, but 'interval scale' was considerably more mentioned in RMT than in IRT.
CONCLUSION: 'Metric' is used in many different ways, and there is no consensus on which MTT metric has interval scale properties. Nevertheless, when using the term 'metric', the authors should specify the level of the metric being used (ordinal, ordered, interval, ratio), and justify why according to them the metric is at that level.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 29509813 PMCID： PMC5839589 DOI： 10.1371/journal.pone.0193861

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The term ‘metric’ is widely used in various fields of science, and it has gained different meanings with the years. These meanings have given rise to different perceptions on what is implied by its subsequent use. Consequently, it is important to understand how the term ‘metric’ is being used, and what characteristics are associated with these different meanings. In the 19th century, when it was first used, it referred mainly to the metric system [1]. In the first half of the 20th century one can find also articles using it in the mathematical sense of metric (or distance) function (see S1 Appendix and [2]). The term ‘scale’ also appears in the 19th century’s literature—e.g., Stiff [1] used it as a way to calculate weights and measures. In the mid-20th century, Coombs [3] brought together ‘metric’ and ‘scale’ introducing the term ‘ordered metric’ as a new type of scale to the four distinguished by Stevens [4]: It is interesting to note here that this is a new type of scale not discussed by Stevens. This is a type of scale that falls between what he calls ordinal scales and interval scales. In ordinal scales nothing is known about the intervals. In interval scales the intervals are equal. In this scale, which we call an ordered metric, the intervals are not equal but they may be ordered in magnitude. In other words, in an ordered metric scale one is able to order the distances between some pairs of points. For instance, one may say that the distance between two points A and B is bigger than the distance between two other points C and D. Therefore, one has more information than in an ordinal scale but less than in an interval scale. In the second half of the 20th century, a distinction between ‘nonmetric’ and ‘metric’ was made. The former refers to ordinal data and the later to interval or ratio data, as Takane et al. [5] explained: Previous authors of multidimensional scaling papers have emphasized a dichotomy of measurement levels which they termed metric and nonmetric. When placed in the context of Stevens [6] measurement theory, it is clear that these terms correspond to three of the four measurement levels delineated by Stevens, namely ordinal (nonmetric) and interval or ratio (metric). Also in the second half of the last century, ‘metric’ has been used in the sense of statistic or index. Schuman and Brace [7] assessed the metric variations of chimpanzee dentition, where ‘metric’ consists of recording of dimensions: their means, ranges, standard deviations, coefficients of variation, modules, indices (talonid-trigonid, length-breadth). More recently, these different uses of the term ‘metric’ are found in the health and social sciences. For example, Crane et al. [8] use ‘metric’ practically as a synonym of ‘scale’. Moreover, Kemmler et al. [9] point out the distinction between metric scaling and ordinal scaling, and Arons and Krabbe [10] investigated both metric and nonmetric multidimensional scaling. In the sense of statistic or index, Bai et al. [11] present the Activity Index as a new metric for summarizing raw tri-axial accelerometry data. Modern Test Theory (MTT) models—consisting of the Rasch Measurement Theory (RMT) [12] and Item Response Theory (IRT) [13] paradigms [14], are used in the health and social science’s literature, inter alia, to evaluate Patient Reported Outcome Measures (PROMs) [15]. The term ‘metric’ is widely used in MTT, however, it is rarely defined. Based on this historical overview on the use of the term ‘metric’, it seems that its use predominantly implies interval scaling. For researchers, this would imply that they can use parametric methods for analyzing the respective data given appropriate distributions. However, if the term is vaguely defined and does not have interval scale properties but researchers treat it as such, there is a risk of getting erroneous results [16] or of under- or overestimating effects in e.g. clinical research [17]. Clinicians dealing with indexes such as Minimal Clinical Important Difference (MCID) in PROMs should be aware that ordinal-based derivations of MCID can lead to wrong conclusions [17]. Kahler et al. [16] provided an example showing the inappropriateness of applying parametric methods to Health-Related Quality of Life scores, i.e., to ordinal data. While it is clear that when using the term ‘interval scale’ mathematical calculations such as the mean and standard deviation are allowed, it is not so clear if the term used is ‘metric’. Therefore, if a researcher or clinician encounters the term ‘metric’, they do not know if they are allowed to perform arithmetic operations. It is also not so clear if MTT models can transform PROM scores to interval data. The current study aims to put some light on these issues.

Objective

The objective of this paper is to examine the use of the term ‘metric’ in health and social sciences’ literature, focusing on the interval scale implication of the term in MTT. Specifically, we aim (1) to explore the different uses and meanings of ‘metric’ in MTT literature, (2) to examine the relationships between ‘metric’ and ‘interval scale’ in RMT and IRT, and (3) to compare the current understanding on whether either or both paradigms can produce interval scaling.

Materials and methods

A Systematic Literature Search and Review [18] was carried out, and the search, data extraction, and data synthesis processes are explained as follows. Where appropriate, this review follows the PRISMA guidelines (see [19] and S2 Appendix). The detailed review protocol is available at S3 Appendix.

Systematic Literature Search

Fig 1 illustrates the iterative process of the Systematic Literature Search. The databases SCOPUS, PsycINFO, PubMed and ERIC were systematically searched. The search terms used in combination in all the databases are shown at the top of Fig 1. Combination 1 (C1) enabled to restrict the articles to MTT literature. Combination 2 (C2), apart from ‘metric’ and ‘interval scale’, whose relationship aimed to be examined, also included ‘conjoint measurement’ and ‘fundamental measurement’. The theory of Conjoint Measurement (CM) provides a means to quantify attributes, i.e., to obtain interval measures [20]. Therefore, CM can be used to justify interval measurement. In fact, the Rasch Measurement Model has been argued to be a probabilistic form of Additive CM [21]. As Simultaneous CM is a form of Fundamental Measurement [20], the latter was also included.

Fig 1

Iterative process of the Systematic Literature Search.

Abbreviations: IRT, Item Response Theory; C, Combination; I, Inclusion; E, Exclusion.

Iterative process of the Systematic Literature Search.

Abbreviations: IRT, Item Response Theory; C, Combination; I, Inclusion; E, Exclusion. A year restriction was established due to the vast amount of records identified in the database search. As there was a clear increase in the number of records after 2001 as shown in Fig 2, we decided to consider the articles published from 2001 to 2005 and from 2011 to 2015. We considered two periods (2001–2005 and 2011–2015) hypothesizing that relevant findings from papers published previous to 2001 would be cited in the 2001–2005 period, and that relevant findings from papers published in 2006–2010 would be cited in 2011–2015.

Fig 2

Barplot of the 7243 records identified through database search per year.

The records were subsequently screened according to the a priori inclusion and exclusion criteria shown in Fig 1. Inclusion criterion 1 (I1) and Exclusion criterion 1 (E1) were checked manually, and I2 by means of a Text Mining (TM) strategy using R software v.3.3.0 (see S4 Appendix). The TM strategy was applied and refined in a few iterations before it was applied to all the eligible articles. In this iterative process, four additional specific Exclusion criteria (E2–E5) were identified to operationalize I2 (see Fig 1). As regards E2, if terms in C1 (e.g., Rasch) or in C2 (e.g., fundamental measurement) only appeared in the reference section, they were not used in the main manuscript texts, letting us to conclude that the paper did not deal substantially with the topic under review in our study. If E3–E5 were applicable, it followed that I2 could not be satisfied any more and therefore the article was neither included. The TM algorithm described in S4 Appendix includes a quality check of the fulfillment of the eligibility criteria.

Data extraction and data synthesis

To respond to Aim 1, there was no need to classify the included articles, because for all of them, the previous and posterior context of the uses of ‘metric’ were examined. This information was obtained applying the TM strategy. Once this information was processed, the uses of ‘metric’ were distinguished as parts of the speech: noun, adjective or adverb. Synonyms and definitions of ‘metric’ were also collected. To respond to Aim 2, the RMT and IRT paradigms needed to be distinguished to examine the uses of ‘metric’ and ‘interval scale’ in each of the paradigms. If in the article only RMT was mentioned, the paradigm was clearly RMT, and if only IRT was mentioned, the paradigm was IRT. If both terms were mentioned, then the paradigm could be RMT, IRT, or MTT, depending on what the article was dealing with. If the article was mainly dealing with RMT, then RMT was the paradigm, and the same holds true for IRT. If both had a similar presence, then the article was allocated to a global group, MTT. Each article was allocated to one paradigm, and this information was added to a table indicating if each article contained the terms ‘metric’ and ‘interval scale’. Contingency tables on the individual and combined usages of ‘metric’ and ‘interval scale’ by paradigm could then be calculated. To respond to Aim 3, in order to identify the reasons why either or both RMT and IRT models can produce interval scaling, a more in depth review of the included articles was needed. Due to the high amount of included articles, we selected a sample of them and we examined how they could be categorized into meaningful groups. Most of them reported on studies in which RMT or IRT models were applied to validate, co-calibrate or develop an instrument, which we referred to Application papers. There were also Theoretical papers, dealing with abstract measurement discussions, and not involving in general any data analysis. Moreover, there were some which were neither pure application nor theory, but examined different RMT or IRT Methodologies via simulation, or presented a novel method, containing some theory but also an application to test the methodology. Another type referred to articles which did not present an application, a methodology nor discussed in depth theoretical measurement issues; but they explained aspects of RMT or IRT with an educational (Teaching) aim. Finally, a last group consisted of articles which mentioned RMT or IRT, predominantly only in the introduction or discussion sections, and their main focus was not on MTT, belonging therefore to a Miscellaneous category. After having identified these five article typologies (Theoretical, Methodological, Teaching, Application, and Miscellaneous) based on a sample of the included articles, we hypothesized that the whole set of the articles could be assigned to one of the previous five types, and the article type allocation was also performed. Two of the authors, NDA and CSF, performed manually both the paradigm and type allocation simultaneously based on the abstract. In case of ambiguity, the full texts were consulted. The two authors first performed both allocations in a random sample of 100 articles and then the disagreements were discussed. Then the remaining articles were split amongst NDA and CSF who performed the allocations of the remaining articles separately. The detailed guidelines for paradigm and type allocation are listed in S5 Appendix. While allocating each article to one paradigm and one type, some comments were made if the article could potentially contribute to justify the ability of RMT and IRT to obtain interval measures. The highlighted articles virtually belonged to Theoretical, Methodological, and Teaching types. Therefore, it was agreed by the research team that Application and Miscellaneous articles would not provide in-depth information in response to Aim 3; thus, it was decided that only Theoretical, Methodological, and Teaching articles would be included in the further full text review. NDA extracted text passages which referred to the ability of RMT and IRT to obtain interval measures into a Word document. Then NDA conducted an inductive thematic analysis [22] in the extracted texts identifying different themes which were then discussed with the research team and subsequently refined.

Results

Initially, 7243 records were identified (4594 in Scopus, 2135 in PsycINFO, 294 in ERIC, and 220 in PubMed). From these, 1949 were duplicates. After applying the year restriction, we ended up with 3361 records to screen. Out of those, 690 were excluded because they were not written in English, were not peer-reviewed articles, or belonged to journals outside the health and social sciences. There were 36 articles which were unavailable to us. The exact figures are shown in Fig 3.

Fig 3

Flow diagram of the search and eligibility processes.

Abbreviations: C, Combination; IRT, Item Response Theory.

Flow diagram of the search and eligibility processes.

Abbreviations: C, Combination; IRT, Item Response Theory. The 2635 eligible articles were assessed for eligibility by means of the Text Mining strategy detailed in S4 Appendix. This step in the literature search resulted in 1337 articles. The dataset containing the included articles is available at S6 Appendix. The agreement in the pilot sample of 100 articles between the two authors for paradigm allocation was 92%, and for type allocation 79%. Conflicts in type allocation concerned mainly the Miscellaneous type. The criteria for type allocation were refined accordingly. Only then the remaining articles were split amongst the two authors who performed paradigm and type allocations separately. The majority (70.6%) of the included articles were allocated to the RMT paradigm and 21.9% to the IRT paradigm (see Fig 4). The Type allocation resulted in the clear predominance of Application studies (68.4%), followed by the types Methodological (13.8%), Miscellaneous (9.1%), Teaching (4.9%), and finally Theoretical studies (3.8%).

Fig 4

Flow chart of the data extraction process.

Abbreviations: RMT, Rasch Measurement Theory; IRT, Item Response Theory; MTT, Modern Test Theory.

Flow chart of the data extraction process.

Abbreviations: RMT, Rasch Measurement Theory; IRT, Item Response Theory; MTT, Modern Test Theory. As it can be seen in Table 1, most of the RMT articles were Application studies (84.2%), while for IRT one third were Application and one third Methodological studies. A higher proportion for Miscellaneous articles was found for IRT than for RMT.

Table 1

Contingency table of paradigm by Type.

Paradigm	Theoretical	Methodological	Teaching	Application	Miscellaneous	Total
RMT	22 (2.3%)	59 (6.2%)	26 (2.8%)	795 (84.2%)	42 (4.4%)	944
IRT	7 (2.4%)	95 (32.4%)	18 (6.1%)	103 (35.2%)	70 (23.9%)	293
MTT	22 (22%)	30 (30%)	22 (22%)	16 (16%)	10 (10%)	100
Total	51 (3.8%)	184 (13.8%)	66 (4.9%)	914 (68.4%)	122 (9.1%)	1337

Abbreviations: RMT, Rasch Measurement Theory; IRT, Item Response Theory; MTT, Modern Test Theory.

Uses and meanings of ‘metric’

Table 2 shows the previous and posterior context of some of the uses of ‘metric’ as a noun and as an adjective or adverb, containing an example of a reference where it was used. Table 3 contains meanings (synonyms and definitions) of ‘metric’, also with the corresponding reference. The complete list of uses, synonyms and definitions can be found in S7 Appendix.

Table 2

Examples of uses of ‘metric’.

First as a noun, then as adjective/adverb.

Pre	Term	Post	Reference
ability	metric		[29]
latent	metric		[30]
observed score	metric		[31]
raw/raw score/raw-score interval	metric		[32]/ [33]/ [34]
difficulty	metric		[35]
	metric	of the items	[36]
Euclidian	metric		[37]
IRT	metric		[38]
logistic/logit/log-odds	metric		[39]/ [40]/ [41]
PROMIS	metric		[42]
Rasch	metric		[43]
T/T-score	metric		[44]/ [45]
z/z-score	metric		[46]/ [47]
Equating the two	metrics	placed the item parameter estimates from both samples on the same scale.	[48]
common	metric		[49]
better/good/more appropriate/strongest possible	metric		[50]/ [51]/ [52]/ [53]
ordinal	metric		[54]
quasi-interval	metric		[55]
interval	metric		[56]
linear	metric		[57]
quantitative	metric		[58]
cardinal	metric		[59]
continuous	metric		[60]
absolute	metric		[60]
transform raw ordinal scores into an interval	metric		[61]
[…] can depend on the	metric	used for the analysis (e.g., scale scores, standardized scores).	[23]
arbitrary	metric		[24]
unit of a	metric		[62]
ordered	metric	scale	[63]
[…] are required to obtain ordered-	metric	scales for respondents and items that place in-between ordinal and interval metrics.	[25]
quasi-	metric	scale	[64]
	metric	scale	[65]
	metric	continuum	[66]
	metric	score	[67]
	metric	variables	[26]
Types of variables used in scales: ordinal, nominal,	metric		[68]
	metric	level	[69]
or somehow forcing them [the measurements] to be expressed	metrically	by employing, for example, additive conjoint measurement.	[26]
	metric	properties	[70]
well-defined labor quality attributes with	metric	properties (i.e., with well-defined unit of measure and zero).	[71]
non	metric	properties of ordered categorical data	[27]
[…] the order of all item difficulties and individual abilities remains the same across all locations of the scale θ. This preservation of	metric	properties across the scale is of utmost importance for the scale anchoring procedures used in this study.	[28]
better/good/ideal/optimal/poor/superior	metric	properties	[72]/ [73]/ [74]/ [75]/ [76]/ [77]
	metric	quality	[78]
	metric	ruler	[79]
	metric	distortion	[80]

Table 3

Synonyms and definitions of ‘metric’.

Pre	Term	Post	Reference
continuum or	metric		[84]
continuous (	metric	)	[85]
	metric	or interval scale	[17]¹
measure or	metric		[86]
	metric	or scale	[87]
ratio scaled (	metric	)	[26]
	Metric	: The underlying value that is used to understand the score and how it is scaled so that meaning can be derived from a score. For the SCI-QOL, the reported metric is a ‘T Metric’ with an average of 50 and standard deviation unit of 10. For PROMIS, the metric reflects the general population average. For SCI-QOL, the metric reflects either general population (when anchored to a PROMIS or Neuro-QOL scale) or to the SCI-population (when it is a new bank that does not have a comparable PROMIS or Neuro-QOL bank). The metric is relative the population that was used to calibrate the items.	[81]
	metrics	(i.e., scales for continuous quantities in well-defined units).	[82]
a scale can be: […] (3) a system of units and numbers that define a mathematical	metric	(e.g., feet and inches are units of the imperial scale; centimeters and meters are units of the metric scale).	[83]
place on a	metric	having a mean of zero and a standard deviation of one.	[88]

1Its publication date is 2016, but Epub 2015. This article and three more were identified as a 2015 article when the search was performed.

Examples of uses of ‘metric’.

First as a noun, then as adjective/adverb. 1Its publication date is 2016, but Epub 2015. This article and three more were identified as a 2015 article when the search was performed. Eight groups of usages have been identified: 4 relating to ‘metric’ as a noun, 3 to ‘metric’ as an adjective/adverb, and 1 overlapping both.

‘Metric’ as a noun

A first group of metric usages referred to the values that persons or items can take on the measurement continuum. Concerning persons, we found ‘ability metric’, ‘latent metric’, ‘observed score metric’, ‘raw metric’/‘raw score metric’/‘raw-score interval metric’; and concerning items, ‘difficulty metric’, ‘metric of the items’. A second group indicated specific metrics, such as ‘Euclidian metric’, ‘IRT metric’, ‘logistic metric’, ‘logit metric’, ‘log-odds metric’, ‘PROMIS metric’, ‘Rasch metric’, ‘T metric’/‘T-score metric’, ‘z metric’/‘z-score metric’. A third group was related to linking or equating metrics (‘Equating the two metrics placed the item parameter estimates from both samples on the same scale’), being ‘common metric’ the most prevalent term. A fourth group contained adjectives qualifying ‘metric’ in a way that a gradient could be considered: ‘good/better/more appropriate/strongest possible metric’. A fifth group consisted of another gradient, in this case describing different levels of a metric: ‘ordinal metric’, ‘quasi-interval metric’, ‘interval metric’, ‘linear metric’, ‘quantitative metric’, ‘cardinal metric’, ‘continuous metric’, ‘absolute metric’. There was a distinction between ordinal and interval levels in ‘Transform raw ordinal scores into an interval metric’: ordinal qualified ‘score’ and interval qualified ‘metric’. In contrast, Robinson and Lubienski [23] considered ‘scale scores’ and ‘standardized scores’ as two metrics; in addition, ‘raw metric’ also appeared in the first group of usages. Also, sometimes the Rasch model was presented as an application helping to overcome the ‘arbitrary metrics’ problem in the observed score metric used in Classical Test Theory [24].

‘Metric’ as an adjective/adverb

The terms ‘ordered metric scale’, ‘quasi-metric scale’ and ‘metric scale’ could be part of the gradient in the fifth group. Noventa et al. [25] placed ‘ordered metric scale’ between ordinal and interval metrics. The last two terms would be equivalent to ‘quasi-interval metric’ and ‘interval metric’, respectively. A sixth group of usages considered ‘metric’ as an indicator that the corresponding substantive was quantitative: ‘metric continuum’, ‘metric score’, ‘metric variables’. In this lines, ‘Types of variables used in scales: ordinal, nominal, metric’, ‘metric’ refers to interval and ratio variables. These variables are therefore at the ‘metric level’. Similarly, Krause [26] employs ‘metrically’ as a synonym of quantitatively: additive conjoint measurement is a method which enables to express measurements metrically. A seventh group consisted of ‘metric properties’ uses. Sometimes it related to having quantitative properties (‘with a well-defined unit of measure and zero’, or the fact that ordered categorical data did not have metric properties [27]). Nevertheless, Hartig et al. [28] used ‘preservation of metric properties’ in the sense of preserving the order of item difficulties and individual abilities across the scale. A gradient for ‘metric properties’ was observed, too: ‘poor/good/better/superior/optimal/ideal metric properties’. A final group of usages related to measurement itself: ‘metric quality’ seemed to refer to the quality of measurement; ‘metric ruler’ was used as a synonym of the item location map; and ‘metric distortion’ referred to the stretching or compression of the ‘units of a metric’.

Synonyms and definitions of ‘metric’

The following synonyms of ‘metric’ were observed: ‘continuum’, ‘continuous’ (adjective), ‘interval scale’, ‘measure’, ‘ratio scaled’ (adjective), ‘scale’. Tulsky et al. [81] defined ‘metric’ explicitly as the underlying value that is used to understand the score and how it was scaled so that meaning can be derived from a score. Here it was also stated that the metric is relative the population that was used to calibrate the items. From his side, Kyngdon [82] defined ‘metrics’ as scales for continuous quantities in well-defined units. Moreover, one of the meanings of scale that Cook et al. [83] listed, one referred to a system of units and numbers that defines a mathematical metric. Finally, a metric was sometimes determined as having a specific mean (e.g., 0) and standard deviation (e.g., 1).

Relationship between ‘metric’ and ‘interval scale’

How the terms ‘metric’, ‘interval scale’, ‘conjoint measurement’, and ‘fundamental measurement’ were used within RMT and IRT is presented in Table 4. The descriptive statistics show that ‘interval scale’ was more often used in RMT than ‘metric’, 716 (76%) articles containing ‘interval scale’ vs 413 (44%) articles containing ‘metric’, in comparison to 37 (13%) vs 275 (94%) in IRT. ‘Conjoint measurement’ and ‘fundamental measurement’ were hardly used in the IRT paradigm, more commonly in the context of RMT or both RMT and IRT (MTT).

Table 4

Contingency table of search terms by paradigm.

Paradigm	Metric	Interval Scale	Conjoint Measurement	Fundamental Measurement	Total
RMT	413 (43.8%)	716 (75.8%)	78 (8.3%)	106 (11.2%)	944
IRT	275 (93.9%)	37 (12.6%)	4 (1.4%)	4 (1.4%)	293
MTT	63 (63%)	54 (54%)	27 (27%)	15 (15%)	100
Total	751 (56.2%)	807 (60.4%)	109 (8.2%)	125 (9.3%)	1337

Abbreviations: RMT, Rasch Measurement Theory; IRT, Item Response Theory; MTT, Modern Test Theory. The percentages are by row. Note that the sum by rows exceeds the respective Total, because an article can contain more than one term. Out of the 751 articles that used ‘metric’, the use of ‘metric’ alone and in combination with ‘interval scale’ was balanced within the RMT paradigm and in those papers with a similar presence of both RMT and IRT (MTT). In contrast, there was a clear predominance of only using ‘metric’ rather than in combination with ‘interval scale’ in IRT (Table 5).

Table 5

Contingency table of the combined use of ‘metric’ and ‘interval scale’ by paradigm.

Paradigm	Metric	Metric with interval scale	Metric without interval scale
RMT	413	216 (52.3%)	197 (47.7%)
IRT	275	21 (7.6%)	254 (92.4%)
MTT	63	26 (41.3%)	37 (58.7%)
Total	751	263 (35%)	488 (65%)

Abbreviations: RMT, Rasch Measurement Theory; IRT, Item Response Theory; MTT, Modern Test Theory.

Ability to produce interval scaling

In respond to the third specific aim, whether interval measures can be obtained with RMT or IRT models, three incompatible answers were derived from the thematic analysis: (1) only RMT models can provide interval measures, (2) both RMT and IRT models can provide interval measures, (3) no MTT model can provide interval measures. An outline of these three themes is presented in Table 6.

Table 6

Themes exposing the incompatible views concerning the ability to obtain interval measures using RMT and IRT metrics.

Only RMT	Both RMT and IRT	No MTT
The theory of conjoint measurement provides the only analytical framework that could be invoked to evaluate whether the resulting scale could be said to have interval as opposed to ordinal or nominal properties. In practice, such a rationale has seldom been applied empirically, and generally hinges upon making an analogy between the Rasch model and a specific version of the theory of conjoint measurement known as additive conjoint measurement. [40, p. 564] For items and persons to be able to conjointly define this common interval (logit) scale, mathematical models need to specify that ICCs should never cross and be parallel […]. Only Rasch models satisfy the requirements of non-crossing and parallel ICCs. [89, p. 290] The main strength of this model [Rasch] is that it allows for testing if the simple summed raw score is a sufficient statistic (which cannot be done with other models) and also tests whether or not the data are consistent with the axioms of conjoint measurement, so providing a transformation to interval scaling, which also cannot be done with other models. [90, p. 142]	In fact, the relation between conjoint measurement and IRT is not limited to the Rasch model; it can be set up for quite a broad class of probabilistic latent variable models. The most general and precise treatment we know of in this context is given by Scheiblechner [116]. [96, p. 110] An alternative approach would be to consider polynomial conjoint measurement. This would be an important advance since it could be used to justify interval scale interpretations when items have varying discriminations. [97, p. 16] The IRT statistical methods generate empirical estimates of item difficulty and person ability that are log-transformed and converted to a true interval measure, thus eliminating the limitations associated with rank order, ordinal scales. [98, p. S143]	The theory of conjoint measurement describes a hierarchy of conditions—the so-called cancellation conditions directly diagnostic of the ordinal-quantitative distinction. However, in general, those applying IRTs show no inclination to employ this conceptual resource. Instead, the prevailing tendency is to evaluate an IRT via indices of goodness of fit that are not necessarily diagnostic of quantitative structure. Even in this potentially promising area of psychometrics, practitioners generally follow the time-honoured tradition of Stevens, begging the question that the attributes of interest are quantitative, instead of investigating it. [104, p. 103] That is, it is possible that attributes that psychometricians aspire to measure are heterogeneous orders, that is, non-measurable attributes, and this fact is not incompatible with observing statistical fit to IRT models. For example, Black et al. [111], Commons et al. [112], Kyngdon [113], and Kyngdon and Richards [114] all present tests in which the only discernable structure manifest in the item sets is ordinal and yet IRT models fit the relevant data. [105, p. 6] At present, the claim that psychometricians are able to measure psychological attributes on interval scales is a myth, and it will remain so until it is recognized that the possibility of attributes, such as abilities, being merely ordinal is one that must be seriously considered. [106, p. 267]

Abbreviations: RMT, Rasch Measurement Theory; IRT, Item Response Theory; MTT, Modern Test Theory; ICC, Item Characteristic Curve.

The theory of conjoint measurement provides the only analytical framework that could be invoked to evaluate whether the resulting scale could be said to have interval as opposed to ordinal or nominal properties. In practice, such a rationale has seldom been applied empirically, and generally hinges upon making an analogy between the Rasch model and a specific version of the theory of conjoint measurement known as additive conjoint measurement. [40, p. 564] For items and persons to be able to conjointly define this common interval (logit) scale, mathematical models need to specify that ICCs should never cross and be parallel […]. Only Rasch models satisfy the requirements of non-crossing and parallel ICCs. [89, p. 290] The main strength of this model [Rasch] is that it allows for testing if the simple summed raw score is a sufficient statistic (which cannot be done with other models) and also tests whether or not the data are consistent with the axioms of conjoint measurement, so providing a transformation to interval scaling, which also cannot be done with other models. [90, p. 142] In fact, the relation between conjoint measurement and IRT is not limited to the Rasch model; it can be set up for quite a broad class of probabilistic latent variable models. The most general and precise treatment we know of in this context is given by Scheiblechner [116]. [96, p. 110] An alternative approach would be to consider polynomial conjoint measurement. This would be an important advance since it could be used to justify interval scale interpretations when items have varying discriminations. [97, p. 16] The IRT statistical methods generate empirical estimates of item difficulty and person ability that are log-transformed and converted to a true interval measure, thus eliminating the limitations associated with rank order, ordinal scales. [98, p. S143] The theory of conjoint measurement describes a hierarchy of conditions—the so-called cancellation conditions directly diagnostic of the ordinal-quantitative distinction. However, in general, those applying IRTs show no inclination to employ this conceptual resource. Instead, the prevailing tendency is to evaluate an IRT via indices of goodness of fit that are not necessarily diagnostic of quantitative structure. Even in this potentially promising area of psychometrics, practitioners generally follow the time-honoured tradition of Stevens, begging the question that the attributes of interest are quantitative, instead of investigating it. [104, p. 103] That is, it is possible that attributes that psychometricians aspire to measure are heterogeneous orders, that is, non-measurable attributes, and this fact is not incompatible with observing statistical fit to IRT models. For example, Black et al. [111], Commons et al. [112], Kyngdon [113], and Kyngdon and Richards [114] all present tests in which the only discernable structure manifest in the item sets is ordinal and yet IRT models fit the relevant data. [105, p. 6] At present, the claim that psychometricians are able to measure psychological attributes on interval scales is a myth, and it will remain so until it is recognized that the possibility of attributes, such as abilities, being merely ordinal is one that must be seriously considered. [106, p. 267] Abbreviations: RMT, Rasch Measurement Theory; IRT, Item Response Theory; MTT, Modern Test Theory; ICC, Item Characteristic Curve. The proponents of the first theme [40, 65, 89–95] argue that RMT models (as opposed to IRT models) test whether or not data are consistent with the axioms of Additive Conjoint Measurement (ACM). As ACM is the only analytical framework to prove if a metric is at the interval level, only RMT models can provide interval scaling. They also argue that if Item Characteristic Curves (ICCs) cross each other (and therefore the ordering of the items depends on the ability location) interval scales cannot be obtained. As regards the second theme [96-103], an explanation on how interval scale can be obtained via an IRT model was not provided in many cases, but it was mostly taken for granted with statements such as IRT generates estimates of item difficulty and person ability that are log-transformed and converted to a true interval measure [98]. One case justified it via simulation [103], and very few mentioned the existence of Polynomial Conjoint Measurement theory as a means to justify interval scale interpretations when ICCs cross each other. The rationale for the last theme [104-110] referred to the untested assumption made in MTT models that the attributes assessed are quantitative. The proponents of this view defend that this assumption needs to be tested to be able to measure attributes on interval scales. A way of testing it is employing directly the axioms of ACM, not by assessing fit to a MTT model. In fact, there is evidence that IRT and RMT models can fit ordinal data [111-114]. In any case, Briggs [115] expressed a pragmatic perspective concerning the quantity assumption stating that until one can demonstrate empirically that a violation of the quantity assumption (i.e., the “‘pathology’” of psychometricians, to use Michell’s language) leads to significant practical consequences […] there will be little incentive to invest the time and effort into a research agenda focused on the discovery of psychological attributes that are measurable in a classical sense.

Discussion

In this study, we aimed to examine the use of the term ‘metric’ in health and social sciences’ literature, focusing on the interval scale implication of the term in MTT. By employing a Text Mining strategy, we were able to identify a number of different uses of ‘metric’ in MTT. As a noun, in many cases it can be considered as a synonym of ‘scale’. The qualifiers ‘good’, ‘more appropriate’, and ‘strongest possible’, of ‘metric’, encountered in some articles, suggest that not all metrics are equal, but that some metrics are better than others. The more a metric approaches the interval level (i.e., the more equal the differences between values are), the better it is. As an adjective, the mainly use of ‘metric’ referred to measurement or indicated quantification (interval or ratio level), as it is done in multidimensional scaling [5]. Based on the findings of the thematic analysis, three incompatible views were identified concerning whether interval measures can be obtained via MTT models: only RMT, both RMT and IRT models, or no MTT model can provide interval measures. Be that as it may, the term ‘interval scale’ was considerably more used in the RMT than in the IRT paradigm.

Customized definitions of ‘metric’

Based on the findings of this review, there is not one formal definition of the term ‘metric’ (as a noun) in the psychometric sense. This has given rise to customized uses of the term, which can cause confusion, such as the definition of ‘common metric’ as an IRT model, such as the GRM (Graded Response Model) or the GPCM (Generalized Partial Credit Model), that comprises parameters of items from various measures, measuring a common variable as appears in the http://www.common-metrics.org/ web page. A link to the web page was identified in one of the articles from our review [117]. Another use which can cause confusion is ‘mathematical metric’ with a meaning other than the formal mathematical definition of ‘metric’ (see S1 Appendix). In an article identified in this review, Cook et al. [83] stated that a mathematical metric can be defined as a system of units and numbers (see Table 3). This statement deviates from the formal mathematical definition of ‘metric’, where the distances among values are known, but they do not need to be equal between consecutive values (i.e., there is no need of a unit). Indeed, if for we define the metric d as if x, y ≠ 0 and d(x, y) = 0 if x = y = 0, we have that, taking for instance the real numbers 1, 2, 3, the distance from 1 to 2 is not the same as from 2 to 3: 1/3 = d(1, 2) ≠ d(2, 3) = 1/5. Hence, there is no constant unit.

Measurement

The question that remains is whether ‘units’ are needed to measure. Not all the proponents of the three themes identified in this review understood ‘measurement’ in the same way. The mainstream of the first and second themes (ability to obtain interval scales via (1) only Rasch or (2) all MTT models) seems to have a ‘realist’ conception of psychological attributes as described by Boorsboom and Mellenbergh [96]. The realists consider that measurement consists in finding out people’s position on an attribute that exists quite independently of the measurement process. After having transformed raw scores (which, according to them are ordinal and therefore have no constant unit) to interval scales via Rasch or IRT models, both perspectives refer sometimes to the ‘logit’ as the unit of their scale [100]. Nevertheless, Humphry [118] argues that in both cases this logit unit comes from a probabilistic model, and therefore a measurement unit can only exist by virtue of uncertainty and error. As opposed to physical units, the units derived from MTT models do not have an ontological status. Therefore, he suggests calling them ‘quasi-units’. Hence, there is no consensus as regards the existence of a unit in MTT scales, and no clear answer on whether it is needed to measure. In any case, according to the findings of our review, there is only one psychometric scale—the Lexile scale [63], which has a unit of measurement explicitly defined. From their part, the proponents of the third theme (no MTT can provide interval scaling) rely on the classical definition of measurement (a measurement is the product of a real number and a unit [63]). Therefore, a unit is needed. In contrast, according to the VIM (International Vocabulary of Metrology) [119] a unit is not needed. Indeed, measurement is defined as a process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity. A quantity is a property of a phenomenon, body, or substance, where the property has a magnitude that can be expressed as a number and a reference. The reference can be not only a measurement unit, but also a measurement procedure, a reference material, or a combination of such. Finally, the VIM distinguishes an ordinal quantity as quantity, defined by a conventional measurement procedure, for which a total ordering relation can be established, according to magnitude, with other quantities of the same kind, but for which no algebraic operations among those quantities exist. An ordinal quantity could not be measured according to the classical definition, because it has no measurement units, and differences and ratios among ordinal quantities have no physical meaning. Whether ordinal quantities can be measured or not, in any case they are more limited than interval scales, because algebraic operations cannot be performed on them. In contrast, a mean or a standard deviation can be computed on interval scales; therefore, when a metric is defined with a mean and standard deviation—such as a T-metric, formally this metric needs to be at the interval level. The proponents of the first and second themes—RMT and IRT metrics are at the interval level, would argue that a Rasch or an IRT metric offers more possibilities than the raw scores (which according to them are ordinal, but that was also a topic of discussion in the articles from our review; see, e.g., [99]) because more operations are allowed on them. Whether correct or not, RMT articles elaborated more the justification to obtain interval scales via a RMT model than IRT articles. Nevertheless, in the third theme it is criticized the common practice of MTT to justify interval scaling via indices of goodness of fit to RMT or IRT models. It is argued that fit to a MTT model should not be considered as evidence for interval scaling, and what should be investigated is whether psychological attributes actually possess a quantitative structure. To avoid customized uses of ‘metric’, a formal definition of it in the psychometric sense should be determined. Probably the one describing more concretely ‘metric’ in this sense is the numbers that the observed measures take on when describing individuals’ standings on the construct of interest [120]. Another option would be to do a parallelism of what it is done in VIM with the term ‘quantity’, namely to specify ‘ordinal metric’ when the metric is ordinal and not interval or ratio. We believe that when ‘metric’ is used alone it is indirectly assumed that this metric is at least at the interval level. In this way, if ‘ordinal metric’ is specified, it is clear that algebraic operations are not allowed. Although there is no consensus on whether Rasch and IRT metrics are interval, the authors should justify why they think that this is the case if they use ‘metric’ with no specification. A refinement to the four levels of scale established by Stevens [4] could be considered. As the term ‘scale’ is used to measure quantities, we believe that ‘nominal scale’ is not appropriate. Instead, nominal property, as suggested in the VIM, should be used. In addition, the term ‘ordered metric scale’ should be considered. This term was introduced in the mid-20th century [3] and it is still used [25, 63]. It refers to a metric between ordinal and interval levels where the magnitude of differences can be ordered. Therefore, we suggest the levels ordinal, ordered, interval, and ratio. In relation to the previous point, when ‘metric’ is used alone, it applies to the latter two categories. The first two should be compulsory adjectives to describe the nature of the metric.

Strengths and limitations

The use of Text Mining strategies in conducting systematic literature searches and reviews is promising. It allowed us to automatically select the articles according to the refined criteria. The fact of providing a table with the previous and posterior context of the term ‘metric’ was extremely valuable to be able to collect the different uses of the term. At the same time, it is important to stress that in the conversion from PDF to TXT, some words were not well identified, such as ‘metric’ being read as ‘menic’ in one case. If words as ‘psychometrics’ were written in the end of a sentence like ‘psycho-metrics’, the corresponding article was flagged as containing ‘metric’. Therefore, few false positives or false negatives were present, but we believe that in a very low proportion and with no major impact in our results. We did not compare the efficacy of TM against manual extraction, and this could be considered in the future as a methodological study. Nevertheless, it would have been quite awkward to manually fill in an Excel file with 2635 articles placed in the rows, and for each article search whether it contained each of the six search terms, being careful that they did not only appear in the reference section, that Rasch refers to G. Rasch [12], that IRT did stand for Item Response Theory, and that ‘metric’ is not part of a proper name. Moreover, for each ‘metric’ use, the previous and posterior context should have been collected to discuss the different uses of the term. The corresponding automatically generated file via TM contains 4312 rows. Another limitation of the present study is the year restriction in the systematic search. Although almost 2000 records were excluded for review, we believe that we have gathered the main uses and meanings of the term ‘metric’, as well as the main views concerning the ability to obtain interval scales via RMT and IRT. Effectively, our TM based strategy opens up possibilities to review a very large number of manuscripts. The approach is innovative, and the way we have undertaken it is an attempt to manage the very large number of manuscripts forthcoming. We made the assumption that relevant findings from papers published previous to 2001 would be cited in the 2001–2005 period, and that relevant findings from 2006–2010 would be cited in papers from 2011–2015. Although untested, Table 3, third column, second point, shows an example that supports our assumption: findings from articles from 2006 [113], 2008 [112], and 2011 [111] were cited in a 2012 paper [105] to support the hypothesis that fit to a MTT model should not be considered as evidence for interval scaling. Also, the justification based on the increase in the number of records after 2001 is potentially biased, because the amount of papers published in the recent years has increased. A further limitation is having restricted the review to MTT. Although it would have made sense to include other potential areas such as factor analysis, we wanted to focus on the existing dispute confronting RMT and IRT. Finally, this study differs from the mainstream systematic search and reviews in the fact that its aim was putting some light on a terminology issue, namely the implications of the use of the term ‘metric’. We are not aware of any other study to have done something similar in the field of psychometrics.

Mathematical metric definition.