| Literature DB >> 26290793 |
Rose L Andrew1, Arianne Y K Albert2, Sebastien Renaut3, Diana J Rennison4, Dan G Bock4, Tim Vines5.
Abstract
Data are the foundation of empirical research, yet all too often the datasets underlying published papers are unavailable, incorrect, or poorly curated. This is a serious issue, because future researchers are then unable to validate published results or reuse data to explore new ideas and hypotheses. Even if data files are securely stored and accessible, they must also be accompanied by accurate labels and identifiers. To assess how often problems with metadata or data curation affect the reproducibility of published results, we attempted to reproduce Discriminant Function Analyses (DFAs) from the field of organismal biology. DFA is a commonly used statistical analysis that has changed little since its inception almost eight decades ago, and therefore provides an opportunity to test reproducibility among datasets of varying ages. Out of 100 papers we initially surveyed, fourteen were excluded because they did not present the common types of quantitative result from their DFA or gave insufficient details of their DFA. Of the remaining 86 datasets, there were 15 cases for which we were unable to confidently relate the dataset we received to the one used in the published analysis. The reasons ranged from incomprehensible or absent variable labels, the DFA being performed on an unspecified subset of the data, or the dataset we received being incomplete. We focused on reproducing three common summary statistics from DFAs: the percent variance explained, the percentage correctly assigned and the largest discriminant function coefficient. The reproducibility of the first two was fairly high (20 of 26, and 44 of 60 datasets, respectively), whereas our success rate with the discriminant function coefficients was lower (15 of 26 datasets). When considering all three summary statistics, we were able to completely reproduce 46 (65%) of 71 datasets. While our results show that a majority of studies are reproducible, they highlight the fact that many studies still are not the carefully curated research that the scientific community and public expects.Entities:
Keywords: Data archiving; Data curation; Repeatability; Statistics
Year: 2015 PMID: 26290793 PMCID: PMC4540019 DOI: 10.7717/peerj.1137
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Summary of papers excluded from or included in the study, in total and listed by the statistical software originally used to analyse the data.
Those included in the study are further broken down by the reasons that reanalysis was not attempted or by the results of the reanalysis. The reanalysis outcome was classified as a complete match when all reanalyzed summary statistics were within 1% of the published values, a partial match when at least one (but not all) met this criterion, and no match when none met this criterion. The metrics considered were PVE, a discriminant function coefficient, and PAC.
| Software | Excluded | Included | Incorrect data file | Insufficient metadata | Data discrepancy | No match | Reanalysed partial match | Complete match |
|---|---|---|---|---|---|---|---|---|
| TOTAL | 14 | 86 | 2 (2.3%) | 7 (8.1%) | 7 (8.1%) | 12 (14%) | 46 (53.5%) | 12 (14%) |
| JMP | 2 | 2 | 0 (0%) | 1 (50%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (50%) |
| MATLAB | 1 | 2 | 0 (0%) | 0 (0%) | 1 (50%) | 0 (0%) | 0 (0%) | 1 (50%) |
| R | 0 | 5 | 2 (40%) | 0 (0%) | 1 (20%) | 1 (20%) | 0 (0%) | 1 (20%) |
| SAS | 1 | 15 | 0 (0%) | 3 (20%) | 2 (13%) | 3 (20%) | 2 (13%) | 5 (33%) |
| SPSS | 6 | 30 | 0 (0%) | 0 (0%) | 2 (7%) | 5 (17%) | 6 (20%) | 17 (57%) |
| STATISTICA | 0 | 9 | 0 (0%) | 1 (11%) | 1 (11%) | 2 (22%) | 0 (0%) | 5 (56%) |
| SYSTAT | 0 | 8 | 0 (0%) | 0 (0%) | 0 (0%) | 1 (12%) | 2 (25%) | 5 (62%) |
| Other | 1 | 2 | 0 (0%) | 1 (50%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (50%) |
| Unknown | 3 | 13 | 0 (0%) | 1 (8%) | 0 (0%) | 0 (0%) | 2 (15%) | 10 (77%) |
Published results and reanalyzed values of DFAs based on data files received from authors.
DFAs included in the current study were categorized according to the adequacy of data files and metadata, and the reproducibility of three metrics (percent variance explained, the largest coefficient and percent assigned correctly) among those that were able to be reanalyzed. Category indicates whether the data set was excluded from the study (E), was incorrect (I), had inadequate metadata (M), displayed data discrepancies (D) or was reanalysed (R). The reasons for excluding data sets from the study or preventing us from reanalyzing the data are summarized. The reanalysis outcome was classified as a complete match (C) when all reanalyzed summary statistics were within 1% of the published values, a partial match (P) when at least one (but not all) met this criterion, and no match (N) when none met this criterion. The same classification was applied to studies using the ‘close’ criterion (within 5%).
| Study no. | Year | Software | PVE | COEF | PAC | Categ. | Reason | Reanalysis outcome | Citation | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Published | Reanalyzed | Published | Reanalyzed | Published | Reanalyzed | Match (within 1%) | Close (within 5%) | ||||||
|
| 1991 | SAS | 47.3 | 45.8 | 93.2 | 93.2 | R | P | C | ( | |||
|
| 1993 | SAS | 83.2 | 84.2 | 18.94 | 20.609 | R | N | P | ( | |||
|
| 1995 | Other (STATGRAPHICS) | 79.1 | 79.1 | 2.87 | −2.868 | 72 | 71.9 | R | C | C | ( | |
|
| 1995 | SPSS | 0.892 | 0.7 | 100 | 100 | R | P | P | ( | |||
|
| 1995 | SPSS | 57.3 | 57.3 | 91.4 | 91.4 | R | C | C | ||||
|
| 1995 | SPSS | 4.02 | −3.805 | 100 | 100 | R | P | P | ( | |||
|
| 1995 | SYSTAT | −1.09 | 1.091 | 92 | 86.9 | R | P | P | ||||
|
| 1995 | SYSTAT | 2.115 | −2.115 | 100 | 100 | R | C | C | ( | |||
|
| 1997 | Not stated | E | Not all variables are morphological | ( | ||||||||
|
| 1997 | SPSS | 67 | 66.9 | R | C | C | ( | |||||
|
| 1997 | SPSS | 96.7 | 92.6 | 1.5 | −2.488 | 100 | 98.6 | R | N | P | ( | |
|
| 1997 | SYSTAT | 99.5 | 99 | −0.57 | 0.611 | 89 | 88.7 | R | P | P | ( | |
|
| 1999 | Not stated | M | Row groupings don’t match paper | |||||||||
|
| 1999 | Not stated | E | No PVE, coef. or PAC | |||||||||
|
| 1999 | SAS | 65 | 64.2 | 61 | 61.4 | R | P | C | ||||
|
| 1999 | SPSS | E | No PVE, coef. or PAC | |||||||||
|
| 1999 | SPSS | 73.4 | 73.4 | R | C | C | ||||||
|
| 1999 | SYSTAT | 90 | 91.7 | R | N | C | ||||||
|
| 2001 | Not stated | 100 | 100 | R | C | C | ( | |||||
|
| 2001 | SAS | 96.7 | 96.4 | R | C | C | ||||||
|
| 2001 | SAS | 71.3 | 93.8 | R | N | N | ||||||
|
| 2001 | SPSS | −1.072 | −1.072 | 96 | 100 | R | P | C | ( | |||
|
| 2001 | SPSS | 100 | 100 | R | C | C | ||||||
|
| 2001 | SPSS | 96 | 96 | R | C | C | ( | |||||
|
| 2001 | SPSS | 5.228 | −5.228 | 86 | 82.6 | R | P | C | ( | |||
|
| 2001 | STATISTICA | 94.4 | 94.4 | R | C | C | ||||||
|
| 2003 | Not stated | 90.3 | 90.3 | R | C | C | ( | |||||
|
| 2003 | Not stated | −2.176 | −2.176 | 90.6 | 90.6 | R | C | C | ||||
|
| 2003 | SAS | M | Column labels in Spanish | |||||||||
|
| 2003 | SAS | D | Extra rows | |||||||||
|
| 2003 | SPSS | 1.011 | 1.011 | 100 | 100 | R | C | C | ||||
|
| 2003 | SPSS | 3.5 | 81 | D | Extra rows | ( | ||||||
|
| 2003 | SPSS | D | Missing rows and row assignments unclear | |||||||||
|
| 2003 | SPSS | 88.9 | 87.5 | R | N | C | ||||||
|
| 2003 | SPSS | 0.772 | 0.766 | 84.3 | 84.3 | R | C | C | ||||
|
| 2003 | STATISTICA | M | Column labels unclear | |||||||||
|
| 2003 | SYSTAT | 1.28 | −1.275 | 81 | 80.6 | R | C | C | ( | |||
|
| 2005 | JMP | E | No PVE, coef or PAC | ( | ||||||||
|
| 2005 | Not stated | 79.9 | 79.7 | R | C | C | ( | |||||
|
| 2005 | Not stated | 83 | 83.1 | 73 | 74.3 | R | P | C | ||||
|
| 2005 | Not stated | 100 | 100 | R | C | C | ( | |||||
|
| 2005 | Other (S-Plus) | E | No PVE, coef or PAC | |||||||||
|
| 2005 | Other (LINDA) | M | Unclear groups | ( | ||||||||
|
| 2005 | SAS | M | Column labels missing | ( | ||||||||
|
| 2005 | SAS | 94.3 | 94.9 | R | C | C | ( | |||||
|
| 2005 | SPSS | 46 | 38.2 | R | N | N | ( | |||||
|
| 2005 | SPSS | 55.1 | 55.6 | 0.352 | 0.779 | 71.8 | 70.3 | R | P | P | ||
|
| 2005 | STATISTICA | 67.5 | 67 | R | C | C | ||||||
|
| 2005 | STATISTICA | 97 | 98.8 | R | N | C | ||||||
|
| 2005 | SYSTAT | 100 | 100 | R | C | C | ||||||
|
| 2007 | MATLAB | D | Missing columns and insufficient metadata | |||||||||
|
| 2007 | Not stated | 1.1 | 1.097 | 97 | 96.6 | R | C | C | ( | |||
|
| 2007 | Not stated | 87.9 | 87.9 | R | C | C | ( | |||||
|
| 2007 | SAS | 8.623 | 3.495 | 97.3 | 98.6 | R | N | P | ||||
|
| 2007 | SAS | 76 | 76.6 | R | C | C | ( | |||||
|
| 2007 | SAS | D | Missing columns | ( | ||||||||
|
| 2007 | SPSS | E | No PVE, coef or PAC | |||||||||
|
| 2007 | SPSS | 76.9 | 76.9 | R | C | C | ( | |||||
|
| 2007 | SPSS | 0.689 | 0.647 | 100 | 85.4 | R | N | N | ( | |||
|
| 2007 | SPSS | 61.8 | 61.6 | R | C | C | ||||||
|
| 2007 | SPSS | E | Final model not given | ( | ||||||||
|
| 2007 | SPSS | 84 | 83.3 | R | C | C | ||||||
|
| 2007 | STATISTICA | 96.1 | 96.2 | R | C | C | ||||||
|
| 2007 | STATISTICA | 93.3 | 93.3 | −0.951 | −0.951 | 89.2 | 89.2 | R | C | C | ||
|
| 2007 | STATISTICA | 1.68 | 1.678 | 83.7 | 83.7 | R | C | C | ( | |||
|
| 2007 | SYSTAT | 90.4 | 90.4 | 90 | 90 | R | C | C | ||||
|
| 2009 | Not stated | 91.2 | 91.2 | R | C | C | ||||||
|
| 2009 | Not stated | E | Not DFA | |||||||||
|
| 2009 | Not stated | 40.8 | 41.1 | 79 | 78.3 | R | C | C | ( | |||
|
| 2009 | Not stated | 0.242 | 0.084 | 100 | 100 | R | P | P | ( | |||
|
| 2009 | SAS | 69 | 69.2 | 1.05 | −1.053 | R | C | C | ( | |||
|
| 2009 | SAS | 0.95 | 0.604 | 80 | 80 | R | P | P | ||||
|
| 2009 | SPSS | 100 | 100 | R | C | C | ||||||
|
| 2009 | SPSS | E | Data not morphological | |||||||||
|
| 2009 | SPSS | 76.4 | 77 | R | C | C | ( | |||||
|
| 2009 | STATISTICA | D | Missing rows | |||||||||
|
| 2009 | STATISTICA | 100 | 98.1 | R | N | C | ||||||
|
| 2009 | SYSTAT | 2.8 | 2.795 | 91 | 91.5 | R | C | C | ( | |||
|
| 2011 | JMP | E | No PVE, coef or PAC | ( | ||||||||
|
| 2011 | JMP | M | Column labels unclear | |||||||||
|
| 2011 | JMP | −7.06 | 7.063 | 100 | 100 | R | C | C | ( | |||
|
| 2011 | MATLAB | 65.5 | 65 | R | C | C | ( | |||||
|
| 2011 | MATLAB | E | Not classical DFA | ( | ||||||||
|
| 2011 | Not stated | 90 | 90.5 | R | C | C | ( | |||||
|
| 2011 | R | D | Missing rows | |||||||||
|
| 2011 | R | I | Wrong file | |||||||||
|
| 2011 | R | I | Wrong file | |||||||||
|
| 2011 | R | 58 | 88.3 | 56 | 57.1 | R | N | P | ||||
|
| 2011 | R | 80.4 | 80.4 | R | C | C | ( | |||||
|
| 2011 | SAS | E | Spanish | |||||||||
|
| 2011 | SAS | 100 | 100 | R | C | C | ( | |||||
|
| 2011 | SPSS | 81.8 | 81.7 | R | C | C | ( | |||||
|
| 2011 | SPSS | 97.7 | 97.7 | 87.5 | 87.5 | R | C | C | ( | |||
|
| 2011 | SPSS | 58.3 | 58.3 | 62.9 | 62.9 | R | C | C | ( | |||
|
| 2011 | SPSS | 87.7 | 87.5 | R | C | C | ||||||
|
| 2011 | SPSS | E | Final model not given | |||||||||
|
| 2011 | SPSS | E | Final model not given | ( | ||||||||
|
| 2011 | SPSS | 100 | 100 | R | C | C | ||||||
|
| 2011 | SPSS | 95.7 | 93.9 | R | N | C | ||||||
|
| 2011 | SPSS | 96 | 89.7 | 1.202 | 0.068 | 100 | 100 | R | P | P | ||
Notes.
Authors were contacted individually once reanalyses were performed. Only authors wishing to be identified are cited above. In addition, several authors agreed to be cited, but not identified directly (Amini, Zamini & Ahmadi, 2007; Audisio et al., 2001; Bulgarella et al., 2007; Ekrt et al., 2009; Foggi, Rossi & Signorini, 1999; Ginoris et al., 2007; Gouws, Stewart & Reavell, 2001; López-González et al., 2001; Magud, Stanisavljević & Petanović, 2007; Malenke, Johnson & Clayton, 2009; Schagerl & Kerschbaumer, 2009; Wasowicz & Rostanski, 2009).
Figure 1Summary of the reproducibility of the 71 reanalyzed data sets and of the problems preventing reanalysis of 15 papers (see Table 1).
Figure 2PVE values from reanalysis versus published DFA. Points on the 1:1 line represent analyses differing by 1% or less.
Figure 3PAC values from reanalysis versus published DFA. Points on the 1:1 line represent analyses differing by 1% or less.
Figure 4Discriminant function coefficients from the reanalysis versus the published results.
Absolute values are used because the signs of coefficients depends on the order of variables. Points on the 1:1 line represent analyses differing by 1% or less.