Literature DB >> 29743333

Pinocchio testing in the forensic analysis of waiting lists: using public waiting list data from Finland and Spain for testing Newcomb-Benford's Law.

Jaime Pinilla¹, Beatriz G López-Valcárcel¹, Christian González-Martel¹, Salvador Peiro².

Abstract

OBJECTIVE: Newcomb-Benford's Law (NBL) proposes a regular distribution for first digits, second digits and digit combinations applicable to many different naturally occurring sources of data. Testing deviations from NBL is used in many datasets as a screening tool for identifying data trustworthiness problems. This study aims to compare public available waiting lists (WL) data from Finland and Spain for testing NBL as an instrument to flag up potential manipulation in WLs.
DESIGN: Analysis of the frequency of Finnish and Spanish WLs first digits to determine if their distribution is similar to the pattern documented by NBL. Deviations from the expected first digit frequency were analysed using Pearson's χ2, mean absolute deviation and Kuiper tests. SETTING/PARTICIPANTS: Publicly available WL data from Finland and Spain, two countries with universal health insurance and National Health Systems but characterised by different levels of transparency and good governance standards. MAIN OUTCOME MEASURES: Adjustment of the observed distribution of the numbers reported in Finnish and Spanish WL data to the expected distribution according to NBL.
RESULTS: WL data reported by the Finnish health system fits first digit NBL according to all statistical tests used (p=0.6519 in χ2 test). For Spanish data, this hypothesis was rejected in all tests (p<0.0001 in χ2 test).
CONCLUSIONS: Testing deviations from NBL distribution can be a useful tool to identify problems with WL data trustworthiness and signalling the need for further testing. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

Entities: Disease Gene Species

Keywords: benford-newcomb distribution; fabricated data; waiting list data

Mesh：

Year: 2018 PMID： 29743333 PMCID： PMC5942457 DOI： 10.1136/bmjopen-2018-022079

Source DB: PubMed Journal: BMJ Open ISSN： 2044-6055 Impact factor: 2.692

Frequent contradictions and conflicts of interests occur between different actors actively involved in waiting list management. Statistical tools can be used to verify the compliance of waiting list data. The Newcomb-Benford’s Law seems to be a valid tool for screening the trustworthiness of public waiting list data and for signalling the need of further analysis. The method proposed in this article can be applied to other healthcare data.

Introduction

Waiting lists (WLs) have been recognised as an inescapable side effect of public National Health Services1 and recently the performance of health systems has been evaluated by comparing WL indicators.2 Healthcare systems face extreme difficulties when managing WLs due to inherent issues related to their origin and maintenance. Reaching conclusions on the importance of WLs from their volume is not an easy task because reliable data are often lacking or hidden, operational definitions are not standard and change frequently, and information on the severity of a patient’s condition is barely included.2 Administrative updates and the standardisation of WLs, although necessary, are often associated with a ‘successful’ decrease in WLs size, which is not necessarily associated with productivity improvements or better management, especially when this decrease is achieved without ensuring that the patient no longer requires the intervention he/she was waiting for.3 In countries with publicly funded National Health Systems, large WLs erode the confidence of citizens in the health system, its leaders and its professionals, and WLs are commonly used as an element of political confrontation. Because WLs have become a relevant issue, both from a social and political point of view, and the numbers describing a diminishing WL are taken as a noticeable signal of ‘success’ in healthcare policy, WL management may be more focused on offering appealing short-term ‘numbers’ than developing coherent long-term solutions consistent with the complex nature of the WL problem. So there is a potential temptation to play with these numbers, as documented in the past.4 5 Unlike Pinocchio, most liars do not provide telltale signs that they are being dishonest, so there is a need for methods to detect manipulation in WL data and to distinguish between accurate data and data that are false or which are omitting information. Originally described by Newcomb,6 in 1938 a physicist called Benford rediscovered a remarkable empirical phenomenon7: for an extensive collection of heterogeneous numerical data expressed in decimal form, the frequency of numbers which have d as the first significant digit, with d=1, 2,…,9, is not 1/9 (11.1%), as one would expect, but is approximately equal to . According to what is today referred to as Newcomb-Benford’s Law (NBL; or Benford’s Lw or First digit law), the probabilities that the leading digits will be 1, 2 or 3 account for >60% of the total probability distribution. Although this law does not apply to datasets of truly random numbers (eg, lottery), sequential numbers, assigned numbers (eg, zip codes) and numerical series with some restrictions, many real-world datasets do conform to an NBL first digit distribution. Knowing that the frequencies of the first significant digits should fall off in a particular way, suspicious data may be statistically tested against this empirical NBL distribution to evaluate their reliability,7 providing a solid basis for screening the trustworthiness of large amount of data, which might have been manipulated. Over many years, the analysis of the digital frequency in datasets has emerged as a powerful tool for detecting data irregularities in tax audit—so-called forensic auditing—to detect financial accounting manipulation,8 9 and large auditing and consulting firms increasingly use this analysis in their fight against financial fraud.10 11 NBL even had a moment of fame with the finding that the Greek Government had been presenting ‘invented’ economic data to the European Union (EU).12 In the medical sector, NBL has been applied to clinical questionnaire data,13 to test falsification of interview data,14 to evaluate the performance of public health surveillance systems during epidemics,15 the accuracy of cancer incidence in cancer registries16 and to recognise fraud in scientific medical publications17 18 but, to the best of our knowledge and although administrative data on WLs are similar in nature to financial accounting, so far no one has used NBL as a ‘stress’ test for the reliability of WL data in publicly funded National Health Systems. The aim of this study is to compare public WL data corresponding to Finland and Spain, both countries with National Health Systems but characterised by different levels of transparency and good governance standards, for testing NBL as an instrument to screen for irregularities in WL data.

Methods

Study design and conceptual framework

According to NBL, for an extensive collection of heterogeneous numerical data expressed in decimal form, the frequency of numerical data with the first significant digit equal to 1 appeared to be about 30%, and equal to 1, 2 or 3, about 60%. For the second and further digits, NBL predicts a more uniform distribution.19 Our analysis examines Spanish and Finnish WL data, specifically the frequency of numbers appearing in first position, and compares it with the expected pattern following NBL distribution.

Setting

WL data for publicly funded elective treatments is not equally accessible and transparent in all countries. Finland first published data on WLs in 1993. In 1996, the Finnish government recognised by law patients’ rights to know WL times and the possibility of complaining in the event of not being satisfied with them, and establishing specific maximum times for accessing healthcare interventions in primary and secondary care. In 2005, the government introduced a new National Healthcare Guarantee system into Finnish law. Currently in Finland, the national monitoring of queues, waiting times and hospital productivity has been intensified and quality has improved during the past decade. This has given the supervisory and other bodies information to act on in order to live up to the Healthcare Guarantee.20 In December 2003, after a formal requirement by the Ombudsman,21 Spain started publishing homogeneous data collected nationwide focusing on the number of patients and the duration of WLs for surgical procedures. These data included information from only 14 different health services belonging to the 17 Spanish Autonomous Communities (ACs) with Regional Healthcare Services. Data published by the Ministry of Health did not include information on all 17 ACs until June 2012. Each AC has responsibilities in health planning, information systems and service delivery, but they do not always provide information on WLs to the Spanish Ministry of Health central authority. As a consequence, the available information can be considered as a proxy to the challenge posed by WLs, but not state of the art or definitive.22 Although most ACs do provide the necessary information to the Ministry, there is no agreement to make that disaggregated information publicly available. Some regional data are available on the websites of some ACs, but these data are not generally homogeneous.23

Data sources

We used publicly available data on WLs collected from the Ministry of Health, Social Policy and Equality in Spain24 and the National Institute for Health and Welfare in Finland,25 respectively. Both databases provide official information on the periodic evolution of WLs for surgery and for outpatient visits to medical and surgical specialties. ‘Patients waiting’ refers to the number of patients waiting for their first visit to an outpatient specialist or to an elective surgical intervention on the date of reference. In Spain, the frequency of reporting data is biannual (June and December) while for Finnish data it is triannually (April, August and December). The dataset consists of a total of 764 total valid observations for Spain (35 lists, 19–23 time periods), and 594 valid observations (14 registers with 0 patients waiting were eliminated) for Finland (38 lists, 16 time periods), from December 2003 to December 2015 and December 2007 to December 2012, respectively. Data are accessible as an online supplementary appendix file.

Statistical analysis

The data were automatically screened for leading digits and numbers. Extracted numbers were transferred to a comma-separated values spreadsheet file and the occurrence of each number was determined using the R Package benford.analysis.26 We used three different statistical tests to determine whether the distribution of the first and higher order digits conformed to NBL: the Pearson’s χ2 test, the mean absolute deviation (MAD) test and the Kuiper test, with the null hypothesis that data would follow the Benford distribution. Pearson’s χ2 test is a natural candidate for testing whether an observed sample satisfies NBL. The statistical test for first digit is defined as , where O and E are the observed and expected absolute frequencies for digit i, respectively. Under H0, the statistic follows a χ2 distribution with 8 df. For specific digits, the standard normal statistic can be used to check whether the observed frequency deviates significantly from its theoretical value. Because the χ2 test is very sensitive to sample size, having enormous power with large N and low power for moderately small sample sizes, we also used the MAD test that ignores sample size. The MAD statistic is calculated as . Finally, a third alternative is the Kuiper test, a modification of the Kolmogorov-Smirnov test. The Kuiper’s test is calculated as , where and , and F(.) stands for cumulated relative frequencies.27 Finally, we tested if NBL applies to our datasets using criteria suggested by Miller28 and Wallace,29 and we further repeated the fitness NBL analysis but using only data for the same point in time (December) for both Finnish and Spanish datasets (see online supplementary annex 1).

Patient involvement

This study uses publicly available data sources and did not include patients as study participants. No patients were involved in setting the research question, the study design or the overall conduct of the study. There are no plans to involve patients in the dissemination of study findings.

Results

Figure 1 compares the overall frequency distributions of the first significant digit for WL data in Finland and Spain against the expected NBL frequency distribution. On visual inspection, Finnish data seem to be satisfactorily adjusted to an NBL distribution, while Spanish data do not seem to satisfy the NBL distribution at all.

Figure 1

Theoretical (line) and observed distributions (columns) of first digit for Finnish and Spanish waiting list data.

Theoretical (line) and observed distributions (columns) of first digit for Finnish and Spanish waiting list data. Tables 1 and 2 show the results of the statistical fitness of the Finnish and Spanish datasets to Benford’s Law. These results are consistent with the previous graph, showing that for Finnish data it is not possible to reject the null hypothesis that they follow the Benford distribution. Nevertheless, for Spanish data, the null hypothesis was rejected in all tests. Test statistics for the first digits of Finnish data Pearson’s χ2 test: 5.9584 (p=0.6519); mean test (absolute value): 0.8077: Kuiper test: 0.8338. All p values are non-significant at the 1% level. The respective critical test values for the 5% and 1% significance levels are: Pearson’s χ2 test (8 df): 15.51 and 20.09; mean test: 1.96 and 2.58; Kuiper test: 1.75 and 2.00. MAD, mean absolute deviation. Test statistics for the first digits of Spanish data Pearson’s χ2 test: 107.511** (p>0.0001); mean test (absolute value): 3.6553**: Kuiper test (absolute value): 4.5732**. **Significant test value on the 1% level. The respective critical test values for the 5% and 1% significance levels are: Pearson’s χ2 test (8 df): 15.51 and 20.09; mean test: 1.96 and 2.58; Kuiper test: 1.75 and 2.00. MAD, mean absolute deviation.

Discussion

Our results basically show that the WL data from Finland follow the NBL, while the data from Spain do not, raising suspicion about their trustworthiness. This example illustrates how statistical tools can be used to verify the compliance of WL data with a regularity law applicable to many different naturally occurring sources of data. Frequent contradictions and conflicts of interests occur between different actors actively involved in WL management. The public and the media consider regularly published data on WLs as the quintessence of healthcare policy success or failure. Therefore, to respond to these expectations, data manipulation is a temptation for both policymakers and managers. The two countries chosen for the present illustration are perceived as remarkably different in their behaviour. Finland has been consistently classified at the top of the international ranking of transparency and good governance, social control of the political class and more likely to enforce penalties in the case of irregularities than other countries. Finland was ranked fourth in the world in the World Justice Project (WJP) Rule of Law Index 2015,30 and was evaluated as the most efficient country in producing public services of high quality at moderate cost.31 Moreover, it is well known internationally that the publicly organised Finnish healthcare service system has been a success story.32 Spain came in 17th position (out of 24 neighbouring countries) in the latest WJP Rule of Law Index. In 2015, it was distinguished for its high level of corruption,33 which seems to be worsening, surpassed only by Italy and Greece in the 15 EU countries before the Eastern enlargement. In Spain, the publication of WL data was introduced in 2002 by an Ombudsman mandate, which reported that data were sparse, broken and sometimes not very truthful.22 Fifteen years have gone by since then, but even now the Ministry of Health recognises the limitations and lack of rigour of such data. Neither in the case of Spain nor in any others, does failure to obey NBL necessarily provide evidence that WL data are inaccurate or have been manipulated. The NBL universal empirical distribution provides a tool to check data quality in the sense of data accuracy, which denotes the closeness of computations or estimates to the exact or true values. If real values following the NBL are replaced with fabricated numbers, the result is typically a deviation from NBL. The fabrication of numbers may not necessarily be an act of deliberate manipulation; even rounding up can cause a deviation from Benford’s Law. Thus, a deviation from the Benford distribution does not provide conclusive proof of manipulation, just as conformity does not prove the cleanliness of the data. Rather, non-conformity should be seen as a signal flagging up data that need closer inspection and further testing. Benford’s Law could thus be used in addition to existing control mechanisms as a first step in checking the possible manipulation of data. Among the limitations of this study it should be first noted that we only analyse the first WL digit. Adding second-order digit and first-two digit combination tests in a forensic analysis is an essential part of a thorough forensic examination. This should be done as a separate test apart from the first-digit test. An exception is made when the dataset under consideration contain too few values, in which case only a first-digit test is performed. As suggested, an empirical threshold in this context is established to avoid second-order digit test and first-two digit combination tests for any dataset having fewer than 1000 records,34 as is our case. There is no formal statistical theory capable of giving significant threshold points for applicable sizes; rather the above suggestions are subjective judgements derived from experiences in dealing with datasets and forensic digital analysis. Second, the published WLs could, as with some restricted series, not adjust to the NBL distribution, but in the preliminary analyses carried out (see online supplementary annex 1) the two lists analysed seem to meet the requirements needed28 29 to follow this distribution. Third, datasets from Finland and Spain have some differences (size, time of collection) that could influence their distribution, although the analysis carried out using only the lists collected in December was consistent with the overall results (see online supplementary annex 1). Finally, it should be noted that both countries use (legitimate) administrative mechanisms to ‘clean’ WLs of deceased persons, people already operated or that no longer want to have surgery, people who could not be reached for appointments or people out of coverage. Also the entries into, and the exits from WLs could have some seasonal variability. These factors clearly influence the number of people waiting, but should not influence the distribution of the first digits, nor the adjustment to the NBL distribution. On the contrary, fabricated data hardly will conform NBL distribution.35 The method proposed in this article can be applied to other healthcare data, as long as control mechanisms or alarm signals for intensifying efforts to monitor and control the clinical and economic information of health centres are in place. Other areas where NBL testing would be interesting are reporting systems for adverse events, files on professional activity, operating theatre times, length of stay statistics and research datafiles (clinical trials, observational studies and analogous data).

Table 1

Test statistics for the first digits of Finnish data

Value	Count	Frequency observed	Frequency expected (Benford’s Law)	Diff. (MAD)	P values of Z-test for each digit
1	175	0.29461	0.30103	−0.00642	0.7544
2	106	0.17845	0.17609	0.00236	0.8717
3	67	0.11279	0.12494	−0.01214	0.4196
4	64	0.10774	0.09691	0.01083	0.3671
5	51	0.08586	0.07918	0.00660	0.5429
6	41	0.06902	0.06695	0.00208	0.8055
7	43	0.07239	0.05799	0.01440	0.1352
8	25	0.04209	0.05115	−0.00906	0.3521
9	22	0.03704	0.04576	−0.00872	0.3757
Total	594

Pearson’s χ2 test: 5.9584 (p=0.6519); mean test (absolute value): 0.8077: Kuiper test: 0.8338. All p values are non-significant at the 1% level.

The respective critical test values for the 5% and 1% significance levels are: Pearson’s χ2 test (8 df): 15.51 and 20.09; mean test: 1.96 and 2.58; Kuiper test: 1.75 and 2.00.

MAD, mean absolute deviation.

Table 2

Test statistics for the first digits of Spanish data

Value	Count	Frequency observed	Frequency expected (Benford’s Law)	Diff. (MAD)	P values of Z-test for each digit
1	312	0.40838	0.30103	0.10735	0.0000**
2	117	0.15314	0.17609	−0.02295	0.0966
3	47	0.06152	0.12494	−0.06342	0.0000**
4	45	0.05890	0.09691	−0.03801	0.0002**
5	50	0.06545	0.07918	−0.01374	0.1798
6	31	0.04058	0.06695	−0.02637	0.0023**
7	55	0.07199	0.05799	0.01400	0.1035
8	41	0.05366	0.05115	0.00251	0.7422
9	66	0.08639	0.04576	0.04063	0.0000**
Total	764

Pearson’s χ2 test: 107.511** (p>0.0001); mean test (absolute value): 3.6553**: Kuiper test (absolute value): 4.5732**. **Significant test value on the 1% level.

The respective critical test values for the 5% and 1% significance levels are: Pearson’s χ2 test (8 df): 15.51 and 20.09; mean test: 1.96 and 2.58; Kuiper test: 1.75 and 2.00.

MAD, mean absolute deviation.

9 in total

1. Scientific fraud in 20 falsified anesthesia papers : detection using financial auditing methods.

Authors: J Hein; R Zobrist; C Konrad; G Schuepfer
Journal: Anaesthesist Date: 2012-06 Impact factor: 1.041

2. Trusts fail to discipline those who manipulate waiting lists.

Authors: Annabel Ferriman
Journal: BMJ Date: 2002-09-21

3. Lies, damned lies, and waiting lists.

Authors: J Yates
Journal: BMJ Date: 1991-10-05

4. Performance of public health surveillance systems during the influenza A(H1N1) pandemic in the Americas: testing a new method based on Benford's Law.

Authors: A J Idrovo; J A Fernández-Niño; I Bojórquez-Chapela; J Moreno-Montoya
Journal: Epidemiol Infect Date: 2011-02-23 Impact factor: 2.451

Review 5. Application of Benford's law: a valuable tool for detecting scientific papers with fabricated data? : A case study using proven falsified articles against a comparison group.

Authors: S Hüllemann; G Schüpfer; J Mauch
Journal: Anaesthesist Date: 2017-10 Impact factor: 1.041

1 in total

1. Using the Newcomb-Benford law to study the association between a country's COVID-19 reporting accuracy and its development.

Authors: Vadim S Balashov; Yuxing Yan; Xiaodi Zhu
Journal: Sci Rep Date: 2021-11-25 Impact factor: 4.379