PURPOSE: In the UK, primary care databases include repeated measurements of health indicators at the individual level. As these databases encompass a large population, some individuals have extreme values, but some values may also be recorded incorrectly. The challenge for researchers is to distinguish between records that are due to incorrect recording and those which represent true but extreme values. This study evaluated different methods to identify outliers. METHODS: Ten percent of practices were selected at random to evaluate the recording of 513,367 height measurements. Population-level outliers were identified using boundaries defined using Health Survey for England data. Individual-level outliers were identified by fitting a random-effects model with subject-specific slopes for height measurements adjusted for age and sex. Any height measurements with a patient-level standardised residual more extreme than ±10 were identified as an outlier and excluded. The model was subsequently refitted twice after removing outliers at each stage. This method was compared with existing methods of removing outliers. RESULTS: Most outliers were identified at the population level using the boundaries defined using Health Survey for England (1550 of 1643). Once these were removed from the database, fitting the random-effects model to the remaining data successfully identified only 75 further outliers. This method was more efficient at identifying true outliers compared with existing methods. CONCLUSIONS: We propose a new, two-stage approach in identifying outliers in longitudinal data and show that it can successfully identify outliers at both population and individual level.
PURPOSE: In the UK, primary care databases include repeated measurements of health indicators at the individual level. As these databases encompass a large population, some individuals have extreme values, but some values may also be recorded incorrectly. The challenge for researchers is to distinguish between records that are due to incorrect recording and those which represent true but extreme values. This study evaluated different methods to identify outliers. METHODS: Ten percent of practices were selected at random to evaluate the recording of 513,367 height measurements. Population-level outliers were identified using boundaries defined using Health Survey for England data. Individual-level outliers were identified by fitting a random-effects model with subject-specific slopes for height measurements adjusted for age and sex. Any height measurements with a patient-level standardised residual more extreme than ±10 were identified as an outlier and excluded. The model was subsequently refitted twice after removing outliers at each stage. This method was compared with existing methods of removing outliers. RESULTS: Most outliers were identified at the population level using the boundaries defined using Health Survey for England (1550 of 1643). Once these were removed from the database, fitting the random-effects model to the remaining data successfully identified only 75 further outliers. This method was more efficient at identifying true outliers compared with existing methods. CONCLUSIONS: We propose a new, two-stage approach in identifying outliers in longitudinal data and show that it can successfully identify outliers at both population and individual level.
Authors: Eric I Benchimol; Liam Smeeth; Astrid Guttmann; Katie Harron; David Moher; Irene Petersen; Henrik T Sørensen; Jean-Marie Januel; Erik von Elm; Sinéad M Langan Journal: CMAJ Date: 2019-02-25 Impact factor: 8.262
Authors: Anna Mageras; Ellen Brazier; Théodore Niyongabo; Gad Murenzi; Jean D'Amour Sinayobye; Adebola A Adedimeji; Christella Twizere; Elizabeth A Kelvin; Kathryn Anastos; Denis Nash; Heidi E Jones Journal: Int J STD AIDS Date: 2021-02-03 Impact factor: 1.359
Authors: Eric I Benchimol; Liam Smeeth; Astrid Guttmann; Katie Harron; Lars G Hemkens; David Moher; Irene Petersen; Henrik T Sørensen; Erik von Elm; Sinéad M Langan Journal: Z Evid Fortbild Qual Gesundhwes Date: 2016-09-28
Authors: Eric I Benchimol; Liam Smeeth; Astrid Guttmann; Katie Harron; David Moher; Irene Petersen; Henrik T Sørensen; Erik von Elm; Sinéad M Langan Journal: PLoS Med Date: 2015-10-06 Impact factor: 11.069
Authors: Charlotte S C Woolley; Ian G Handel; B Mark Bronsvoort; Jeffrey J Schoenebeck; Dylan N Clements Journal: PLoS One Date: 2020-01-24 Impact factor: 3.240
Authors: Anthony Batte; Michelle C Starr; Andrew L Schwaderer; Robert O Opoka; Ruth Namazzi; Erika S Phelps Nishiguchi; John M Ssenkusu; Chandy C John; Andrea L Conroy Journal: BMC Nephrol Date: 2020-09-29 Impact factor: 2.388