Viet-Thi Tran1, Raphael Porcher2, Viet-Chi Tran3, Philippe Ravaud4. 1. Department of General Medicine, Paris Diderot University, 16 Rue Henri Huchard, 75018 Paris, France; Centre de recherche en Epidémiologie et Statistiques (CRESS), INSERM U1153, Place du Parvis Notre Dame, 75004 Paris, France; Centre d'Épidémiologie Clinique, Hôpital Hôtel-Dieu, Assistance Publique-Hôpitaux de Paris, 1 Place du Parvis Notre Dame, 75004 Paris, France. Electronic address: thi.tran-viet@htd.aphp.fr. 2. Centre de recherche en Epidémiologie et Statistiques (CRESS), INSERM U1153, Place du Parvis Notre Dame, 75004 Paris, France; Centre d'Épidémiologie Clinique, Hôpital Hôtel-Dieu, Assistance Publique-Hôpitaux de Paris, 1 Place du Parvis Notre Dame, 75004 Paris, France; Paris Descartes University, 12 Rue de l'Ecole de Medecine, 75006 Paris, France. 3. Laboratoire Paul Painlevé-UMR CNRS 8524, Bâtiment M2, Cité Scientifique, 59655 Villeneuve-d'Ascq, France; Université des Sciences et Technologies de Lille, Cité Scientifique, 59650 Villeneuve-d'Ascq, France. 4. Centre de recherche en Epidémiologie et Statistiques (CRESS), INSERM U1153, Place du Parvis Notre Dame, 75004 Paris, France; Centre d'Épidémiologie Clinique, Hôpital Hôtel-Dieu, Assistance Publique-Hôpitaux de Paris, 1 Place du Parvis Notre Dame, 75004 Paris, France; Paris Descartes University, 12 Rue de l'Ecole de Medecine, 75006 Paris, France; Department of Epidemiology, Columbia University Mailman School of Public Health, 116th St & Broadway, New York, NY, USA.
Abstract
OBJECTIVE: Sample size in surveys with open-ended questions relies on the principle of data saturation. Determining the point of data saturation is complex because researchers have information on only what they have found. The decision to stop data collection is solely dictated by the judgment and experience of researchers. In this article, we present how mathematical modeling may be used to describe and extrapolate the accumulation of themes during a study to help researchers determine the point of data saturation. STUDY DESIGN AND SETTING: The model considers a latent distribution of the probability of elicitation of all themes and infers the accumulation of themes as arising from a mixture of zero-truncated binomial distributions. We illustrate how the model could be used with data from a survey with open-ended questions on the burden of treatment involving 1,053 participants from 34 different countries and with various conditions. The performance of the model in predicting the number of themes to be found with the inclusion of new participants was investigated by Monte Carlo simulations. Then, we tested how the slope of the expected theme accumulation curve could be used as a stopping criterion for data collection in surveys with open-ended questions. RESULTS: By doubling the sample size after the inclusion of initial samples of 25 to 200 participants, the model reliably predicted the number of themes to be found. Mean estimation error ranged from 3% to 1% with simulated data and was <2% with data from the study of the burden of treatment. Sequentially calculating the slope of the expected theme accumulation curve for every five new participants included was a feasible approach to balance the benefits of including these new participants in the study. In our simulations, a stopping criterion based on a value of 0.05 for this slope allowed for identifying 97.5% of the themes while limiting the inclusion of participants eliciting nothing new in the study. CONCLUSION: Mathematical models adapted from ecological research can accurately predict the point of data saturation in surveys with open-ended questions.
OBJECTIVE: Sample size in surveys with open-ended questions relies on the principle of data saturation. Determining the point of data saturation is complex because researchers have information on only what they have found. The decision to stop data collection is solely dictated by the judgment and experience of researchers. In this article, we present how mathematical modeling may be used to describe and extrapolate the accumulation of themes during a study to help researchers determine the point of data saturation. STUDY DESIGN AND SETTING: The model considers a latent distribution of the probability of elicitation of all themes and infers the accumulation of themes as arising from a mixture of zero-truncated binomial distributions. We illustrate how the model could be used with data from a survey with open-ended questions on the burden of treatment involving 1,053 participants from 34 different countries and with various conditions. The performance of the model in predicting the number of themes to be found with the inclusion of new participants was investigated by Monte Carlo simulations. Then, we tested how the slope of the expected theme accumulation curve could be used as a stopping criterion for data collection in surveys with open-ended questions. RESULTS: By doubling the sample size after the inclusion of initial samples of 25 to 200 participants, the model reliably predicted the number of themes to be found. Mean estimation error ranged from 3% to 1% with simulated data and was <2% with data from the study of the burden of treatment. Sequentially calculating the slope of the expected theme accumulation curve for every five new participants included was a feasible approach to balance the benefits of including these new participants in the study. In our simulations, a stopping criterion based on a value of 0.05 for this slope allowed for identifying 97.5% of the themes while limiting the inclusion of participants eliciting nothing new in the study. CONCLUSION: Mathematical models adapted from ecological research can accurately predict the point of data saturation in surveys with open-ended questions.
Authors: Susan C Weller; Ben Vickers; H Russell Bernard; Alyssa M Blackburn; Stephen Borgatti; Clarence C Gravlee; Jeffrey C Johnson Journal: PLoS One Date: 2018-06-20 Impact factor: 3.240
Authors: Viet-Thi Tran; Eugene Messou; Mariam Mama Djima; Philippe Ravaud; Didier K Ekouevi Journal: BMJ Qual Saf Date: 2018-04-29 Impact factor: 7.035
Authors: Amanda R Merner; Thomas Frazier; Paul J Ford; Scott E Cooper; Andre Machado; Brittany Lapin; Jerrold Vitek; Cynthia S Kubu Journal: Front Hum Neurosci Date: 2021-02-24 Impact factor: 3.169
Authors: Astrid Chevance; Axel Fortel; Adeline Jouannin; Faustine Denis; Marie-France Mamzer; Philippe Ravaud; Stephanie Sidorkiewicz Journal: J Med Internet Res Date: 2022-02-18 Impact factor: 7.076