Willie H Chang1,2, Pouria Mashouri1, Alexander X Lozano1,3,4, Brittney Johnstone1,5, Mia Husić1, Annie Olry6, Sylvie Maiella6, Tugce B Balci7, Sarah L Sawyer8, Peter N Robinson9,10, Ana Rath6, Michael Brudno11,12,13,14. 1. Centre for Computational Medicine, The Hospital For Sick Children, Toronto, ON, Canada. 2. Department of Computer Science, Princeton University, Princeton, NJ, USA. 3. Faculty of Medicine, University of Toronto, Toronto, ON, Canada. 4. Department of Materials Science & Engineering, Stanford University, Stanford, CA, USA. 5. Sunnybrook Health Sciences Centre, Toronto, ON, Canada. 6. Orphanet, Institut national de la santé et de la recherche médicale, Paris, France. 7. Medical Genetics Program of Southwestern Ontario, London Health Sciences Centre, London, ON, Canada. 8. Department of Genetics, Children's Hospital of Eastern Ontario and Children's Hospital of Eastern Ontario Research Institute, University of Ottawa, Ottawa, ON, Canada. 9. The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA. 10. Institute for Systems Genomics, University of Connecticut, Farmington, CT, USA. 11. Centre for Computational Medicine, The Hospital For Sick Children, Toronto, ON, Canada. brudno@cs.toronto.edu. 12. Department of Computer Science, University of Toronto, Toronto, ON, Canada. brudno@cs.toronto.edu. 13. Genetics and Genome Biology Program, The Hospital for Sick Children, Toronto, ON, Canada. brudno@cs.toronto.edu. 14. University Health Network, Toronto, ON, Canada. brudno@cs.toronto.edu.
Abstract
PURPOSE: Computational documentation of genetic disorders is highly reliant on structured data for differential diagnosis, pathogenic variant identification, and patient matchmaking. However, most information on rare diseases (RDs) exists in freeform text, such as academic literature. To increase availability of structured RD data, we developed a crowdsourcing approach for collecting phenotype information using student assignments. METHODS: We developed Phenotate, a web application for crowdsourcing disease phenotype annotations through assignments for undergraduate genetics students. Using student-collected data, we generated composite annotations for each disease through a machine learning approach. These annotations were compared with those from clinical practitioners and gold standard curated data. RESULTS: Deploying Phenotate in five undergraduate genetics courses, we collected annotations for 22 diseases. Student-sourced annotations showed strong similarity to gold standards, with F-measures ranging from 0.584 to 0.868. Furthermore, clinicians used Phenotate annotations to identify diseases with comparable accuracy to other annotation sources and gold standards. For six disorders, no gold standards were available, allowing us to create some of the first structured annotations for them, while students demonstrated ability to research RDs. CONCLUSION: Phenotate enables crowdsourcing RD phenotypic annotations through educational assignments. Presented as an intuitive web-based tool, it offers pedagogical benefits and augments the computable RD knowledgebase.
PURPOSE: Computational documentation of genetic disorders is highly reliant on structured data for differential diagnosis, pathogenic variant identification, and patient matchmaking. However, most information on rare diseases (RDs) exists in freeform text, such as academic literature. To increase availability of structured RD data, we developed a crowdsourcing approach for collecting phenotype information using student assignments. METHODS: We developed Phenotate, a web application for crowdsourcing disease phenotype annotations through assignments for undergraduate genetics students. Using student-collected data, we generated composite annotations for each disease through a machine learning approach. These annotations were compared with those from clinical practitioners and gold standard curated data. RESULTS: Deploying Phenotate in five undergraduate genetics courses, we collected annotations for 22 diseases. Student-sourced annotations showed strong similarity to gold standards, with F-measures ranging from 0.584 to 0.868. Furthermore, clinicians used Phenotate annotations to identify diseases with comparable accuracy to other annotation sources and gold standards. For six disorders, no gold standards were available, allowing us to create some of the first structured annotations for them, while students demonstrated ability to research RDs. CONCLUSION: Phenotate enables crowdsourcing RD phenotypic annotations through educational assignments. Presented as an intuitive web-based tool, it offers pedagogical benefits and augments the computable RD knowledgebase.
Entities:
Keywords:
crowdsourcing; machine learning; medical education; phenotype; rare diseases
Authors: Sebastian Köhler; Michael Gargano; Nicolas Matentzoglu; Leigh C Carmody; David Lewis-Smith; Nicole A Vasilevsky; Daniel Danis; Ganna Balagura; Gareth Baynam; Amy M Brower; Tiffany J Callahan; Christopher G Chute; Johanna L Est; Peter D Galer; Shiva Ganesan; Matthias Griese; Matthias Haimel; Julia Pazmandi; Marc Hanauer; Nomi L Harris; Michael J Hartnett; Maximilian Hastreiter; Fabian Hauck; Yongqun He; Tim Jeske; Hugh Kearney; Gerhard Kindle; Christoph Klein; Katrin Knoflach; Roland Krause; David Lagorce; Julie A McMurry; Jillian A Miller; Monica C Munoz-Torres; Rebecca L Peters; Christina K Rapp; Ana M Rath; Shahmir A Rind; Avi Z Rosenberg; Michael M Segal; Markus G Seidel; Damian Smedley; Tomer Talmy; Yarlalu Thomas; Samuel A Wiafe; Julie Xian; Zafer Yüksel; Ingo Helbig; Christopher J Mungall; Melissa A Haendel; Peter N Robinson Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971