| Literature DB >> 31748751 |
Yichi Zhang1, Tianrun Cai2, Sheng Yu3,4, Kelly Cho5,6, Chuan Hong1, Jiehuan Sun1, Jie Huang2, Yuk-Lam Ho5, Ashwin N Ananthakrishnan7, Zongqi Xia8, Stanley Y Shaw9, Vivian Gainer10, Victor Castro10, Nicholas Link5, Jacqueline Honerlaw5, Sicong Huang2, David Gagnon5,11, Elizabeth W Karlson2, Robert M Plenge2, Peter Szolovits12, Guergana Savova13, Susanne Churchill14, Christopher O'Donnell5,15, Shawn N Murphy10,14,16, J Michael Gaziano5,6, Isaac Kohane14, Tianxi Cai1,14, Katherine P Liao17,18,19.
Abstract
Phenotypes are the foundation for clinical and genetic studies of disease risk and outcomes. The growth of biobanks linked to electronic medical record (EMR) data has both facilitated and increased the demand for efficient, accurate, and robust approaches for phenotyping millions of patients. Challenges to phenotyping with EMR data include variation in the accuracy of codes, as well as the high level of manual input required to identify features for the algorithm and to obtain gold standard labels. To address these challenges, we developed PheCAP, a high-throughput semi-supervised phenotyping pipeline. PheCAP begins with data from the EMR, including structured data and information extracted from the narrative notes using natural language processing (NLP). The standardized steps integrate automated procedures, which reduce the level of manual input, and machine learning approaches for algorithm training. PheCAP itself can be executed in 1-2 d if all data are available; however, the timing is largely dependent on the chart review stage, which typically requires at least 2 weeks. The final products of PheCAP include a phenotype algorithm, the probability of the phenotype for all patients, and a phenotype classification (yes or no).Entities:
Mesh:
Year: 2019 PMID: 31748751 PMCID: PMC7323894 DOI: 10.1038/s41596-019-0227-6
Source DB: PubMed Journal: Nat Protoc ISSN: 1750-2799 Impact factor: 13.491