| Literature DB >> 34514437 |
Jignesh R Parikh1,2, Casie A Genetti3,2, Asli Aykanat3, Catherine A Brownstein3, Klaus Schmitz-Abe3, Morgan Danowski3, Andrew Quitadomo3,4, Jill A Madden3, Calum Yacoubian5, Richard Gain5, Tessa Williams5, Mary Meskell5, Andrew Brown5, Alison Frith5, Shira Rockowitz3,4, Piotr Sliz3,4, Pankaj B Agrawal3,6, Thomas Defay7, Paul McDonagh7,8, John Reynders7,9, Sebastien Lefebvre7, Alan H Beggs3.
Abstract
Effective genetic diagnosis requires the correlation of genetic variant data with detailed phenotypic information. However, manual encoding of clinical data into machine-readable forms is laborious and subject to observer bias. Natural language processing (NLP) of electronic health records has great potential to enhance reproducibility at scale but suffers from idiosyncrasies in physician notes and other medical records. We developed methods to optimize NLP outputs for automated diagnosis. We filtered NLP-extracted Human Phenotype Ontology (HPO) terms to more closely resemble manually extracted terms and identified filter parameters across a three-dimensional space for optimal gene prioritization. We then developed a tiered pipeline that reduces manual effort by prioritizing smaller subsets of genes to consider for genetic diagnosis. Our filtering pipeline enabled NLP-based extraction of HPO terms to serve as a sufficient replacement for manual extraction in 92% of prospectively evaluated cases. In 75% of cases, the correct causal gene was ranked higher with our applied filters than without any filters. We describe a framework that can maximize the utility of NLP-based phenotype extraction for gene prioritization and diagnosis. The framework is implemented within a cloud-based modular architecture that can be deployed across health and research institutions.Entities:
Year: 2021 PMID: 34514437 PMCID: PMC8432593 DOI: 10.1016/j.xhgg.2021.100035
Source DB: PubMed Journal: HGG Adv ISSN: 2666-2477
Figure 1.Study flow diagram
Schematic of the overall study design and analysis plan. For each patient in a training set of 52 patients, we employed uniform processes to collect Human Phenotype Ontology (HPO) terms extracted by natural language processing (NLP), manually extracted HPO terms, and exome sequencing (ES) data in the form of variant call files (VCFs). Manually extracted HPO terms were compared to NLP-extracted HPO terms per patient in the training set with respect to (1) frequency of use, (2) HPO term depth within the ontology, and (3) diversity of phenotypic abnormality classes captured, confirming significant differences across all three dimensions. Next, we established thresholds per dimension that were used to create filtered lists of NLP terms per patient. Exomiser was run on each of the filtered NLP term lists (in addition to the manual and unfiltered NLP lists for comparison) per patient, and performance per filter was evaluated using metrics such as area under the receiver operating curve (AUC) and sensitivity. Top-performing NLP filters were combined into a tiered pipeline, which was finally applied to and evaluated on a subsequently ascertained set of 12 patients in the test set, whose data were collected using the same uniform processes described above.
Figure 2.Performance of Exomiser using phenotypes extracted by manual curation versus natural language processing (NLP) among training set cases
(A) Receiver operating characteristic curves with sensitivities noted for specificities corresponding to the top 5, 10, and 20 ranked genes, respectively.
(B) Box and whiskers plots of distribution of the ranks of the correct genes. Each data point in a distribution corresponds to a specific patient, with lines connecting the ranks of each patient across the two phenotype extraction methods to indicate increase versus decrease in rank. The median and max (worst) ranks are also noted adjacent to the corresponding values in the distributions.
(C) Box and whiskers plots of the distribution of the combined Exomiser scores for the correct gene per patient. Each data point in a distribution corresponds to a specific patient, with lines connecting the scores of each patient across the two phenotype extraction methods to indicate increase versus decrease in score. The median scores are noted adjacent to the median values in the distributions.
Figure 3.Comparing features of HPO terms identified by NLP alone versus terms identified by both manual- and NLP-based extraction
Box and whiskers plots of (A) distribution of mean frequency percentiles of HPO terms, (B) distribution of mean depth of HPO terms, and (C) distribution of diversity of HPO terms. Each data point in a distribution corresponds to a specific patient in the training set, with lines connecting values of the respective summary feature per patient across the two NLP term subsets to indicate increase versus decrease in value. Mean values per distribition, the difference in means, and associated p-values, calculated using a Wilcoxon’s signed-rank test, are noted above each plot.
Parameter combinations for the top-performing natural language processing (NLP) filters
| Best NLP Filter | |||
|---|---|---|---|
| Frequency (%) | Depth | Diversity | |
| AUC | 90 | 6 | 6 |
| Median rank | 80 | 6 | 6 |
| Median score | 90 | 0 | 12 |
| Genes needed | 60 | 4 | 10 |
| Sensitivity top 5 | 80 | 6 | 6 |
| Sensitivity top 10 | 90 | 6 | 4 |
| Sensitivity top 20 | 90 | 6 | 6 |
| Median (median absolute deviation) |
|
|
|
Sensitivity in prospectively analyzed test set cases comparing NLP filters from the pipeline versus using unfiltered NLP
| Pipeline step | Using pipeline NLP filters (n, cumulative %) | Using unfiltered NLP (n, cumulative %) |
|---|---|---|
| Step 1: top 5 genes (pipeline uses 80/6/6 filter) | 1 (9.09) | 1 (9.09) |
| Step 2: top 20 genes (pipeline uses 90/6/6 filter) | 8 (66.67) | 7 (58.33) |
| Step 3: top 50 genes (pipeline uses filter ensemble) | 11 (91.67) | 9 (75.00) |
| Step 4: all genes (pipeline uses manual phenotyping) | 12 (100) | 12 (100) |