| Literature DB >> 31797587 |
Allen Nie1,2, Arturo L Pineda, Matt W Wright, Hannah Wand, Bryan Wulf, Helio A Costa, Ronak Y Patel, Carlos D Bustamante, James Zou.
Abstract
As genetic sequencing costs decrease, the lack of clinical interpretation of variants has become the bottleneck in using genetics data. A major rate limiting step in clinical interpretation is the manual curation of evidence in the genetic literature by highly trained biocurators. What makes curation particularly time-consuming is that the curator needs to identify papers that study variant pathogenicity using different types of approaches and evidences-e.g. biochemical assays or case control analysis. In collaboration with the Clinical Genomic Resource (ClinGen)-the flagship NIH program for clinical curation-we propose the first machine learning system, LitGen, that can retrieve papers for a particular variant and filter them by specific evidence types used by curators to assess for pathogenicity. LitGen uses semi-supervised deep learning to predict the type of evi+dence provided by each paper. It is trained on papers annotated by ClinGen curators and systematically evaluated on new test data collected by ClinGen. LitGen further leverages rich human explanations and unlabeled data to gain 7.9%-12.6% relative performance improvement over models learned only on the annotated papers. It is a useful framework to improve clinical variant curation.Entities:
Mesh:
Year: 2020 PMID: 31797587 PMCID: PMC7478937
Source DB: PubMed Journal: Pac Symp Biocomput ISSN: 2335-6928
Fig. 1.Paper annotation workflow. From a paper on PubMed (left), the curator selects which subset of the five variant curation (VCI) evidence types that the paper is relevant for (middle), and provide explanations for the selection (right). We highlight some keywords for emphasis. LitGen’s goal is to predict which evidence types are relevant given a paper.
Labeled data summary: number of papers and explanations by VCI evidence type.
| # unique papers | # explanations | ||||
|---|---|---|---|---|---|
| Evidence types in the VCI | ACMG criteria | Train | Holdout | Train | Holdout |
| Experimental Studies | BS3, PS3 | 385 | 74 | 732 | 80 |
| Allele Data | BP2, PM3 | 441 | 86 | 971 | 103 |
| Segregation Data | BS4, PP1 | 232 | 40 | 271 | 40 |
| Specificity of Phenotype | PP4 | 482 | 26 | 993 | 28 |
| Case Control | PS4 | 656 | 264 | 952 | 331 |
Training data collected during Oct 2016 to Mar 2019. Holdout evaluation data collected during April 2019 to May 2019. Note that we do not allow the algorithm to use explanations during test time. We have 1543 labeled data points for training.
Fig. 5.We display a set of keywords that are the most positively associated with each VCI evidence types from human explanations by training a lasso model on unigram features. Coefficients refer to Lasso coefficients.
Performance of different training strategies for LitGen model.
| Apr 2019 to May 2019 | |||
|---|---|---|---|
| Strategy | Avg Accu | EM | Wgt |
| Baseline (Majority) | 62.9 | 8.7 | 36.0 |
| BiLSTM | 82.6 | 45.2 | 62.7 |
| BiLSTM + Naive Exp | 83.8 | 48.7 | 66.5 |
| BiLSTM + Naive Unlabled | 83.9 | 50.1 | 65.7 |
| BiLSTM + Naive Exp + Naive Unlabeled | |||
| BiLSTM + Exp-guided Snorkel | 84.0 | 50.1 | 66.8 |
Accuracy of baseline (always guess the majority class), BiLSTM and LitGen model for each evidence type.
| Evidence type | Baseline (Majority) | BiLSTM | LitGen |
|---|---|---|---|
| Experimental Studies | 63.1 | 85.6 | |
| Allele Data | 65.7 | 80.4 | |
| Segregation Data | 73.8 | 88.8 | 88.8 |
| Specificity of Phenotype | 66.0 | 87.0 | |
| Case Control | 45.8 | 71.2 |
Evaluation of the quality of generated proxy labels on the holdout test set.
| Apr 2019 to May 2019 | |||
|---|---|---|---|
| Labeling model | Avg Accu | EM | Wgt |
| Naive Unlabeled | 81.2 | 40.3 | 53.2 |
| Exp-guided Unlabeled | 82.8 | 46.1 | 60.0 |
| Exp-guided Snorkel | 11.5 | 2.6 | 42.3 |