| Literature DB >> 28200120 |
Juan Miguel Cejuela1,2, Aleksandar Bojchevski1,2, Carsten Uhlig1, Rustem Bekmukhametov1,3, Sanjeev Kumar Karn1,4, Shpend Mahmuti1, Ashish Baghudana1,5, Ankit Dubey1,6, Venkata P Satagopam7, Burkhard Rost1,8.
Abstract
MOTIVATION: The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. 'E6V'), leaving relevant mentions natural language (NL) largely untapped (e.g. 'glutamic acid was substituted by valine at residue 6').Entities:
Mesh:
Year: 2017 PMID: 28200120 PMCID: PMC5870606 DOI: 10.1093/bioinformatics/btx083
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Classification of mutation mentions
Q115P; Asp8Asn; 76A>T c.925delA; g.3912G>C; rs206437 c.388 + 3insT delPhe1388; F33fsins; IVS3(+1); D17S250; TP73Δex2/3 | yes no no no | yes yes no no | yes yes yes no | |
3992-9g–>a mutation; codon 92, TAC–>TAT Gly 18 to Lys; leucine for arginine 90 G643 to A; abrogated loss of Chr19 | no yes no | no yes no | yes no no | |
glycine to arginine substitution at codon 20 glycine was substituted by lysine at residue 18 deletion of 10 and 8 residues from the N- and C-terminals | yes no no | yes no no | no no no |
Note: Examples of mutation mentions of increasing level of complexity as found in the literature (ST: standard; SST: semi-standard; NL: natural language). The columns MF, SETH and tmVar indicate if the methods MutationFinder, SETH and tmVar, respectively, recognize the examples listed.
Fig. 1nala method active learning process. Each blue box represents an iteration state of the nala method. The method and the iteration training sets are implemented in parallel. The previous iteration method (nala_t-1) is used to automatically annotate unseen documents. Selected documents with outstanding errors are reviewed manually and added to the iteration training set t. New features are evaluated in 5-fold cross validation and the method is retrained with all previous sets (nala_t). At the end, the sum of iteration training sets without IDP4 form the nala_training corpus. The final nala method is trained on nala_training (only) and evaluated against the nala_known and nala_discoveries corpora
Fig. 2Natural language (NL) mutation mentions important. What type of mutation mentions dominates annotated corpora that somehow sample the literature: standard (ST, e.g. E6V), semi-standard (SST), or natural language (NL)? Grayed out bars indicate counts with repetitions, full bars unique mentions (e.g. E6V occurring twice in the same paper, is counted twice for the grayed out values and only once per paper for the others). The Variome, Variome120, IDP4 and nala_discoveries corpora assembled different representations of NL mentions. The dashed line separates corpora with papers describing well-known, well-indexed genes and proteins (left of dashed line: SETH, tmVar, Variome, Variome120, IDP4 and nala_known) and articles describing more recent discoveries that still have to be indexed in databases (right of dashed line: nala_discoveries) (Color version of this figure is available at Bioinformatics online.)
Significance of NL mentions
| IDP4 | Variome | Var.120 | nala_discoveries | ||||
|---|---|---|---|---|---|---|---|
| Annotator* | (1) | (2) | (1) | (2) | (3) | ||
| Documents | 30% | 42% | 22% | 33% | 78% | 62% | 77% |
| Mentions | 14% | 19% | 6% | 40% | 52% | 39% | 49% |
Note: Percentages of documents (3rd row) or mentions (4th row) that contain at least one NL (natural language) or SST (semi-standard) for which no ST (standard) mention exists in the same text. *Two different annotators were compared for the corpus IDP4; three different annotators were compared for the corpus nala_discoveries.
Fig. 3nala performed well for all corpora. The bars give two different results: values above the horizontal lines in bars reflect the F-measures for all mentions, while values below the horizontal lines in bars reflect the F-measures for the subset of NL-mentions in the corpus (high error bars indicate corpora with few NL mentions). The exception was the result for the method tmVar on the corpus tmVar_test, which was taken from the original publication of the method in which no result was reported for NL-only (Wei ). That publication reports only exact matching performance, i.e. its overlapping performance might be higher than shown here. nala consistently matched or outperformed other top-of-the-line methods in well-indexed corpora (SetsKnown; left of dashed line) and substantially improved over the status quo in recent non-indexed discoveries (nala_discoveries; right of dashed line). The F-measures of tmVar and SETH for NL-only on nala_discoveries was essentially zero (two rightmost bars) (Color version of this figure is available at Bioinformatics online.)
Previously indexed versus new discoveries
| method | P | R | F ± StdErr | P | R | F |
|---|---|---|---|---|---|---|
| 87 | 92 | 89 ± 3 | 90 | 40 | 55 ± 7 | |
| 95 | 79 | 87 ± 3 | 93 | 26 | 41 ± 10 | |
| 97 | 74 | 83 ± 5 | 93 | 25 | 40 ± 10 | |
Note: Precision (P), Recall (R) and F-Measure (F) for methods on corpora with previously indexed articles (SetsKnown: SETH, tmVar_test, Variome120, nala_known) and a corpus directly sampled from PubMed without index (nala_discoveries).
Fig. 4nala could fully replace other methods. For each publication we considered all mentions correctly identified by one of the top three methods and kept only the findings unique in each publication. The y-axis plots the percentage of those mentions identified uniquely by one of the methods (All: all mentions, NL: NL-only mentions). For all corpora containing publications of genes and proteins indexed in the databases (SetsKnown), 1% of the mentions were detected only by tmVar and 12% only by nala, while SETH found no mention in this dataset that nala had not detected. Only nala correctly detected NL-only mentions in abstracts with new discoveries (100% bar on right triplet)
Fig. 5Word embedding (WE) features crucial for success. The inclusion of WE features (WE = on versus WE = off) substantially improved performance for both nala_known (texts previously indexed) and nala_discoveries (no previous indices). The increase in performance was highest for NL mentions, but for ST mentions it was also significant