| Literature DB >> 33842537 |
Paola Turina1, Piero Fariselli2, Emidio Capriotti1.
Abstract
During the last years, the increasing number of DNA sequencing and protein mutagenesis studies has generated a large amount of variation data published in the biomedical literature. The collection of such data has been essential for the development and assessment of tools predicting the impact of protein variants at functional and structural levels. Nevertheless, the collection of manually curated data from literature is a highly time consuming and costly process that requires domain experts. In particular, the development of methods for predicting the effect of amino acid variants on protein stability relies on the thermodynamic data extracted from literature. In the past, such data were deposited in the ProTherm database, which however is no longer maintained since 2013. For facilitating the collection of protein thermodynamic data from literature, we developed the semi-automatic tool ThermoScan. ThermoScan is a text mining approach for the identification of relevant thermodynamic data on protein stability from full-text articles. The method relies on a regular expression searching for groups of words, including the most common conceptual words appearing in experimental studies on protein stability, several thermodynamic variables, and their units of measure. ThermoScan analyzes full-text articles from the PubMed Central Open Access subset and calculates an empiric score that allows the identification of manuscripts reporting thermodynamic data on protein stability. The method was optimized on a set of publications included in the ProTherm database, and tested on a new curated set of articles, manually selected for presence of thermodynamic data. The results show that ThermoScan returns accurate predictions and outperforms recently developed text-mining algorithms based on the analysis of publication abstracts. Availability: The ThermoScan server is freely accessible online at https://folding.biofold.org/thermoscan. The ThermoScan python code and the Google Chrome extension for submitting visualized PMC web pages to the ThermoScan server are available at https://github.com/biofold/ThermoScan.Entities:
Keywords: automated literature mining; document classification; protein stability; text mining; thermodynamic data
Year: 2021 PMID: 33842537 PMCID: PMC8027235 DOI: 10.3389/fmolb.2021.620475
Source DB: PubMed Journal: Front Mol Biosci ISSN: 2296-889X
) and tables (
| Score | TH | Q2 | TNR | NPV | TPR | PPV | MCC | F1 | AUC | AUPR |
|---|---|---|---|---|---|---|---|---|---|---|
| Max | 3.00 | 0.97 | 1.00 | 0.95 | 0.94 | 1.00 | 0.94 | 0.97 | 0.99 | 0.99 |
| Mean | 1.36 | 0.94 | 0.94 | 0.95 | 0.95 | 0.94 | 0.89 | 0.94 | 0.98 | 0.99 |
FIGURE 1Precision and Recall of ThermoScan at different classification thresholds. The plots show the performance based on the Max (A) and Mean (B) scores. The performance measures TPR (black) and PPV (red) are defined in Supplementary Materials. The shaded area represents the range between the minimum and maximum scoring values.
FIGURE 2Performance measures of ThermoScan based on the Max (red) and Mean (blue) scores. The plots show the AUC (Area Under the receiving operating characteristic Curve). The shaded area represents the range between the minimum and maximum scoring values (A) and the AUPR (Area Under the Precision-Recall curve) (B) for the two scoring systems. The TPR, FPR, and PPV performance measures are defined in Supplementary Materials.
Performance of ThermoScan on the New-PSU and Snew-PSU datasets. The ThermoScan thresholds obtained in the optimization step with maximum and mean paragraph/table scoring methods are 3.00 and 1.36 respectively. The performance measures are defined in Supplementary Materials.
| Score | Dataset | Q2 | TNR | NPV | TPR | PPV | MCC | F1 | AUC | AUPR |
|---|---|---|---|---|---|---|---|---|---|---|
| Max | New-PSU | 0.80 | 0.49 | 0.88 | 0.96 | 0.78 | 0.55 | 0.86 | 0.86 | 0.86 |
| Snew-PSU | 0.91 | 0.75 | 0.88 | 0.96 | 0.92 | 0.76 | 0.94 | 0.96 | 0.94 | |
| Mean | New-PSU | 0.80 | 0.59 | 0.77 | 0.91 | 0.81 | 0.53 | 0.85 | 0.83 | 0.82 |
| Snew-PSU | 0.89 | 0.83 | 0.75 | 0.91 | 0.94 | 0.71 | 0.92 | 0.92 | 0.91 |
Comparison of the performance of ThermoScan (based on maximum paragraph/table score) with BioReader and MedlineRanker on the New-PSU and Snew-PSU datasets. The classification thresholds for BioReader and MedlineRanker and ThermoScan are 0.022, 0.027 and three respectively. The performance measures are defined in Supplementary Materials.
| Method | Dataset | Q2 | TNR | NPV | TPR | PPV | MCC | F1 | AUC | AUPR |
|---|---|---|---|---|---|---|---|---|---|---|
| BioReader | New-PSU | 0.66 | 0.59 | 0.50 | 0.70 | 0.76 | 0.28 | 0.73 | 0.64 | 0.72 |
| Snew-PSU | 0.70 | 0.69 | 0.43 | 0.70 | 0.87 | 0.34 | 0.77 | 0.69 | 0.75 | |
| MedlineRanker | New-PSU | 0.63 | 0.63 | 0.47 | 0.63 | 0.76 | 0.25 | 0.69 | 0.70 | 0.67 |
| Snew-PSU | 0.70 | 0.68 | 0.43 | 0.70 | 0.87 | 0.34 | 0.78 | 0.78 | 0.72 | |
| ThermoScan | New-PSU | 0.80 | 0.49 | 0.88 | 0.96 | 0.78 | 0.55 | 0.86 | 0.86 | 0.86 |
| Snew-PSU | 0.91 | 0.75 | 0.88 | 0.96 | 0.92 | 0.76 | 0.94 | 0.96 | 0.94 |