| Literature DB >> 32507889 |
N Palopoli1, J A Iserte2, L B Chemes3, C Marino-Buslje2, G Parisi1, T J Gibson4, N E Davey5.
Abstract
Modern biology produces data at a staggering rate. Yet, much of these biological data is still isolated in the text, figures, tables and supplementary materials of articles. As a result, biological information created at great expense is significantly underutilised. The protein motif biology field does not have sufficient resources to curate the corpus of motif-related literature and, to date, only a fraction of the available articles have been curated. In this study, we develop a set of tools and a web resource, 'articles.ELM', to rapidly identify the motif literature articles pertinent to a researcher's interest. At the core of the resource is a manually curated set of about 8000 motif-related articles. These articles are automatically annotated with a range of relevant biological data allowing in-depth search functionality. Machine-learning article classification is used to group articles based on their similarity to manually curated motif classes in the Eukaryotic Linear Motif resource. Articles can also be manually classified within the resource. The 'articles.ELM' resource permits the rapid and accurate discovery of relevant motif articles thereby improving the visibility of motif literature and simplifying the recovery of valuable biological insights sequestered within scientific articles. Consequently, this web resource removes a critical bottleneck in scientific productivity for the motif biology field. Database URL: http://slim.icr.ac.uk/articles/. © The authors 2020. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved.Entities:
Mesh:
Year: 2020 PMID: 32507889 PMCID: PMC7276420 DOI: 10.1093/database/baaa040
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1Layout of the articles.ELM resource. (A) Scheme of the articles.ELM framework. (B) Segment of the articles.ELM search page for the search term ‘Tumor susceptibility gene 101’ showing the source of the match as the protein names in the article metadata rather than the article title or abstract. The article also indicates motif classes classified for the article and, if available, any motifs curated in ELM or classified in articles.ELM for the article. (C) Segment of the articles.ELM classify page from the classification of the article titled ‘Structure and functional interactions of the Tsg101 UEV domain’ (11). The output shows the LIG_PTAP_UEV_1 motif class assigned to the article with high confidence (as marked by the full yellow star) and the article abstract is annotated with the key terms for the LIG_PTAP_UEV_1 classification. The terms are coloured by their weight and correspond to the colouring of the logo in the classified page (see panel D). The bottom half of the classify page allows the article to be manually annotated by the user using a list of motif classes and motif groups. (D) Segment of the articles.ELM classified page showing the word cloud representation of the classifier built on the ELM UEV domain-binding PTAP motif (ELM: LIG_PTAP_UEV_1) related articles from the ELM resource. The binding protein (TSG101), binding domain (the UEV domain) and motif sequence/name (PTAP, PSAP or Late domain motif) are highlighted here to demonstrate the relevance of the key terms.
Figure 2Benchmarking results for the sources of article protein annotation and classification. (A) The ability of each source of article protein annotation to correctly reannotate the proteins manually curated for motif articles by the ELM resource. Recall is the fraction of UniProt accessions annotated for the articles in the ELM resource that were returned. Precision is the fraction of UniProt accessions returned for the ELM resource by each source of article protein annotation that are correct. (B) The overlap of the correctly identified UniProt accessions between each article annotation resource. The denominator of the proportion relates to the row. (C) ROC curve for the 5-fold cross-validation of the ELM class annotation of the ELM training set. Scores are calculated for a single fold scoring set against classifiers trained with the four remaining folds as a training set. Data describe a binary classification pooling the negative classes for each class in the classifier. (D) The ability of the article.ELM classifier to identify the correct ELM class in 10 manually curated, real-world motif article datasets. Recall and precision are shown, along with the ability of the classifier to recognize the curated motif class as the top-scoring class (Recall Top Ranked) and its ability to recognize a related, alternative motif (Recall Alternative). A detailed and up-to-date version of the data shown here is available at http://slim.icr.ac.uk/articles/benchmarking/.