| Literature DB >> 28845458 |
Morgan N Price1, Adam P Arkin1.
Abstract
Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST's database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/. IMPORTANCE With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins' functions.Entities:
Keywords: annotation; text mining
Year: 2017 PMID: 28845458 PMCID: PMC5557654 DOI: 10.1128/mSystems.00039-17
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 6.496
FIG 1 Example of PaperBLAST results. For each protein that is linked to the literature and is similar to the query protein, PaperBLAST shows a list of articles. For each article, PaperBLAST shows up to two snippets that mention the protein. a.a., amino acids.
FIG 2 Coverage of PaperBLAST. (A) How often hypothetical proteins or other vaguely annotated proteins from different types of organisms have homologs in the PaperBLAST database with a BLAST score ratio above the given threshold. (B) How often vaguely annotated bacterial proteins have homologs in PaperBLAST, in the characterized subset of Swiss-Prot, or in any of the three curated databases that are included in PaperBLAST (the characterized subset of Swiss-Prot, GeneRIF, or EcoCyc). In both panels, only homologs with high-coverage alignments (at least 80%) were included.
Numbers of proteins and scientific articles and links between them in PaperBLAST’s database
| Source | No. of | No. of | No. of | No. of links |
|---|---|---|---|---|
| EuropePMC | 315,579 | 73,542 | 639,550 | 613,726 |
| Swiss-Prot | 79,388 | 27,453 | 38,342 | |
| GeneRIF | 77,836 | 662,069 | 1,038,801 | |
| EcoCyc | 3,923 | 11,143 | 22,769 | |
| Total | 400,961 | 748,450 | 1,721,795 |
Proteins with different identifiers but with the same sequence are counted only once.
The count of proteins for Swiss-Prot includes some proteins that were linked to experimental evidence but that were not linked to articles about the protein’s function (see Materials and Methods).
The count of proteins for EcoCyc does not include proteins that are not linked to any scientific articles (even though these are included in PaperBLAST’s database).
The total is less than the sum of the parts due to overlap of data sources.