| Literature DB >> 32367111 |
Douglas Teodoro1,2, Julien Knafou1,2, Nona Naderi1,2, Emilie Pasche1,2, Julien Gobeill1,2, Cecilia N Arighi3, Patrick Ruch1,2.
Abstract
In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32367111 PMCID: PMC7198315 DOI: 10.1093/database/baaa026
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1UniProt Knowledgebase annotation process. Manually protein-annotated documents (Swiss-Prot) are associated with UniProt entry categories (Function, Name & Taxonomy, etc.) according to the type of information available in the publication, improving the organization of the annotations within the knowledge base. A much larger set of publications (TrEMBL) is then automatically annotated according to their source characteristics.
Figure 2Synthetic annotation example illustrating how a single publication can be associated with different sets of UniProtKB entry categories.
Figure 3Positive passages extracted from annotations in Figure 2. k-nearest (k = 0) sentences containing candidate evidence for the protein annotation are concatenated to create a ‘positive’ document. Similarly, sentences that do not contain evidence for the category are concatenated to create a ‘negative’ document (not shown).
Figure 4Outline of the UPCLASS CNN-based classification architecture with an embedding layer, three CNN layers followed by two dense layers. The ‘positive’ sentences (accession in) are concatenated and fed to one branch of the model (‘in’ branch). The leftover sentences (accession out) are used to create the ‘negative’ document and fed to the other branch of the model (‘out’ branch).
Distribution of categories in the manually annotated training collection from UniProtKB
|
|
|
|
|
|---|---|---|---|
| Expression | 53 274 | 35 128 | 34 446 |
| Family & domains | 4910 | 3807 | 3310 |
| Function | 105 417 | 49 896 | 72 674 |
| Interaction | 60 252 | 28 318 | 30 646 |
| Names | 11 334 | 9130 | 1100 |
| Pathology & biotech | 39 870 | 23 573 | 32 410 |
| PTM/Processing | 69 080 | 31 142 | 17 335 |
| Sequences | 217 879 | 130 288 | 109 333 |
| Structure | 25 569 | 14 257 | 19 553 |
| Subcellular Location | 48 876 | 31 866 | 28 793 |
| Miscellaneous | 111 454 | 52 724 | 16 812 |
|
|
|
|
|
Examples: number of document–accession examples annotated with a category in the training collection. Unique accessions: number of unique accessions annotated with a category in the training set. Unique document: number of unique publications annotated with a category in the training set
Resulting sentences after passing through the pre-processing pipeline
|
|
|
|---|---|
| YddV from | yddv escherichia coli ec novel globin coupl heme base oxygen sensor protein display diguanyl cyclas activ respons oxygen avail |
Micro and macro average results for the not tagged and tagged models obtained from the test set of 58k records
|
|
|
| ||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
|
| 0.63 | 0.42 | 0.50 | 0.55 | 0.42 | 0.50 |
|
| 0.55 | 0.66 | 0.60 | 0.48 | 0.60 | 0.53 |
|
| 0.74 | 0.43 | 0.54 | 0.56 | 0.28 | 0.37 |
|
|
| 0.38 | 0.50 | 0.64 | 0.25 | 0.36 |
|
| 0.67 |
| 0.71 |
| 0.46 | 0.55 |
|
| 0.69 | 0.74 |
| 0.60 |
|
|
Highest results are shown in bold. Asterisk denotes statistically significant improvement
Micro and macro average results for the not tagged and tagged models obtained from the test set of unique document→categories pairs (around 26 k samples)
|
|
|
| ||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
|
| 0.56 | 0.68 | 0.61 | 0.45 |
| 0.50 |
|
| 0.59 | 0.66 | 0.62 | 0.46 | 0.53 | 0.49 |
|
| 0.73 | 0.43 | 0.54 |
| 0.25 | 0.36 |
|
|
| 0.45 | 0.56 | 0.61 | 0.27 | 0.38 |
|
| 0.64 |
|
| 0.54 | 0.54 |
|
|
| 0.65 | 0.66 | 0.66 | 0.55 | 0.49 | 0.52 |
Highest results are shown in bold. Asterisk denotes statistically significant improvement
Examples of prediction output for the CNN not tagged and tagged models
|
|
|
|
| |
|---|---|---|---|---|
|
|
| |||
| 11 847 227 | Q9BTW9 | Function | Function, pathology & biotech, sequences | Function |
| 11 847 227 | O75695 | Function, interaction, miscellaneous | Function, pathology & biotech, sequences | Function, interaction |
| 11 847 227 | Q15814 | Function, pathology & biotech | Function, pathology & biotech, sequences | Function |
| 11 847 227 | Q9Y2Y0 | Interaction | Function, pathology & biotech, sequences | Function, interaction |
| 11 847 227 | P36405 | Interaction, pathology & biotech | Function, pathology & biotech, sequences | Function, interaction |
| 15 326 186 | A7E3N7 | Expression | Expression, function | Expression, function |
| 15 326 186 | Q8NFA2 | Expression, function | Expression, function | Expression, function |
| 15 326 186 | Q672J9 | Expression, function, sequences | Expression, function | Expression, function |
| 15 326 186 | Q672K1 | Expression, sequences | Expression, function | Expression, function |
| 15 326 186 | Q8CJ00 | Function | Expression, function | Expression, function |
| 10 427 773 | Q9SAA2 | Expression | Expression, sequences | Expression, sequences |
| 10 427 773 | Q9SXJ6 | Expression, sequences | Expression, sequences | Expression, sequences |
| 10 427 773 | Q9S834 | Expression, sequences | Expression, sequences | Expression, sequences, subcellular location |
| 10 427 773 | Q9XJ36 | Expression, sequences | Expression, sequences | Expression, sequences |
| 10 427 773 | Q9SXJ7 | Expression, sequences, subcellular location | Expression, sequences | Expression, sequences |
| 10 427 773 | Q9XJ35 | Expression, sequences, subcellular location | Expression, sequences | Expression, subcellular location |
| 10 427 773 | P42762 | Expression, subcellular location | Expression, sequences | Expression, sequences |
| 10 427 773 | P62126 | Sequences | Expression, sequences | Expression, sequences |
Figure 5Precision-recall curves for the UniProtKB categories obtained from the CNN tagged classification. Mean average precision is shown in parentheses. Black horizontal dashed line: performance of a random classifier.
Figure 6Classifier precision as a function of the publication size. There is no correlation between the size of the input size and precision. Circle size: number of publications within a size bin. Yellow points: high precision and lower ratio between publication size and the max publication size in the test set. Purple points: low precision and higher ratio between the publication size and the max publication size in the test set.
Figure 7Micro average precision performance per organism. Higher precision for organisms happens when there is a concentration of categories. Black dashed horizontal line: mean organism precision.