| Literature DB >> 28750104 |
Christian-Alexander Dudek1, Henning Dannheim1, Dietmar Schomburg1.
Abstract
The prediction of gene functions is crucial for a large number of different life science areas. Faster high throughput sequencing techniques generate more and larger datasets. The manual annotation by classical wet-lab experiments is not suitable for these large amounts of data. We showed earlier that the automatic sequence pattern-based BrEPS protocol, based on manually curated sequences, can be used for the prediction of enzymatic functions of genes. The growing sequence databases provide the opportunity for more reliable patterns, but are also a challenge for the implementation of automatic protocols. We reimplemented and optimized the BrEPS pattern generation to be applicable for larger datasets in an acceptable timescale. Primary improvement of the new BrEPS protocol is the enhanced data selection step. Manually curated annotations from Swiss-Prot are used as reliable source for function prediction of enzymes observed on protein level. The pool of sequences is extended by highly similar sequences from TrEMBL and SwissProt. This allows us to restrict the selection of Swiss-Prot entries, without losing the diversity of sequences needed to generate significant patterns. Additionally, a supporting pattern type was introduced by extending the patterns at semi-conserved positions with highly similar amino acids. Extended patterns have an increased complexity, increasing the chance to match more sequences, without losing the essential structural information of the pattern. To enhance the usability of the database, we introduced enzyme function prediction based on consensus EC numbers and IUBMB enzyme nomenclature. BrEPS is part of the Braunschweig Enzyme Database (BRENDA) and is available on a completely redesigned website and as download. The database can be downloaded and used with the BrEPScmd command line tool for large scale sequence analysis. The BrEPS website and downloads for the database creation tool, command line tool and database are freely accessible at http://breps.tu-bs.de.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28750104 PMCID: PMC5531587 DOI: 10.1371/journal.pone.0182216
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1The BrEPS 2.0 workflow.
The protocol consists of six steps to generate the BrEPS database. A: Selection and preparation of sequences. B: All-vs-all BLAST of sequences. C: Complete linkage clustering based on the E-value from BLAST. D: Multiple sequence alignment and pattern creation on selected nodes. E: Pattern verification. F: Preparation of the final database.
Filters applied to UniProt protein entries to parse enzyme data from UniProt flatfiles.
| Filter | Field | Value | BrEPS | BrEPS 2.0 |
|---|---|---|---|---|
| Length | SQ | 100-7000 aa | yes | yes |
| Keywords | DE | not putative, hypothetical, fragment, probable, possible and potential | yes | yes |
| EC number | DE | present | yes | yes |
| Evidence | PE | 1 | no | yes |
| Publication | RX | present | no | yes |
*: Only for additional sequences from TrEMBL.
Fig 2Detailed overview of the new data selection.
Only Swiss-Prot sequences with evidence on protein level (A) are used as seed sequences to retrieve additional, non-redundant sequences from TrEMBL and Swiss-Prot using UniRef references with >= 50% sequence identity (B and C). These additional sequences get the corresponding Swiss-Prot annotation and are merged with the seed sequences into one database (D).
Similarity sets created for every amino acid pair to complement semi-conserved amino acid positions with other highly similar amino acids.
| Pair | Score | Similarity set | Pair | Score | Similarity set |
|---|---|---|---|---|---|
| EN | 0.9 | D | KQ | 1.5 | R |
| EK | 1.2 | Q | QR | 1.5 | K |
| MV | 1.6 | IL | IM | 2.5 | L |
| HR | 0.6 | KQ | AT | 0.6 | S |
| FW | 3.6 | Y | FI | 1.0 | LM |
| KN | 0.8 | E | NQ | 0.7 | DEHK |
| LV | 1.8 | I | FM | 1.6 | L |
| HK | 0.6 | NQR | DQ | 0.9 | E |
The scores are derived from Gonnets PAM250 matrix [22].
Fig 3Extended patterns example.
Semi-conserved pattern positions are extended with amino acids from PAM250-based similarity sets.
Proposed EC number examples for patterns created from sequences with different EC numbers.
| # | EC numbers | Proposed EC | Reason |
|---|---|---|---|
| 1 | 1.1.-.- | Consensus of first two digits | |
| 2 | |||
| 1 | 2.1.3.1 | 2.1.3.1 OR 6.4.1.1 | No consensus, both functions are possible |
| 2 | 6.4.1.1 | ||
| 1 | 6.4.1.2 & | 6.4.1.2 AND 6.3.4.14 | Two functions for one enzyme |
| 6.3.4.14 | |||
| 1 | 6.4.1.- | Consensus in only one EC number | |
| 2 | |||
| 6.3.4.14 |
Multiple EC numbers for one enzyme are separated with “&”. Consensus parts of EC numbers are underlined.
Evaluation results for BrEPS 1.0 and BrEPS 2.0 using UniProt release 2014_10.
| TP | FP | N | PPV [%] | DR [%] | |
| strict | 8840 | 1711 | 13426 | 83.78 | 44.00 |
| fuzzy | 9514 | 1037 | 13426 | 90.17 | 44.00 |
| in BrEPS | 5674 | 1107 | 0 | 83.67 | 100.00 |
| not in BrEPS | 3166 | 604 | 13426 | 83.98 | 21.92 |
| TP | FP | N | PPV [%] | DR [%] | |
| strict | 11126 | 1798 | 11053 | 86.09 | 53.90 |
| fuzzy | 11673 | 1251 | 11053 | 90.32 | 53.90 |
| extended | 101 | 29 | 0 | 77.69 | 100.00 |
| in BrEPS | 6766 | 1109 | 0 | 85.92 | 100.00 |
| not in BrEPS | 4360 | 689 | 11053 | 86.35 | 31.36 |
Evaluation results for BrEPS 2.0 using UniProt release 2017_01.
| TP | FP | N | PPV [%] | DR [%] | |
|---|---|---|---|---|---|
| strict | 11951 | 1740 | 10286 | 87.29 | 57.10 |
| fuzzy | 12444 | 1247 | 10286 | 90.89 | 57.10 |
| extended | 83 | 19 | 0 | 81.37 | 100.00 |
| in BrEPS | 7398 | 1097 | 0 | 87.09 | 100.00 |
| not in BrEPS | 4553 | 643 | 10286 | 87.63 | 33.56 |
Comparison of InterPro 63.0 with BrEPS 2017.1.
| TP | FP | N | PPV [%] | DR [%] | |
|---|---|---|---|---|---|
| InterPro (strict) | 9459 | 2121 | 12397 | 81.68 | 48.30 |
| InterPro (loose) | 9965 | 1615 | 12397 | 86.05 | 48.30 |
| BrEPS 2.0 (strict) | 11951 | 1740 | 10286 | 87.29 | 57.10 |