| Literature DB >> 26495291 |
Min Zhao1, Yanming Chen2, Dacheng Qu2, Hong Qu3.
Abstract
The substrates of a transporter are not only useful for inferring function of the transporter, but also important to discover compound-compound interaction and to reconstruct metabolic pathway. Though plenty of data has been accumulated with the developing of new technologies such as in vitro transporter assays, the search for substrates of transporters is far from complete. In this article, we introduce METSP, a maximum-entropy classifier devoted to retrieve transporter-substrate pairs (TSPs) from semistructured text. Based on the high quality annotation from UniProt, METSP achieves high precision and recall in cross-validation experiments. When METSP is applied to 182,829 human transporter annotation sentences in UniProt, it identifies 3942 sentences with transporter and compound information. Finally, 1547 confidential human TSPs are identified for further manual curation, among which 58.37% pairs with novel substrates not annotated in public transporter databases. METSP is the first efficient tool to extract TSPs from semistructured annotation text in UniProt. This tool can help to determine the precise substrates and drugs of transporters, thus facilitating drug-target prediction, metabolic network reconstruction, and literature classification.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26495291 PMCID: PMC4606149 DOI: 10.1155/2015/254838
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Workflow of design and function of METSP. Step I (highlighted in pink): explicit TSPs were manually collected from UniProt, TCDB, and TransportDB databases. Step II (in blue): the UniProt annotation text of proteins in explicit TSPs and in randomly selecting protein set was processed to get positive and unlabeled sentence training sets. The maximum-entropy model was used to train and retain the classifier. Step III (in green): the classifier was used to recognize TSPs from query protein annotation text. The new TSPs were obtained by further experts checking.
Summary of reliable TSPs from UniProt, TCDB, and TransportDB.
| UniProt | TCDB | TransportDB | Sum from formula ( | |
|---|---|---|---|---|
| TSPs | 35586 | 2641 | 86726 | 6955 |
| Transporters | 25056 | 1501 | 57070 | 5042 |
| Substrates | 528 | 229 | 351 | 275 |
Note: (∗): R = {r | r ∈ R UniProt∩R TransportDB ∪ R TCDB}, where R UniProt, R TCDB, and R TransportDB refer to all TSPs collected from UniProt, TCDB, and TransportDB, respectively.
The precision, recall of ME classifier, and the number of “false” negative instances that were actually positive instances in four iterations.
| Iteration 1 | Iteration 2 | Iteration 3 | Iteration 4 | |
|---|---|---|---|---|
| Precision | 94.93% | 98.17% | 98.50% | 98.54% |
| Recall | 97.52% | 97.95% | 98.00% | 98.02% |
| FP ratio | 546/688 | 70/250 | 24/205 | 16/201 |
Note: FP ratio represents the number of “false” negative instances that were actually positive instances.
Figure 2The performance comparison of ME and NB classifiers. ROC curves of maximum-entropy classifier and Naïve Bayes classifier on the original (a) and relabelled datasets (b).
Figure 3Comparison of TSP data. Comparing human TSPs extracted by METSP with that in three existing transporter-substrate databases (TCDB, TransportDB, and KEGG database). Blue bars represent the number of TSPs extracted by METSP and in the three databases; red bars represent the number of TSPs that were not extracted by METSP but in the three databases; green bars represent the number of TSPs extracted by METSP but not in the three databases.