| Literature DB >> 24465495 |
Bjørn P Pedersen1, Georgiana Ifrim2, Poul Liboriussen3, Kristian B Axelsen4, Michael G Palmgren5, Poul Nissen1, Carsten Wiuf6, Christian N S Pedersen3.
Abstract
BACKGROUND: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem.Entities:
Mesh:
Year: 2014 PMID: 24465495 PMCID: PMC3896382 DOI: 10.1371/journal.pone.0085139
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Overview of the canonical P-type ATPase classes.
| Class | Substrate out | Substrate in | Expected TMs (a,b) | Taxonomic coverage (b) |
| IA | unknown (c) | K+ | 7 | Prokaryotic |
| IB | HM (d) | unknown | 6–8 | All kingdoms |
| IIA | Ca2+ or Mn2+ | H+ | 10 | All kingdoms |
| IIB | Ca2+ | H+ | 10 | All kingdoms |
| IIC | Na+ or H+ | K+ | 10 | Metazoa |
| IID | Na+ or K+ | unknown | 10 | Fungi |
| IIIA | H+ | none (e) | 10 | All excl. Metazoa |
| IIIB | unknown | Mg2+ (f) | 10 | Prokaryotic |
| IV | unknown | PL (g) | 10 | Eukaryotic |
| VA | unknown | unknown | 12 | Eukaryotic |
| VB | unknown | unknown | 12 | Eukaryotic |
a) TM: transmembrane helices. b) Expected from the literature. See main text for references. c) Substrate have not been identified. d) HM: heavy metals. Primarily Cu and Zn, but also Co, Cd, Ag and Pb. e) Possibly no countertransport. f) Transported with the electrochemical gradient. g) PL: Phospholipids.
Figure 1Flowchart of the SLR classification approach.
Classification is based on 12 SLR-classifiers (orange). The numbers noted in parenthesis are the classification-result using the UniProtKB dataset as an example. HM: heavy metals. PL: Phospholipids.
Comparison of running time of SLR to HMM for task 1.
| Classifier | AUC | % TP | % FP | CPU Running time (a) |
| SLR | 99.61% | 99.1901% | 0.0000% | 19 min |
| HMM | 99.99% | 100.0000% | 0.3396% | 20 min+150 min (b) |
a) CPU running time for complete training and classification of UniProtKB.
b) HMM running time is split into time for constructing a MSA and time for actual training/classification.
Breakdown of P-type ATPase classes found in the UniProtKB dataset.
| Domain/Kingdom | 0 | IA | IB | IIA | IIB | IIC | IID | IIIA | IIIB | IV | VA | VB | Total |
| Animal | 49 | 1 | 68 | 165 | 102 | 248 | 2 | 0 | 0 | 251 | 35 | 63 | 984 |
| Plant | 9 | 2 | 156 | 66 | 134 | 4 | 0 | 179 | 1 | 108 | 14 | 2 | 675 |
| Fungi | 22 | 3 | 166 | 204 | 94 | 18 | 62 | 124 | 5 | 255 | 58 | 63 | 1074 |
| One-celled Eukaryotes | 48 | 1 | 57 | 77 | 81 | 22 | 12 | 32 | 1 | 199 | 33 | 79 | 642 |
| – | – | ||||||||||||
| Eukaryota | 128 | 7 | 447 | 512 | 411 | 292 | 76 | 335 | 7 | 813 | 140 | 207 | 3375 |
| Bacteria | 79 | 489 | 3914 | 1042 | 38 | 72 | 29 | 38 | 362 | 27 | 0 | 1 | 6091 |
| Archaea | 8 | 7 | 133 | 35 | 0 | 14 | 0 | 29 | 0 | 0 | 0 | 0 | 226 |
| Virus | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 2Distribution of the total number of P-type ATPases inc. isoforms found in individual genomes in the Genome dataset.
Eukaryota (n = 70), Bacteria (n = 975) and Archaea (n = 78).
Figure 3Membrane Topology in P-type ATPases.
Top: Overview of the membrane topology found in P-type ATPases. Gray helices denote the 6 TM core-element found in all pumps (here numbered 1–6). P shows the cytosolic phosphorylation site containing the DKTGT motif.
Membrane Topology in P-type ATPases.
| Class | core only | N-core | core-C | N-core-C | core-1TM | Broken core | Total |
| 0 | 3 | 1 | 40 | 7 | - | 16 | 67 |
| IA | 63 | 3 | 4 | 22 | 312 | 10 | 414 |
| IB | 128 | 2121 | 105 | 12 | - | 476 | 2842 |
| IIA | 3 | 0 | 879 | 8 | - | 46 | 936 |
| IIB | 5 | 0 | 302 | 10 | - | 14 | 331 |
| IIC | 3 | 0 | 157 | 1 | - | 10 | 171 |
| IID | 0 | 0 | 49 | 1 | - | 4 | 54 |
| IIIA | 0 | 0 | 123 | 3 | 0 | 29 | 155 |
| IIIB | 3 | 3 | 230 | 7 | - | 22 | 265 |
| IV | 1 | 0 | 357 | 21 | - | 57 | 436 |
| VA | 0 | 0 | 10 | 35 | - | 6 | 51 |
| VB | 0 | 0 | 18 | 69 | - | 12 | 99 |
|
|
|
|
|
|
|
|
|
‘N’ and ‘C’ denote the N- and C-terminal element respectively. ‘Core-1TM’ denotes proteins with exactly one TM after the core (only class IA). ‘Broken core’ counts sequences with less than 6 TM in the core, regardless of total number of TMs.