| Literature DB >> 17142240 |
Paolo Sonego1, Mircea Pacurar, Somdutta Dhir, Attila Kertész-Farkas, András Kocsor, Zoltán Gáspári, Jack A M Leunissen, Sándor Pongor.
Abstract
Protein classification by machine learning algorithms is now widely used in structural and functional annotation of proteins. The Protein Classification Benchmark collection (http://hydra.icgeb.trieste.it/benchmark) was created in order to provide standard datasets on which the performance of machine learning methods can be compared. It is primarily meant for method developers and users interested in comparing methods under standardized conditions. The collection contains datasets of sequences and structures, and each set is subdivided into positive/negative, training/test sets in several ways. There is a total of 6405 classification tasks, 3297 on protein sequences, 3095 on protein structures and 10 on protein coding regions in DNA. Typical tasks include the classification of structural domains in the SCOP and CATH databases based on their sequences or structures, as well as various functional and taxonomic classification problems. In the case of hierarchical classification schemes, the classification tasks can be defined at various levels of the hierarchy (such as classes, folds, superfamilies, etc.). For each dataset there are distance matrices available that contain all vs. all comparison of the data, based on various sequence or structure comparison methods, as well as a set of classification performance measures computed with various classifier algorithms.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17142240 PMCID: PMC1669728 DOI: 10.1093/nar/gkl812
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Examples of records (benchmark tests) included in the collection
| Benchmark tests | Data | Classification tasks | Comparison methods |
|---|---|---|---|
| Classification of protein domains in SCOP [PCB0001, PCB00003, PDB0005] | 11 944 Protein sequences/or protein structures from SCOP95 ( | Superfamilies subdivided into families………246 | BLAST, Smith–Waterman, Needleman–Wunsch, LA–kernel, PRIDE2 |
| Folds subdivided into superfamilies………191 | |||
| Classes subdivided into folds………377 | |||
| Classification of protein domains in CATH [PCB00007, PCB00009, PCB00011, PCB00013] | 11 373 Protein sequences/or protein structures from CATH ( | (H) groups subdivided into S groups………165 | BLAST, Smith–Waterman, Needleman–Wunsch, LA–kernel, PRIDE2 |
| T groups subdivided into H groups………199 | |||
| A groups subdivided into T groups………297 | |||
| Classes subdivided into A groups………33 | |||
| CLassification of phyla based on 3 phospho-glycerate kinase (3PGK) sequences. [PCB00031, PCB00032] | 131 3PGK Protein and DNA sequences ( | Groups of kingdoms (Archaea, Bacteria, Eucarya) subdivided into phyla……10 | BLAST, Smith–Waterman, Needleman–Wunsch, LA–kernel, LZW, PPMZ |
| Functional annotation of unicellular eukaryotic sequences based on prokaryotic orthologs. [PCB00031] | 17 973 Sequences of prokaryotes and unicellular eukaryotes from the COG databases ( | Orthologous groups subdivided into prokaryotes and eukaryotes………119 | BLAST, Smith–Waterman, Needleman–Wunsch, LA–kernel, LZW, PPMZ |
aThe collection contains a total of 6405 benchmark tests including a total of 3297 protein sequence classification tests, 3095 3D classification tests and 10 DNA (coding region) classification tests. The accession numbers of the records are given in square brackets.
bSee text for the references.
Figure 1Details of a record in the database.
Figure 2Cumulative results of a benchmark test PCB00033. The underlying dataset is a small subset of SCOP comprising of 55 classification tasks (corresponding to 8 all-α, 15 all-β, 30 α/β and 2 other classes). The numbers represent average AUC values [0,1] obtained by receiver operator curve (ROC) analysis (18). This value is high for good classifiers and is close to 0.5 for random classification. The classification methods include 1NN—Nearest neighbor (30), RF—Random forest (16), SVM—Support Vector Machines (14), ANN—Artificial neural networks (15) and LogReg—Logistic regression (17). The comparison methods include BLAST (8), SW—Smith–Waterman (9), NW—Needleman–Wunsch (10), LZW—Lempel–Ziv compression distance and PPMZ—partial match compression distance (11). The Smith–Waterman algorithm performs better than the other comparison algorithms, especially when used in conjunction with SVM.