Literature DB >> 15608182

The SBASE domain sequence resource, release 12: prediction of protein domain-architecture using support vector machines.

Kristian Vlahovicek¹, László Kaján, Vilmos Agoston, Sándor Pongor.

Abstract

SBASE (http://www.icgeb.trieste.it/sbase) is an online resource designed to facilitate the detection of domain homologies based on sequence database search. The present release of the SBASE A library of protein domain sequences contains 972,397 protein sequence segments annotated by structure, function, ligand-binding or cellular topology, clustered into 8547 domain groups. SBASE B contains 169,916 domain sequences clustered into 2526 less well-characterized groups. Domain prediction is based on an evaluation of database search results in comparison with a 'similarity network' of inter-sequence similarity scores, using support vector machines trained on similarity search results of known domains.

Entities: Disease Gene

Mesh：

Substances：
Proteins

Year: 2005 PMID： 15608182 PMCID： PMC540066 DOI： 10.1093/nar/gki112

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The SBASE project was initiated in order to develop a prediction scheme that can automatically recognize instances of known protein domains in the newly determined sequences, using similarity search on a reference domain sequence database (1–3). One of the project's main motivations has been to solve the prediction problem without the use of a manmade model or consensus description of domain sequence groups, in order to decrease maintenance costs while maintaining the generalization power of the prediction. The resulting system consists of a reference domain sequence database on one hand, and a related predictor program on the other, so the system's predicting power can be optimized by tuning both these components in concert. SBASE 12.0 is a collection of 972, 397 protein domain sequences. Each SBASE domain record contains a sequence assigned to one of the 8 547 functionally or structurally well-characterized groups (SBASE A), or to one of the less well-characterized 2 018 groups described in terms of amino acid composition or cellular localization (SBASE B). All domains are cross-referenced back to their parent protein databases [Swiss-Prot + TrEMBL (4), PIR (5) and to entries in other domain repositories, such as INTERPRO (6), or its member databases such as Pfam (7), SMART (8) and PRINTS (9)]. Finding known domain types in new sequences includes two subtasks: (i) locating the potential domains—in the SBASE system this problem has been approached by analyzing the distribution of cumulative FASTA or BLAST similarity scores along the query sequence (10,11); and (ii) selecting/accepting the best candidate domains. This task[lkl] is a classification problem that was initially solved using significance values (11). In a subsequently developed analysis scheme, a database versus database comparison was used to create a similarity network in which the nodes are domain sequences and the (weighted) edges are similarity scores (12,13). In the resulting predictor algorithm each domain group was characterized by two variables: the average number of similarities above a selected threshold (NSD) and the average similarity score (AVS), which, in graph theory correspond to the terms ‘degree’ and ‘average weight’, respectively (12,13). Group-specific threshold values were calculated for both variables and the classification was based on a probabilistic score, which was calculated from the threshold values as well as measures derived from the distribution of the two characteristic parameters (NSD and AVS), for each domain group (14). Even though the system gave reassuring results in most groups (3,13), there were a number of persistent mispredictions that could not be eliminated by the optimization of threshold values. In the present release of the system we introduce, in addition to the BLAST score and the degree (NSD), new variables, namely, (i) HSP length (alignment length) determined for the subject, (ii) score coverage, i.e. the similarity score divided by the self-similarity score of the subject (database entry); (iii) length coverage, i.e. the length of the aligned region (HSP length determined for the subject) divided by the subject length and (iv) length-to-score ratio, the length of the subject divided by the similarity score. The similarity of a query to a group is characterized by the average of these variables calculated from the BLAST alignments made between the query and the members of the group. The averaged variables are quite efficient in clustering domain sequences using BLAST searches. For instance, over 92% of the groups in the PFAM-SEED database (7) were completely separated from the non-group member neighbors by at least one of the variables (Figure 1). In order to get a robust separation of the sequence clusters we trained support vector machines (SVMs) with the linear kernel and the variables mentioned above, using the SVM utilities of the R package (www.r-project.org). Benchmark SVM training was based on comparing the non-redundant members of the PFAM-SEED 8.0 (128 780 sequences) to a set of their parent proteins (94 102 sequences) using a BLAST score cutoff of 40, and recording ‘good’ or ‘bad’ hits if 20% of an HSP overlapped respectively, with a domain of corresponding domain type or different domain type. The training took 10 h on a dual-processor (1400 MHz) AMD Opteron 240 machine. The predictive performance of the resulting system is shown by the fact that, if PFAM-SEED 8.0 is used as the reference database, over 60% of the groups had none or only one mistaken prediction (Table 1), and the average difference in the domain boundaries is <5 amino acids in 90% of the cases (Figure 2).

Figure 1

Separation of domain group members from neighbors in three dimensions. The kringle group is one of the perfectly separated groups, WD repeat is one of the critical cases (Lhsp = length of HSP, Lsbj = length of subject (database entry), S/Sself = score coverage; see text for explanations).

Table 1.

SVM benchmark figuresa

Domain group	No. of sequences		Match	Mismatch	Unpredicted
	Learning set	Test set
Kringle domain	24	9	9	0	0
Fibronectin type III domain	108	352	328	2	22
WD repeat	1924	673	542	12	125
EGF-like domain	87	290	262	13	15
Protein kinase domain	67	545	505	5	35
Annexin repeat	181	34	32	1	2
Sushi domain	80	119	103	0	3
Trypsin family	83	128	110	0	1
Globin family	79	59	57	1	0
ABC transporter domain	63	564	563	1	0
Ank repeat	1195	736	535	5	196
Total	128 780	60 457	56 891	238	3328

aThe learning set consisted of the parent protein sequences of domains in PFAM-SEED 8.0. The test included parent proteins with annotated domains not included in PFAM-SEED.

Figure 2

Domain boundary prediction statistics available at the SBASE homepage.

IMPROVEMENTS WITH RESPECT TO THE PREVIOUS RELEASE

The examples included in the consolidated domain sequence collection SBASE A were filtered so as to discard conspicuously short or long domain examples. The consolidated sets were used to trainSVMs. SBASE A group names have been renamed so as to match INTERPRO names. Domains, families and repeats are now classified into separate categories. The SVMs were included into the predictor algorithm. Evaluation of a protein sequence query typically takes 10 s including 5 s of the BLAST run.

DISTRIBUTION AND ACCESS

The SBASE domain library browser and domain architecture prediction system are accessible through the web-interface at http://www.icgeb.org/sbase.

13 in total

1. A simple probabilistic scoring method for protein domain identification.

Authors: J Murvai; K Vlahovicek; S Pongor
Journal: Bioinformatics Date: 2000-12 Impact factor: 6.937

2. Prediction of protein functional domains from sequences using artificial neural networks.

Authors: J Murvai; K Vlahovicek; C Szepesvári; S Pongor
Journal: Genome Res Date: 2001-08 Impact factor: 9.043

3. The domain-server: direct prediction of protein domain-homologies from BLAST search.

Authors: J Murvai; K Vlahovicek; E Barta; S Parthasarathy; H Hegyi; F Pfeiffer; S Pongor
Journal: Bioinformatics Date: 1999-04 Impact factor: 6.937

4. SMART 4.0: towards genomic data integration.

Authors: Ivica Letunic; Richard R Copley; Steffen Schmidt; Francesca D Ciccarelli; Tobias Doerks; Jörg Schultz; Chris P Ponting; Peer Bork
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

5. PIRSF: family classification system at the Protein Information Resource.

Authors: Cathy H Wu; Anastasia Nikolskaya; Hongzhan Huang; Lai-Su L Yeh; Darren A Natale; C R Vinayaka; Zhang-Zhi Hu; Raja Mazumder; Sandeep Kumar; Panagiotis Kourtesis; Robert S Ledley; Baris E Suzek; Leslie Arminski; Yongxing Chen; Jian Zhang; Jorge Louie Cardenas; Sehee Chung; Jorge Castro-Alvear; Georgi Dinkov; Winona C Barker
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

6. PRINTS and its automatic supplement, prePRINTS.

Authors: T K Attwood; P Bradley; D R Flower; A Gaulton; N Maudling; A L Mitchell; G Moulton; A Nordle; K Paine; P Taylor; A Uddin; C Zygouri
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

7. The SBASE domain sequence library, release 10: domain architecture prediction.

Authors: Kristian Vlahovicek; Laszló Kaján; János Murvai; Zoltán Hegedus; Sándor Pongor
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

8. The InterPro Database, 2003 brings increased coverage and new features.

Authors: Nicola J Mulder; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Daniel Barrell; Alex Bateman; David Binns; Margaret Biswas; Paul Bradley; Peer Bork; Phillip Bucher; Richard R Copley; Emmanuel Courcelle; Ujjwal Das; Richard Durbin; Laurent Falquet; Wolfgang Fleischmann; Sam Griffiths-Jones; Daniel Haft; Nicola Harte; Nicolas Hulo; Daniel Kahn; Alexander Kanapin; Maria Krestyaninova; Rodrigo Lopez; Ivica Letunic; David Lonsdale; Ville Silventoinen; Sandra E Orchard; Marco Pagni; David Peyruc; Chris P Ponting; Jeremy D Selengut; Florence Servant; Christian J A Sigrist; Robert Vaughan; Evgueni M Zdobnov
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

9. Improved detection of homology in distantly related proteins: similarity of adducin with actin-binding proteins.

Authors: G Simon; R Paladini; S Tisminetzky; M Cserzö; Z Hátsági; A Tossi; S Pongor
Journal: Protein Seq Data Anal Date: 1992

10. The Pfam protein families database.

Authors: Alex Bateman; Lachlan Coin; Richard Durbin; Robert D Finn; Volker Hollich; Sam Griffiths-Jones; Ajay Khanna; Mhairi Marshall; Simon Moxon; Erik L L Sonnhammer; David J Studholme; Corin Yeats; Sean R Eddy
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

9 in total

1. A comparative proteomic approach to analyse structure, function and evolution of rice chitinases: a step towards increasing plant fungal resistance.

Authors: Kishore Sarma; Budheswar Dehury; Jagajjit Sahu; Ranjan Sarmah; Smita Sahoo; Mousumi Sahu; Priyabrata Sen; Mahendra Kumar Modi; Madhumita Barooah
Journal: J Mol Model Date: 2012-06-09 Impact factor: 1.810

2. Identification of type 2 diabetes-associated combination of SNPs using support vector machine.

Authors: Hyo-Jeong Ban; Jee Yeon Heo; Kyung-Soo Oh; Keun-Joon Park
Journal: BMC Genet Date: 2010-04-23 Impact factor: 2.797

3. Wheat wounding-responsive HD-Zip IV transcription factor GL7 is predominantly expressed in grain and activates genes encoding defensins.

Authors: Nataliya Kovalchuk; Wei Wu; Natalia Bazanova; Nicolas Reid; Rohan Singh; Neil Shirley; Omid Eini; Alexander A T Johnson; Peter Langridge; Maria Hrmova; Sergiy Lopato
Journal: Plant Mol Biol Date: 2019-06-10 Impact factor: 4.076