Literature DB >> 15608182

The SBASE domain sequence resource, release 12: prediction of protein domain-architecture using support vector machines.

Kristian Vlahovicek1, László Kaján, Vilmos Agoston, Sándor Pongor.   

Abstract

SBASE (http://www.icgeb.trieste.it/sbase) is an online resource designed to facilitate the detection of domain homologies based on sequence database search. The present release of the SBASE A library of protein domain sequences contains 972,397 protein sequence segments annotated by structure, function, ligand-binding or cellular topology, clustered into 8547 domain groups. SBASE B contains 169,916 domain sequences clustered into 2526 less well-characterized groups. Domain prediction is based on an evaluation of database search results in comparison with a 'similarity network' of inter-sequence similarity scores, using support vector machines trained on similarity search results of known domains.

Entities:  

Mesh:

Substances:

Year:  2005        PMID: 15608182      PMCID: PMC540066          DOI: 10.1093/nar/gki112

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

The SBASE project was initiated in order to develop a prediction scheme that can automatically recognize instances of known protein domains in the newly determined sequences, using similarity search on a reference domain sequence database (1–3). One of the project's main motivations has been to solve the prediction problem without the use of a manmade model or consensus description of domain sequence groups, in order to decrease maintenance costs while maintaining the generalization power of the prediction. The resulting system consists of a reference domain sequence database on one hand, and a related predictor program on the other, so the system's predicting power can be optimized by tuning both these components in concert. SBASE 12.0 is a collection of 972, 397 protein domain sequences. Each SBASE domain record contains a sequence assigned to one of the 8 547 functionally or structurally well-characterized groups (SBASE A), or to one of the less well-characterized 2 018 groups described in terms of amino acid composition or cellular localization (SBASE B). All domains are cross-referenced back to their parent protein databases [Swiss-Prot + TrEMBL (4), PIR (5) and to entries in other domain repositories, such as INTERPRO (6), or its member databases such as Pfam (7), SMART (8) and PRINTS (9)]. Finding known domain types in new sequences includes two subtasks: (i) locating the potential domains—in the SBASE system this problem has been approached by analyzing the distribution of cumulative FASTA or BLAST similarity scores along the query sequence (10,11); and (ii) selecting/accepting the best candidate domains. This task[lkl] is a classification problem that was initially solved using significance values (11). In a subsequently developed analysis scheme, a database versus database comparison was used to create a similarity network in which the nodes are domain sequences and the (weighted) edges are similarity scores (12,13). In the resulting predictor algorithm each domain group was characterized by two variables: the average number of similarities above a selected threshold (NSD) and the average similarity score (AVS), which, in graph theory correspond to the terms ‘degree’ and ‘average weight’, respectively (12,13). Group-specific threshold values were calculated for both variables and the classification was based on a probabilistic score, which was calculated from the threshold values as well as measures derived from the distribution of the two characteristic parameters (NSD and AVS), for each domain group (14). Even though the system gave reassuring results in most groups (3,13), there were a number of persistent mispredictions that could not be eliminated by the optimization of threshold values. In the present release of the system we introduce, in addition to the BLAST score and the degree (NSD), new variables, namely, (i) HSP length (alignment length) determined for the subject, (ii) score coverage, i.e. the similarity score divided by the self-similarity score of the subject (database entry); (iii) length coverage, i.e. the length of the aligned region (HSP length determined for the subject) divided by the subject length and (iv) length-to-score ratio, the length of the subject divided by the similarity score. The similarity of a query to a group is characterized by the average of these variables calculated from the BLAST alignments made between the query and the members of the group. The averaged variables are quite efficient in clustering domain sequences using BLAST searches. For instance, over 92% of the groups in the PFAM-SEED database (7) were completely separated from the non-group member neighbors by at least one of the variables (Figure 1). In order to get a robust separation of the sequence clusters we trained support vector machines (SVMs) with the linear kernel and the variables mentioned above, using the SVM utilities of the R package (www.r-project.org). Benchmark SVM training was based on comparing the non-redundant members of the PFAM-SEED 8.0 (128 780 sequences) to a set of their parent proteins (94 102 sequences) using a BLAST score cutoff of 40, and recording ‘good’ or ‘bad’ hits if 20% of an HSP overlapped respectively, with a domain of corresponding domain type or different domain type. The training took 10 h on a dual-processor (1400 MHz) AMD Opteron 240 machine. The predictive performance of the resulting system is shown by the fact that, if PFAM-SEED 8.0 is used as the reference database, over 60% of the groups had none or only one mistaken prediction (Table 1), and the average difference in the domain boundaries is <5 amino acids in 90% of the cases (Figure 2).
Figure 1

Separation of domain group members from neighbors in three dimensions. The kringle group is one of the perfectly separated groups, WD repeat is one of the critical cases (Lhsp = length of HSP, Lsbj = length of subject (database entry), S/Sself = score coverage; see text for explanations).

Table 1.

SVM benchmark figuresa

Domain groupNo. of sequencesMatchMismatchUnpredicted
 Learning setTest set   
Kringle domain249900
Fibronectin type III domain108352328222
WD repeat192467354212125
EGF-like domain872902621315
Protein kinase domain67545505535
Annexin repeat181343212
Sushi domain8011910303
Trypsin family8312811001
Globin family79595710
ABC transporter domain6356456310
Ank repeat11957365355196
Total128 78060 45756 8912383328

aThe learning set consisted of the parent protein sequences of domains in PFAM-SEED 8.0. The test included parent proteins with annotated domains not included in PFAM-SEED.

Figure 2

Domain boundary prediction statistics available at the SBASE homepage.

IMPROVEMENTS WITH RESPECT TO THE PREVIOUS RELEASE

The examples included in the consolidated domain sequence collection SBASE A were filtered so as to discard conspicuously short or long domain examples. The consolidated sets were used to trainSVMs. SBASE A group names have been renamed so as to match INTERPRO names. Domains, families and repeats are now classified into separate categories. The SVMs were included into the predictor algorithm. Evaluation of a protein sequence query typically takes 10 s including 5 s of the BLAST run.

DISTRIBUTION AND ACCESS

The SBASE domain library browser and domain architecture prediction system are accessible through the web-interface at http://www.icgeb.org/sbase.
  13 in total

1.  A simple probabilistic scoring method for protein domain identification.

Authors:  J Murvai; K Vlahovicek; S Pongor
Journal:  Bioinformatics       Date:  2000-12       Impact factor: 6.937

2.  Prediction of protein functional domains from sequences using artificial neural networks.

Authors:  J Murvai; K Vlahovicek; C Szepesvári; S Pongor
Journal:  Genome Res       Date:  2001-08       Impact factor: 9.043

3.  The domain-server: direct prediction of protein domain-homologies from BLAST search.

Authors:  J Murvai; K Vlahovicek; E Barta; S Parthasarathy; H Hegyi; F Pfeiffer; S Pongor
Journal:  Bioinformatics       Date:  1999-04       Impact factor: 6.937

4.  SMART 4.0: towards genomic data integration.

Authors:  Ivica Letunic; Richard R Copley; Steffen Schmidt; Francesca D Ciccarelli; Tobias Doerks; Jörg Schultz; Chris P Ponting; Peer Bork
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

5.  PIRSF: family classification system at the Protein Information Resource.

Authors:  Cathy H Wu; Anastasia Nikolskaya; Hongzhan Huang; Lai-Su L Yeh; Darren A Natale; C R Vinayaka; Zhang-Zhi Hu; Raja Mazumder; Sandeep Kumar; Panagiotis Kourtesis; Robert S Ledley; Baris E Suzek; Leslie Arminski; Yongxing Chen; Jian Zhang; Jorge Louie Cardenas; Sehee Chung; Jorge Castro-Alvear; Georgi Dinkov; Winona C Barker
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

6.  PRINTS and its automatic supplement, prePRINTS.

Authors:  T K Attwood; P Bradley; D R Flower; A Gaulton; N Maudling; A L Mitchell; G Moulton; A Nordle; K Paine; P Taylor; A Uddin; C Zygouri
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

7.  The SBASE domain sequence library, release 10: domain architecture prediction.

Authors:  Kristian Vlahovicek; Laszló Kaján; János Murvai; Zoltán Hegedus; Sándor Pongor
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

8.  The InterPro Database, 2003 brings increased coverage and new features.

Authors:  Nicola J Mulder; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Daniel Barrell; Alex Bateman; David Binns; Margaret Biswas; Paul Bradley; Peer Bork; Phillip Bucher; Richard R Copley; Emmanuel Courcelle; Ujjwal Das; Richard Durbin; Laurent Falquet; Wolfgang Fleischmann; Sam Griffiths-Jones; Daniel Haft; Nicola Harte; Nicolas Hulo; Daniel Kahn; Alexander Kanapin; Maria Krestyaninova; Rodrigo Lopez; Ivica Letunic; David Lonsdale; Ville Silventoinen; Sandra E Orchard; Marco Pagni; David Peyruc; Chris P Ponting; Jeremy D Selengut; Florence Servant; Christian J A Sigrist; Robert Vaughan; Evgueni M Zdobnov
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

9.  Improved detection of homology in distantly related proteins: similarity of adducin with actin-binding proteins.

Authors:  G Simon; R Paladini; S Tisminetzky; M Cserzö; Z Hátsági; A Tossi; S Pongor
Journal:  Protein Seq Data Anal       Date:  1992

10.  The Pfam protein families database.

Authors:  Alex Bateman; Lachlan Coin; Richard Durbin; Robert D Finn; Volker Hollich; Sam Griffiths-Jones; Ajay Khanna; Mhairi Marshall; Simon Moxon; Erik L L Sonnhammer; David J Studholme; Corin Yeats; Sean R Eddy
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

View more
  9 in total

1.  A comparative proteomic approach to analyse structure, function and evolution of rice chitinases: a step towards increasing plant fungal resistance.

Authors:  Kishore Sarma; Budheswar Dehury; Jagajjit Sahu; Ranjan Sarmah; Smita Sahoo; Mousumi Sahu; Priyabrata Sen; Mahendra Kumar Modi; Madhumita Barooah
Journal:  J Mol Model       Date:  2012-06-09       Impact factor: 1.810

2.  Identification of type 2 diabetes-associated combination of SNPs using support vector machine.

Authors:  Hyo-Jeong Ban; Jee Yeon Heo; Kyung-Soo Oh; Keun-Joon Park
Journal:  BMC Genet       Date:  2010-04-23       Impact factor: 2.797

3.  Wheat wounding-responsive HD-Zip IV transcription factor GL7 is predominantly expressed in grain and activates genes encoding defensins.

Authors:  Nataliya Kovalchuk; Wei Wu; Natalia Bazanova; Nicolas Reid; Rohan Singh; Neil Shirley; Omid Eini; Alexander A T Johnson; Peter Langridge; Maria Hrmova; Sergiy Lopato
Journal:  Plant Mol Biol       Date:  2019-06-10       Impact factor: 4.076

4.  CX, DPX and PRIDE: WWW servers for the analysis and comparison of protein 3D structures.

Authors:  Kristian Vlahovicek; Alessandro Pintar; Laavanya Parthasarathi; Oliviero Carugo; Sándor Pongor
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

5.  SVM-based prediction of caspase substrate cleavage sites.

Authors:  Lawrence J K Wee; Tin Wee Tan; Shoba Ranganathan
Journal:  BMC Bioinformatics       Date:  2006-12-18       Impact factor: 3.169

6.  CyclinPred: a SVM-based method for predicting cyclin protein sequences.

Authors:  Mridul K Kalita; Umesh K Nandal; Ansuman Pattnaik; Anandhan Sivalingam; Gowthaman Ramasamy; Manish Kumar; Gajendra P S Raghava; Dinesh Gupta
Journal:  PLoS One       Date:  2008-07-02       Impact factor: 3.240

7.  In silico identification of AMPylating enzymes and study of their divergent evolution.

Authors:  Shradha Khater; Debasisa Mohanty
Journal:  Sci Rep       Date:  2015-06-03       Impact factor: 4.379

Review 8.  The interactome: predicting the protein-protein interactions in cells.

Authors:  Dariusz Plewczyński; Krzysztof Ginalski
Journal:  Cell Mol Biol Lett       Date:  2008-10-06       Impact factor: 5.787

9.  HCV genotyping using statistical classification approach.

Authors:  Ping Qiu; Xiao-Yan Cai; Wei Ding; Qing Zhang; Ellie D Norris; Jonathan R Greene
Journal:  J Biomed Sci       Date:  2009-07-08       Impact factor: 8.410

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.