Literature DB >> 15608172

eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity.

Qiaojuan Jane Su¹, Lin Lu, Serge Saxonov, Douglas L Brutlag.

Abstract

Classifying proteins into families and superfamilies allows identification of functionally important conserved domains. The motifs and scoring matrices derived from such conserved regions provide computational tools that recognize similar patterns in novel sequences, and thus enable the prediction of protein function for genomes. The eBLOCKs database enumerates a cascade of protein blocks with varied conservation levels for each functional domain. A biologically important region is most stringently conserved among a smaller family of highly similar proteins. The same region is often found in a larger group of more remotely related proteins with a reduced stringency. Through enumeration, highly specific signatures can be generated from blocks with more columns and fewer family members, while highly sensitive signatures can be derived from blocks with fewer columns and more members as in a superfamily. By applying PSI-BLAST and a modified K-means clustering algorithm, eBLOCKs automatically groups protein sequences according to different levels of similarity. Multiple sequence alignments are made and trimmed into a series of ungapped blocks. Motifs and position-specific scoring matrices were derived from eBLOCKs and made available for sequence search and annotation. The eBLOCKs database provides a tool for high-throughput genome annotation with maximal specificity and sensitivity. The eBLOCKs database is freely available on the World Wide Web at http://motif.stanford.edu/eblocks/ to all users for online usage. Academic and not-for-profit institutions wishing copies of the program may contact Douglas L. Brutlag (brutlag@stanford.edu). Commercial firms wishing copies of the program for internal installation may contact Jacqueline Tay at the Stanford Office of Technology Licensing (jacqueline.tay@stanford.edu; http://otl.stanford.edu/).

Entities: Chemical Species

Mesh：

Substances：
Proteins

Year: 2005 PMID： 15608172 PMCID： PMC540014 DOI： 10.1093/nar/gki060

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

During the last two decades, the successful scale-up of automated high-throughput DNA sequencing technologies has made a dramatic change to the biological discoveries in biology and biomedical sciences. An increasing number of complete genome sequences from various organisms have been determined, and the first draft of the full human genome sequence is now available. A major new goal of the Human Genome Project is functional genomics, which uses experimental and computational techniques to elucidate the function and structure of the encoded gene products. Computer-aided sequence analysis has become an increasingly important method for identifying the function of uncharacterized proteins translated from genomic or cDNA sequences. The primary method for sequence analysis is similarity search, such as BLAST (1,2), FASTA (3) and Smith–Waterman (4) programs. However, in many cases, the sequence of an unknown protein is too distantly related to any protein of known function to detect its resemblance by overall, or even local, sequence alignment. The biological function of such a sequence can often be revealed by detection within its sequence of patterns conserved among a family of proteins. Sequence patterns emphasize specific residues that are essential for a biological function ignoring other positions that are not as important for function. The conserved patterns of a protein family usually correspond to important functions such as ligand binding, catalysis, protein interaction, etc. A conserved pattern or motif is derived from the multiple sequence alignment of a group of related proteins. Therefore, compilation of alignments of conserved protein regions is the basis of pattern recognition for protein identification. A number of protein family and superfamily databases have been built to archive the conserved alignments and searching tools that have been developed to link a query sequence to the related family or superfamily through different pattern matching algorithms. Widely used protein family and superfamily databases include Pfam (5), PROSITE (6), SMART (7), PRINTS (8), ProDom (9), Domo (10,11), BLOCKS+ (12), InterPro (13), SYSTERS (14), ProtoMap (15), CluSTr (16), SBASE (17), ProClass (18) and ProtoNet (19). Structural classification databases cluster proteins at the three-dimensional structure level. Structure classes are defined in databases such as SCOP (20), CATH (21) and FSSP (22). A biologically important region is most stringently conserved among a smaller family of highly similar proteins. The same region is often found in a larger group of more remotely related proteins with a reduced stringency. The eBLOCKs database has been designed to enumerate a cascade of protein blocks with varied conservation levels for each functional domain. Through enumeration, highly specific signatures can be generated from blocks with more columns and fewer family members, while highly sensitive signatures can be derived from blocks with fewer columns and more members as in a superfamily. We have generated the eBLOCKs database to compile ungapped conserved regions, or blocks, directly from an unclassified sequence database in a generic and fully automated way. eBLOCKs builds protein groups from sequences based on PSI-BLAST searches (23). eBLOCKs clusters PSI-BLAST hit sequences into groups of many different conservation levels: subfamilies, families and superfamilies. Each group represents one distinct level of conservation, which can then be used to build patterns of a particular specificity. The enumeration of blocks with an array of different specificities determines the basis for generating motifs or position-specific scoring matrices (PSSMs) over a wide range of sensitivity and specificity. This feature of eBLOCKs allows recognition of a given query sequence by matching with blocks of all family levels, providing a solution to the dilemma of sensitivity and specificity in pattern recognition.

METHODS

eBLOCKs first uses PSI-BLAST to find all the sequences that share various similarities to a seed sequence. Sequences that are reported in a PSI-BLAST result usually fall into groups that share distinct regions or domains. A typical PSI-BLAST result is illustrated in Figure 1. Group 1 sequences share global similarity to the query sequence (seed sequence); Group 2 sequences share a domain close to the N-terminus of the seed sequence; Group 3 sequences share the C-terminal region of the seed sequence. Group 1 could form a protein family, while Groups 2 and 3 could be superfamilies sharing more distant similarities. Although the region shared by Group 2 is also included in Group 1, the same region is less conserved in Group 2 than in Group 1. Therefore, eBLOCKs not only builds high specificity blocks from Group 1, but also generates sets of high sensitivity blocks from Groups 2 and 3. The eBLOCKs database was built with three major steps: (i) cluster a PSI-BLAST result into individual groups representing distinct similarity modules; (ii) make gapped multiple sequence alignment for sequences contained in each group; and (iii) trim each gapped multiple alignment into ungapped subregions, or blocks.

Figure 1

A typical PSI-BLAST result have multiple similarity modules. Group 1 contains sequences in Cluster 1; Group 2 contains sequences in Clusters 1 and 2; and Group 3 contains sequences in Clusters 1 and 3.

Using a modified K-means clustering method, the sequences returned from the PSI-BLAST search are classified into K clusters, where K is automatically determined by a heuristic method. The individual cluster thus obtained represents a subgroup that aligns to a distinct region of the query sequence. The grouping of clusters is illustrated in Figure 2. In this example, Cluster 8 represents a group of closely related sequences that are globally similar. Cluster 2 represents a group of sequences that are almost globally similar but differ in the N-terminus. Cluster 9 represents a group of sequences that can only align at a region closer to the C-terminal end. Each cluster is further organized together with all of its ‘covering’ clusters to form a group, where the ‘covering clusters’ are the other clusters sharing a longer region that fully covers the region shared by the cluster. As shown in Figure 2, Cluster 8 forms Group 8, and Clusters 2 and 8 form Group 2, while all the three clusters form Group 9. Representing the same region by multiple groups with different levels of conservation allows eBLOCKs to annotate a novel sequence with maximal specificity and sensitivity. Models built from these different groups are able to extract patterns that are conserved within a subfamily, a family or a superfamily. The group assembly is done for each cluster, and thus the total number of groups is equal to the number of clusters found by K-means.

Figure 2

Clusters defined by K-means clustering are organized into groups. A typical conservation region is represented by multiple groups with different similarity levels, so as to maximize specificity and sensitivity. Group 8 contains sequences in Cluster 8; Group 2 contains sequences in Clusters 8 and 2; Group 9 contains sequences in Clusters 8, 2 and 9.

After the PSI-BLAST result has been divided into groups that represent distinct conservation modules, sequences in each group are aligned together. One multiple sequence alignment is generated for each group. The alignment is derived from the PSI-BLAST alignments. Such alignments contain gaps. We define eBLOCKs as ungapped conserved regions. The block widths directly affect the specificities of the derived patterns, either as regular expressions or probability matrices. To ensure that the blocks provide sufficient specificity, we set a minimum width of 10 positions for eBLOCKs. From each multiple alignment generated for each group, all the subregions with at least 10 consecutive non-gapped positions are trimmed out as raw blocks. Both the front and back edges of each raw block are examined to allow refinement and extension of the edges when the conservation level is high. Figure 3 summarizes the generation of eBLOCKs from one PSI-BLAST result. Similarity groups representing shared modules at different conservation levels, including subfamilies, families, superfamilies, are formed by the clustering and grouping of all the subject sequences returned by a PSI-BLAST search. Sequences in each group are aligned and the ungapped regions are excised to form several blocks. An eBLOCKs accession number is composed of three parts: the SWISS-PROT accession number of the seed sequence, the group number as assigned by K-means clustering and the block number as the sequential number of trimmed blocks from the multiple sequence alignment for the group.

Figure 3

A flowchart for the eBLOCKs algorithm. Similarity groups that represent shared modules at different conservation levels are formed by the clustering and grouping of all the subject sequences returned by a PSI-BLAST search. Sequences in each group are aligned and the ungapped regions are excised to form several blocks. An eBLOCK accession number is composed of three parts: the SWISS-PROT accession number of the seed sequence, the group number as assigned by K-means clustering and the block number as the sequential number of trimmed blocks from the multiple sequence alignment for the group.

RESULTS

The current eBLOCKs database was built with Swiss-Prot Release 40, resulting in a total of 159 974 blocks. The distribution of the average information content is shown in Figure 4a. The distribution of the block width in the eBLOCKs database is shown in Figure 4b. The distribution of the number of member sequences is shown in Figure 4c.

Figure 4

Statistics of the current eBLOCKs database. (a) The distribution of the average information content for the blocks. (b) The distribution of block width. (c) The distribution of the number of sequences contained in the blocks.

Blocks can be used to detect related sequences through pattern matching. Two kinds of patterns can be computed from blocks: discrete motifs [regular expressions (24)] and PSSMs (25,26). Discrete motifs were generated for the blocks in eBLOCKs database using the eMOTIF package (27,28). PSSMs were computed for the blocks in eBLOCKs using eMATRIX package (26,29). Expectation values (E-value) were calculated as described below. For each motif generated by eMOTIF, an E-value is calculated from its specificity (S) by summing up all other equal or better specificities (S) in the database: For each eMATRIX hit, the specificity can be converted into an E-value by multiplying by N, the number of tests performed, which is equal to B * (L − W + 1), where B is the total number of blocks (29): The eBLOCKs database is available on the Web at http://motif.stanford.edu/eblocks/. Users can submit query sequences in ‘Search A Sequence’ page, and select either eMotif or eMatrix as the tool to search against eMOTIF or eMATRIX databases derived from the current eBLOCKs database. In the result page, eBLOCKs hits are ranked by E-values and each hit has a pointer to the eBLOCK record page. The eBLOCK record page displays the accession number of the block, the block alignment, the sequence Logo and provides commands to use the corresponding PSSM to scan a number of databases to retrieve matching sequences. The eBLOCKs database is also searchable by accession number and by keywords as provided in ‘Search By Accession’ and ‘Search By Keyword’ pages.

DISCUSSION

We have built an eBLOCKs database automatically from protein sequences. eBLOCKs represents a similarity region multiple times by enumerating groups with different levels of conservations (Figure 2). This important feature of eBLOCKs maximizes its sensitivity and specificity when used to annotate a query sequence. When a region is represented at the superfamily level, more remotely related sequences are included in the block, which is consequently narrower and allows more substitutions for the conserved residues (Group 9 in Figure 2). Conversely, a family or subfamily level block contains more closely related sequences, and therefore is wider and more restricted in residue substitutions (Group 8 in Figure 2). Thus, eBLOCKs actually archives family trees for each conservation region. Specificity increases when going up the tree to the subfamily level, and sensitivity increases when going down the tree to the superfamily level. By enumerating all family levels, eBLOCKs forms the basis for highly sensitive and highly specific pattern matching and enables pattern discovery with optimal sensitivity and specificity for a given query sequence. The eBLOCKs building process is generic and can be applied to any set of protein sequences. The characterized proteins in SWISS-PROT are selected as the building set for this work since these are sources of extensively validated annotation and are therefore useful when applied to identify an unknown sequence. Nonetheless, TrEMBL or other protein databases can be used as the building set if we desire to incorporate the most recent information from the large scale sequencing projects. Alternatively, a more restricted collection of sequences could be used as the source sequences in order to focus on a particular subset of sequences. In fact, eBLOCKs has been successfully applied to a set of signal transduction proteins and has generated a database called eSIGNAL (J. Alexander, unpublished data). In this work, the eMATRIX and eMOTIF algorithms were used to derive PSSMs and motifs from the eBLOCKs database and the corresponding eMATRIX search and eMOTIF search tools were used in sequence searches. Alternatively, one can use the IMPALA package (30) to derive PSSMs from the eBLOCKs database and to search sequences.

30 in total

1. ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons.

Authors: F Corpet; F Servant; J Gouzy; D Kahn
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. ProClass protein family database.

Authors: H Huang; C Xiao; C H Wu
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

3. ProtoMap: automatic classification of protein sequences and hierarchy of protein families.

Authors: G Yona; N Linial; M Linial
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

4. The SYSTERS protein sequence cluster set.

Authors: A Krause; J Stoye; M Vingron
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

5. IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices.

Authors: A A Schäffer; Y I Wolf; C P Ponting; E V Koonin; L Aravind; S F Altschul
Journal: Bioinformatics Date: 1999-12 Impact factor: 6.937

6. Fast probabilistic analysis of sequence function using scoring matrices.

Authors: T D Wu; C G Nevill-Manning; D L Brutlag
Journal: Bioinformatics Date: 2000-03 Impact factor: 6.937

7. The PROSITE database, its status in 2002.

Authors: Laurent Falquet; Marco Pagni; Philipp Bucher; Nicolas Hulo; Christian J A Sigrist; Kay Hofmann; Amos Bairoch
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

8. The InterPro Database, 2003 brings increased coverage and new features.

Authors: Nicola J Mulder; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Daniel Barrell; Alex Bateman; David Binns; Margaret Biswas; Paul Bradley; Peer Bork; Phillip Bucher; Richard R Copley; Emmanuel Courcelle; Ujjwal Das; Richard Durbin; Laurent Falquet; Wolfgang Fleischmann; Sam Griffiths-Jones; Daniel Haft; Nicola Harte; Nicolas Hulo; Daniel Kahn; Alexander Kanapin; Maria Krestyaninova; Rodrigo Lopez; Ivica Letunic; David Lonsdale; Ville Silventoinen; Sandra E Orchard; Marco Pagni; David Peyruc; Chris P Ponting; Jeremy D Selengut; Florence Servant; Christian J A Sigrist; Robert Vaughan; Evgueni M Zdobnov
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

9. Recent improvements to the SMART domain-based sequence annotation resource.

Authors: Ivica Letunic; Leo Goodstadt; Nicholas J Dickens; Tobias Doerks; Joerg Schultz; Richard Mott; Francesca Ciccarelli; Richard R Copley; Chris P Ponting; Peer Bork
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

10. Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment.

Authors: J Gracy; P Argos
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

12 in total

1. Analysis of the peroxiredoxin family: using active-site structure and sequence information for global classification and residue analysis.

Authors: Kimberly J Nelson; Stacy T Knutson; Laura Soito; Chananat Klomsiri; Leslie B Poole; Jacquelyn S Fetrow
Journal: Proteins Date: 2010-12-22

2. A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models.

Authors: Juliana S Bernardes; Alessandra Carbone; Gerson Zaverucha
Journal: BMC Bioinformatics Date: 2011-03-23 Impact factor: 3.169

3. Protein structural modularity and robustness are associated with evolvability.

Authors: Mary M Rorick; Günter P Wagner
Journal: Genome Biol Evol Date: 2011-05-21 Impact factor: 3.416

4. Choosing negative examples for the prediction of protein-protein interactions.

Authors: Asa Ben-Hur; William Stafford Noble
Journal: BMC Bioinformatics Date: 2006-03-20 Impact factor: 3.169

5. Motif kernel generated by genetic programming improves remote homology and fold detection.

Authors: Tony Håndstad; Arne J H Hestnes; Pål Saetrom
Journal: BMC Bioinformatics Date: 2007-01-25 Impact factor: 3.169

6. MAGIIC-PRO: detecting functional signatures by efficient discovery of long patterns in protein sequences.

Authors: Chen-Ming Hsu; Chien-Yu Chen; Baw-Jhiune Liu
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

7. QuasiMotiFinder: protein annotation by searching for evolutionarily conserved motif-like patterns.

Authors: Roee Gutman; Carine Berezin; Roy Wollman; Yossi Rosenberg; Nir Ben-Tal
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

8. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis.

Authors: Bin Liu; Xiaolong Wang; Lei Lin; Qiwen Dong; Xuan Wang
Journal: BMC Bioinformatics Date: 2008-12-01 Impact factor: 3.169

9. MAGIIC-PRO: detecting functional signatures by efficient discovery of long patterns in protein sequences.

Authors: Chen-Ming Hsu; Chien-Yu Chen; Baw-Jhiune Liu
Journal: Nucleic Acids Res Date: 2008-03 Impact factor: 16.971

10. Minimotif miner 2nd release: a database and web system for motif search.

Authors: Sanguthevar Rajasekaran; Sudha Balla; Patrick Gradie; Michael R Gryk; Krishna Kadaveru; Vamsi Kundeti; Mark W Maciejewski; Tian Mi; Nicholas Rubino; Jay Vyas; Martin R Schiller
Journal: Nucleic Acids Res Date: 2008-10-31 Impact factor: 16.971