Literature DB >> 9545449

Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment.

J Gracy1, P Argos.   

Abstract

MOTIVATION: Genome sequencing projects require the periodic application of analysis tools that can classify and multiply align related protein sequence domains. Full automation of this task requires an efficient integration of similarity and alignment techniques.
RESULTS: We have developed a fully automated process that classifies entire protein sequence databases, resulting in alignment of the homologous sequences. The successive steps of the procedure are based on compositional and local sequence similarity searches followed by multiple sequence alignments. Global similarities are detected from the pairwise comparison of amino acid and dipeptide compositions of each protein. After the elimination of all but one sequence from each detected cluster of closely related proteins, the remaining sequences are compiled in a suffix tree which is self-compared to detect local sequence similarities. Sets of proteins which share similar sequence segments are then weighted according to their closeness and multiply aligned using a fast hierarchical dynamic programming algorithm. Computational strategies were devised to minimize computer processing time and memory space requirements. The accuracy of the sequence classifications has been evaluated for 12 462 primary structures distributed over 341 known families. The percentage of sequences with missed or incorrect family assignments was 6.8% on the test set. This low error level is only twice that of the manually constructed PROSITE database ( 3.4% ) and is substantially better than that found for the automatically built PRODOM database ( 34.9% ). AVAILABILITY: The resulting database, called DOMO, is available through database search routine SRS at Infobiogen (http://www.infobiogen.fr/srs5/), EBI (http://srs.ebi.ac.uk:5000/) and EMBL (http://www.embl-heidelberg.de/srs5/) World Wide Web sites. CONTACT: gracy@infobiogen.fr

Entities:  

Mesh:

Substances:

Year:  1998        PMID: 9545449     DOI: 10.1093/bioinformatics/14.2.164

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  15 in total

1.  Increased coverage of protein families with the blocks database servers.

Authors:  J G Henikoff; E A Greene; S Pietrokovski; S Henikoff
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families.

Authors:  G Yona; N Linial; M Linial
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

3.  The MetaFam Server: a comprehensive protein family resource.

Authors:  K A Silverstein; E Shoop; J E Johnson; A Kilian; J L Freeman; T M Kunau; I A Awad; M Mayer; E F Retzel
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

4.  Mendel-GFDb and Mendel-ESTS: databases of plant gene families and ESTs annotated with gene family numbers and gene family names.

Authors:  D Lonsdale; M Crowe; B Arnold; B C Arnold
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

5.  Automated search of natively folded protein fragments for high-throughput structure determination in structural genomics.

Authors:  Y Kuroda; K Tani; Y Matsuo; S Yokoyama
Journal:  Protein Sci       Date:  2000-12       Impact factor: 6.725

6.  DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches.

Authors:  J D Thompson; F Plewniak; J Thierry; O Poch
Journal:  Nucleic Acids Res       Date:  2000-08-01       Impact factor: 16.971

7.  Sequence similarities of protein kinase peptide substrates and inhibitors: comparison of their primary structures with immunoglobulin repeats.

Authors:  J Kubrycht; J Borecký; K Sigler
Journal:  Folia Microbiol (Praha)       Date:  2002       Impact factor: 2.099

8.  Drosophila genomic sequence annotation using the BLOCKS+ database.

Authors:  J G Henikoff; S Henikoff
Journal:  Genome Res       Date:  2000-04       Impact factor: 9.043

9.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences.

Authors:  J Burke; D Davison; W Hide
Journal:  Genome Res       Date:  1999-11       Impact factor: 9.043

10.  Fold homology detection using sequence fragment composition profiles of proteins.

Authors:  Armando D Solis; Shalom R Rackovsky
Journal:  Proteins       Date:  2010-10
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.