Literature DB >> 8743688

Effective protein sequence comparison.

W R Pearson1.   

Abstract

Although there are several different comparison programs available (e.g., BLASTP, FASTA, SSEARCH, and BLITZ) that can be used with different scoring systems (e.g., PAM120, PAM250, BLOSUM50, BLOSUM62) and different databases (e.g., PIR, SWISS-PROT, GenPept), the following search protocol should identify homologous sequences whenever they can be found. 1. Always compare protein sequences if the genes encode proteins. Protein sequence comparison will typically double the evolutionary lookback time over DNA sequence comparison. 2. Search several sequence databases using a rapid sequence comparison program (e.g., BLASTP or FASTA, ktup = 2). Well-curated databases like PIR or SWISS-PROT tend to have fewer redundant sequences, which improves the statistical significance of a match, but they are less comprehensive and up-to-date than GenPept. 3. If there is good agreement between the distribution of scores and the theoretical distribution, and the alignments do not include "simple sequence" domains, accept sequences with FASTA E() values or BLASTP P() values below 0.02 as homologous. 4. If no library sequences are found with E values below 0.02, perform additional searches with FASTA, ktup = 1, or SSEARCH. If library sequences with E values less than 0.02 are found, the sequences are probably homologous, unless a low-complexity domain is aligned. However, sequences with similarity scores from 0.02 to 10.0 may be homologous as well. To characterize these more distantly related sequences, select "marginal" library sequences and use them to search the databases. Additional family members should have E values less than 0.05. 5. Homologous sequences share a common ancestor, and thus a common protein fold. Depending on the evolutionary distance and divergence path, two or more homologous sequences may have very few absolutely conserved residues. However, if homology has been inferred between A and B, between B and C, and between C and D, A and D must be homologous, even if they share no significant similarity. 6. Sequences with marginal E values should also be tested using the PRSS program. Compare the query and library sequences using at least 200 (and preferably 1000) shuffles. Shuffles using a window (-w) of 10-20 are more stringent than a uniform shuffle. Use the E value after 1000 shuffles to confirm an inference of homology. 7. Homologous sequences are usually similar over an entire sequence or domain, typically sharing 20-25% or greater identity for more than 200 residues. Matches that are more than 50% identical in a 20- to 40-amino acid region occur frequently by chance and do not indicate homology. By following these steps, one will very rarely assert that two sequences are homologous when in fact they are not. However, these criteria are stringent; distantly related homologous sequences may fail to be detected because their similarity is not statistically significant. These tests are biased toward missing some distantly related sequences to avoid the possibility of misidentifying unrelated ones. In most database searches, the ratio of related to unrelated sequences is more than 4000:1 (e.g., 10 related and 40,000 unrelated sequences). Thus, one is more likely to mistakenly identify two sequences as related than to overlook a genuine relationship, and our conservative evaluation criteria reflect that bias.

Mesh:

Substances:

Year:  1996        PMID: 8743688     DOI: 10.1016/s0076-6879(96)66017-0

Source DB:  PubMed          Journal:  Methods Enzymol        ISSN: 0076-6879            Impact factor:   1.600


  78 in total

1.  SCOP: a structural classification of proteins database.

Authors:  L Lo Conte; B Ailey; T J Hubbard; S E Brenner; A G Murzin; C Chothia
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels.

Authors:  J Lin; M Gerstein
Journal:  Genome Res       Date:  2000-06       Impact factor: 9.043

3.  The KEGG databases at GenomeNet.

Authors:  Minoru Kanehisa; Susumu Goto; Shuichi Kawashima; Akihiro Nakaya
Journal:  Nucleic Acids Res       Date:  2002-01-01       Impact factor: 16.971

4.  The sequence determinants of cadherin molecules.

Authors:  A E Kister; M A Roytberg; C Chothia; J M Vasiliev; I M Gelfand
Journal:  Protein Sci       Date:  2001-09       Impact factor: 6.725

5.  The Hsp90 family of proteins in Arabidopsis thaliana.

Authors:  P Krishna; G Gloor
Journal:  Cell Stress Chaperones       Date:  2001-07       Impact factor: 3.667

6.  The resistome of Pseudomonas aeruginosa in relationship to phenotypic susceptibility.

Authors:  Veronica N Kos; Maxime Déraspe; Robert E McLaughlin; James D Whiteaker; Paul H Roy; Richard A Alm; Jacques Corbeil; Humphrey Gardner
Journal:  Antimicrob Agents Chemother       Date:  2014-11-03       Impact factor: 5.191

7.  RevTrans: Multiple alignment of coding DNA from aligned amino acid sequences.

Authors:  Rasmus Wernersson; Anders Gorm Pedersen
Journal:  Nucleic Acids Res       Date:  2003-07-01       Impact factor: 16.971

8.  Structural details (kinks and non-alpha conformations) in transmembrane helices are intrahelically determined and can be predicted by sequence pattern descriptors.

Authors:  Isidore Rigoutsos; Peter Riek; Robert M Graham; Jiri Novotny
Journal:  Nucleic Acids Res       Date:  2003-08-01       Impact factor: 16.971

9.  The web server of IBM's Bioinformatics and Pattern Discovery group: 2004 update.

Authors:  Tien Huynh; Isidore Rigoutsos
Journal:  Nucleic Acids Res       Date:  2004-07-01       Impact factor: 16.971

10.  Sequence comparison of human and mouse genes reveals a homologous block structure in the promoter regions.

Authors:  Yutaka Suzuki; Riu Yamashita; Matsuyuki Shirota; Yuta Sakakibara; Joe Chiba; Junko Mizushima-Sugano; Kenta Nakai; Sumio Sugano
Journal:  Genome Res       Date:  2004-09       Impact factor: 9.043

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.