Literature DB >> 10547837

Flexible sequence similarity searching with the FASTA3 program package.

W R Pearson1.   

Abstract

The FASTA3 and FASTA2 packages provide a flexible set of sequence-comparison programs that are particularly valuable because of their accurate statistical estimates and high-quality alignments. Traditionally, sequence similarity searches have sought to ask one question: "Is my query sequence homologous to anything in the database?" Both FASTA and BLAST can provide reliable answers to this question with their statistical estimates; if the expectation value E is < 0.001-0.01 and you are not doing hundreds of searches a day, the answer is probably yes. In general, the most effective search strategies follow these rules: 1. Whenever possible, compare at the amino acid level, rather than the nucleotide level. Search first with protein sequences (blastp, fasta3, and ssearch3), then with translated DNA sequences (fastx, blastx), and only at the DNA level as a last resort (Table 5). 2. Search the smallest database that is likely to contain the sequence of interest (but it must contain many unrelated sequences for accurate statistical estimates). 3. Use sequence statistics, rather than percent identity or percent similarity, as your primary criterion for sequence homology. 4. Check that the statistics are likely to be accurate by looking for the highest-scoring unrelated sequence, using prss3 to confirm the expectation, and searching with shuffled copies of the query sequence [randseq, searches with shuffled sequences should have E approx 1.0]. 5. Consider searches with different gap penalties and other scoring matrices. Searches with long query sequences against full-length sequence libraries will not change dramatically when BLOSUM62 is used instead of BLOSUM50 (20), or a gap penalty of -14/-2 is used in place of -12/-2. However, shallower or more stringent scoring matrices are more effective at uncovering relationships in partial sequences (3,18), and they can be used to sharpen dramatically the scope of the similarity search. However, as illustrated in the last section, the E value is only the first step in characterizing a sequence relationship. Once one has confidence that the sequences are homologous, one should look at the sequence alignments and percent identities, particularly when searching with lower quality sequences. When sequence alignments are very short, the alignment should become more significant when a shallower scoring matrix is used, e.g., BLOSUM62 rather than BLOSUM50 (remember to change the gap penalties). Homology can be reliably inferred from statistically significant similarity. Whereas homology implies common three-dimensional structure, homology need not imply common function. Orthologous sequences usually have similar functions, but paralogous sequences often acquire very different functional roles. Motif databases, such as PROSITE (21), can provide evidence for the conservation of critical functional residues. However, motif identity in the absence of overall sequence similarity is not a reliable indicator of homology.

Entities:  

Mesh:

Year:  2000        PMID: 10547837     DOI: 10.1385/1-59259-192-2:185

Source DB:  PubMed          Journal:  Methods Mol Biol        ISSN: 1064-3745


  202 in total

1.  MAGPIE/EGRET annotation of the 2.9-Mb Drosophila melanogaster Adh region.

Authors:  T Gaasterland; A Sczyrba; E Thomas; G Aytekin-Kurban; P Gordon; C W Sensen
Journal:  Genome Res       Date:  2000-04       Impact factor: 9.043

2.  DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches.

Authors:  J D Thompson; F Plewniak; J Thierry; O Poch
Journal:  Nucleic Acids Res       Date:  2000-08-01       Impact factor: 16.971

3.  The evolution of DNA regulatory regions for proteo-gamma bacteria by interspecies comparisons.

Authors:  Nikolaus Rajewsky; Nicholas D Socci; Martin Zapotocky; Eric D Siggia
Journal:  Genome Res       Date:  2002-02       Impact factor: 9.043

4.  Fast algorithms for large-scale genome alignment and comparison.

Authors:  Arthur L Delcher; Adam Phillippy; Jane Carlton; Steven L Salzberg
Journal:  Nucleic Acids Res       Date:  2002-06-01       Impact factor: 16.971

5.  GapA and CrmA coexpression is essential for Mycoplasma gallisepticum cytadherence and virulence.

Authors:  L Papazisi; S Frasca; M Gladd; X Liao; D Yogev; S J Geary
Journal:  Infect Immun       Date:  2002-12       Impact factor: 3.441

6.  Genetic analysis of the upper phenylacetate catabolic pathway in the production of tropodithietic acid by Phaeobacter gallaeciensis.

Authors:  Martine Berger; Nelson L Brock; Heiko Liesegang; Marco Dogs; Ines Preuth; Meinhard Simon; Jeroen S Dickschat; Thorsten Brinkhoff
Journal:  Appl Environ Microbiol       Date:  2012-03-09       Impact factor: 4.792

7.  Putative phenoloxidases in the tunicate Ciona intestinalis and the origin of the arthropod hemocyanin superfamily.

Authors:  A Immesberger; T Burmester
Journal:  J Comp Physiol B       Date:  2003-12-11       Impact factor: 2.200

8.  HGVbase: a curated resource describing human DNA variation and phenotype relationships.

Authors:  D Fredman; G Munns; D Rios; F Sjöholm; M Siegfried; B Lenhard; H Lehväslaiho; A J Brookes
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

9.  MIPS: analysis and annotation of proteins from whole genomes.

Authors:  H W Mewes; C Amid; R Arnold; D Frishman; U Güldener; G Mannhaupt; M Münsterkötter; P Pagel; N Strack; V Stümpflen; J Warfsmann; A Ruepp
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

10.  Evolution of gene structural complexity: an alternative-splicing-based model accounts for intron-containing retrogenes.

Authors:  Chengjun Zhang; Andrea R Gschwend; Yidan Ouyang; Manyuan Long
Journal:  Plant Physiol       Date:  2014-02-11       Impact factor: 8.340

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.