Literature DB >> 10627144

Significance of Z-value statistics of Smith-Waterman scores for protein alignments.

J P Comet1, J C Aude, E Glémet, J L Risler, A Hénaut, P P Slonimski, J J Codani.   

Abstract

The Z-value is an attempt to estimate the statistical significance of a Smith-Waterman dynamic alignment score (SW-score) through the use of a Monte-Carlo process. It partly reduces the bias induced by the composition and length of the sequences. This paper is not a theoretical study on the distribution of SW-scores and Z-values. Rather, it presents a statistical analysis of Z-values on large datasets of protein sequences, leading to a law of probability that the experimental Z-values follow. First, we determine the relationships between the computed Z-value, an estimation of its variance and the number of randomizations in the Monte-Carlo process. Then, we illustrate that Z-values are less correlated to sequence lengths than SW-scores. Then we show that pairwise alignments, performed on 'quasi-real' sequences (i.e., randomly shuffled sequences of the same length and amino acid composition as the real ones) lead to Z-value distributions that statistically fit the extreme value distribution, more precisely the Gumbel distribution (global EVD, Extreme Value Distribution). However, for real protein sequences, we observe an over-representation of high Z-values. We determine first a cutoff value which separates these overestimated Z-values from those which follow the global EVD. We then show that the interesting part of the tail of distribution of Z-values can be approximated by another EVD (i.e., an EVD which differs from the global EVD) or by a Pareto law. This has been confirmed for all proteins analysed so far, whether extracted from individual genomes, or from the ensemble of five complete microbial genomes comprising altogether 16956 protein sequences.

Entities:  

Mesh:

Year:  1999        PMID: 10627144     DOI: 10.1016/s0097-8485(99)00008-x

Source DB:  PubMed          Journal:  Comput Chem        ISSN: 0097-8485


  17 in total

1.  Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes.

Authors:  R Apweiler; M Biswas; W Fleischmann; A Kanapin; Y Karavidopoulou; P Kersey; E V Kriventseva; V Mittard; N Mulder; I Phan; E Zdobnov
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

2.  CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins.

Authors:  E V Kriventseva; W Fleischmann; E M Zdobnov; R Apweiler
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

3.  Massive sequence comparisons as a help in annotating genomic sequences.

Authors:  A Louis; E Ollivier; J C Aude; J L Risler
Journal:  Genome Res       Date:  2001-07       Impact factor: 9.043

4.  PHYTOPROT: a database of clusters of plant proteins.

Authors:  S Mohseni-Zadeh; A Louis; P Brézellec; J-L Risler
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

5.  Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters.

Authors:  E V Kriventseva; F Servant; R Apweiler
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

6.  Testing statistical significance scores of sequence comparison methods with structure similarity.

Authors:  Tim Hulsen; Jacob de Vlieg; Jack A M Leunissen; Peter M A Groenen
Journal:  BMC Bioinformatics       Date:  2006-10-12       Impact factor: 3.169

7.  Where does the alignment score distribution shape come from?

Authors:  Philippe Ortet; Olivier Bastien
Journal:  Evol Bioinform Online       Date:  2010-12-12       Impact factor: 1.625

8.  CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions.

Authors:  Yongchao Liu; Bertil Schmidt; Douglas L Maskell
Journal:  BMC Res Notes       Date:  2010-04-06

9.  A simple derivation of the distribution of pairwise local protein sequence alignment scores.

Authors:  Olivier Bastien
Journal:  Evol Bioinform Online       Date:  2008-02-14       Impact factor: 1.625

10.  A computational method to predict genetically encoded rare amino acids in proteins.

Authors:  Barnali N Chaudhuri; Todd O Yeates
Journal:  Genome Biol       Date:  2005-08-31       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.