Literature DB >> 10642881

A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins.

H Wan1, J C Wootton.   

Abstract

Different local regions of natural amino acid or nucleotide sequences show remarkable heterogeneity in residue composition, reflecting diversity in evolutionary history and physiochemical constraints. Compositional complexity measures are helpful for describing and understanding this variegation. Motivated by some open problems in comparative genomics and protein folding, we have developed a new 'global' compositional complexity measure, G1, which overcomes a crucial limitation of earlier methods. The 'local' measures used in previous research resemble entropy functions and are inherently dependent on an underlying probability distribution. Local measures cannot rigorously compare complexity across sequences of substantially different size, because real sequences show very irregular heterogeneity and do not have the necessary ergodicity in scaling and asymptotic properties. G1 is a member of a new class of scale-independent, distribution-independent complexity functions. For a sequence S of length L on an N-letter alphabet, G1 is derived from ratios in the integer partition lattice, P¿L,N¿ of L with N parts, where the elements of P¿L,N¿ are the state vectors of S, (n1, n2,..., nN), ranked by an order principle. We present theorems and proofs relating to the metric properties of G1 and its relationship to other state-vector-dependent compositional complexity functions, together with a fully-efficient O(L) algorithm to compute G1. The distributions of G1 were calculated for the entire sets of translated proteins encoded by extensively sequenced genomes. The results establish the existence of a clear evolutionary principle, common to bacteria, archaea and eukaryotes, that the proteins encoded by more extreme AT-rich and GC-rich genomes have generally lower compositional complexity than those of more typical organisms.

Mesh:

Substances:

Year:  2000        PMID: 10642881     DOI: 10.1016/s0097-8485(99)00048-0

Source DB:  PubMed          Journal:  Comput Chem        ISSN: 0097-8485


  11 in total

1.  The compositional adjustment of amino acid substitution matrices.

Authors:  Yi-Kuo Yu; John C Wootton; Stephen F Altschul
Journal:  Proc Natl Acad Sci U S A       Date:  2003-12-08       Impact factor: 11.205

2.  The Ensembl analysis pipeline.

Authors:  Simon C Potter; Laura Clarke; Val Curwen; Stephen Keenan; Emmanuel Mongin; Stephen M J Searle; Arne Stabenau; Roy Storey; Michele Clamp
Journal:  Genome Res       Date:  2004-05       Impact factor: 9.043

3.  Compositional adjustment of Dirichlet mixture priors.

Authors:  Xugang Ye; Yi-Kuo Yu; Stephen F Altschul
Journal:  J Comput Biol       Date:  2010-12       Impact factor: 1.479

Review 4.  Protein database searches using compositionally adjusted substitution matrices.

Authors:  Stephen F Altschul; John C Wootton; E Michael Gertz; Richa Agarwala; Aleksandr Morgulis; Alejandro A Schäffer; Yi-Kuo Yu
Journal:  FEBS J       Date:  2005-10       Impact factor: 5.542

5.  The genetic code is nearly optimal for allowing additional information within protein-coding sequences.

Authors:  Shalev Itzkovitz; Uri Alon
Journal:  Genome Res       Date:  2007-02-09       Impact factor: 9.043

6.  Low-complexity regions in Plasmodium falciparum proteins.

Authors:  E Pizzi; C Frontali
Journal:  Genome Res       Date:  2001-02       Impact factor: 9.043

Review 7.  Substitution scoring matrices for proteins - An overview.

Authors:  Rakesh Trivedi; Hampapathalu Adimurthy Nagarajaram
Journal:  Protein Sci       Date:  2020-10-12       Impact factor: 6.725

8.  Association of SLC34A2 variation and sodium-lithium countertransport activity in humans and baboons.

Authors:  Xiaojing Zheng; Candace M Kammerer; Laura A Cox; Alanna Morrison; Stephen T Turner; Robert E Ferrell
Journal:  Am J Hypertens       Date:  2009-01-01       Impact factor: 2.689

9.  MACSIMS: multiple alignment of complete sequences information management system.

Authors:  Julie D Thompson; Arnaud Muller; Andrew Waterhouse; Jim Procter; Geoffrey J Barton; Frédéric Plewniak; Olivier Poch
Journal:  BMC Bioinformatics       Date:  2006-06-23       Impact factor: 3.169

10.  Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches.

Authors:  Yi-Kuo Yu; E Michael Gertz; Richa Agarwala; Alejandro A Schäffer; Stephen F Altschul
Journal:  Nucleic Acids Res       Date:  2006-10-26       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.