Literature DB >> 34049487

A phylogenetic approach for weighting genetic sequences.

Nicola De Maio1, Alexander V Alekseyenko2,3, William J Coleman-Smith2, Fabio Pardi2,4, Marc A Suchard5, Asif U Tamuri2,6, Jakub Truszkowski2,7, Nick Goldman2.   

Abstract

BACKGROUND: Many important applications in bioinformatics, including sequence alignment and protein family profiling, employ sequence weighting schemes to mitigate the effects of non-independence of homologous sequences and under- or over-representation of certain taxa in a dataset. These schemes aim to assign high weights to sequences that are 'novel' compared to the others in the same dataset, and low weights to sequences that are over-represented.
RESULTS: We formalise this principle by rigorously defining the evolutionary 'novelty' of a sequence within an alignment. This results in new sequence weights that we call 'phylogenetic novelty scores'. These scores have various desirable properties, and we showcase their use by considering, as an example application, the inference of character frequencies at an alignment column-important, for example, in protein family profiling. We give computationally efficient algorithms for calculating our scores and, using simulations, show that they are versatile and can improve the accuracy of character frequency estimation compared to existing sequence weighting schemes.
CONCLUSIONS: Our phylogenetic novelty scores can be useful when an evolutionarily meaningful system for adjusting for uneven taxon sampling is desired. They have numerous possible applications, including estimation of evolutionary conservation scores and sequence logos, identification of targets in conservation biology, and improving and measuring sequence alignment accuracy.

Entities:  

Keywords:  Alignment; Conservation scores; Phylogenetics; Protein profile; Sequence weights

Year:  2021        PMID: 34049487     DOI: 10.1186/s12859-021-04183-8

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


  39 in total

1.  Amino acid substitution matrices from protein blocks.

Authors:  S Henikoff; J G Henikoff
Journal:  Proc Natl Acad Sci U S A       Date:  1992-11-15       Impact factor: 11.205

2.  Clustal W and Clustal X version 2.0.

Authors:  M A Larkin; G Blackshields; N P Brown; R Chenna; P A McGettigan; H McWilliam; F Valentin; I M Wallace; A Wilm; R Lopez; J D Thompson; T J Gibson; D G Higgins
Journal:  Bioinformatics       Date:  2007-09-10       Impact factor: 6.937

Review 3.  Profile hidden Markov models.

Authors:  S R Eddy
Journal:  Bioinformatics       Date:  1998       Impact factor: 6.937

4.  Weighting aligned protein or nucleic acid sequences to correct for unequal representation.

Authors:  P R Sibbald; P Argos
Journal:  J Mol Biol       Date:  1990-12-20       Impact factor: 5.469

Review 5.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

6.  Weights for data related by a tree.

Authors:  S F Altschul; R J Carroll; D J Lipman
Journal:  J Mol Biol       Date:  1989-06-20       Impact factor: 5.469

7.  Maximum discrimination hidden Markov models of sequence consensus.

Authors:  S R Eddy; G Mitchison; R Durbin
Journal:  J Comput Biol       Date:  1995       Impact factor: 1.479

8.  Volume changes in protein evolution.

Authors:  M Gerstein; E L Sonnhammer; C Chothia
Journal:  J Mol Biol       Date:  1994-03-04       Impact factor: 5.469

9.  Position-based sequence weights.

Authors:  S Henikoff; J G Henikoff
Journal:  J Mol Biol       Date:  1994-11-04       Impact factor: 5.469

10.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Authors:  Fabian Sievers; Andreas Wilm; David Dineen; Toby J Gibson; Kevin Karplus; Weizhong Li; Rodrigo Lopez; Hamish McWilliam; Michael Remmert; Johannes Söding; Julie D Thompson; Desmond G Higgins
Journal:  Mol Syst Biol       Date:  2011-10-11       Impact factor: 11.429

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.