Literature DB >> 26828375

FastaHerder2: Four Ways to Research Protein Function and Evolution with Clustering and Clustered Databases.

Pablo Mier1,2, Miguel A Andrade-Navarro1,2.   

Abstract

The accelerated growth of protein databases offers great possibilities for the study of protein function using sequence similarity and conservation. However, the huge number of sequences deposited in these databases requires new ways of analyzing and organizing the data. It is necessary to group the many very similar sequences, creating clusters with automated derived annotations useful to understand their function, evolution, and level of experimental evidence. We developed an algorithm called FastaHerder2, which can cluster any protein database, putting together very similar protein sequences based on near-full-length similarity and/or high threshold of sequence identity. We compressed 50 reference proteomes, along with the SwissProt database, which we could compress by 74.7%. The clustering algorithm was benchmarked using OrthoBench and compared with FASTA HERDER, a previous version of the algorithm, showing that FastaHerder2 can cluster a set of proteins yielding a high compression, with a lower error rate than its predecessor. We illustrate the use of FastaHerder2 to detect biologically relevant functional features in protein families. With our approach we seek to promote a modern view and usage of the protein sequence databases more appropriate to the postgenomic era.

Keywords:  cluster analysis; clustering; computational biology; data mining; databases

Mesh:

Substances:

Year:  2016        PMID: 26828375     DOI: 10.1089/cmb.2015.0191

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  8 in total

1.  Glutamine Codon Usage and polyQ Evolution in Primates Depend on the Q Stretch Length.

Authors:  Pablo Mier; Miguel A Andrade-Navarro
Journal:  Genome Biol Evol       Date:  2018-03-01       Impact factor: 3.416

2.  dAPE: a web server to detect homorepeats and follow their evolution.

Authors:  Pablo Mier; Miguel A Andrade-Navarro
Journal:  Bioinformatics       Date:  2017-04-15       Impact factor: 6.937

3.  The Protein Structure Context of PolyQ Regions.

Authors:  Franziska Totzeck; Miguel A Andrade-Navarro; Pablo Mier
Journal:  PLoS One       Date:  2017-01-26       Impact factor: 3.240

4.  Geometric characterisation of disease modules.

Authors:  Franziska Härtner; Miguel A Andrade-Navarro; Gregorio Alanis-Lobato
Journal:  Appl Netw Sci       Date:  2018-06-18

5.  Manifold learning and maximum likelihood estimation for hyperbolic network embedding.

Authors:  Gregorio Alanis-Lobato; Pablo Mier; Miguel A Andrade-Navarro
Journal:  Appl Netw Sci       Date:  2016-11-15

6.  Efficient embedding of complex networks to hyperbolic space via their Laplacian.

Authors:  Gregorio Alanis-Lobato; Pablo Mier; Miguel A Andrade-Navarro
Journal:  Sci Rep       Date:  2016-07-22       Impact factor: 4.379

7.  CABRA: Cluster and Annotate Blast Results Algorithm.

Authors:  Pablo Mier; Miguel A Andrade-Navarro
Journal:  BMC Res Notes       Date:  2016-04-30

8.  The latent geometry of the human protein interaction network.

Authors:  Gregorio Alanis-Lobato; Pablo Mier; Miguel Andrade-Navarro
Journal:  Bioinformatics       Date:  2018-08-15       Impact factor: 6.937

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.