Literature DB >> 33711918

Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation.

Alessandro Laio1, Marco Punta2,3, Elena Tea Russo4.   

Abstract

BACKGROUND: The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence.
RESULTS: We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results.
CONCLUSIONS: The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.

Entities:  

Keywords:  Pfam; Protein families; Sequence analysis; Unsupervised clustering

Mesh:

Substances:

Year:  2021        PMID: 33711918      PMCID: PMC7955657          DOI: 10.1186/s12859-021-04013-x

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


  32 in total

1.  The COG database: a tool for genome-scale analysis of protein functions and evolution.

Authors:  R L Tatusov; M Y Galperin; D A Natale; E V Koonin
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  Domain assignment for protein structures using a consensus approach: characterization and analysis.

Authors:  S Jones; M Stewart; A Michie; M B Swindells; C Orengo; J M Thornton
Journal:  Protein Sci       Date:  1998-02       Impact factor: 6.725

3.  Manual classification strategies in the ECOD database.

Authors:  Hua Cheng; Yuxing Liao; R Dustin Schaeffer; Nick V Grishin
Journal:  Proteins       Date:  2015-05-08

4.  Crystal structure of archaeosine tRNA-guanine transglycosylase.

Authors:  Ryuichiro Ishitani; Osamu Nureki; Shuya Fukai; Teiya Kijimoto; Nobukazu Nameki; Masakatsu Watanabe; Hisao Kondo; Mitsuo Sekine; Norihiro Okada; Susumu Nishimura; Shigeyuki Yokoyama
Journal:  J Mol Biol       Date:  2002-05-03       Impact factor: 5.469

5.  Ultrafast clustering algorithms for metagenomic sequence analysis.

Authors:  Weizhong Li; Limin Fu; Beifang Niu; Sitao Wu; John Wooley
Journal:  Brief Bioinform       Date:  2012-07-06       Impact factor: 11.622

6.  A holistic approach to marine eco-systems biology.

Authors:  Eric Karsenti; Silvia G Acinas; Peer Bork; Chris Bowler; Colomban De Vargas; Jeroen Raes; Matthew Sullivan; Detlev Arendt; Francesca Benzoni; Jean-Michel Claverie; Mick Follows; Gaby Gorsky; Pascal Hingamp; Daniele Iudicone; Olivier Jaillon; Stefanie Kandels-Lewis; Uros Krzic; Fabrice Not; Hiroyuki Ogata; Stéphane Pesant; Emmanuel Georges Reynaud; Christian Sardet; Michael E Sieracki; Sabrina Speich; Didier Velayoudon; Jean Weissenbach; Patrick Wincker
Journal:  PLoS Biol       Date:  2011-10-18       Impact factor: 8.029

7.  EVEREST: automatic identification and classification of protein domains in all protein sequences.

Authors:  Elon Portugaly; Amir Harel; Nathan Linial; Michal Linial
Journal:  BMC Bioinformatics       Date:  2006-06-02       Impact factor: 3.169

8.  InterPro in 2019: improving coverage, classification and access to protein sequence annotations.

Authors:  Alex L Mitchell; Teresa K Attwood; Patricia C Babbitt; Matthias Blum; Peer Bork; Alan Bridge; Shoshana D Brown; Hsin-Yu Chang; Sara El-Gebali; Matthew I Fraser; Julian Gough; David R Haft; Hongzhan Huang; Ivica Letunic; Rodrigo Lopez; Aurélien Luciani; Fabio Madeira; Aron Marchler-Bauer; Huaiyu Mi; Darren A Natale; Marco Necci; Gift Nuka; Christine Orengo; Arun P Pandurangan; Typhaine Paysan-Lafosse; Sebastien Pesseat; Simon C Potter; Matloob A Qureshi; Neil D Rawlings; Nicole Redaschi; Lorna J Richardson; Catherine Rivoire; Gustavo A Salazar; Amaia Sangrador-Vegas; Christian J A Sigrist; Ian Sillitoe; Granger G Sutton; Narmada Thanki; Paul D Thomas; Silvio C E Tosatto; Siew-Yit Yong; Robert D Finn
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

Review 9.  The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function.

Authors:  Marco Punta; Yanay Ofran
Journal:  PLoS Comput Biol       Date:  2008-10-31       Impact factor: 4.475

10.  SUPERFAMILY--sophisticated comparative genomics, data mining, visualization and phylogeny.

Authors:  Derek Wilson; Ralph Pethica; Yiduo Zhou; Charles Talbot; Christine Vogel; Martin Madera; Cyrus Chothia; Julian Gough
Journal:  Nucleic Acids Res       Date:  2008-11-26       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.