Literature DB >> 26743509

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.

Maria Hauser1, Martin Steinegger2, Johannes Söding3.   

Abstract

MOTIVATION: Sequence databases are growing fast, challenging existing analysis pipelines. Reducing the redundancy of sequence databases by similarity clustering improves speed and sensitivity of iterative searches. But existing tools cannot efficiently cluster databases of the size of UniProt to 50% maximum pairwise sequence identity or below. Furthermore, in metagenomics experiments typically large fractions of reads cannot be matched to any known sequence anymore because searching with sensitive but relatively slow tools (e.g. BLAST or HMMER3) through comprehensive databases such as UniProt is becoming too costly.
RESULTS: MMseqs (Many-against-Many sequence searching) is a software suite for fast and deep clustering and searching of large datasets, such as UniProt, or 6-frame translated metagenomics sequencing reads. MMseqs contains three core modules: a fast and sensitive prefiltering module that sums up the scores of similar k-mers between query and target sequences, an SSE2- and multi-core-parallelized local alignment module, and a clustering module.In our homology detection benchmarks, MMseqs is much more sensitive and 4-30 times faster than UBLAST and RAPsearch, respectively, although it does not reach BLAST sensitivity yet. Using its cascaded clustering workflow, MMseqs can cluster large databases down to ∼30% sequence identity at hundreds of times the speed of BLASTclust and much deeper than CD-HIT and USEARCH. MMseqs can also update a database clustering in linear instead of quadratic time. Its much improved sensitivity-speed trade-off should make MMseqs attractive for a wide range of large-scale sequence analysis tasks.
AVAILABILITY AND IMPLEMENTATION: MMseqs is open-source software available under GPL at https://github.com/soedinglab/MMseqs CONTACT: martin.steinegger@mpibpc.mpg.de, soeding@mpibpc.mpg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Mesh:

Year:  2016        PMID: 26743509     DOI: 10.1093/bioinformatics/btw006

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  26 in total

1.  Uniclust databases of clustered and deeply annotated protein sequences and alignments.

Authors:  Milot Mirdita; Lars von den Driesch; Clovis Galiez; Maria J Martin; Johannes Söding; Martin Steinegger
Journal:  Nucleic Acids Res       Date:  2016-11-28       Impact factor: 16.971

2.  Highly regulated, diversifying NTP-dependent biological conflict systems with implications for the emergence of multicellularity.

Authors:  Gurmeet Kaur; A Maxwell Burroughs; Lakshminarayan M Iyer; L Aravind
Journal:  Elife       Date:  2020-02-26       Impact factor: 8.140

3.  MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.

Authors:  Martin Steinegger; Johannes Söding
Journal:  Nat Biotechnol       Date:  2017-10-16       Impact factor: 54.908

4.  Chlorine redox chemistry is widespread in microbiology.

Authors:  Tyler P Barnum; John D Coates
Journal:  ISME J       Date:  2022-10-06       Impact factor: 11.217

5.  Systematic discovery of recombinases for efficient integration of large DNA sequences into the human genome.

Authors:  Matthew G Durrant; Alison Fanton; Josh Tycko; Michaela Hinks; Sita S Chandrasekaran; Nicholas T Perry; Julia Schaepe; Peter P Du; Peter Lotfy; Michael C Bassik; Lacramioara Bintu; Ami S Bhatt; Patrick D Hsu
Journal:  Nat Biotechnol       Date:  2022-10-10       Impact factor: 68.164

6.  Identification of Uncharacterized Components of Prokaryotic Immune Systems and Their Diverse Eukaryotic Reformulations.

Authors:  A Maxwell Burroughs; L Aravind
Journal:  J Bacteriol       Date:  2020-11-19       Impact factor: 3.490

7.  IDseq-An open source cloud-based pipeline and analysis service for metagenomic pathogen detection and monitoring.

Authors:  Katrina L Kalantar; Tiago Carvalho; Charles F A de Bourcy; Boris Dimitrov; Greg Dingle; Rebecca Egger; Julie Han; Olivia B Holmes; Yun-Fang Juan; Ryan King; Andrey Kislyuk; Michael F Lin; Maria Mariano; Todd Morse; Lucia V Reynoso; David Rissato Cruz; Jonathan Sheu; Jennifer Tang; James Wang; Mark A Zhang; Emily Zhong; Vida Ahyong; Sreyngim Lay; Sophana Chea; Jennifer A Bohl; Jessica E Manning; Cristina M Tato; Joseph L DeRisi
Journal:  Gigascience       Date:  2020-10-15       Impact factor: 6.524

8.  Using collections of structural models to predict changes of binding affinity caused by mutations in protein-protein interactions.

Authors:  Alberto Meseguer; Lluis Dominguez; Patricia M Bota; Joaquim Aguirre-Plans; Jaume Bonet; Narcis Fernandez-Fuentes; Baldo Oliva
Journal:  Protein Sci       Date:  2020-09-05       Impact factor: 6.725

9.  Bacterial death and TRADD-N domains help define novel apoptosis and immunity mechanisms shared by prokaryotes and metazoans.

Authors:  Gurmeet Kaur; Lakshminarayan M Iyer; A Maxwell Burroughs; L Aravind
Journal:  Elife       Date:  2021-06-01       Impact factor: 8.140

10.  Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome.

Authors:  Stephen Nayfach; David Páez-Espino; Lee Call; Soo Jen Low; Hila Sberro; Natalia N Ivanova; Amy D Proal; Michael A Fischbach; Ami S Bhatt; Philip Hugenholtz; Nikos C Kyrpides
Journal:  Nat Microbiol       Date:  2021-06-24       Impact factor: 17.745

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.