Literature DB >> 30615063

MMseqs2 desktop and local web server app for fast, interactive sequence searches.

Milot Mirdita1, Martin Steinegger1,2, Johannes Söding1.   

Abstract

SUMMARY: The MMseqs2 desktop and web server app facilitates interactive sequence searches through custom protein sequence and profile databases on personal workstations. By eliminating MMseqs2's runtime overhead, we reduced response times to a few seconds at sensitivities close to BLAST.
AVAILABILITY AND IMPLEMENTATION: The app is easy to install for non-experts. GPLv3-licensed code, pre-built desktop app packages for Windows, MacOS and Linux, Docker images for the web server application and a demo web server are available at https://search.mmseqs.com. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2019. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2019        PMID: 30615063      PMCID: PMC6691333          DOI: 10.1093/bioinformatics/bty1057

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

The most popular sequence similarity search tool, BLAST (Altschul , 1997), has garnered ∼7000 citations per year during the last 5 years, attesting to the unremitting importance of sequence searches for biology. This popularity may be largely owed to the excellent web services with short response times despite fast-growing databases provided by the NCBI/NIH, which requires a huge compute infrastructure. The distributed approach of running searches locally on personal computers or IT platforms of companies and research groups allows for custom databases, high availability and protects sensitive data. But web server applications for local homology searches are slow as they mostly rely on BLAST (e.g. Deng ; Priyam ). Here, we present an application software to search with protein and nucleotide sequences through custom protein sequence and profile databases using MMseqs2 (Steinegger and Söding, 2017), achieving response times of seconds instead of minutes at a similar sensitivity as BLAST.

2 Materials and methods

2.1 Reduced runtime overhead

MMseqs2 owes its sensitivity and speed mainly to its pre-filtering stage, which rejects ∼99.99% of sequences. The pre-filter uses a reverse k-mer index table for the target database and also requires matrices with similarity scores between 2-mers and between 3-mers to generate the lists of similar 7-mers (Steinegger and Söding, 2017). Reading in the index table and computing these matrices on-the-fly takes ∼0.5 min of runtime overhead for each search. We reduced this to 0.05 s by (1) writing the index table, the matrices and other pre-computable data into a file if it does not yet exist, memory mapping the file to take advantage of the system page cache (for detailed memory requirements see Supplementary Materials) and (3) optimizing I/O operations.

2.2 Optimized sequence-to-profile search mode

The index table for profile databases stores, for each position in a profile, all k-mers with a profile similarity score above a threshold set by –s. The number of similar k-mers grows exponentially with k. To save memory, we chose a short k = 5 as default for this mode. We also added to Mmseqs2 utilities for creating profiles from multiple sequence alignments (MSAs) and converting between profile formats.

2.3 Desktop and web server app

Based on the same code base, the application can be either deployed through Docker containers to be accessed through web browsers or packaged as a desktop GUI application with the Electron framework (electronjs.org). In either case, the backend part of the application provides a RESTful API and worker scheduling. The server supports protein, translated nucleotide and nucleotide sequence searches and iterative and reverse profile searches. The application takes a list of either protein or nucleotide sequences in FASTA/FASTQ format as query input. To generate a target search database, the application takes a FASTA/FASTQ file for protein sequence searches or a STOCKHOLM MSA file for protein profile searches. Search results are shown with a customized feature-viewer (github.com/calipho-sib/feature-viewer) (Fig. 1A) and can be downloaded in tabular BLAST format.
Fig. 1.

(A) Screenshots of the search interface and result visualization. (B) Runtime of searches with the baseline MMseqs2 (square) and the new server mode (circle) at four sensitivity settings (-s). (C) Domain annotation: Speedup versus sensitivity at 95% precision for MMseqs2 (triangle: sequence-profile search, upside-down triangle: sequence–sequence search; sensitivity settings: -s 1, 3, 5, 7), DIAMOND (square; default, –sensitive, –more-sensitive) and BLAST (circle). HMMER3 matches to Pfam domains are used as ground truth. The speed-ups exclude the times to format the databases

(A) Screenshots of the search interface and result visualization. (B) Runtime of searches with the baseline MMseqs2 (square) and the new server mode (circle) at four sensitivity settings (-s). (C) Domain annotation: Speedup versus sensitivity at 95% precision for MMseqs2 (triangle: sequence-profile search, upside-down triangle: sequence–sequence search; sensitivity settings: -s 1, 3, 5, 7), DIAMOND (square; default, –sensitive, –more-sensitive) and BLAST (circle). HMMER3 matches to Pfam domains are used as ground truth. The speed-ups exclude the times to format the databases

3 Results

Figure 1B demonstrates the reduction of runtime overhead by comparing the runtimes of the Mmseqs2 version without (‘baseline’) to the new version with pre-computations and memory mapping (‘server mode’). Runtimes refer to searches with amino acid query sets of 1, 10, 100, 1000 and 10 000 sequences of average length 350 (sampled from the Uniclust30 database) through the Uniclust30 2017_10 database (Mirdita ) with 13.5 million sequences, measured on a server with 2 Intel Xeon E5-2680 v4 CPUs with 14 cores each. The index table and matrix pre-computation (∼3 min 40 s) is not included in the runtimes. To test the quality and speed of annotating Pfam domains on genes assembled from metagenomics data, we built a test set by sampling 100 000 full-length sequences longer than 150 residues from our Marine Eukaryotic Reference Catalogue (Steinegger ), clustering this set to 30% maximum pairwise sequence identity with MMseqs2 and sampling 10 000 sequences from the redundancy-reduced set. We annotated these sequences with PfamA 31.0 domains (Finn ) using HMMER3 (Finn ). We then compared how well the sequence-sequence searches of MMseqs2, BLAST and DIAMOND (Buchfink ) and the sequence-to-profile searches of MMseqs2 could find the correct domain annotations. For the sequence-sequence search methods, we built a database from all sequences in PfamA.full MSAs and reported as E-value of a Pfam domain the E-value for the best-matching sequence from its MSA. We defined a search as true positive (TP) if the top match was annotated by HMMER3 with an E-value better than 10−3 and as false positive (FP) if the top match was not annotated with an HMMER3 E-value below 1. All other searches were considered ambiguous and ignored. For each method, we determined the E-value at which the precision TP/(TP+FP) is 95% and measured the sensitivity at that E-value. As Figure 1C shows, MMseqs2 sequence-to-profile searches are ∼30 times faster than sequence-sequence searches with DIAMOND, MMseqs2 and BLAST and ∼300 times faster than HMMER3. MMseqs2 sequence-to-profile searches reach 87% relative sensitivity at 95% precision, making them an attractive alternative to HMMER3 when speed is critical.

4 Conclusion

The desktop and web server app for MMseqs2 performs fast sequence searches at unprecedented speed-to-sensitivity trade-off on local computers. Thousand queries take only a minute to search through fifteen million sequences of the Uniclust30 database, much faster than NCBI’s BLAST website. We hope the MMseqs2 app will also empower users unfamiliar with command line interfaces to perform fast and sensitive searches with their own sequence and profile databases. Click here for additional data file.
  9 in total

1.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

2.  ViroBLAST: a stand-alone BLAST web server for flexible queries of multiple databases and user's datasets.

Authors:  Wenjie Deng; David C Nickle; Gerald H Learn; Brandon Maust; James I Mullins
Journal:  Bioinformatics       Date:  2007-06-22       Impact factor: 6.937

Review 3.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

4.  Fast and sensitive protein alignment using DIAMOND.

Authors:  Benjamin Buchfink; Chao Xie; Daniel H Huson
Journal:  Nat Methods       Date:  2014-11-17       Impact factor: 28.547

5.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold.

Authors:  Martin Steinegger; Milot Mirdita; Johannes Söding
Journal:  Nat Methods       Date:  2019-06-24       Impact factor: 28.547

6.  Uniclust databases of clustered and deeply annotated protein sequences and alignments.

Authors:  Milot Mirdita; Lars von den Driesch; Clovis Galiez; Maria J Martin; Johannes Söding; Martin Steinegger
Journal:  Nucleic Acids Res       Date:  2016-11-28       Impact factor: 16.971

7.  MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.

Authors:  Martin Steinegger; Johannes Söding
Journal:  Nat Biotechnol       Date:  2017-10-16       Impact factor: 54.908

8.  HMMER web server: interactive sequence similarity searching.

Authors:  Robert D Finn; Jody Clements; Sean R Eddy
Journal:  Nucleic Acids Res       Date:  2011-05-18       Impact factor: 16.971

9.  Pfam: the protein families database.

Authors:  Robert D Finn; Alex Bateman; Jody Clements; Penelope Coggill; Ruth Y Eberhardt; Sean R Eddy; Andreas Heger; Kirstie Hetherington; Liisa Holm; Jaina Mistry; Erik L L Sonnhammer; John Tate; Marco Punta
Journal:  Nucleic Acids Res       Date:  2013-11-27       Impact factor: 16.971

  9 in total
  50 in total

1.  Three-dimensional Structure Databases of Biological Macromolecules.

Authors:  Vaishali P Waman; Christine Orengo; Gerard J Kleywegt; Arthur M Lesk
Journal:  Methods Mol Biol       Date:  2022

2.  Harnessing protein folding neural networks for peptide-protein docking.

Authors:  Tomer Tsaban; Julia K Varga; Orly Avraham; Ziv Ben-Aharon; Alisa Khramushin; Ora Schueler-Furman
Journal:  Nat Commun       Date:  2022-01-10       Impact factor: 14.919

Review 3.  A simple guide to de novo transcriptome assembly and annotation.

Authors:  Venket Raghavan; Louis Kraft; Fantin Mesny; Linda Rigerte
Journal:  Brief Bioinform       Date:  2022-03-10       Impact factor: 11.622

4.  Experimental evolution of the megaplasmid pMPPla107 in Pseudomonas stutzeri enables identification of genes contributing to sensitivity to an inhibitory agent.

Authors:  Brian A Smith; Kevin Dougherty; Meara Clark; David A Baltrus
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  2021-11-29       Impact factor: 6.237

5.  A Fast and Interpretable Deep Learning Approach for Accurate Electrostatics-Driven pKa Predictions in Proteins.

Authors:  Pedro B P S Reis; Marco Bertolini; Floriane Montanari; Walter Rocchia; Miguel Machuqueiro; Djork-Arné Clevert
Journal:  J Chem Theory Comput       Date:  2022-07-15       Impact factor: 6.578

6.  Conformational buffering underlies functional selection in intrinsically disordered protein regions.

Authors:  Nicolás S González-Foutel; Juliana Glavina; Wade M Borcherds; Matías Safranchik; Susana Barrera-Vilarmau; Amin Sagar; Alejandro Estaña; Amelie Barozet; Nicolás A Garrone; Gregorio Fernandez-Ballester; Clara Blanes-Mira; Ignacio E Sánchez; Gonzalo de Prat-Gay; Juan Cortés; Pau Bernadó; Rohit V Pappu; Alex S Holehouse; Gary W Daughdrill; Lucía B Chemes
Journal:  Nat Struct Mol Biol       Date:  2022-08-10       Impact factor: 18.361

7.  Improved Protein Real-Valued Distance Prediction Using Deep Residual Dense Network (DRDN).

Authors:  S Geethu; E R Vimina
Journal:  Protein J       Date:  2022-08-25       Impact factor: 4.000

8.  Structural Conservation and Diversity of PilZ-Related Domains.

Authors:  Michael Y Galperin; Shan-Ho Chou
Journal:  J Bacteriol       Date:  2020-01-29       Impact factor: 3.490

9.  Dissecting protein domain variability in the core RNA interference machinery of five insect orders.

Authors:  Fabricio Barbosa Monteiro Arraes; Diogo Martins-de-Sa; Daniel D Noriega Vasquez; Bruno Paes Melo; Muhammad Faheem; Leonardo Lima Pepino de Macedo; Carolina Vianna Morgante; Joao Alexandre R G Barbosa; Roberto Coiti Togawa; Valdeir Junio Vaz Moreira; Etienne G J Danchin; Maria Fatima Grossi-de-Sa
Journal:  RNA Biol       Date:  2020-12-31       Impact factor: 4.652

10.  PredictProtein - Predicting Protein Structure and Function for 29 Years.

Authors:  Michael Bernhofer; Christian Dallago; Tim Karl; Venkata Satagopam; Michael Heinzinger; Maria Littmann; Tobias Olenyi; Jiajun Qiu; Konstantin Schütze; Guy Yachdav; Haim Ashkenazy; Nir Ben-Tal; Yana Bromberg; Tatyana Goldberg; Laszlo Kajan; Sean O'Donoghue; Chris Sander; Andrea Schafferhans; Avner Schlessinger; Gerrit Vriend; Milot Mirdita; Piotr Gawron; Wei Gu; Yohan Jarosz; Christophe Trefois; Martin Steinegger; Reinhard Schneider; Burkhard Rost
Journal:  Nucleic Acids Res       Date:  2021-07-02       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.