Motivation: Modern molecular methods have increased our ability to describe microbial communities. Along with the advances brought by new sequencing technologies, we now require intensive computational resources to make sense of the large numbers of sequences continuously produced. The software developed by the scientific community to address this demand, although very useful, require experience of the command-line environment, extensive training and have steep learning curves, limiting their use. We created SEED 2, a graphical user interface for handling high-throughput amplicon-sequencing data under Windows operating systems. Results: SEED 2 is the only sequence visualizer that empowers users with tools to handle amplicon-sequencing data of microbial community markers. It is suitable for any marker genes sequences obtained through Illumina, IonTorrent or Sanger sequencing. SEED 2 allows the user to process raw sequencing data, identify specific taxa, produce of OTU-tables, create sequence alignments and construct phylogenetic trees. Standard dual core laptops with 8 GB of RAM can handle ca. 8 million of Illumina PE 300 bp sequences, ca. 4 GB of data. Availability and implementation: SEED 2 was implemented in Object Pascal and uses internal functions and external software for amplicon data processing. SEED 2 is a freeware software, available at http://www.biomed.cas.cz/mbu/lbwrf/seed/ as a self-contained file, including all the dependencies, and does not require installation. Supplementary data contain a comprehensive list of supported functions. Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Modern molecular methods have increased our ability to describe microbial communities. Along with the advances brought by new sequencing technologies, we now require intensive computational resources to make sense of the large numbers of sequences continuously produced. The software developed by the scientific community to address this demand, although very useful, require experience of the command-line environment, extensive training and have steep learning curves, limiting their use. We created SEED 2, a graphical user interface for handling high-throughput amplicon-sequencing data under Windows operating systems. Results: SEED 2 is the only sequence visualizer that empowers users with tools to handle amplicon-sequencing data of microbial community markers. It is suitable for any marker genes sequences obtained through Illumina, IonTorrent or Sanger sequencing. SEED 2 allows the user to process raw sequencing data, identify specific taxa, produce of OTU-tables, create sequence alignments and construct phylogenetic trees. Standard dual core laptops with 8 GB of RAM can handle ca. 8 million of Illumina PE 300 bp sequences, ca. 4 GB of data. Availability and implementation: SEED 2 was implemented in Object Pascal and uses internal functions and external software for amplicon data processing. SEED 2 is a freeware software, available at http://www.biomed.cas.cz/mbu/lbwrf/seed/ as a self-contained file, including all the dependencies, and does not require installation. Supplementary data contain a comprehensive list of supported functions. Supplementary information: Supplementary data are available at Bioinformatics online.
High-throughput sequencing technologies have significantly increased our capability to describe microbial communities. These modern molecular techniques can detect fine compositions of communities or subtle changes in the proportions of specific taxa in any environment (Halfvarson ; Patin ). However, to convert raw amplicon sequencing data into interpretable tables and figures, it is necessary to remove errors introduced by sequencing and library preparation, isolate informative sequences, cluster and count similar sequences and finally identify sequences and determine their taxonomic affiliation.There are many different pipelines, tools and approaches to execute the steps required to profile microbial communities based on PCR amplicons (Amir ; Callahan ; Caporaso ; Edgar, 2010), although most of them are exclusive to Unix-based operating systems (Linux and IOS; Altschul ). Windows-based operating systems are used by ca. 85% of computer users worldwide (Statcounter. Global stat, http://gs.statcounter.com/os-market-share/desktop/worldwide, 2017) due to their more intuitive and user-friendly interface. Unix systems might be more versatile and more resourceful than Windows-based systems (Mangul ), but a platform like SEED 2 provides Windows-users, who are not familiar with Unix systems, with access to sequence-processing technology. Moreover, offering scientists easy-to-use tools not only saves them time but also allows them to spend more energy interpreting their data and planning experiments.There have been several efforts to facilitate access to bioinformatic tools for users without experience with command-line approaches by offering a GUI (graphical user interface), for instance mcaGUI (Copeland ), Clovr-ITS (White ), BMP desktop (Pylro ) and PipeCraft (Anslan ). Nevertheless, none of those tools supplies the users with sequence visualization functions and are not available for Windows users. Sequence viewing and editing tools such as Seqotron (Fourment and Holmes, 2016), UGENE (Okonechnikov ) and Jalview (Waterhouse ) offers a GUI approach, but are focused on sequence alignment, secondary structure visualization and phylogenetic reconstruction. Furthermore, none of these GUI-based tools support sequence clustering, taxonomy assignment and OTU table construction, functions which are fundamental in microbial community analyses.Here we present SEED 2, an intuitive graphical user interface for batch processing of fasta and fastq files specific for amplicon sequencing studies. It further facilitates clustering, quality filtering/trimming, taxonomic identification, creation and description of molecular taxa and their phylogenetic placements and for quick assessment of basic microbial community statistics.
2 Materials and methods
SEED 2 works through a graphical interface (Supplementary Fig. S1) to process data from Illumina, Ion Torrent and Sanger sequencing. It accepts fasta, fastq and text formats as input files. For Illumina-generated data, users can join paired-end reads through the graphical interface and use them as input. Upon selecting an input file, SEED 2 loads this file into the computer’s memory at which time sequences can be visualized and edited. Through the sequence editor/visualizer, it is possible to remove or trim low-quality sequences, search for specific sequence domains such as sequencing barcodes or primer sequences, including degenerate oligonucleotides, and to group and label sequences containing specified domains in a process known as demultiplexing. This process is typically utilized in metabarcoding studies when multiple samples, individually labelled with artificial barcodes are sequenced together (Caporaso ). To reduce computational time, it is possible to dereplicate the sequences after filtering and labelling and work with a concise set of unique sequences, saving a mapping file of the dereplication step. To perform clustering of similar sequences into molecular taxa or OTUs (Operational Taxonomic Units), SEED 2 offers the use of two external algorithms implemented in Vsearch (Rognes ) and Usearch (Edgar, 2010). These two software tools perform open-reference clustering, ranking sequences by their abundances, using the most abundant ones as the clustering starting points, and subsequently grouping sequences by an arbitrary (user defined) level of identity. Chimera removal is also possible through these two software tools. After clustering, it is possible to create OTU-tables with counts across samples, to filter out singletons or OTUs at any minimum/maximum abundance threshold and assign taxonomy to the OTUs using the alignment software BLAST (Altschul ), which allows the retrieval of any number of best hits for each query. BLAST searches can be performed with the user's favourite database at their own computer or remotely through the NCBI API (internet connection required). SEED 2 even makes it possible to create a user-tailored searchable database from a loaded fasta file. To generate a custom BLAST database, SEED 2 uses the makeblastdb command where the input is a fasta file, and the output is the blastdb itself. With this taxonomical information, nonspecific sequences can be removed, diversity indexes and rarefaction curves can be created and experimental data can be further explored. SEED 2 is one of the very few amplicon-processing tools that has the option to create and visualize phylogenetic trees and allows them to be edited, manipulated and exported as Newick trees. Finally, SEED 2 exports all tables and summaries of the sequence data as ‘txt’, ‘tab’ delimited files or they can be directly copied to the clipboard. All command-line software used within SEED 2 are accessed through the graphical interface to ensure maximal usability.
3 Implementation
SEED 2 was written in Object Pascal and is available to all 64-bit Windows platforms from Windows 7 onward. The software implements functions to find sequence domains, group sequences and edit sequence labels. Data is loaded into the computer’s RAM and allows users to apply quality filters, trim and manipulate sequences in batch. SEED 2 makes use of ‘hash table’ structures (called Dictionary in Pascal language), which is shared by all compiled binaries used during the running of SEED 2. Data is cached into the RAM where it remains for all the processing steps, while in software built on scripting language pipelines, data is erased and reloaded from the RAM at every processing step. This makes SEED 2 not limited to a recommended pipeline, but rather a whole platform for data processing, faster and more memory efficient than scripting-based pipelines. This comes at a cost that the maximum number of sequences to be processed is limited by the user’s computer RAM. However, a standard laptop computer with 8 GB RAM can handle at least 8 million Illumina sequencing reads comprising tens of thousands of OTUs which amounts to ca. 4 GB of data. All steps taken during data processing are stored in a workflow manager for automatizing functions and improvement of reproducibility.As a benchmark, in a Windows 8.1 computer, with 4 cores i7-6700 3.4 GHz and 16 GB of RAM we processed 100 000 16S amplicon Illumina PE 300 bp sequences in 157 min. For the ITS amplicon marker, we processed 100 000 Illumina PE 300 bp sequences in 79 min. All the steps performed, including the time consumed for these analyses and a list of external software and default commands used for each function are reported in the Supplementary Doc1-benchmark and Doc2-list_of_commands, respectively. Moreover, SEED 2 requires 185 MB of HD space.
4 Conclusions
SEED 2 is a fast, intuitive and memory efficient sequence-processing tool. It is applicable to any study using fasta or fastq data from all current high-throughput sequencing platforms. The graphical interface supplies users with tools necessary to quickly analyze meta-taxonomic data.Click here for additional data file.
Authors: Wade K Copeland; Vandhana Krishnan; Daniel Beck; Matt Settles; James A Foster; Kyu-Chul Cho; Mitch Day; Roxana Hickey; Ursel M E Schütte; Xia Zhou; Christopher J Williams; Larry J Forney; Zaid Abdo Journal: Bioinformatics Date: 2012-06-12 Impact factor: 6.937
Authors: Andrew M Waterhouse; James B Procter; David M A Martin; Michèle Clamp; Geoffrey J Barton Journal: Bioinformatics Date: 2009-01-16 Impact factor: 6.937
Authors: Stephen Altschul; Barry Demchak; Richard Durbin; Robert Gentleman; Martin Krzywinski; Heng Li; Anton Nekrutenko; James Robinson; Wayne Rasband; James Taylor; Cole Trapnell Journal: Nat Biotechnol Date: 2013-10 Impact factor: 54.908
Authors: Benjamin J Callahan; Paul J McMurdie; Michael J Rosen; Andrew W Han; Amy Jo A Johnson; Susan P Holmes Journal: Nat Methods Date: 2016-05-23 Impact factor: 28.547
Authors: Amnon Amir; Daniel McDonald; Jose A Navas-Molina; Evguenia Kopylova; James T Morton; Zhenjiang Zech Xu; Eric P Kightley; Luke R Thompson; Embriette R Hyde; Antonio Gonzalez; Rob Knight Journal: mSystems Date: 2017-03-07 Impact factor: 6.496
Authors: Jonas Halfvarson; Colin J Brislawn; Regina Lamendella; Yoshiki Vázquez-Baeza; William A Walters; Lisa M Bramer; Mauro D'Amato; Ferdinando Bonfiglio; Daniel McDonald; Antonio Gonzalez; Erin E McClure; Mitchell F Dunklebarger; Rob Knight; Janet K Jansson Journal: Nat Microbiol Date: 2017-02-13 Impact factor: 17.745
Authors: Radka Sudová; Jana Rydlová; Martina Čtvrtlíková; Petr Kohout; Fritz Oehl; Jana Voříšková; Zuzana Kolaříková Journal: Mycorrhiza Date: 2021-01-24 Impact factor: 3.387
Authors: Scott W Simpkins; Raamesh Deshpande; Justin Nelson; Sheena C Li; Jeff S Piotrowski; Henry Neil Ward; Yoko Yashiroda; Hiroyuki Osada; Minoru Yoshida; Charles Boone; Chad L Myers Journal: Nat Protoc Date: 2019-02 Impact factor: 13.491
Authors: Jan Šobotník; Thomas Bourguignon; Patrik Soukup; Tomáš Větrovský; Petr Stiblik; Kateřina Votýpková; Amrita Chakraborty; David Sillam-Dussès; Miroslav Kolařík; Iñaki Odriozola; Nathan Lo; Petr Baldrian Journal: Appl Environ Microbiol Date: 2021-01-04 Impact factor: 4.792
Authors: Daniel Morais; Luiz F W Roesch; Marc Redmile-Gordon; Fausto G Santos; Petr Baldrian; Fernando D Andreote; Victor S Pylro Journal: PeerJ Date: 2018-07-30 Impact factor: 2.984