Literature DB >> 29452334

SEED 2: a user-friendly platform for amplicon high-throughput sequencing data analyses.

Tomáš Vetrovský¹, Petr Baldrian¹, Daniel Morais¹.

Abstract

Motivation: Modern molecular methods have increased our ability to describe microbial communities. Along with the advances brought by new sequencing technologies, we now require intensive computational resources to make sense of the large numbers of sequences continuously produced. The software developed by the scientific community to address this demand, although very useful, require experience of the command-line environment, extensive training and have steep learning curves, limiting their use. We created SEED 2, a graphical user interface for handling high-throughput amplicon-sequencing data under Windows operating systems.
Results: SEED 2 is the only sequence visualizer that empowers users with tools to handle amplicon-sequencing data of microbial community markers. It is suitable for any marker genes sequences obtained through Illumina, IonTorrent or Sanger sequencing. SEED 2 allows the user to process raw sequencing data, identify specific taxa, produce of OTU-tables, create sequence alignments and construct phylogenetic trees. Standard dual core laptops with 8 GB of RAM can handle ca. 8 million of Illumina PE 300 bp sequences, ca. 4 GB of data. Availability and implementation: SEED 2 was implemented in Object Pascal and uses internal functions and external software for amplicon data processing. SEED 2 is a freeware software, available at http://www.biomed.cas.cz/mbu/lbwrf/seed/ as a self-contained file, including all the dependencies, and does not require installation. Supplementary data contain a comprehensive list of supported functions. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Genetic Markers

Year: 2018 PMID： 29452334 PMCID： PMC6022770 DOI： 10.1093/bioinformatics/bty071

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Background

High-throughput sequencing technologies have significantly increased our capability to describe microbial communities. These modern molecular techniques can detect fine compositions of communities or subtle changes in the proportions of specific taxa in any environment (Halfvarson ; Patin ). However, to convert raw amplicon sequencing data into interpretable tables and figures, it is necessary to remove errors introduced by sequencing and library preparation, isolate informative sequences, cluster and count similar sequences and finally identify sequences and determine their taxonomic affiliation. There are many different pipelines, tools and approaches to execute the steps required to profile microbial communities based on PCR amplicons (Amir ; Callahan ; Caporaso ; Edgar, 2010), although most of them are exclusive to Unix-based operating systems (Linux and IOS; Altschul ). Windows-based operating systems are used by ca. 85% of computer users worldwide (Statcounter. Global stat, http://gs.statcounter.com/os-market-share/desktop/worldwide, 2017) due to their more intuitive and user-friendly interface. Unix systems might be more versatile and more resourceful than Windows-based systems (Mangul ), but a platform like SEED 2 provides Windows-users, who are not familiar with Unix systems, with access to sequence-processing technology. Moreover, offering scientists easy-to-use tools not only saves them time but also allows them to spend more energy interpreting their data and planning experiments. There have been several efforts to facilitate access to bioinformatic tools for users without experience with command-line approaches by offering a GUI (graphical user interface), for instance mcaGUI (Copeland ), Clovr-ITS (White ), BMP desktop (Pylro ) and PipeCraft (Anslan ). Nevertheless, none of those tools supplies the users with sequence visualization functions and are not available for Windows users. Sequence viewing and editing tools such as Seqotron (Fourment and Holmes, 2016), UGENE (Okonechnikov ) and Jalview (Waterhouse ) offers a GUI approach, but are focused on sequence alignment, secondary structure visualization and phylogenetic reconstruction. Furthermore, none of these GUI-based tools support sequence clustering, taxonomy assignment and OTU table construction, functions which are fundamental in microbial community analyses. Here we present SEED 2, an intuitive graphical user interface for batch processing of fasta and fastq files specific for amplicon sequencing studies. It further facilitates clustering, quality filtering/trimming, taxonomic identification, creation and description of molecular taxa and their phylogenetic placements and for quick assessment of basic microbial community statistics.

2 Materials and methods

SEED 2 works through a graphical interface (Supplementary Fig. S1) to process data from Illumina, Ion Torrent and Sanger sequencing. It accepts fasta, fastq and text formats as input files. For Illumina-generated data, users can join paired-end reads through the graphical interface and use them as input. Upon selecting an input file, SEED 2 loads this file into the computer’s memory at which time sequences can be visualized and edited. Through the sequence editor/visualizer, it is possible to remove or trim low-quality sequences, search for specific sequence domains such as sequencing barcodes or primer sequences, including degenerate oligonucleotides, and to group and label sequences containing specified domains in a process known as demultiplexing. This process is typically utilized in metabarcoding studies when multiple samples, individually labelled with artificial barcodes are sequenced together (Caporaso ). To reduce computational time, it is possible to dereplicate the sequences after filtering and labelling and work with a concise set of unique sequences, saving a mapping file of the dereplication step. To perform clustering of similar sequences into molecular taxa or OTUs (Operational Taxonomic Units), SEED 2 offers the use of two external algorithms implemented in Vsearch (Rognes ) and Usearch (Edgar, 2010). These two software tools perform open-reference clustering, ranking sequences by their abundances, using the most abundant ones as the clustering starting points, and subsequently grouping sequences by an arbitrary (user defined) level of identity. Chimera removal is also possible through these two software tools. After clustering, it is possible to create OTU-tables with counts across samples, to filter out singletons or OTUs at any minimum/maximum abundance threshold and assign taxonomy to the OTUs using the alignment software BLAST (Altschul ), which allows the retrieval of any number of best hits for each query. BLAST searches can be performed with the user's favourite database at their own computer or remotely through the NCBI API (internet connection required). SEED 2 even makes it possible to create a user-tailored searchable database from a loaded fasta file. To generate a custom BLAST database, SEED 2 uses the makeblastdb command where the input is a fasta file, and the output is the blastdb itself. With this taxonomical information, nonspecific sequences can be removed, diversity indexes and rarefaction curves can be created and experimental data can be further explored. SEED 2 is one of the very few amplicon-processing tools that has the option to create and visualize phylogenetic trees and allows them to be edited, manipulated and exported as Newick trees. Finally, SEED 2 exports all tables and summaries of the sequence data as ‘txt’, ‘tab’ delimited files or they can be directly copied to the clipboard. All command-line software used within SEED 2 are accessed through the graphical interface to ensure maximal usability.

3 Implementation

SEED 2 was written in Object Pascal and is available to all 64-bit Windows platforms from Windows 7 onward. The software implements functions to find sequence domains, group sequences and edit sequence labels. Data is loaded into the computer’s RAM and allows users to apply quality filters, trim and manipulate sequences in batch. SEED 2 makes use of ‘hash table’ structures (called Dictionary in Pascal language), which is shared by all compiled binaries used during the running of SEED 2. Data is cached into the RAM where it remains for all the processing steps, while in software built on scripting language pipelines, data is erased and reloaded from the RAM at every processing step. This makes SEED 2 not limited to a recommended pipeline, but rather a whole platform for data processing, faster and more memory efficient than scripting-based pipelines. This comes at a cost that the maximum number of sequences to be processed is limited by the user’s computer RAM. However, a standard laptop computer with 8 GB RAM can handle at least 8 million Illumina sequencing reads comprising tens of thousands of OTUs which amounts to ca. 4 GB of data. All steps taken during data processing are stored in a workflow manager for automatizing functions and improvement of reproducibility. As a benchmark, in a Windows 8.1 computer, with 4 cores i7-6700 3.4 GHz and 16 GB of RAM we processed 100 000 16S amplicon Illumina PE 300 bp sequences in 157 min. For the ITS amplicon marker, we processed 100 000 Illumina PE 300 bp sequences in 79 min. All the steps performed, including the time consumed for these analyses and a list of external software and default commands used for each function are reported in the Supplementary Doc1-benchmark and Doc2-list_of_commands, respectively. Moreover, SEED 2 requires 185 MB of HD space.

4 Conclusions

SEED 2 is a fast, intuitive and memory efficient sequence-processing tool. It is applicable to any study using fasta or fastq data from all current high-throughput sequencing platforms. The graphical interface supplies users with tools necessary to quickly analyze meta-taxonomic data. Click here for additional data file.

18 in total

1. mcaGUI: microbial community analysis R-Graphical User Interface (GUI).

Authors: Wade K Copeland; Vandhana Krishnan; Daniel Beck; Matt Settles; James A Foster; Kyu-Chul Cho; Mitch Day; Roxana Hickey; Ursel M E Schütte; Xia Zhou; Christopher J Williams; Larry J Forney; Zaid Abdo
Journal: Bioinformatics Date: 2012-06-12 Impact factor: 6.937

2. PipeCraft: Flexible open-source toolkit for bioinformatics analysis of custom high-throughput amplicon sequencing data.

Authors: Sten Anslan; Mohammad Bahram; Indrek Hiiesalu; Leho Tedersoo
Journal: Mol Ecol Resour Date: 2017-06-21 Impact factor: 7.090

3. Unipro UGENE: a unified bioinformatics toolkit.

Authors: Konstantin Okonechnikov; Olga Golosova; Mikhail Fursov
Journal: Bioinformatics Date: 2012-02-24 Impact factor: 6.937

4. Jalview Version 2--a multiple sequence alignment editor and analysis workbench.

Authors: Andrew M Waterhouse; James B Procter; David M A Martin; Michèle Clamp; Geoffrey J Barton
Journal: Bioinformatics Date: 2009-01-16 Impact factor: 6.937

5. The anatomy of successful computational biology software.

Authors: Stephen Altschul; Barry Demchak; Richard Durbin; Robert Gentleman; Martin Krzywinski; Heng Li; Anton Nekrutenko; James Robinson; Wayne Rasband; James Taylor; Cole Trapnell
Journal: Nat Biotechnol Date: 2013-10 Impact factor: 54.908

6. DADA2: High-resolution sample inference from Illumina amplicon data.

Authors: Benjamin J Callahan; Paul J McMurdie; Michael J Rosen; Andrew W Han; Amy Jo A Johnson; Susan P Holmes
Journal: Nat Methods Date: 2016-05-23 Impact factor: 28.547

7. Addressing the Digital Divide in Contemporary Biology: Lessons from Teaching UNIX.

Authors: Serghei Mangul; Lana S Martin; Alexander Hoffmann; Matteo Pellegrini; Eleazar Eskin
Journal: Trends Biotechnol Date: 2017-07-15 Impact factor: 21.942

8. Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns.

Authors: Amnon Amir; Daniel McDonald; Jose A Navas-Molina; Evguenia Kopylova; James T Morton; Zhenjiang Zech Xu; Eric P Kightley; Luke R Thompson; Embriette R Hyde; Antonio Gonzalez; Rob Knight
Journal: mSystems Date: 2017-03-07 Impact factor: 6.496

9. Dynamics of the human gut microbiome in inflammatory bowel disease.

Authors: Jonas Halfvarson; Colin J Brislawn; Regina Lamendella; Yoshiki Vázquez-Baeza; William A Walters; Lisa M Bramer; Mauro D'Amato; Ferdinando Bonfiglio; Daniel McDonald; Antonio Gonzalez; Erin E McClure; Mitchell F Dunklebarger; Rob Knight; Janet K Jansson
Journal: Nat Microbiol Date: 2017-02-13 Impact factor: 17.745

10. Seqotron: a user-friendly sequence editor for Mac OS X.

Authors: Mathieu Fourment; Edward C Holmes
Journal: BMC Res Notes Date: 2016-02-17

41 in total

1. Symbiosis of isoetid plant species with arbuscular mycorrhizal fungi under aquatic versus terrestrial conditions.

Authors: Radka Sudová; Jana Rydlová; Martina Čtvrtlíková; Petr Kohout; Fritz Oehl; Jana Voříšková; Zuzana Kolaříková
Journal: Mycorrhiza Date: 2021-01-24 Impact factor: 3.387

2. Arbuscular mycorrhizal fungal communities of forbs and C3 grasses respond differently to cultivation and elevated nutrients.

Authors: Petr Šmilauer; Marie Šmilauerová; Milan Kotilínek; Jiří Košnar
Journal: Mycorrhiza Date: 2021-05-29 Impact factor: 3.387

3. Using BEAN-counter to quantify genetic interactions from multiplexed barcode sequencing experiments.

Authors: Scott W Simpkins; Raamesh Deshpande; Justin Nelson; Sheena C Li; Jeff S Piotrowski; Henry Neil Ward; Yoko Yashiroda; Hiroyuki Osada; Minoru Yoshida; Charles Boone; Chad L Myers
Journal: Nat Protoc Date: 2019-02 Impact factor: 13.491

4. Termites Are Associated with External Species-Specific Bacterial Communities.

Authors: Jan Šobotník; Thomas Bourguignon; Patrik Soukup; Tomáš Větrovský; Petr Stiblik; Kateřina Votýpková; Amrita Chakraborty; David Sillam-Dussès; Miroslav Kolařík; Iñaki Odriozola; Nathan Lo; Petr Baldrian
Journal: Appl Environ Microbiol Date: 2021-01-04 Impact factor: 4.792

5. SEQU-INTO: Early detection of impurities, contamination and off-targets (ICOs) in long read/MinION sequencing.

Authors: Markus Joppich; Margaryta Olenchuk; Julia M Mayer; Quirin Emslander; Luisa F Jimenez-Soto; Ralf Zimmer
Journal: Comput Struct Biotechnol J Date: 2020-05-23 Impact factor: 7.271

6. Specialisation events of fungal metacommunities exposed to a persistent organic pollutant are suggestive of augmented pathogenic potential.

Authors: Celso Martins; Adélia Varela; Céline C Leclercq; Oscar Núñez; Tomáš Větrovský; Jenny Renaut; Petr Baldrian; Cristina Silva Pereira
Journal: Microbiome Date: 2018-11-22 Impact factor: 14.650

7. BTW-Bioinformatics Through Windows: an easy-to-install package to analyze marker gene data.

Authors: Daniel Morais; Luiz F W Roesch; Marc Redmile-Gordon; Fausto G Santos; Petr Baldrian; Fernando D Andreote; Victor S Pylro
Journal: PeerJ Date: 2018-07-30 Impact factor: 2.984

8. Fungal Community Composition and Diversity Vary With Soil Horizons in a Subtropical Forest.

Authors: Xia Luo; Kezhong Liu; Yuyu Shen; Guojing Yao; Wenguang Yang; Peter E Mortimer; Heng Gui
Journal: Front Microbiol Date: 2021-07-01 Impact factor: 5.640

9. Successional Development of Fungal Communities Associated with Decomposing Deadwood in a Natural Mixed Temperate Forest.

Authors: Clémentine Lepinay; Lucie Jiráska; Vojtěch Tláskal; Vendula Brabcová; Tomáš Vrška; Petr Baldrian
Journal: J Fungi (Basel) Date: 2021-05-25

10. Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding.

Authors: Sten Anslan; R Henrik Nilsson; Christian Wurzbacher; Petr Baldrian; Mohammad Bahram
Journal: MycoKeys Date: 2018-09-11 Impact factor: 2.984