Literature DB >> 32071706

Computational methods for 16S metabarcoding studies using Nanopore sequencing data.

Andres Santos^1,2,3, Ronny van Aerle³, Leticia Barrientos^1,2,3, Jaime Martinez-Urtaza³.

Abstract

Assessment of bacterial diversity through sequencing of 16S ribosomal RNA (16S rRNA) genes has been an approach widely used in environmental microbiology, particularly since the advent of high-throughput sequencing technologies. An additional innovation introduced by these technologies was the need of developing new strategies to manage and investigate the massive amount of sequencing data generated. This situation stimulated the rapid expansion of the field of bioinformatics with the release of new tools to be applied to the downstream analysis and interpretation of sequencing data mainly generated using Illumina technology. In recent years, a third generation of sequencing technologies has been developed and have been applied in parallel and complementarily to the former sequencing strategies. In particular, Oxford Nanopore Technologies (ONT) introduced nanopore sequencing which has become very popular among molecular ecologists. Nanopore technology offers a low price, portability and fast sequencing throughput. This powerful technology has been recently tested for 16S rRNA analyses showing promising results. However, compared with previous technologies, there is a scarcity of bioinformatic tools and protocols designed specifically for the analysis of Nanopore 16S sequences. Due its notable characteristics, researchers have recently started performing assessments regarding the suitability MinION on 16S rRNA sequencing studies, and have obtained remarkable results. Here we present a review of the state-of-the-art of MinION technology applied to microbiome studies, the current possible application and main challenges for its use on 16S rRNA metabarcoding.

Entities: Chemical Disease Gene Species

Keywords: Microbial diversity; MinION; Third generation sequencing

Year: 2020 PMID： 32071706 PMCID： PMC7013242 DOI： 10.1016/j.csbj.2020.01.005

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Functionality, interaction, and dynamics of microbial communities are considered critical to the existence of ecological balance and life [1], [2]. The fact that only less than 1% of microorganisms are cultivable under laboratory conditions [3] has presented historical constraints to providing a precise dimension of the microbial world, and to studying microbial diversity within a taxonomic context. Since the foundations of molecular phylogeny were established in the 1960s and 70s, the 16S rRNA gene has been universally used for taxonomic studies of prokaryotic species [4], [5]. 16S rRNA is part of the small ribosomal subunit (SSU) present in all prokaryotic cells and the gene encoding for this molecule possesses some distinctive characteristics that make it suitable for taxonomic profiling: 1) it is ubiquitous, being found in all prokaryotic and archaebacteria organisms [6]; 2) the relatively small size (~1500 bp) and high degree of functional conservation [5], 3) the presence of variable regions in the 16S rRNA gene as result of diverse rates of evolution among species, which can be used to distinguish different bacterial groups [7], [8], and 4) the existence of highly conserved regions in the gene sequence, which can be used to design universal primers flanking different hypervariable regions (nine in total, V1-V9) identified in the gene [9]. On the other hand, the use of 16S rRNA for bacterial identification has some limitations, including the variable number of copies of these genes in bacterial genomes, the low taxonomic resolution at the species level for some bacterial groups, and the bias for taxonomic assignment of sequences depending on the variable region chosen for the analysis [10]. Until the late 1990s, the 16S rRNA gene was only applied in a taxonomic context to define species uniquely based on individual bacteria obtained from pure (mostly clinical) cultures [6], [11]. However, in 1997, Pace et al. [12] described for the first time the composition of microbial communities without the need for cultivation in the laboratory by employing the sequence of the 16S rRNA gene using Sanger sequencing. This work led to the establishment of a universal approach to the study of microbial communities. Nowadays, sequence analysis of 16S rRNA continues to be the gold standard for studying microbial diversity, enabling accurate taxonomic profiling of the prokaryotic groups present in both clinical and environmental samples [11], [12]. The introduction of Sanger sequencing technology in the investigation of microbial communities signified a revolution in the world of microbial ecology and entirely changed how microbial diversity was assessed. However, this approach required the analysis of individual sequences, implying that a cloning step was needed as a crucial prerequisite for the investigation of samples (Fig. 1a). As a result, sequences up to ~1000 can be generated. However, the number of sequences that can be analyzed was limited due the output of Sanger platforms (Table 1). Therefore, a complete evaluation of bacterial diversity using Sanger sequencing became a serious challenge in terms of time and costs.

Fig. 1

Table 1

Comparison of the available sequencing platforms for 16S metagenomic analysis using metabarcoding approach.

Sequencing Platform	Read Length (bp)	Accuracy	Output	Sequencing Chemistry	Run Time	Advantages in Metabarcoding approaches
Sanger	400–900	99.999%	1.9–84 Kb	Dideoxy chain termination	20 min −3 h	Long read length, high quality
Illumina MiSeq	75–300	99.9%	13.2–20 Gb	Sequencing by Synthesis	21–56 h	High Throughput, read quality
MinION	>200,000	~95%	~50 Gb	Single Sequencing real time-long reads	1–48 h	High Throughput, Long read length, portability
PacBio	10–15 Kb	99.999	5–10 Gb	Single Sequencing real time-long reads	4 h	Long read length and quality

Most common metabarcoding sequencing strategies for each sequencing technology generation. (a) First generation sequencing (Sanger). Under this approach, metabarcoding is classically performed by amplifying full-length 16S rRNA genes from an environmental DNA sample; once the amplicon has been obtained, the cloning of the 16S amplicons is performed, sequences are added into a vector and then transformed into a host; finally, plasmid extraction and purification are performed and the sequencing of 16S rRNA inserts is carried out by the Sanger method. (b) Second generation sequencing (Illumina). From environmental DNA samples, a PCR amplification of specific regions of de 16S rRNA gene is performed; depending on the scope of the study, one or two regions of the 16S gene can be amplified, with regions V1-V2 and V3-V4 being the most frequently used; by using these regions, a paired end library (the mix of DNA fragments with adapters attached to theirs ends and ready to be sequenced) preparation is often used for this purpose, adapters (exogenous nucleic acids that are ligated to a nucleic acid molecule to be sequenced) and index (unique DNA sequences ligated to fragments within a sequencing library, they allow the posterior sorting and identification of different samples sequenced on a same sequencing run) are added to 16S amplicon extremes and libraries of ~300 bp in length are finally sequenced on the Illumina MiSeq platform. (c) Third generation sequencing (Nanopore). This recently developed approach starts with the amplification of the full-length 16S rRNA gene from environmental DNA using universal primers; simultaneously, indexes for multiplexing are added to the amplicons in the same PCR reaction; once amplicons have been purified, the library preparation process is performed, consisting of the addition of a protein at a specific tagged region of the 16S amplicons (10 min for library preparation); finally direct sequencing of the samples is carried out on the MinION sequencer. Comparison of the available sequencing platforms for 16S metagenomic analysis using metabarcoding approach. Globally, the advent of high-throughput sequencing, or Second Generation Sequencing (SGS) technologies, and its rapid and widespread application across laboratories in the early 2000s represented a paradigm shift in microbial ecology. The characteristic high output and data accuracy provided by these new technologies, along with the removal of tedious and time-consuming steps such as the cloning of DNA fragments and electrophoretic separation of sequencing products required for Sanger sequencing, makes possible the generation of massive sequencing data in short run processes. Among the different companies pioneering high-throughput sequencing, Illumina has achieved a leading position in the market, becoming the standard sequencing technology and the most frequently applied in microbial ecology studies [13], [14]. The common elements in the sequences generated by this technology are the reduced length (from 50 bp to 300 bp), high throughput (from 2 Gb to 750 Gb), high accuracy, and reduced cost (starting from ~$40 USD per Gb approximately) [15] (Table 1). Nevertheless, due to the differential characteristics of the Illumina and Sanger technologies in terms of sequence length, full-length sequences of the 16S rRNA gene are not achievable using Illumina sequencing alone. To overcome this limitation, 16S gene analysis with Illumina has been typically restricted to specific variable regions of the 16S rRNA, instead of the complete gene (Fig. 1b). However, the remarkable characteristics of Illumina sequencing in terms of outputs, accuracy and speed, have made this technology central in almost all of the most prominent studies based on 16S analysis carried out up to date, including the Human Microbiome Project [16], Earth Microbiome Project [17] and the Extreme Microbiome Project [18].

Current analytical approaches applied in 16S metagenomic studies

An additional innovation introduced by high-throughput sequencing technologies was the need for new strategies to manage and investigate the massive amount of sequencing data generated. From the user perspective, this change involved a transition from the application of basic computer programs accessible to general users in standard computers, to the need for sophisticated computational analysis requiring advanced bioinformatic skills. This situation stimulated the rapid expansion of the field of bioinformatics applied to microbial ecology studies, mainly with the release of new tools applied to the downstream analysis and interpretation of sequencing data. Nowadays, a large number of powerful tools are available which enable an efficient integration of different types of data [15], [16], [17]. Within this context, several bioinformatics programs and tools for processing amplicon sequencing data are presently available, most of them designed to work with V3 and V4 variable regions of the 16S rRNA gene. The most popular packages for 16S amplicon analysis are QIIME [20], MOTHUR [21] and Phyloseq [22]. In particular for 16S metagenomic studies, standard analysis packages and pipelines typically include a workflow comprising demultiplexing and quality control steps, followed by the generation of Operational Taxonomic Units (OTU picking) and/or “Amplicon Sequence Variants analysis” (ASV) analysis, which allows the taxonomic assignment of representative sequences and diversity analysis of the sample (Fig. 2). Consequently, taxonomic assignment of sequences is a critical step and the most informative element for microbial diversity analyses.

Fig. 2

Classic pipelines MOTHUR [21] and QIIME2 [20] and their complete workflow for 16S rRNA amplicons analyses, the “common processes” flow contains all common steps in both pipelines.

Classic pipelines MOTHUR [21] and QIIME2 [20] and their complete workflow for 16S rRNA amplicons analyses, the “common processes” flow contains all common steps in both pipelines. A detailed pipeline of the most conventional workflows for 16S rRNA Illumina sequences are presented in Fig. 2. Despite the differences between the different packages, the principal components in the workflow are analog and shared a common process, which includes: quality control of sequences, clustering or ASV analyses, taxonomic assignment and diversity analyses (Fig. 3).

Fig. 3

Recommended MinION 16S rRNA amplicons pipeline for bacterial diversity analysis. [90], [91], [92]

Third generation of sequencing technologies

In recent years, a third generation of sequencing (TSG) technologies has been developed and have been used in parallel and complementarily to the former sequencing strategies. These new technologies interrogate a single molecule of DNA in real time and produce very long reads (from 1 to 100 kb). In 2011, Pacific Biosciences introduced the first TSG technology, which was termed single-molecule real-time sequencing [19], [20]. Recent releases of a new sequencer, in particular the Sequel, has improved the output by increasing read length and throughput per run by 10- and 100-fold respectively. However, despite that this new platform is two-fold cheaper than the previous versions, it is still less cost-effective than Illumina and therefore the applications of this platform to 16S metagenomic studies remain scarce. In addition, the error rate falls in the same range as the first PacBio version (~13%) [25] and the output is still lower than Illumina. Therefore, price and limited output has restricted the application of the PacBio system in microbial community studies [22], [23], [24] (Table 1). In 2014, Oxford Nanopore Technologies (ONT) introduced nanopore sequencing [28]. Nanopore sequencing was developed at the end of the 1980s [29], although the first successful use of this sequencing technology was reported in 2012 [30]. This sequencing technology directly detects the nucleotides without active DNA synthesis, since a long stretch of single stranded DNA passes through a protein nanopore that is stabilized in an electrically resistant polymer membrane [25], [26], [27]. Specifically, nucleotide detection is based on setting a voltage across this membrane, which is composed by sensors that are able to detect the ionic current changes shifted by nucleotides occupying the pore in real time while the DNA molecule passes through. Applying this technology, ONT released the MinION platform in 2014, with some remarkable advantages such as low price, portability, and fast sequencing chemistry [33]. MinION is basically a base to grip a flowcell responsible for the direct sequencing of individual DNA strands that translocate nanoscale the pores in the semiconductor membrane [34]. The most remarkable characteristic of the MinION Nanopore sequencer is the length of the sequences generated by the flowcell and the amount of data that can be produced per run. Moreover, MinION is a miniaturized sequencing device and the smallest available today in the market, with dimensions of 10 × 3 × 2 cm and weight of 87 g. One particular feature is that the sequencing process does not utilize a secondary signal such as light or pH, as with Illumina and PacBio [35]. According to the manufacturer, the most recent chemistry used in the R9.4.5 version of the flowcell provides an accuracy of ~95% with an output of ~20 Gb. However, the quality of the reads generated by the R9.4.5 flowcell is still lower than those of Illumina, which possess an accuracy of 99.9% (Table 1). Typical problems in Nanopore reads are the frequent presence of insertions and deletions artificially generated in the sequences that may introduce some obstacles to correctly analyze and interpret data from MinION [32]. Another remarkable characteristic of ONT platforms is that data analysis can be performed from the beginning of the sequencing run, which could considerably reduce the time of analysis compared to Illumina platforms. In addition, costs associated with the analyses performed by MinION are much lower compared with other sequencing platforms currently applied for 16S metagenomic studies (Table 1). All these characteristics make the MinION an accessible technology for many laboratories, which has generated a rapid expansion of the use of this technology across the scientific community. Within this context, a remarkable and original feature that ONT have developed is the “nanopore community,” which is part of the ONT website. This “community” provides a common space where users can get help and feedback on device performance, methodologies, and bioinformatic analysis. It is important to note that there are other ONT platforms that can produce larger quantities of sequencing data than the MinION platform, with the same characteristics, such as GridION (100 Gb) and PromethION (6 Tb) [30]

The potential of the Nanopore sequencing for 16S rRNA studies

Nanopore sequencing brings to 16S rRNA metabarcoding studies the benefits of both first and second-generation sequencing. ONT platforms generate long reads, allowing cover the full-length sequence of 16S rRNA gene (V1-V9 regions) through a fast, cheap, and high throughput process. One of the most relevant advantages of the full-length 16S rRNA sequences is that they offer a higher level of taxonomic and phylogenetic resolution for bacterial identification since all the informative sites of 16S rRNA genes are considered in the analysis [36]. With Illumina sequencing, the conventional strategy for sequencing the 16S rRNA uses the hypervariable regions V1-V2 and/or V3-V4 [37], and taxonomy is assigned based only on these short variable regions of the 16S rRNA gene of approximately ~300 bp. The analysis of these short regions provides a limited taxonomic resolution in most cases, failing to reliably discriminate sequences beyond genus level [31], [32]. Moreover, the choice of these regions will produce a direct effect on the specificity of the taxonomic assignment. For example, V4 regions better represent the whole bacterial diversity in host-associated studies, while V1-V2 are more specific for skin microbiota studies. In addition, taxonomic resolution varies for different groups of bacteria when using different portions of the 16S rRNA gene [40]. By contrast, the resolution obtained with Nanopore sequencing is only comparable to levels provided by Sanger 16S rRNA sequencing, with the potential for providing better discrimination among taxa, a deeper phylogenetic signal, and a more accurate taxonomic placement of 16S rRNA nanopore sequences [34], [31], [30]. Another advantage of ONT, is that data can be generated in a short runtime (1–48 h) and at an affordable price (~ $50 USD per sample) Table 1. As previously mentioned, MinION is one of the most popular ONT platforms today and has been used extensively in genomics and transcriptomics studies [35], [36], [37], [38], [39], [40], and over the last two years is rapidly growing in studies on microbial diversity. However, despite the evident benefits of the use of ONT technology in microbial ecology studies, there are still several factors limiting the implementation of these new approaches in the routine analysis of microbial diversity. The scarcity of tools specifically designed to work with full sequences of the 16S gene have made it extremely challenging to carry out a specialized taxonomic analysis of Nanopore sequences. Moreover, the limited quality of Nanopore 16S sequences has represented a serious constraint to apply exiting tools designed for other technologies (mostly Illumina) to analyze these sequences.

Nanopore 16S metagenomic studies

Studies applying Nanopore sequencing to describe microbial diversity have conventionally applied a similar approach than previous studies, which were mostly Illumina-based, regardless of the fact that Nanopore generates full-length 16S sequences. With Nanopore, the full length 16S rRNA gene is amplified by PCR using universal primers (27F and 1493R). The library is prepared by the addition of adapters in the amplicon sequences, and samples are sequenced directly with a flowcell gripped on the MinION device (Fig. 1 c). Authors have tried to standardize a different 16S-based amplicon barcoding protocol by using a two PCR step-based protocol, with the first process to amplify the 16S rRNA gene and a second one for the addition of adapters for the 16S amplicons sequencing [48], [49]. Another strategy has been based on the use of an ONT 1D2 chemistry library preparation where both DNA strands are sequenced (similar to the paired-end sequencing of Illumina), improving the quality of the reads by sequencing both strands of the target DNA [50]. Although different strategies have been applied in published studies using Nanopore sequencing for 16S rRNA metabarcoding, the 16S barcoding Kit of Oxford Nanopore Technologies has been predominantly used with satisfactory results [41], [42], [43], [44]. Similar to sample preparation, methodologies introduced to analyze Nanopore 16S amplicons have included a broad range of bioinformatic tools. Nevertheless, despite the different tools, the central process in all the published studies is the application of a strategy based on taxonomic assignment [44], [43], [45], [47].

Taxonomic assignment using Nanopore 16S sequences

Compared with Illumina, there is a scarcity of bioinformatic tools and protocols designed specifically for the analysis of Nanopore 16S sequences. The most extensively used tool is the cloud-based data analysis service EPI2ME (ONT), which provides a number of workflows for end-to-end analysis of nanopore 16S data: 16S taxonomic classification, a barcoding protocol, and quality filter of reads. For taxonomic assignment, FASTQ files are uploaded on the FASTQ 16S protocol of the EPI2ME platform, reads are filtered by quality and then taxonomy is assigned using BLAST to the NCBI database, with a minimum horizontal coverage of 30% and a minimum accuracy of 77% as default parameters (ONT). However, this tool is not publicly available and only ONT customers can gain access to this tool through a web platform. Moreover, quality filters, adapter trimming, or setting of alignment parameters such as identity and coverage of sequences, are already configured by default and the user cannot modify more than the initial parameters of the quality of reads. Furthermore, the format of the final output with the taxonomic assignment results is not compatible with other tools for performing downstream analyses such as diversity and taxonomic differential abundance. To overcome these limitations of EPI2ME software, it is necessary to define a different analytical pipeline that considers other bioinformatic tools available. Cusco [48] applied a mapping approach for taxonomic assignment using the tool Minimap, and was able to determine the taxonomic composition at the genus and species level for bacterial isolates, mock communities, and complex skin samples. However, the study suggested the need for a more accurate bioinformatic protocol to achieve more reliable results. Another important result of this research is that taxonomic accuracy can be improved by analyzing sequences longer than 16S rRNA gene, such as the rrn operon (16S rRNA-ITS-23S rRNA; 4500 bp). Using Minimap2 [54], Kai et al. [52] reported a species-level bacteria identification with more than 90% of reads correctly assigned to each species. A subsequent study carried out by Hardegen et al. [49] used a BLAST-based classification and concluded that their pipeline can be suitable for taxonomic assignment of 16S rRNA reads from Nanopore sequencing. Edwards et al. [51] used VSEARCH [55] for taxonomic assignment and reached a confidence level of ~75% at the phylum and family level. A different approach was performed by Ma et al. [50], who carried out taxonomic classification using RDP classifier [56], and reported in pure-culture an average annotation accuracy of 93.8% and 82.0% at the phyla and genus level, respectively. Mitsuhashi et al. [57] analyzed a mock community of pleural effusion from a patient with empyema using Centrifuge [58] and BLAST for taxonomic analysis, successfully identifying all the species presents in the mock community applying Centrifuge [58]. Turner et al. [53] described the microbiome of a new invasive nemertean species using Centrifuge [58] for taxonomic assignment, identifying 2054 species associated with the microbiome. Considering all of the aforementioned studies, Centrifuge [58] and Minimap [54] have been the most frequently used taxonomic classifiers for Nanopore datasets [50], [41], [44], [43], [45]. Regarding the characteristics of both bioinformatic tools, Centrifuge [58] is capable of accurately identifying reads when using databases containing multiple highly similar reference genomes, such as different strains of a bacterial species. Moreover, Centrifuge works by building a database of genomes in which unique segments of these genomes are identified to build an FM-index (a compressed data structure for full-text pattern searching). This FM-index can be used for efficient searches of sequenced reads against genome segments in a database. On the other hand, Minimap2 [54] is a general-purpose alignment program that maps long DNA sequences against reference genomes such as Human, fungal, bacterial, or viral genomes. Minimap2 is >30 times faster than long-read mapping tools or cDNA mapping tools and also possesses higher accuracy, surpassing most aligners specialized in a single type of alignment. Although both tools have been applied with success to the analysis of Nanopore data, Minimap was specifically developed for mapping long reads while Centrifuge was conceived for a more general purpose (mapping against full genomes databases) in metagenomic analyses. However, in terms of parameter setting and configuration, Centrifuge offers more variety of modules and versatility, which could result in a more reliable taxonomic assignment. Other tools such as BLASTN, MEGABLAST and LASTZ [52], [50] have also applied for taxonomic assignment in metabarcoding studies using Illumina sequencing. Nevertheless, it is important to highlight that due to the differences between Nanopore and Illumina reads in terms of longer and poorer quality resulting from the presence of insertions and deletions on sequences, many of these standardized bioinformatics tools and pipelines are not suitable to be used with Nanopore data. In this context, Magi et al [60], [61] have made an assessment of alignment and mapping tools and concluded that mapping or aligning Nanopore reads against a database is particularly challenging due to the size, high number and non-uniform error profiles of these long sequences. This study also found that mapping and alignment tools such as LAST, BWA, BLASR, and MarginAlign, were inefficient to process Nanopore data and the outcomes of these analyses were deeply influenced by the sequence lengths, since longer sequences contained more errors [53], [54], [14], [46]. Moreover, Centrifuge has been included as part of the pipeline for the analysis of nanopore sequences in the new tool MINDS [62]. Based on these studies, Centrifuge and Minimap2 have proven to be the most suitable tools to work with Nanopore data, and they could be considered the best choices at present. In addition, a second critical aspect to consider in taxonomic assignment is the composition of the database, which generally has a strong influence on the percentage of sequences correctly assigned to different taxonomic levels [63], [64]. To date, there are few curated databases available for microbial identification—the most frequently used for 16S studies SILVA [65], Greengenes [66], RDP [56], and NCBI [67]. SILVA database contains taxonomic information for the domains of Bacteria, Archaea, and Eukarya. It is based primarily on phylogenies for small subunit rRNAs (16S for prokaryotes and 18S for Eukarya) [64]. Their taxonomic hierarchy and rank are constructed according to Bergey’s Taxonomic Outlines, List of Prokaryotic Names with Standing in Nomenclature (LPSN), and manual curation [68]. Greengenes is the most popular and widely used database, since it is the default database in the QIIME pipeline (http://qiime.org/index.html). It provides Bacterial and Archaeal taxonomy based on phylogenetic trees inferred from chimera-free, consistent multiple sequence alignments, but it has not been updated since May 2013. The NCBI taxonomy contains the names of all organisms associated with submissions to the NCBI sequence data bases. It is manually curated based on current systematic literature, and uses over 150 sources. It contains some duplicate names that represent different organisms. Each NCBI database node has a scientific name and may have some synonyms assigned to it. Is important to note that this has been the most used database in articles of MinION 16S sequences classification [57], [51], [59], [53], [52]. The RDP database is based on 16S rRNA sequences from Bacteria, Archaea, and Fungi (Eukarya). It contains16S rRNA sequences available from the International Nucleotide Sequence Database Collaboration (INSDC) database. Another new database is EzBiocloud, which is a species level resolution database made of 61 700 species/phylotypes, including 13 132 species/phylotypes with validly published names, and 62 362 whole-genome assemblies that were identified taxonomically at the genus, species, and subspecies levels [69]. Some authors have evaluated the differences in taxonomic assignment using these databases, [64] and showed that NCBI is the bigger one in terms of number of sequences, followed by SILVA, RDP and Greengenes, respectively. In addition, they found that Silva shares the most taxonomic units with NCBI, and that green genes is the less diverse data base. Moreover, only green genes and NCBI could get taxonomic assignment to the species level rank, while SILVA allows only genus as the lowest rank. Importantly, NCBI database is not curated for all the groups of microorganisms and may contain duplicated copies of 16S sequences, which can lead to a bias in taxonomic assignment by an overestimation because of the high number of some bacterial groups. An example of this is the high number of available sequences belonging to pathogenic bacterial groups given by the NCBI repository. Contrasting with clinical strains, sequences belonging to extreme environments still remain scarce in the NCBI database and may be underrepresented when a taxonomic assignment is carried out. More detailed guidelines for the selection of the database is provided by Park & Won 2018 [68]. A final consideration for the selection of tools is the format for output data, since they cannot be compatible with other bioinformatics tools applied for downstream analysis. This particularly relates to those tools performing statistical tests, and generating plots and comparative analyses of taxonomic profiles identified in samples. A detailed description of the different options and applications of the available tools for 16S metagenomic studies using Nanopore data are summarized in Table 2.

Table 2

Different tools used to analyze Nanopore 16S data in metabarcoding studies.

Analysis approach	Data processes included	Tools used for analysis	Taxonomic Data Base	Reference
Profiling of bacterial communities	Basecalling, Demultiplexing, adapters and barcode trimming, chimera removal, taxonomic assignment	Albacore V2.3.1, Porechop, Yacrd 0.3, Minimap, EPI2ME	NCBI and rrn database	[48]
In field metagenome bacterial community analysis	Basecalling, Demultiplexing, Taxonomic assignment, diversity analysis	Albacore v1.10, SiINTAX, usearch v10.0.240	Ribosomal Database Project	[51]
Rapid bacterial pathogens identification	Basecalling, human reads removal, bacterial reads taxonomic assignment	Albacore 2.2.4, TanTan v13, Minimap2, R	GenomeSync database, NCBI database	[52]
Monitoring microbial of an anaerobic digestion system	Basecalling, Demultiplexing, adapter trimming, Taxonomic assignment	Metrichor, EPI2ME, poRe, Porechop, QIIME, BLAST,	GreenGenes database	[49]
Microbiome characterization	Basecalling, OTU picking, taxonomy assignment.	Metrichor v2.42.2, Poretools, QIIME 1.9. RDP classifier, BLASTn	GreenGenes database	[50]
Microbiome amplicon sequencing workflow	Bassecalling, alignment, re-orientation of reads, de-novo clustering, chimera removal,	Fast5-to-fastq, seqtk, INC-Seq, blastn, Graphmap, POA, chopSeq, nanoClust, R	No taxonomic assignment	[81]

Different tools used to analyze Nanopore 16S data in metabarcoding studies.

Constraints to move beyond taxonomic assignment with Nanopore sequencing data

Since most of the analytical tools for taxonomic assignment have been developed to be applied to Illumina data and cannot be used for Nanopore sequences, the potential benefits of using full-length 16S rRNA sequences has not been systematically explored. The deeper taxonomic resolution provided by the full 16S gene sequence can reach the genus and species level with higher specificity than other approaches, [68], [69], [70]. This methodology has been applied with success in clinical, forensic and quality control of industrial processes where many of the microorganisms to be identified are well represented in databases due to their medical/human relevance [29], [61]. However, taxonomic assignment is not always the best approach in other ecological contexts where the microbial community has not been previously studied. In these circumstances, the most representative microorganisms living in these habitats may remain unexplored and consequently their genomic data are not present in databases, which makes the taxonomic identification for many of the reads impossible. This situation is probably even more critical working with Nanopore data, since databases are predominantly composed by fragments of the 16S rRNA gene and presence of full-length sequences is frequently the exception and not the rule, limiting a reliable taxonomic identification based on the full sequence of the gene. On the other hand, the presence of a large number of reads without taxonomic assignment has a direct impact in providing a realistic measure of the biological diversity in the sample, leading to an underestimation of the real number of species. In this context, and as described in section 2, to overcome these limitations and the bias induced by a direct taxonomic assignment of reads, approaches such as Operational Taxonomic units (OTU) picking and/or denoising pipelines are commonly used for 16S Illumina data analysis [71], [72], [73] Both OTU picking and ASV analyses reduce the duplication and error of representative sequences and allow the analysis of bacterial groups without a database limitation, which allows for a more reliable taxonomic assignment resulting in a more robust definition of microbial communities (Table 3).

Table 3

Bioinformatic tools for 16S rRNA metabarcoding Nanopore data.

Process	Tool	Input file	Programming languages	Available from	Reference
Basecalling	Albacore	Fast5	Python	https://nanoporetech.com/	ONT
	Guppy	Fast5	Python	https://nanoporetech.com/	ONT
	Deep Nano	fast5	Python	https://bitbucket.org/vboza/deepnano	[77]
	Chiron	Fast5	Python	https://github.com/haotianteng/Chiron	[78]

Sequencing report	NanoPlot	fastq, fasta, sequencing_summary (Albacore or guppy basecaller)	Python	https://github.com/wdecoster/NanoPlot	[82]
	pOre	fastq, fasta	R	https://sourceforge.net/projects/rpore/files/	[83]
	pauvre	fastq		https://github.com/conchoecia/pauvre	Github
	poretools	fastq, fast5	Python	https://github.com/arq5x/poretools	[84]

Demultiplexing	Albacore	Fast5	Python	https://nanoporetech.com/	ONT
	qcat	fastq	Python	https://github.com/nanoporetech/qcat	Github
	porechop	fastq, fasta	C++, Python	https://github.com/rrwick/Porechop	Github

Filtering and trimming	NanoFilt	fastq	Python	https://github.com/wdecoster/nanofilt	[82]
	Filtlong	fastq	C++, Python	https://github.com/rrwick/Filtlong	Github
	Porechop	fastq	C++, Python	https://github.com/rrwick/Porechop	Github

Taxonomic assignment	Minimap2	fastq, fasta	C++, Python	https://github.com/lh3/minimap2	[54]
	Wimp	fastq	Cloud-based	https://nanoporetech.com/	ONT
	Centrifuge	fastq, fasta	g++	https://ccb.jhu.edu/software/centrifuge	[58]
	LASTZ	fastq, fasta	g++, python	https://github.com/lastz/lastz	Github

Clustering	NanoClust	USEARCH/VSEARCH format	Python	https://github.com/umerijaz/nanopore/blob/master/nanoCLUST.py	[81]
	CARNAC-LR	paf	C++, Python	https://github.com/kamimrcht/CARNAC-LR	[80]

Data exploration	Pavian	Kraken and MetaPhlan formats	R	https://github.com/fbreitwieser/pavian	[85]
	PHINCH	biom	Cloud-based	https://github.com/PitchInteractiveInc/Phinch	[86]
	Krona	Krona format	–	https://github.com/marbl/Krona/wiki	[87]
	MEGAN6	OTU table	–	http://ab.inf.uni-tuebingen.de/software/megan6/	[88]
	Microbiome Analyst	OTU table, taxonomy table	Cloud-based	https://www.microbiomeanalyst.ca/	[89]

Bioinformatic tools for 16S rRNA metabarcoding Nanopore data. These analyses need to be performed in order to execute a taxonomic assignment and diversity analysis (Fig. 3). As described previously, tools such as DADA2 and Deblur are the most commonly applied in Illumina sequencing pipelines. However, because of the particular characteristics of Nanopore 16S reads (length and quality), the use of DADA2 and Deblur or any other algorithm based on ASV detection, has not as of yet been viable for Nanopore data. The number of errors—mainly insertions/deletions—typically introduced through the Nanopore sequencing, represent an extraordinary limitation to finding similarity between reads. Furthermore, the artificial divergence in sequences caused by the poor quality of reads, even when they come from a single organism, can produce the effect that each read is identified as a single sequence variant, leading to an overestimation of bacterial diversity [71]. As a consequence, the analysis of Nanopore reads with inappropriate OTU clustering tools or using an ASV approach could provide a completely incorrect picture of the microbial diversity of the sample showing a dataset with very divergent sequences. Therefore, although the ASV approach is the most complete way to assess bacterial diversity, it is impracticable for Nanopore data analysis, with the only option available being the application of an OTUs-based clustering approach. However, similar limitations to the ones identified using ASV can be found when the most popular clustering algorithms are applied [74], such as UCLUST [75], VSEARCH [55] or CDHIT [76]. The use of the popular pipeline QIIME to analyze Nanopore 16S sequences was assessed in a recent study [50], indicating that the tool failed at the step of OTU picking, which corroborates the aforementioned issue of applying tools designed for Illumina to Nanopore data. By performing a close or open reference OTU clustering, only a small fraction of the data would be clustered and the main proportion of a dataset will be composed of singletons, which cause an erroneous overestimation of the bacterial diversity in the samples. As previously mentioned, read quality is one of the most important constraints for nanopore data analysis. Basecalling is the most determinant process for the improvement of sequence quality. Nanopore sequencing is based on the detection of changes in electric currents produced by the passing of DNA strands through a nanopore. Each base ideally should have a specific current variation, called an event. Each event is summarized by the mean and variance of the current and by the event duration [77], [51]. Translation of this event into a DNA sequence is known as the basecalling process. Original basecallers of ONT used Hidden Markov Models (HMM), however nowadays new strategies based on the use of machine learning are applied in all modern nanopore sequences basecallers, such as Guppy, DeepNano, and Chiron [77], [78]. This machine learning-based basecallers use neural networks that can be trained with real sequencing data. The use of machine learning approaches has been shown to be effective for improving the quality of nanopore sequencing data and limiting the impact of base modifications, insertions, and deletions commonly present in raw data [79]. Therefore, the use of these new approach of machine learning on nanopore data has been crucial for the sequence quality improvement and in the short term will probably allow the necessary improvement of nanopore sequences to go beyond the taxonomic assignment of 16S sequences. A final and important point to be considered is the difference in the orientations of reads produced by Illumina and Nanopore sequencing technologies. With Illumina, read orientation is defined from the beginning of sequencing and therefore sequences are all in the same orientation, which greatly facilitates bioinformatic data analysis. This homogeneity in the sequencing data is essential for alignment and clustering because reads can be compared more easily. On the other hand, with the 1D sequencing chemistry of Nanopore, adapters can be ligated to one or both ends of the DNA template [71] and DNA strands are sequenced in random orientations. Consequently, after the basecalling process the dataset is composed by a mix of forward and reverse sequences that are not complementary to each other. Hence, it may be critical to incorporate an additional step to evaluate the orientation of reads prior to the analysis of Nanopore data in order to reach consistent results. According to the points discussed in previous sections relating to the availability of tools and their applications for working with Nanopore sequences, a workflow for 16S rRNA data analysis is proposed in the Fig. 3.

Summary and outlook

With the advent of modern technologies for sequencing, microbial ecology studies based on the analysis of the microbial 16S rRNA gene have become one of the most popular techniques in metabarcoding studies. Most of the studies conducted to date using Nanopore sequences report pipelines applied with a narrow scope, typically using a specific bioinformatic protocol to detect a particular pathogen or a target bacterial group or taxon, without considering the analysis of the whole microbial community present in the sample. However, most of the current aligners, clustering algorithms, and tools cannot process Nanopore data [74] and this remains a challenge to performing a more comprehensive analysis of Nanopore 16S rRNA data. Due to the potential bias introduced by taxonomic assignment, OTU clustering may represent a more convenient alternative. In this regard, the new tools developed for transcriptomic de-novo clustering could represent an alternative to explore in the future [66], [67]. As several transcriptomic based studies have been carried out with Nanopore, a possible alternative would be to apply these varieties of tools for de-novo clustering of all the transcripts originating from a single gene, and apply the same strategy to group all the variants of the 16S gene in a sample. Moreover, some of these tools have been developed to deal with the particular features of the Nanopore sequences and, therefore, can be used as a first approach to implement a specific clustering tool for 16S sequences from Nanopore. Finally, many challenges for data analysis have surfaced since the development of the new sequencing technologies. The correct use of available tools has contributed to extending the use of 16S data from Nanopore for a first evaluation of the microbial composition. For Nanopore, efforts have been primarily focused on designing tools for basecalling, demultiplexing, and taxonomic assignment, according to the demand of consumers and end-users of this technology. Certainly, we are still in the first stages of the genomic revolution and the future will bring new possibilities for the expansion of these technologies and development of a new generation of powerful bioinformatic tools. The best parameters concerning the identity, alignment, and database choice must also be evaluated for each dataset in particular if the identification at the species level is required. The 2019 release by ONT of the new version (R10) of the flowcell with a new chemistry, will offer a substantial improvement in quality and quantity of data, with a consensus accuracy reaching 99% and an output of 50 Gb. All these developments in Nanopore outputs will generate new challenges for bioinformatic analysis, but will also bring new opportunities to revolutionize microbial ecology studies.

CRediT authorship contribution statement

Andres Santos: Writing - original draft. Ronny van Aerle: Writing - review & editing. Leticia Barrientos: Writing - review & editing. Jaime Martinez-Urtaza: Writing - review & editing.

20 in total

1. Targeted metagenome sequencing reveals the abundance of Planctomycetes and Bacteroidetes in the rhizosphere of pomegranate.

Authors: Renuka Ravinath; Anupam J Das; Talambedu Usha; Nijalingappa Ramesh; Sushil Kumar Middha
Journal: Arch Microbiol Date: 2022-07-14 Impact factor: 2.667

2. Emu: species-level microbial community profiling of full-length 16S rRNA Oxford Nanopore sequencing data.

Authors: Alexander Dilthey; Todd J Treangen; Kristen D Curry; Qi Wang; Michael G Nute; Alona Tyshaieva; Elizabeth Reeves; Sirena Soriano; Qinglong Wu; Enid Graeber; Patrick Finzer; Werner Mendling; Tor Savidge; Sonia Villapol
Journal: Nat Methods Date: 2022-06-30 Impact factor: 47.990

Review 3. The Applications of Nanopore Sequencing Technology in Pathogenic Microorganism Detection.

Authors: Xiaojian Zhu; Shanshan Yan; Fenghua Yuan; Shaogui Wan
Journal: Can J Infect Dis Med Microbiol Date: 2020-12-31 Impact factor: 2.471

4. Identification of plastic-associated species in the Mediterranean Sea using DNA metabarcoding with Nanopore MinION.

Authors: Keren Davidov; Evgenia Iankelevich-Kounio; Iryna Yakovenko; Yuri Koucherov; Maxim Rubin-Blum; Matan Oren
Journal: Sci Rep Date: 2020-10-16 Impact factor: 4.379

5. Real-Time Culture-Independent Microbial Profiling Onboard the International Space Station Using Nanopore Sequencing.

Authors: Sarah Stahl-Rommel; Miten Jain; Hang N Nguyen; Richard R Arnold; Serena M Aunon-Chancellor; Gretta Marie Sharp; Christian L Castro; Kristen K John; Sissel Juul; Daniel J Turner; David Stoddart; Benedict Paten; Mark Akeson; Aaron S Burton; Sarah L Castro-Wallace
Journal: Genes (Basel) Date: 2021-01-16 Impact factor: 4.096

6. Full-length 16S rRNA gene amplicon analysis of human gut microbiota using MinION™ nanopore sequencing confers species-level resolution.

Authors: Yoshiyuki Matsuo; Shinnosuke Komiya; Yoshiaki Yasumizu; Yuki Yasuoka; Katsura Mizushima; Tomohisa Takagi; Kirill Kryukov; Aisaku Fukuda; Yoshiharu Morimoto; Yuji Naito; Hidetaka Okada; Hidemasa Bono; So Nakagawa; Kiichi Hirota
Journal: BMC Microbiol Date: 2021-01-26 Impact factor: 3.605

Review 7. Nanopore sequencing and its application to the study of microbial communities.

Authors: Laura Ciuffreda; Héctor Rodríguez-Pérez; Carlos Flores
Journal: Comput Struct Biotechnol J Date: 2021-03-07 Impact factor: 7.271

8. Establishment and assessment of an amplicon sequencing method targeting the 16S-ITS-23S rRNA operon for analysis of the equine gut microbiome.

Authors: Yuta Kinoshita; Hidekazu Niwa; Eri Uchida-Fujii; Toshio Nukada
Journal: Sci Rep Date: 2021-06-04 Impact factor: 4.379

9. A framework for in situ molecular characterization of coral holobionts using nanopore sequencing.

Authors: Quentin Carradec; Julie Poulain; Emilie Boissin; Benjamin C C Hume; Christian R Voolstra; Maren Ziegler; Stefan Engelen; Corinne Cruaud; Serge Planes; Patrick Wincker
Journal: Sci Rep Date: 2020-09-28 Impact factor: 4.379

10. Evaluation of full-length nanopore 16S sequencing for detection of pathogens in microbial keratitis.

Authors: Liying Low; Pablo Fuentes-Utrilla; James Hodson; John D O'Neil; Amanda E Rossiter; Ghazala Begum; Kusy Suleiman; Philip I Murray; Graham R Wallace; Nicholas J Loman; Saaeha Rauz
Journal: PeerJ Date: 2021-02-15 Impact factor: 2.984