Literature DB >> 35621046

Intra-genomic rRNA gene variability of Nassellaria and Spumellaria (Rhizaria, Radiolaria) assessed by Sanger, MinION and Illumina sequencing.

Miguel M Sandin^1,2, Sarah Romac¹, Fabrice Not¹.

Abstract

Ribosomal RNA (rRNA) genes are known to be valuable markers for the barcoding of eukaryotic life and its phylogenetic classification at various taxonomic levels. The large-scale exploration of environmental microbial diversity through metabarcoding approaches has been focused mainly on the V4 and V9 regions of the 18S rRNA gene. The accurate interpretation of such environmental surveys is hampered by technical (e.g. PCR and sequencing errors) and biological biases (e.g. intra-genomic variability). Here we explored the intra-genomic diversity of Nassellaria and Spumellaria specimens (Radiolaria) by comparing Sanger sequencing with Illumina and Oxford Nanopore Technologies (MinION). Our analysis determined that intra-genomic variability of Nassellaria and Spumellaria is generally low, yet some Spumellaria specimens showed two different copies of the V4 with <97% similarity. Of the different sequencing methods, Illumina showed the highest number of contaminations (i.e. environmental DNA, cross-contamination, tag-jumping), revealed by its high sequencing depth; and MinION showed the highest sequencing rate error (~14%). Yet the long reads produced by MinION (~2900 bp) allowed accurate phylogenetic reconstruction studies. These results highlight the requirement for a careful interpretation of Illumina-based metabarcoding studies, in particular regarding low abundant amplicons, and open future perspectives towards full-length rDNA environmental metabarcoding surveys.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35621046 PMCID： PMC9545545 DOI： 10.1111/1462-2920.16081

Source DB: PubMed Journal: Environ Microbiol ISSN： 1462-2912 Impact factor: 5.476

Introduction

Ribosomal RNA (rRNA) genes are known to be valuable taxonomic markers for the barcoding of eukaryotic life and its phylogenetic classification at different levels; mainly due to its intra‐genomic tandem repeated structure, the presence of conserved and variable regions and its occurrence in all eukaryotes (Pawlowski et al., 2012; del Campo et al., 2018). The 18S rRNA gene has been widely used in molecular environmental surveys, in particular the short hypervariable regions V4 and V9, thanks to the extensive occurrence in public databases and the availability of generalist primers flanking their sides (Amaral‐zettler et al., 2009; Stoeck et al., 2010). The advent of high‐throughput sequencing (HTS) techniques has allowed the massive sequencing of molecular environmental diversity supporting its exploration through a metabarcoding approach (de Vargas et al., 2015; Massana et al., 2015; Forster et al., 2016; Pernice et al., 2016). The large number of reads generated by HTS is normally classified into operational taxonomic units (OTUs) based on arbitrary similarity thresholds. OTUs are not only used to identify taxonomic entities but also to describe community structure (Blaxter et al., 2005). The increasing use of the HTS has led to the development of different clustering methods resulting in finer‐scale OTUs that focus on single nucleotide differences (Mahé et al., 2015) or on the correction of sequencing errors based on the error rate entropy (so‐called amplicon sequence variants or ASVs; Callahan et al., 2016). HTS produce a vast amount of reads carrying errors that are difficult to distinguish from real biological variation, which is considered as a main factor inflating diversity (Kunin et al., 2010). Intra‐genomic rDNA polymorphism and its different copy numbers among taxa can also affect diversity assessments (Gong et al., 2013; Gong and Marchetti, 2019). Other less common causes, yet important, have also been reported as factors inflating diversity estimates, such as lateral gene transfer (Yabuki et al., 2014) or presence of pseudogenes (Thornhill et al., 2007). Considering the very high sequencing depth allowed by the current HTS technologies it is likely that most of this intra‐genomic diversity could be sequenced, potentially leading to an overestimation of the environmental diversity. Several studies have argued that, quantitatively at the community level, the number of molecular clusters largely exceeds that of morphological counts (Medinger et al., 2010; Bachy et al., 2013; Santoferrara et al., 2016), leading to hypothesize that using HTS approaches scientists are actually measuring species intra‐genomic variability (Caron and Hu, 2018). Current HTS technologies used for environmental surveys can sequence short fragments of DNA of about 400 base pairs (bp) only, such as the hypervariable regions (V4 and V9 most commonly used in protist) of the 18S rRNA gene. Comparing such short hypervariable regions to, far from exhaustive reference sequences databases may also contribute to inflating environmental diversity by misidentification of environmental clusters or lack of intra‐genomic rDNA variability representation (Pitsch et al., 2019), among other causes. New sequencing technologies have been developed with the ability of high‐throughput sequencing longer nucleotide fragments in real‐time, such as Oxford Nanopore Technologies (ONT) or Pacific Bioscience (PacBio). These sequencing methods have already showed their useful capabilities, for example PacBio has developed a circular consensus sequencing resulting in near‐zero error long reads, improving aspects such as genome assembly (Wenger et al., 2019) or even phylogenetic analysis of environmental diversity (Jamy et al., 2019). However, its limited high‐throughput sequencing ability and its relatively high cost (Goodwin et al., 2016) may affect the sequencing depth at a metabarcoding community level. On the other side, ONT provides a large quantity of reads, inexpensively and highly portable with the MinION device (Levy and Myers, 2016). Despite that the error rate of the ONT reads is improving since it was first released (for MinION, from ~60% in 2014 to <15% error rate; Rang et al., 2018) it is still a major concern, reaching up to 3%–6% of errors in the best‐case scenarios (Tyler et al., 2018). In this study, we aim at assessing both (i) the intra‐genomic variability within protists and (ii) whether long‐read sequencing provides a reliable source of diversity estimate suitable for environmental molecular surveys. To test these hypotheses we compare three different sequencing methods, Sanger, ONT (MinION) and Illumina. We focused our efforts on two groups of Radiolaria, the Nassellaria and Spumellaria (Polycystines). Some species of Radiolaria harbour endosymbiotic micro‐algae, constituting a community system which can be considered as a so‐called holobiont. The Radiolaria host is an important group of protists in eukaryotic plankton communities, contributing for a major fraction of the total reads in environmental molecular surveys (de Vargas et al., 2015; Pernice et al., 2016) but the number of morphologically described species (Suzuki and Not, 2015) does not match with the molecular barcodes. Nassellaria and Spumellaria environmental diversity lags far behind other radiolarian, despite possessing the largest morphological diversity described. Recent studies have dwelled on exploring their extant morpho‐molecular diversity (Sandin et al., 2019; Sandin et al., 2021) showing their uncharted diversity among Radiolaria. The ecological importance of these groups and their observed low molecular diversity in environmental surveys stresses the need for understanding such differences between molecular and morphological diversity.

Experimental procedures

Single‐cell sampling, isolation and DNA extraction

Plankton samples were collected in the Bay of Villefranche‐sur‐Mer (France) and in the West Mediterranean Sea (MOOSE‐GE 2017 expedition) by plankton nets tows (from 20 to 64 μm mesh size). More information on sampling methodology for specific samples can be found in the RENKAN database (http://abims.sb-roscoff.fr/renkan/). Specimens were individually handpicked with Pasteur pipettes from the plankton community and maintained in 0.2 μm filtered seawater for several hours. They were transferred three to four times into new 0.2 μm filtered seawater to allow self‐cleaning from debris, particles attached to the cell or prey(s) digestion. By doing so, it is expected to keep only essential entities from the holobiont (the radiolarian host + associated symbionts and bacteria).

Amplification and sequencing

A schematic representation of the study design and the different amplification and sequencing steps can be found in Supplementary material Fig. S1 and a fully detailed explanation of the experimental procedures can be found in Supplementary Information. Four holobionts were selected to amplify the full length of the rRNA gene (Supplementary material Fig. S2), two of them belonging to Spumellaria: Mge17‐81 (Rhizosphaera trigonacantha, Rhizosphaeroidea) and Mge17‐82 (Spongosphaera streptacantha, Spongosphaeroidea); and two to Nassellaria: Vil325 (Eucyrtidium cienkowskii, Eucyrtidioidea) and Vil496 (Eucyrtidium acuminatum, Eucyrtidioidea). Each holobiont was PCR‐amplified in three different technical replicates covering from the beginning of the 18S rRNA gene until the end of the 2D region of the 28S rRNA gene. Final PCR product was divided into two for sequencing by Sanger (after cloning) and ONT using the MinION device (Jain et al., 2015; Laver et al., 2015) during its last 28 h (20 h → 48 h), after a prior independent experiment (data not analysed herein). Eight holobionts were selected to amplify the V4 region (~380 bp) of the 18S rRNA gene (Supplementary material Fig. S2), four belonging to Spumellaria: Mge17‐81 (common to Sanger and MinION sequencing), Mge17‐82 (common to Sanger and MinION sequencing), Vil480 (Tetrapyle octacantha, Pylonioidea), Vil497 (Arachnospongus varians, Liosphaeroidea); and four to Nassellaria: Mge17‐9 (Extotoxon undulatum, Artostrobioidea), Mge17‐124 (Carpocanium obliqua, Carpocaniidae), Vil490 (Pterocorys cf. zanclea, Pterocorythoidea), Vil496 (common to Sanger and MinION sequencing). Holobiont Vil325 (from Sanger and MinION sequencing) had no DNA left for further experiments and therefore was not possible to include in this section. After PCR amplification of three technical replicates, about 500 ng of pooled amplicons were sent to Fasteris (https://www.fasteris.com, Switzerland) for Illumina sequencing on a MiSeq nano V2 2 × 250. Sequencing results from this study along with associated metadata for its analysis and native formats have been deposited in figshare with the following DOI: https://doi.org/10.6084/m9.figshare.16922764.v1. Raw sequencing results have also been deposited in SRA under the accession number PRJNA816840. All scripts used in this study along with the tools for the replication of the analysis are available on github (github.com/MiguelMSandin/IntraGenomic-variability).

Sequences/amplicons analyses

Raw sequences/amplicons were measured in bp length and compared against reference sequences by local alignment (BLAST; using NCBI as reference sequences) for Sanger and MinION sequencing results and global alignment (vsearch; Rognes et al., 2016; using PR2 v4.14.0 database as reference sequence, Guillou et al., 2013) for Illumina results. Thereafter depending on the sequencing nature different pipelines were followed: Sanger reads belonging to the same replicate were concatenated, MinION reads were de‐multiplexed with cutadapt (Martin, 2011) and Illumina reads were clustered into ASVs using DADA2 and the pipeline described in Callahan et al. (2016). Resulting ASVs were post‐clustered based on co‐occurrence, similarity and abundance with the ‘LULU’ algorithm (Frøslev et al., 2017). The V4 region of sequences coming from Sanger and MinION sequencing was extracted with cutadapt (Martin, 2011), using the same primer set used to amplify and sequence with Illumina (see Supplementary Information for further details), and clustered using swarm (Mahé et al., 2015). In order to compare different sequencing methods and discriminate errors from intra‐genomic variability, we perform an entropy analysis. Sequences taxonomically assigned to Nassellaria and Spumellaria were aligned independently using MAFFT v7.395 (Katoh and Standley, 2013) and for every position of each alignment, Shannon entropy was calculated using a custom script (see ‘alignmentEntropy.py’ in github.com/MiguelMSandin/IntraGenomic-variability). The entropy of the full V4 region was measured in an independent analysis to compare the three different sequencing technologies against reference sequences from PR2 v4.14.0. Sequences assigned to Polycystines from both Sanger and Minion sequencing have been clustered into consensus sequences for phylogenetic analysis. Consensus sequences coming from Sanger were produced with a custom script (see ‘alignmentConsensus.py’ in github.com/MiguelMSandin/IntraGenomic-variability) and MinION consensus sequences according to Wurzbacher et al. (2019) with the script consension (https://microbiology.se/software/consension/). Resulting consensus sequences from MinION were mapped against the raw fastq file using minimap2 (Li, 2018) and lastly polished with racon v1.4.22 (Vaser et al., 2017). Final consensus sequences were aligned against reference sequences (extracted from Decelle et al., 2012; Biard et al., 2015; Sandin et al., 2019 and Sandin et al., 2021; Supplementary material Table S2) using MAFFT v7.395 (Katoh and Standley, 2013) and automatically trimmed using trimal (Capella‐Gutiérrez et al., 2009) with a 30% gap threshold. The final data set contains 525 taxa and 2434 positions and phylogenetic analysis was done by RAxML (Stamatakis, 2014) and GTR + Gamma and GTR + CAT evolutionary models with 1000 rapid bootstraps. Final trees were visualized and edited with FigTree version 1.4.3 (Rambaut, 2016).

Results

Quality of sequencing results

In total four Radiolaria holobionts were sequenced by both Sanger and MinION after three independent PCR amplifications or replicates for each (see Supplementary material Fig. S1 for a schematic representation of the experimental procedure and Supplementary material Fig. S2 for a table with detailed information on each holobiont). Prior to Sanger sequencing, 24 cloning reactions per replicate were performed to further amplify the targeted rRNA gene. In total 834 Sanger sequences (representing 737 817 bases) were successfully retrieved for 864 cloned amplicons (Supplementary material Table S1). These sequences had an average length of 884.67 (±198.03) bp, with a median of 1005 bp and an average BLAST similarity identity of 98.13% (±2.59%) with a reference sequence (Fig. 1, Sanger).

Fig. 1

Boxplot summarizing sequencing results of Sanger, Oxford Nanopore Technologies (MinION) and Illumina sequencing; top row shows the sequence length (in bp) and bottom row the percentage identity of the first match against a reference sequence per cell. Note that identity for Sanger and MinION was performed by BLAST and for Illumina by global alignment (vsearch) due to the shorter length of the reads. The MinION flow cell was used during its last 28 h (20 h → 48 h; after a previous independent experiment) resulting in 864 total reads (representing a total of 1 645 774 bases). The low throughput obtained in our study contrast with the high number of sequences obtained during the first 20 h (0 h → 20 h) of the same flow cell in the previous independent study, where a total of 225 573 sequences were obtained (representing 296 072 423 total bases; data not analysed herein). Based on these observations, it turns that, the last 28 h of the flow cell sequenced 0.38% of the total sequences obtained by the flow cell, representing 0.56% of the total bases. These 864 sequences were compared against NCBI database by BLAST tool and 81 had no match at all. Among the remaining 783 sequences, 185 were assigned to bacteria, 593 to eukaryotes and five matching simultaneously different domains (i.e. the same sequence matching virus, uncultured prokaryote and uncultured eukaryote). These 783 sequences had an average length of 2043.48 bp (±1143.42) bp, with a median of 2746 bp and an average identity of 86.01% (±2.91%) with a reference sequence. De‐multiplexing MinION sequences resulted in 55 sequences among the different replicates and specimens (Supplementary material Table S1). De‐multiplexed sequences had an average length of 2641.82 bp (±681.80) bp, a median length of 2921 bp and an average identity of 86.48% (±2.33%) similarity with a reference sequence (Fig. 1, MinION). Regarding Illumina sequencing, a total of 1 019 196 sequences (representing 254 799 000 bases) were sequenced in both the R1 and the R2 files from eight Radiolaria holobionts, of which only three were also sequenced by Sanger and MinION (due to original DNA limitations; Supplementary material Fig. S2). On average 34 671 (±19,609) reads were obtained for each replicate. These reads were merged resulting in 30 524 (±17,359) amplicons on average per replicate and 66 913 total unique amplicons (Supplementary material Table S1). They had an average length of 375.3 (±15.89) bp, a median of 382 bp and an average identity of 99.28% (±2.26%) with a reference sequence (Fig. 1, Illumina).

Taxonomic assignation and diversity of reads

Sanger sequenced amplicons of the different parts of the ribosomal genes belonging to the same cloned replicate were concatenated resulting in 287 sequences with an average length of 2570.7 (±296.97) bp. This covers the first ~1000 bp of the 18S rRNA gene (primer SA), the last ~600–800 bp of the 18S and the Internal Transcribed Spacer 1 (ITS1) (primer S69f) and the regions D1 and D2 from the 28S rRNA gene (primer D1R). These sequences were assigned predominantly to Nassellaria and Spumellaria, representing 85.4% of the sequences (Fig. 2, Sanger). The rest of the sequences belong to Chrysophyceae, Ciliates, Diatoms, Fungi, Dinoflagellates and Alveolates. The 55 de‐multiplexed sequences extracted from MinION sequencing showed a similar taxonomic assignation as for Sanger sequencing, yet less diverse: Nassellaria and Spumellaria represented 72.7% of the total assignations and Diatoms, Chrysophyceae, Dinoflagellates, Fungi and Alveolates the rest of the groups (Fig. 2, MinION).

Fig. 2

Taxonomic affiliation of sequences/amplicons obtained by Sanger (after concatenation of the amplified fragments), Oxford Nanopore Technologies (MinION) and Illumina (reads processed by dada2) sequencing for each cell. The area represents the proportion of total number of unique sequences/amplicons affiliated to the specific taxonomic entity (tree map). Numbers below the taxonomic group represent the number of unique sequences/amplicons. Amplicons of Illumina were clustered using an error‐based method (dada2) resulting in 153 ASVs, comprising 570 041 reads. These ASVs had an average length of 379.2 (±8.99) bp, a median length of 378 bp and an average similarity identity of 99.70% (±5.97%). The most diverse groups are dinoflagellates (28 ASVs), Spumellaria (23 ASVs), Nassellaria (19 ASVs) and Fungi (14 ASVs mostly within Agaricomycotina and Ustilaginomycotina) (Fig. 2, MinION). Despite the big diversity within each holobiont, the number of reads is highly dominated by 1, 2 or 3 ASVs belonging to the host of the holobiont or the symbiotic algae (Supplementary material Fig. S3). Due to the great taxonomic diversity found within holobionts by Illumina sequencing, ASVs were only considered if they were present in at least three replicates (triplicates in the PCR), with a total abundance equal to or higher than the median value of the abundance (that is 102 reads) and assigned until at least the taxonomic rank ‘Order’. We have chosen stringent thresholds in an attempt to remove all artefacts and/or contaminations, while being aware that part of the diversity within the holobionts might also be removed. After processing, the number of ASVs decreased to 36 ASVs but the total reads did not change drastically with 544 127 reads, representing up to 95.45% of the total reads. Main taxonomic affiliations were similar to the one described before filtering the ASVs. Furthermore, the relative proportion of unexpected ASVs took more importance; such as an ASV affiliated to Collodaria present in four holobionts, an ASV affiliated to Acantharea present in three holobionts and one to two ASVs affiliated to Craniata (assigned to the genuses Homo and Capra with 99 and 69 minimum bootstrap support respectively) present in every holobiont (Supplementary material Fig. S4). When exploring in detail the ASVs affiliated to Polycystines and their abundance and distribution among replicates after applying the stringent thresholds, it is possible to find up to three different and highly abundant ASVs within the same specimen (e.g. Spongosphaera streptacantha or Arachnospngus varians; Fig. 3). These ASVs appearing in the same three replicates within a unique specimen had a high similarity among one another, and they were grouped together into a single ASV after post‐clustering based on co‐occurrence and similarity with ‘LULU’ (Supplementary material Fig. S5). Some of the most abundant ASVs were found in the three replicates of their corresponding specimens and in a fourth and fifth replicate; as for ‘ASV‐2’, ‘ASV‐5’ and ‘ASV‐7’, ASVs from the Nassellaria specimens Pterocorys zanclea, Extotoxon undulatum and Eucyrtidium acuminatum respectively. ‘ASV‐2’ was present in the three replicates of Eu. acuminatum and in one replicate of Carpocanium obliqua and Rhizosphaera trigonacantha, a Spumellaria specimen, although at very low relative abundances. In E. undulatum there were two ASVs, of which one (‘ASV‐5’) was present in the three replicates of the holobiont and in a fourth and fifth replicate from Rhizosphaera trigonacantha and Spongosphaera streptacantha, the second ASV (‘ASV‐47’) was only present in one replicate. While ASV‐47 is also found in C. obliqua, T. octacantha and A. varians, it never appears within the three replicates of the same holobiont. Its taxonomic affiliation is to Collodaria, with a 100% similarity to a reference sequence and 100 BS support for the taxonomic assignation.

Fig. 3

Circular plot representing the log‐transformed abundance of polycystines amplicons (lower half of the circles) clustered with dada2 and replicate affiliation (upper half of the circles, indicated by roman numbers under the name of the specimens). Only amplicons affiliated to polycystines present in three or more samples and with a total abundance equal to or higher than 102 reads (median) were considered.

Intra‐genomic variability

The hyper‐variable region V4 of the 18S rRNA gene was extracted from the sequences affiliated to Radiolaria from both Sanger and MinION sequencing, in order to compare with Illumina sequencing and assess if the same genetic diversity is detectable within the holobiont. In total 52, 56, 59 and 68 sequences were obtained from Sanger sequencing of Eucyrtidium acuminatum, Eu. cienkowski, Spongosphaera streptacantha and Rhizosphaera trigonacantha respectively. The same unique sequence was found to be the most abundant in both Eu. acuminatum and Eu. cienkowski (54 and 47 reads respectively) being exactly identical despite being obtained from specimens morphologically identified as different species. In addition, up to five other unique sequences were found with a very low abundance (2, 2, 1, 1 and 1 reads, green dots on Fig. 4 under unique reads). All sequences of the V4 hypervariable region extracted from the two Nassellaria specimens showed similarities within them close to 100% with a maximum of one base of difference (boxplots on Fig. 4 under unique reads). Despite these two different morpho‐species showing identical V4 sequences, the ITS1 and the partial 28S rRNA gene (D1 + D2 regions) show inter‐specific differences (data not shown). Similar patterns of intra‐genomic variability in the V4 hypervariable region are seen in the Spumellaria R. trigonacantha, with one sequence found 67 times and another one found only one time. In contrast, S. streptacantha showed up to three different sequences relatively abundant (16, 13 and 13 reads) and eight other sequences with a lower abundance, having a similarity among them of 96.1%. It is important to note that the three most abundant sequences of S. streptacantha were present in the three different replicates (PCR reactions). When these sequences were clustered with the software swarm using 1 difference threshold, Eu. acuminatum and Eu. cienkowski shared the same amplicon. In R. trigonacantha the sequence with a single read was grouped in the unique amplicon found for this holobiont, and up to three different amplicons were found in S. streptacantha (still with a similarity among these amplicons of ~96%). Increasing the threshold to two differences did not change the outcome for S. streptacantha. When using a threshold of three differences, the number of amplicons in S. streptacantha decreases to 2, with low changes in the similarities among them (96.4%). Yet one amplicon largely dominates the other (51 reads against eight reads). The detailed exploration of such polymorphism revealed that nucleotide differences are mostly located in variable helixes of the secondary structure of the 18S rRNA gene (Supplementary material Fig. S6). Same protocol was implemented on the V4 hyper‐variable region of the 18S rRNA extracted from the MinION sequences. In total four sequences were extracted from R. trigonacantha, two from S. streptacantha, six from Eu. cienkowski and 20 from Eu. acuminatum. It was not possible to cluster the sequences with swarm due to their large dissimilarities. Up to a difference threshold of 10, there were still no clusters, keeping an average intra‐genomic similarity of 86.92% (±3.56%).

Fig. 4

Abundance and intra‐genomic similarities of the V4 region extracted from Sanger sequencing results of the total unique reads and clustered with swarm with 1 (d1), 2 (d2) and 3 (d3) differences.

Discerning intra‐genomic variability from sequencing errors

In order to compare the intra‐genomic variability between different sequencing methods and discriminate between sequencing errors, we calculated the entropy of the alignment from sequences obtained in this study at every position. Since it is expected that errors are random, a single error in the alignment would represent low evenness resulting in low entropy values (yet not 0). On the other side, intra‐genomic variability is expected to be sequenced in several replicates increasing the evenness and with it the entropy. In general, concatenated sequences from Sanger sequencing belonging to Nassellaria showed a trend of Shannon entropy values of 0 among them (Supplementary material Fig. S7). Towards the end of the amplified fragments the entropy slightly increased, meaning that most probably the sequencing error rate increases. Especially at the end of the region D2 of the 28S rRNA gene the entropy reaches its highest values along a region of ~100 bp length, probably meaning variability among the different copies of the rDNA. Regarding Spumellaria, Rhizosphaera trigonacantha has a similar trend as Nassellaria (Supplementary material Fig. S7). In contrast, Spongosphaera streptacantha shows regions of high entropic values at around the position 750 (V4 region from the 18S rRNA gene), along with the beginning of the 28S rRNA gene and especially over a region of ~250 bp on the ITS1, showing a most probable big intra‐genomic variability of the rDNA. In general MinION sequences aligned against the reference sequence with many gaps (Supplementary material Fig. S7). Alignments of sequences coming from MinION have in general a greater number of positions than alignments of sequences from Sanger sequencing (~922.25 bp more on average); greater than the length of the 5.8S and ITS2 regions of the rDNA that MinION sequences are covering in comparison to Sanger sequences (18S + ITS1 and D1 and D2 regions of the 28S rRNA gene). These sequences had high entropy values all along the rDNA alignment. With the exception of the 18S rRNA gene of Nassellaria, where there is a constant entropy trend, there are variations of the entropy depending on the region, but these variations do not match exactly those of the Sanger sequences. The V4 hypervariable region of the 18S rRNA gene used in Fig. 4 along with that extracted from MinION sequencing and Illumina were pooled together in one alignment for the V4 region comparison. In total, 108 sequences for Nassellaria and 127 sequences for Spumellaria were extracted from Sanger sequencing, 19 Nassellaria sequences and five Spumellaria sequences were extracted from MinION sequencing, including those amplicons extracted from Illumina taxonomically assigned to Nassellaria (19) and Spumellaria (23) and completed with reference sequences from pr2 v4.14.0 (83 for Nassellaria and 671 for Spumellaria) trying to gather most of their known diversity. Final aligned dataset had a length of 478 bp for Nassellaria and 461 bp for Spumellaria (Fig. 5). Reference sequences showed a hotspot of higher Shannon diversity within the hypervariable region in both Nassellaria and Spumellaria spanning from the position ~80 until ~210 with maximum values around the position ~110–150. Nassellaria reference sequences showed two other hotspots regions peaking at around the positions ~330 and ~420, yet average entropy values are half of the first hotspot region. Regarding Spumellaria, these two last hotspot regions are less marked than for Nassellaria due to overall higher entropy values. Sanger sequences maintain near‐0 entropy values, with the exception of a region between ~100 and 200 bp in Spumellaria showing higher entropy values (corresponding to Spongosphaera streptacantha). Illumina ASVs follow the same patterns as the reference alignment, with the highest entropy values at around the position ~150. Despite the lack of trend found for the full length of the rDNA for MinION sequences (Supplementary material Fig. S7), when focusing on the V4 region, shows similar patterns that those found in the reference alignment and Illumina, although these peaks are smoother due to the relative higher entropy values all along the alignment.

Fig. 5

Shannon entropy analysis for every position (on x‐axis) of the V4 hypervariable region of the 18S rRNA gene for reference sequences of Nassellaria and Spumellaria (extracted from PR2 v4.14.0) and for Sanger, Illumina and Oxford Nanopore Technologies (MinION) sequencing results aligned all together within the different taxonomic groups. For further results on the near‐full rDNA alignment entropy obtained by Sanger and Oxford Nanopore Technologies (MinION) sequencing see supplementary Fig. S7.

Phylogenetic signal of long‐read sequencing

We performed a last attempt to test whether the high‐error rate from MinION is affecting further downstream analysis or could be overcome. Sequences obtained from Sanger and MinION were clustered into consensus sequences in order to resolve potential random errors and were aligned in a phylogenetic tree along with reference sequences (Supplementary material Table S2). Despite the high error rate of MinION sequences, both the random distribution of the errors and the long reads provide accurate phylogenetic information when processed into consensus sequences (Fig. 6). Consensus sequences from MinION are phylogenetically sister to the consensus sequences from Sanger showing high phylogenetic support and short branches. Consensus sequences from Sanger sequencing have a similarity >99% against reference sequences except for Eucyrtidium cienkowski (97.32%). Consensus sequences from MinION have a maximum similarity identity against reference sequences of 99.5%, 98.9% and 99.2% for Eucyrtidium sp., Rhizosphaera trigonacantha and S. streptacantha respectively (data not shown).

Fig. 6

Molecular phylogeny of consensus Oxford Nanopore Technologies (MinION) sequences, consensus of Sanger results and reference sequences inferred from the concatenated complete 18S and partial 28S (D1–D2 regions) rRNA genes. In bold are shown specimens used in this study (consensus) and reference sequences previously obtained for the same specimens. Numbers after the OTU name of sequences obtained by MinION represent how many raw sequences the consensus was built. The tree was obtained by using a phylogenetic maximum likelihood method implemented in RAxML using the GTR + CAT model of sequence evolution and 1000 rapid bootstraps (BS, shown at the nodes). Bootstrap values below 60 are not shown. Black circles indicate BS ≥99%. Hollow circles indicate BS ≥90%.

Discussion

Our results showed that intra‐genomic rRNA gene variability of Nassellaria and Spumellaria is generally limited, as previously found in tintinnids ciliates (Bachy et al., 2013). Such low intra‐genomic variability extends in some cases to inter‐specific similarities, where the two closely related species Eucyrtidium cienkowskii and Eu. acuminatum show the same V4 hypervariable region of the 18S rRNA gene. In other species intra‐genomic rDNA variability can be very important, as it is the case of Spongosphaera streptacantha in which we found two distinct intra‐genomic V4 18S rRNA gene regions (<97% similarity, representing 14 different nucleotide positions). These polymorphisms are located in variable helixes of the 18S rRNA gene, not affecting the secondary structure, as already shown in Foraminifera (Weber and Pawlowski, 2014). Taxonomic differences among closely related groups have also been found in Oligotrich and Peritrich Ciliates (Gong et al., 2013). Most of this intra‐genomic variability is, however, overlooked due to the presence of a highly repeated copy that predominates over the low abundant copies, as previously found in Nassellaria or among other orders of Radiolaria such as Acantharia (Decelle et al., 2014). Yet, in Acantharia the intra‐genomic variability could also become important, finding up to three different OTUs (V9: clustered at 97% and present in two replicates; Decelle et al., 2014). Similar studies have shown a relationship between the intra‐genomic variability and the number of macronuclei (Zhao et al., 2019) and the rDNA copy number in ciliates (Gong and Marchetti, 2019) and in alveolates (Medinger et al., 2010). That could explain the taxonomic differences in the intra‐genomic variability of Radiolaria, since both Nassellaria and Spumellaria have only one nucleus that tends to be small, whereas Acantharia and Collodaria have several nuclei (Suzuki et al., 2009). In the former case, the species tend to show low intra‐genomic variability, whereas in the latter case the intra‐genomic variability can be relatively high (Decelle et al., 2014). Illumina sequencing has identified a priori a very diverse holobiont host gene variability; finding in some Radiolaria specimens up to 10–11 different ASVs (Fig. 2). And when comparing the holobiont community obtained by the three different sequencing methods, Illumina shows a notably more diverse holobiont community. The unexpected presence of taxonomic groups such as Acantharia, Collodaria or Craniata within holobionts constituted by Nassellaria and Spumellaria as hosts, questions the reliability of the so‐called ‘rare’ ASVs in environmental studies. Part of this rare biosphere has been suggested as artefacts inflating diversity estimates (Kunin et al., 2010; Bachy et al., 2013). Technical issues such as cross‐contamination during sequencing or tag‐jumping during library preparation also question the rare biosphere over‐estimating it (Kircher et al., 2012; Schnell et al., 2015); as seen in ASVs present in an additional fourth and fifth replicate (i.e. ‘ASV‐2’, ‘ASV‐5’ and ‘ASV‐7’ in Fig. 3). Other ASVs have had full similarities against a reference sequence and have passed the stringent abundance filters we have applied (e.g. Craniata in fig. 2 or ‘ASV‐47’ affiliated to Collodaria in Fig. 3), suggesting the presence of environmental DNA (eDNA). All holobionts were collected in the north‐western Mediterranean Sea, yet at varying depths, localities and dates. Numbers of reads have been found to be correlated with the number of nuclei and the cell size (Biard et al., 2017; Pitsch et al., 2019). And our results indicate that eDNA contamination is showing a bias towards organisms with a higher copy number, such as Collodaria (Biard et al., 2017) or the metazoans Craniata. A problem that might be accentuated when the targeted DNA is relatively scarce (i.e. that of Nassellaria) and has to ‘compete’ for the available space in the flow‐cell during sequencing, resulting in large differences of relative abundances, as seen in Carpocanium obliqua and Tetrapyle octacantha (Fig. 2 and supplementary Fig. S4). Probably, DNA from Acantharia and Collodaria are more prone to be successfully amplified compared to that of Nassellaria and Spumellaria, as seen in differences of DNA amplification success among distinct Foraminifera taxa (Weiner et al., 2016). In this study we have used general eukaryotic primers that are equally binding these radiolarian groups. Therefore, further analysis would be needed to assess the possible effect of shell or cell architecture regarding potential impact on DNA extraction and amplification. These results highlight the need for a careful interpretation of diversity through environmental metabarcoding surveys due to the unpredictable bias found among various taxa (e.g. Weiner et al., 2016; Gong and Marchetti, 2019). In some cases, Illumina sequencing from eDNA tends to lead to an under‐representation of the environmental diversity (e.g. Carpocanium obliqua and Tetrapyle octacantha in Fig. 2) and to an over‐estimate for other taxa (e.g. Collodaria and Craniata). For example, the presence of ‘ASV‐2’, ‘ASV‐5’ or ‘ASV‐7’ in fourth and fifth additional replicates would have been considered as low abundant ASVs in those specific samples in global environmental metabarcoding surveys lacking replicates. Such biases emphasize the need to consider technical replication in metabarcoding surveys to ensure an accurate estimation of the environmental microbial diversity (Prosser, 2010). Besides, understanding other aspects is especially important when working with specific taxa, such as the estimation of the rDNA copy number (Biard et al., 2017; Gong and Marchetti, 2019), the analysis of differences in relative abundance (Morton et al., 2019) or even exploring previous steps such as DNA extraction bias, amplification primers or intracellular architecture, may contribute to a more accurate interpretation of metabarcoding surveys. However, a big part of the overestimated diversity might also be due to sequencing errors, as previously proposed (Bachy et al., 2013; Decelle et al., 2014), since the 50% less abundant ASVs accounted for less than 5% of the total reads meaning a high presence of singletons and low abundant clusters. Recently developed tools, such as the ‘LULU’ algorithm (Frøslev et al., 2017), proved to be efficient identifying intra‐genomic variability or early PCR errors by post‐clustering similar ASVs/OTUs based on co‐occurrence patterns. Illumina sequencing is the method with the highest sequencing depth. After clustering into ASVs, the error rate is the lowest from the three methods used and is randomly distributed. Besides, most of the suspected artefacts produced during Illumina sequencing were removed by applying stringent filters and/or post‐clustering methods. Sanger sequencing also showed a limited error rate, yet singletons were found in the V4 18S rRNA gene region of every replicate sequenced, and were later corrected when clustering (Fig. 4). However, towards the end of the amplified fragments the error rate increased substantially. Due to the potential presence of sequencing errors in reference sequences, we suggest that filtering of Illumina ASVs should be done principally based on abundance rather than on identity thresholds. MinION is known to be a sequencing technology with the highest error rate, providing alignment similarities of raw reads around 94%–97% in bacteria (Tyler et al., 2018) better than our results with an average of 86% for the best scoring sequences. Yet, recent studies have generated consensus sequences decreasing considerably the error rate and reaching results comparable or better than those obtained by Sanger sequencing (Pomerantz et al., 2018; Wurzbacher et al., 2019), as seen in our study (>98.9% similarity identity to reference sequences). Despite the high error rate of raw reads from MinION found in our analysis, consensus sequences showed accurate phylogenetic signal with high bootstrap supports. It turns out that the length of the reads over‐competes the randomly generated sequencing errors and brings accurate phylogenetic information. In our study we have used the last 28 h of the MinION sequencing device, largely affecting the life and quality of the nanopores from the MinION flow cell in detecting the ion signal. It is therefore recommended to stop the flow cell according to desired acquisition of reads and not time‐sequenced when using different projects in the same flow cell; as seen in Srivathsan et al. (2021) where they have successfully re‐used a flow cell up to four times. ONT have the advantage of directly sequencing the DNA strand with no need of PCR amplification, and therefore removing biases associated in these steps that most likely produced part of the artificial environmental diversity found in Illumina generated data. This also brings the possibility to work with absolute reads, and not relative values providing a more quantitative picture of the genetic community. Yet, rDNA copy number will still be an issue when comparing genetic and morphological diversity. ONT's accuracy is improving since it was first released (Rang et al., 2018), and new methods for correcting and analysing ONT results are coming along (e.g. demultiplexing: Wick et al., 2018; base‐calling: Wick et al., 2019, Xu et al., 2021; consensus sequence building: Pomerantz et al., 2018; Wurzbacher et al., 2019; and polishing: Vaser et al., 2017). Therefore, MinION could generate fruitful results in the near future for metabarcoding surveys by taking advantage of the extensive 18S reference database available and the high variability and taxonomic resolution of the ITS and the 28S (as seen in environmental studies performed with PacBio, Jamy et al., 2019). The high cost‐effectiveness of ONT (Cui et al., 2020) and the portability and sequencing depth and length of MinION device show fruitful perspectives for its implementation in environmental phylogenetic studies. Along with the high accuracy of circular consensus sequencing of PacBio (Wenger et al., 2019) for the high‐throughput establishment of reference sequences, environmental phylogenetic studies could move towards the full rDNA surveys. Appendix S1. Supporting Information. Click here for additional data file. Fig. S1. Schematic representation of the study design. Click here for additional data file. Fig. S2. Light microscopy images of specimens used in this study. On the right of each specimen it is indicated whether it was used (tick mark) or not (cross mark) for Sanger + Oxford Nanopore Technologies (MinION) sequencing and/or Illumina, and in grey, a brief description on location and time of isolation. Scale bar (when available; black/white bar) represents 50 μm. Click here for additional data file. Fig. S3. Tree map of the total number of reads (abundance) of every ASV affiliated to a taxonomic group for each cell obtained by Illumina sequencing. ASVs were processed by dada2. Click here for additional data file. Fig. S4. Tree map of every ASV affiliated to a taxonomic group (left column) and their total number of reads (abundance; right column) for each cell obtained by Illumina sequencing. ASVs were processed by dada2. Only ASVs present in 3 or more samples and with a total abundance equal or higher than 102 reads (median) were considered. Click here for additional data file. Fig. S5. Circular plot representing the post‐clustering (by ‘LULU’) of polycystines amplicons clustered with dada2 shown in Figure 4. Click here for additional data file. Fig. S6. Schematic representation of the 18S rRNA gene 2‐dimensional structure (obtained with rnacentral.org/r2dt) from Spongosphaera streptacantha (Mge17‐82), highlighting the most important regions and positions of intra‐genomic variability. Nucleotide size of the highlighted regions represents the relative proportion of the given nucleotide at the given position in the alignment (‘x’ represents a gap) among the sequences obtained by Sanger sequencing. Nucleotides with an orange background refer to the 14 different positions found in the V4 hypervariable region. Click here for additional data file. Fig. S7. Shannon entropy analysis for every position (on x axis) of Sanger and Oxford Nanopore Technologies (MinION) sequencing results aligned independently for Nassellaria (green) and Spumellaria (blue). Lines represent the tendency of the entropy for each alignment. Black arrows in Sanger boxes represent direction and approximate position of the primers used for Sanger sequencing. Insertions (‘‐’) were not considered due to the differences in length of the sequences and the high amount of insertions produced by the raw sequences from MinION. Click here for additional data file. Table S1. Number of sequences obtained by the three different sequencing methods used in this study Sanger, MinION and Illumina for every holobiont, specimen and replicate. Click here for additional data file. Table S2. List of publicly available sequences used in Fig. 6. Click here for additional data file.

53 in total

Review 1. Replicate or lie.

Authors: James I Prosser
Journal: Environ Microbiol Date: 2010-04-28 Impact factor: 5.491

2. Wide occurrence of SSU rDNA intragenomic polymorphism in foraminifera and its implications for molecular species identification.

Authors: Alexandra Anh-Thu Weber; Jan Pawlowski
Journal: Protist Date: 2014-07-24

3. Biogeography and diversity of Collodaria (Radiolaria) in the global ocean.

Authors: Tristan Biard; Estelle Bigeard; Stéphane Audic; Julie Poulain; Andres Gutierrez-Rodriguez; Stéphane Pesant; Lars Stemmann; Fabrice Not
Journal: ISME J Date: 2017-03-24 Impact factor: 10.302

Review 4. Coming of age: ten years of next-generation sequencing technologies.

Authors: Sara Goodwin; John D McPherson; W Richard McCombie
Journal: Nat Rev Genet Date: 2016-05-17 Impact factor: 53.242

Review 5. Advancements in Next-Generation Sequencing.

Authors: Shawn E Levy; Richard M Myers
Journal: Annu Rev Genomics Hum Genet Date: 2016-06-09 Impact factor: 8.929

6. Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform.

Authors: Martin Kircher; Susanna Sawyer; Matthias Meyer
Journal: Nucleic Acids Res Date: 2011-10-21 Impact factor: 16.971

7. Improved data analysis for the MinION nanopore sequencer.

Authors: Miten Jain; Ian T Fiddes; Karen H Miga; Hugh E Olsen; Benedict Paten; Mark Akeson
Journal: Nat Methods Date: 2015-02-16 Impact factor: 28.547

8. Fast and accurate de novo genome assembly from long uncorrected reads.

Authors: Robert Vaser; Ivan Sović; Niranjan Nagarajan; Mile Šikić
Journal: Genome Res Date: 2017-01-18 Impact factor: 9.043

9. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome.

Authors: Aaron M Wenger; Paul Peluso; William J Rowell; Pi-Chuan Chang; Richard J Hall; Gregory T Concepcion; Jana Ebler; Arkarachai Fungtammasan; Alexey Kolesnikov; Nathan D Olson; Armin Töpfer; Michael Alonge; Medhat Mahmoud; Yufeng Qian; Chen-Shan Chin; Adam M Phillippy; Michael C Schatz; Gene Myers; Mark A DePristo; Jue Ruan; Tobias Marschall; Fritz J Sedlazeck; Justin M Zook; Heng Li; Sergey Koren; Andrew Carroll; David R Rank; Michael W Hunkapiller
Journal: Nat Biotechnol Date: 2019-08-12 Impact factor: 54.908

Review 10. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy.

Authors: Franka J Rang; Wigard P Kloosterman; Jeroen de Ridder
Journal: Genome Biol Date: 2018-07-13 Impact factor: 13.583

1 in total

1. Does filter pore size introduce bias in DNA sequence-based plankton community studies?

Authors: Guolin Ma; Ramiro Logares; Yuanyuan Xue; Jun Yang
Journal: Front Microbiol Date: 2022-09-26 Impact factor: 6.064

1 in total