Markus Joppich1, Margaryta Olenchuk1, Julia M Mayer1, Quirin Emslander2, Luisa F Jimenez-Soto3, Ralf Zimmer1. 1. LFE Bioinformatics, Department of Informatics, Ludwig-Maximilians-Universität München, 80333 München, Germany. 2. Physics of Synthetic Biological Systems, Physics Department, Technische Universität München, 85748 Garching, Germany. 3. Walther Straub Institute for Pharmacology and Toxicology, Ludwig-Maximilians-Universität München, Goethestrasse 33, 80336 München, Germany.
Abstract
The MinION sequencer by Oxford Nanopore Technologies turns DNA and RNA sequencing into a routine task in biology laboratories or in field research. For downstream analysis it is required to have a sufficient amount of target reads. Especially prokaryotic or bacteriophagic sequencing samples can contain a significant amount of off-target sequences in the processed sample, stemming from human DNA/RNA contamination, insufficient rRNA depletion, or remaining DNA/RNA from other organisms (e.g. host organism from bacteriophage cultivation). Such impurity, contamination and off-targets (ICOs) block read capacity, requiring to sequence deeper. In comparison to second-generation sequencing, MinION sequencing allows to reuse its chip after a (partial) run. This allows further usage of the same chip with more sample, even after adjusting the library preparation to reduce ICOs. The earlier a sample's ICOs are detected, the better the sequencing chip can be conserved for future use. Here we present sequ-into, a low-resource and user-friendly cross-platform tool to detect ICO sequences from a predefined ICO database in samples early during a MinION sequencing run. The data provided by sequ-into empowers the user to quickly take action to preserve sample material and chip capacity. sequ-into is available from https://github.com/mjoppich/sequ-into.
The MinION sequencer by Oxford Nanopore Technologies turns DNA and RNA sequencing into a routine task in biology laboratories or in field research. For downstream analysis it is required to have a sufficient amount of target reads. Especially prokaryotic or bacteriophagic sequencing samples can contain a significant amount of off-target sequences in the processed sample, stemming from human DNA/RNA contamination, insufficient rRNA depletion, or remaining DNA/RNA from other organisms (e.g. host organism from bacteriophage cultivation). Such impurity, contamination and off-targets (ICOs) block read capacity, requiring to sequence deeper. In comparison to second-generation sequencing, MinION sequencing allows to reuse its chip after a (partial) run. This allows further usage of the same chip with more sample, even after adjusting the library preparation to reduce ICOs. The earlier a sample's ICOs are detected, the better the sequencing chip can be conserved for future use. Here we present sequ-into, a low-resource and user-friendly cross-platform tool to detect ICO sequences from a predefined ICO database in samples early during a MinION sequencing run. The data provided by sequ-into empowers the user to quickly take action to preserve sample material and chip capacity. sequ-into is available from https://github.com/mjoppich/sequ-into.
Long-read sequencing is rapidly evolving as a common practice in molecular biology. In 2018 more than 130 articles mentioning MinION or 280 articles mentioning PacBio have been published. Great advances have been made in terms of feasibility, cost, throughput, and read-length, now delivering single bacterial reads of more than one million base-pairs in length [1]. Oxford Nanopore (MinION) sequencing is becoming more and more popular with diverse applications like plant pathogen identification [2], virology [3], or botany [4]. One of its major advantages is portability, allowing in-the-field sequencing, e.g. screening for pathogens [5] or new species under arctic conditions [6] - and even on the International Space Station [7].One of the most important requirements of successful sequencing is the sample purity, whether in the lab or out in the field. However, samples containing off-target reads are still common [8], [9]. A reduced number of target sequences complicates correct, high-quality downstream analysis of sequencing data. Low number of target reads may, for instance, effect transcriptomic analyses (e.g. differential expression), or reduce the evidence for specific splice isoforms. On a genomic scale, it has been reported that (public) genome assemblies contain sequences highly likely originating from contamination[9].Particularly with MinION sequencing, the sequencing time is not fixed: a run can be aborted at any time or new material can be added for sequencing. Thus, the general success criterion of a sequencing experiment might not be the total yield of (on-target) sequences, but instead the detection (or absence) of certain target sequences. An interactive analysis of the sequenced reads can be of help to decide whether a sequencing run can be successfully concluded, or should be aborted because it will not yield the necessary data of the intended target in the required quality.Several tools have been developed since the public introduction of the MinION sequencer in 2012. Among these are NanoOK [10], RUBRIC [11], What’s in my Pot (WIMP) [12] and npAnalysis [13].Each of these serves a particular problem. NanoOK is a toolkit to assess descriptive statistics from MinION sequencing runs. With RUBRIC, selective sequencing can be performed by ejecting unwanted sequences from the pore, requiring a complex dual-computer setup. WIMP is built into Metrichor, which requires a paid subscription. Finally, npAnalysis provides a streaming server for Nanopore Sequencing reads, which is capable of detecting sequenced organisms, similar to WIMP. While this allows an online analysis, the setup and usage are rather sophisticated.During the preparation of DNA or RNA for sequencing, several steps, including enzymatic reactions, can hamper the quality of the samples, e.g., inefficient rRNA depletion. In metagenomics, success is determined by the choice of correct and efficient primers [14]. In both cases, the detection of off-target sequences or specific organisms could be done directly while sequencing, right after the first actual reads of the sample are available. The sooner ICO sequences in the sample are detected, the more chip capacity can be rescued for further use.Here we describe the applicability of sequ-into for online detection of sample ICOs during the sequencing run, or also after the sequencing has been concluded. sequ-into provides an online, descriptive overview of the sequenced reads, cross-platform compatibility (Windows, MacOS, Linux) and an easy installation combined with a graphical user-interface on a typical laptop computer. Using state-of-the-art long-read alignments, sequ-into can be of great help for on-/off-target analysis when performing laboratory protocol optimization, enabling a rapid assessment of sequenced reads. It has the capability to add genomes of interest which can be specifically targeted during analysis. By providing a descriptive overview of the sequencing run and its alignment to the selected on/off-target references, sequ-into allows an easily comprehensible and sharable analysis of a sequencing run. Such a setup reflects many real-world scenarios, including the more widespread in-the-field-usage of the MinION device.
Material & methods
Sequencing data
The biological samples were prepared as described in the supplementary information. The sequencing time and yield has been different per sample and is summarized in Table 2. Additional external phage DNA reads have been downloaded from EMBL EBI under accession id PRJEB8318 (Jain et al. [15]).
Table 2
Summary of the sequencing runs analysed by sequ-into. The number of reads refers to the number of basecalled reads.
Run ID
Sequence Type
Duration of Sequencing Run
Number of reads
Off-target rate (%)
1
CP & EP Phage DNA
3:42h
65,964
42.11
2
Kit Phage DNA
2:00h
14,750
7.66
3
DNAseI Phage DNA
2:00h
20,756
2.15
11
H. pylori RNA
6:00h
26,145
63.52
12
H. pylori RNA
5:50h
15,332
57.21
13
H. pylori RNA
3:00h
22,940
55.35
21
H. pylori RNA
5:00h
24,540
65.86
1117
E. coli phage genome
59:46h
108,026
9.37
1118
E. coli phage genome
59:40h
103,384
10.01
1121
E. coli phage genome
59:40h
49,720
9.47
Read extraction
Only basecalled FAST5 reads can be used for read extraction. Thus, the read extraction script (extract_fast5.py) of sequ-into relies on the live-basecalling functionality of MinKNOW. It can extract basecalled sequences from one or more (e.g. de-multiplexed reads) locations containing FAST5 or FASTQ files. If sequ-into performs the read extraction from a folder containing FAST5 files, in addition to the reads (in FASTQ-format), an additional file containing the read-name and its creation time is produced. This allows further analysis of the off-target rate over sequencing time.
Software
sequ-into is implemented using Electron, a framework for creating native applications with web technologies like JavaScript, HTML, and CSS [16]. The user interface is developed in Typescript[17]/React[18] based on MaterialUI[19]. The read extraction from FAST5 folders is performed using the above described extract_fast5.py script. Reads are aligned to given references in python using the python wrapper for Minimap2, mappy, an aligner specialized for long-read alignments[20]. The alignment and the generation of all statistics, figures and the HTML-report is coordinated using our python server (startAlignmentServer.py, included in sequ-into). This server allows to increment existing results and thereby an online processing of the input reads. sequ-into uses the Windows Subsystem for Linux to ensure compatibility on Microsoft Windows. All required python dependencies can be installed using pip
[21].
Benchmark
sequ-into internally uses mappy, the python wrapper for minimap2 [20]. Hence, the mapping accuracy of sequ-into is mainly determined by the minimap2 performance on the MinKNOW basecalled reads..We perform two benchmarks: one on simulated E. coli and bacteriophageEscherichia phage ADB-2 reads in a mixture, and another benchmark on a metagenomics dataset.The reads have been simulated using NanoSim [22] version 2.1.0 with the Nanopore R9 1D profile provided by the NanoSim authors. For the metagenomics datasets the sequencing data from Edwards et al. [23] and accession PRJEB30868 (ERX3139117-ERX3139119) have been used. These are sequences generated from MinION sequencing of the ZymoBIOMICS Microbial Community Standard from Zymo Research [24].
Riboseq library
Ribosomal RNA contamination is detected by aligning the reads against a library of microbial ribosomal RNA sequences: the riboseq library. Using the list of available bacterial genomes from EMBL-EBI [25], for each species one representative strain (the first in list) is downloaded. This ensures that the additionally needed disk space is small. From those genomes, all sequences either directly annotated as rRNA or where the product (or any other description) is annotated as ribosomal RNA, are extracted. Currently included eukarya are Homo sapiens, Rattus norvegicus, Mus musculus, Arabidopsis thaliana, Caenorhabditis elegans and Danio rerio. For these species, ribosomal RNA sequences provided by Rfam [26] in RNA families Bacterial small subunit ribosomal RNA (RF00177) and Eukaryotic small subunit ribosomal RNA (RF01960) are available. Some eukaryotic rRNA sequences are also contained in RF00177, which thus is included as well. Finally, ribosomal RNA sequences for 1485 species are accessible from within sequ-into, searchable via a text input with auto completion.
Results & discussion
sequ-into is available as a cross-platform compatible software providing a mean of interaction (e.g. file dialog) known to users know from everyday computer usage (Fig. A.2). It has been designed such that the user can perform an off-target analysis using an easy-to-follow workflow. Each step for this analysis is supported by a brief graphical and written description. Providing a GUI makes the application accessible to most scientists [27], [28].
Fig. A.2
sequ-into provides explanations in its graphical user-interface leading through each step of the off-target analysis (here: selecting correct input).
While we anticipate the online use of sequ-into (while sequencing), it can also be used to analyse reads after the sequencing run has already been concluded (post-sequencing). To start the off-target read detection, the user can choose the current sequencing folder (online or post-sequencing detection) or regular FASTQ files (post-sequencing detection). sequ-into is designed to work with both FASTQ files, and FAST5 files. In the latter case (FAST5, real-time base calling), it will extract the first thousand reads in FASTQ format for further analysis (or all, if demanded), taking advantage of the live-basecalling functionality of MinKNOW. In our data, using the MinKNOW live-basecalling introduces an average delay of 48 s in sequence availability, with a maximal delay of 100 s.The input reads are aligned against given reference sequences, e.g. genomes or rRNA sequences. By default, an Escherichia coli k12 MG1655 genome is included in the distribution. In addition, sequences from the riboseq library (ribosomal RNAs from over 1400 organisms, see Materials & Methods) can be selected here. The user may also upload/use custom genomes in FASTA format, allowing to prepare custom ICO libraries. Any given reference may be defined as either an on- or off-target sequence, depending on whether it is the intended target sequence or an ICO.Mappy, the python wrapper for Minimap2 [20] is used to align reads. This has several advantages: first it eases installation because sequ-into only depends on python tools, which are all installable via the standard python package installer pip[21]. Second, no intermediate SAM/BAM-files are written to disk. Moreover, no I/O bandwidth is taken from the actual sequencing and basecalling process. In addition, it would be risky to use BAM files to store alignments of ultra-long (genomic) reads due to the CIGAR size limit in the bam-format.Aligning the selected reads against the references assesses the off-target rate (in terms of target versus off-target sequences). For example: bacterial RNA is intended to be sequenced and the user defines the bacterial rRNA as off-target reference. All aligned reads then originate from the off-target sequence, hence stem from ICOs. The off-target rate is then “% aligned reads”. Contrarily, in case the user specifies the transcriptome or genome of the intended species, all aligned reads are considered as on-target reads. The on-target rate is then “% aligned reads”.Analysing transcriptomic reads may incur extra complexity, particularly in eukaryotes, due to their intron/exon structure. Counting the aligned bases of eukaryotic transcriptomes requires the handling of intronic gaps. The user can select to ignore aligned fragments with CIGAR code N in the calculations, e.g. because the reference contains intronic regions not present in the sequenced sample. sequ-into uses an online algorithm for calculating the required statistics and alignments. Upon starting sequ-into, the alignment server is ready to start or update an analysis by first loading any existing results, updating these results with the statistics of the newly processed reads, and, finally, saving the combined result for the next iteration.As final output, sequ-into provides an overview of the performed alignment via the on- and off-target rate, the fraction of aligned bases and an analysis of the on-target rate over time (Fig. 1). In addition to the descriptive overview, sequ-into also shows a notification if the samples contain more than off-target sequences (Fig. 1).
Fig. 1
A-C: The steps from (raw) sequencing data to output from sequ-into. D: Example run of sequ-into: result for a Helicobacter pylori RNA-seq example. Here, of all reads originate from H. pylori ribosomal RNA (off-target reference).
A-C: The steps from (raw) sequencing data to output from sequ-into. D: Example run of sequ-into: result for a Helicobacter pylori RNA-seq example. Here, of all reads originate from H. pylori ribosomal RNA (off-target reference).Due to varying read lengths, the number of (un-)aligned bases are considered. On the base-level, two measures are useful: the length of the alignment on the reference (alignment bases) and the length of the matching bases in the alignment (aligned bases). While the first measure is important to determine how well the reference is covered, the latter also gives an estimate of the alignment quality (regarding substitutions). Explanations and a description of how to interpret the reported descriptive values help the users to understand the values, also decreasing chances of misunderstandings.If reads are extracted from FAST5 files and more than reads are available, sequ-into provides a plot shows a binned (bin-size reads) histogram of the alignment ratio of the reads (e.g. in the first reads). This analysis is of particular interest for mixed samples, e.g. phage DNA sequencing with leftover phage host DNA, because changes in the off-target rate have been observed.The read length distribution of the reads and the results of the alignment analysis are shown in a result summary, supported by pie charts of aligned reads and bases, and histograms of read length distributions. Besides the output in the sequ-into app (Fig. 1), where this overview is displayed, sequ-into saves the created plots together with an HTML report, which can easily be shared among colleagues.In order to save computational resources, sequ-into uses an online and incremental algorithm. Before the alignment and read extraction, existing results are loaded. Only new reads are extracted and further processed. Alignment counts are updated incrementally and the descriptive statistics are updated and stored for the next analysis round. Thus, sequ-into runs on laptop computers, matching the portability of the MinION sequencer. The analysis of reads with suspected E. coli contamination took 12 s including read extraction from FAST5 files on a Microsoft Windows 10 laptop with an Intel i7-7820HQ CPU and 32 GB RAM. Even on a more mainstream and (computationally) less powerful Microsoft Surface Book with 16 GB RAM and a 128 GB SSD, the sample was analyzed in less than 10 s. For detecting ribosomal off-target sequences in
Helicobacter pylori transcriptomic reads (Run 11), less than 10 s are needed (without read extraction) on the Windows 10 laptop. Neither sequ-into nor the live basecalling caused a bottleneck in this analysis. It can thus be used directly side-by-side with the MinION sequencer, either in the field or the lab.Two benchmarks have been performed to assess the accuracy and correctness of sequ-into. The results are shown in Table 1.
Table 1
Summary of the benchmarking results for simulated and metagenomics data. The number of total simulated reads is . In the metagenomics dataset, the number of total aligned reads (149,742) is determined by the reads in the dataset and the unaligned reads.
Organism
Reads
Reads (after change)
Fraction Measured
Theoretic Fraction
Expected Fraction
Difference Theory
Difference Expected
NanoSim E. coli
49,996
66.66%
66.6%
66.6%
-
-
NanoSim E. phage ADB-2
24,998
33.33%
33.33%
33.3%
-
-
Meta: Pseudomonas aeruginosa
15,096
10.08%
12.00%
12.50%
−1.92%
−2.42%
Meta: Escherichia coli
27,698
20,407
13.63%
12.00%
12.50%
1.63%
1.13%
Meta: Salmonella enterica
26,106
18,815
12.56%
12.00%
12.50%
0.56%
0.06%
Meta: Lactobacillus fermentum
15,904
10.62%
12.00%
12.30%
−1.38%
−1.68%
Meta: Enterococcus faecalis
18,624
12.44%
12.00%
9.50%
0.44%
2.94%
Meta: Staphylococcus aureus
24,036
16.05%
12.00%
12.00%
4.05%
4.05%
Meta: Listeria monocytogenes
21,653
14.46%
12.00%
12.50%
2.46%
1.96%
Meta: Bacillus subtilis
19,901
13.29%
12.00%
14.70%
1.29%
−1.41%
Meta: Saccheromyces cerevisiae
3,560
2.38%
2.00%
2.08%
0.38%
0.30%
Meta: Cryptococcus neoformans
2,960
1.98%
2.00%
1.56%
−0.02%
0.42%
Summary of the benchmarking results for simulated and metagenomics data. The number of total simulated reads is . In the metagenomics dataset, the number of total aligned reads (149,742) is determined by the reads in the dataset and the unaligned reads.Summary of the sequencing runs analysed by sequ-into. The number of reads refers to the number of basecalled reads.For the simulated dataset the number of simulated E. coli reads was and for E. phage ADB-2. Of all reads aligned by sequ-into only 6 reads fail to align. Otherwise, sequ-into, respectively the underlying minimap2 mapper, performs perfect.More interesting is the metagenomics benchmark. Here, the number of aligned reads deviates slightly from the number of expected reads and the expected fraction, respectively. For both yeast species, the fraction of identified reads matches the expected ratio well. Staphylococcus aureus is more prevalent than expected, but only by 4% - which is the maximal deviation observed in the benchmark. Interesting are the high read counts for Escherichia coli and Salmonella enterica (Table 1). This is because reads align to both genomes. sequ-into does not try to untangle this, but reports these multi-mapping instead (Fig. 2). Assigning half of the reads to each organism, the align fraction resembles the expected fraction well enough. The fact, that the expected fraction already differs from theory is known and is based on the fact, that the DNA extraction and library prep may induce a bias. This is also reported by the manufacturer.
Fig. 2
Upset plot showing the number of aligned reads per expected genome of the metagenomics sample (Zymo Research Mock Community). It can be seen that most reads only map to a single organism. Noticeable are the reads which are shared by S. enterica and E. coli.
Upset plot showing the number of aligned reads per expected genome of the metagenomics sample (Zymo Research Mock Community). It can be seen that most reads only map to a single organism. Noticeable are the reads which are shared by S. enterica and E. coli.Given the low deviations from the expected fractions in the metagenomics sample, sequ-into/minimap2 performs considerably well. This is supported by the simulated reads, where all but 6 reads are assigned correctly. This is no surprise, since the accuracy of sequ-into is strongly determined by its underlying mapper, minimap2, which achieves an alignment rate of 98% and more for long reads [20].
Use-cases
In order to demonstrate that sequ-into supports lab experiments, three use-cases are presented. The first one demonstrates how sequ-into helps in the sequencing of genomic samples with high off-target susceptibility. Here the protocol for extracting phage DNA and depleting E.coli host DNA was improved in only a few rounds of the rapid prototyping cycle (Fig. A.1).
Fig. A.1
The rapid prototyping cycle performed to establish the new protocol. The quality-control step was performed using sequ-into.
The second use-case analyses a transcriptome sequencing project (post-sequencing analysis) and shows how sequ-into could have helped to improve sample quality by detecting a high ribosomal RNA content in the first experiment. This post analysis led to a better ribosomal RNA depletion before the sequencing of further samples.In the third use-case an external, publicly available dataset is re-analysed regarding its off-target rate.The sequencing details are given in Table 2.
Case I: DNA purification
In this use-case a DNA sequencing analysis targeting phage DNA was performed. The practical problem is to determine levels of host (E. coli) DNA contamination after phage isolation for faster evaluation of extraction protocols. For later applicability, host DNA levels must be as low as possible.MinION sequencing was used to assess the purity of the extracted DNA. In three rounds the purification protocol was improved (Table 2). The initial run (run 1) used only chloroform-phenol extraction and ethanol precipitation for DNA extraction and contains a peak E. coli off-target rate of 80%. For run 2, a standard phage isolation kit was used for DNA extraction, leading to off-target rates of below 20%. Final adjustments performed for run 3, with DNAseI incubation, led to an off-target rate of below 5% (run 3, details in A.1).Using sequ-into, the experimenters are able to analyse their data conveniently. This allows to employ rapid-prototyping of the laboratory protocol, as the results are available directly after sequencing (Fig. A.1). In case of unwanted/bad results, a new strategy can be tested without first having to wait for the bioinformaticians to finish the analysis. The report function of sequ-into allows experimenters to easily share the report to discuss the results within a team.
Case II: RNA sequencing
The second use-case is from an Helicobacter pylori transcriptomic sequencing project (Table 2, method Supplement A.2). Common RNA purification techniques like poly-A-tail selection do not work in bacteria. Only ribosomal RNA depletion kits may get rid of rRNA using enzymatic reactions, making rRNA depletion particularly important for transcriptomic sequencing. Considering that rRNA can make up more than 85% of a cell’s RNA [29], while not giving any information about the transcriptional regulation. After applying an enzymatic rRNA depletion on the input library, the initial rRNA content, in the experiments performed, was between 58–65% per sample, considering either the first of the total sequencing time (data not shown) or the first sequenced reads (Fig. A.3). Using sequ-into after data collection, it was possible to determine how well the library preparation and also the rRNA depletion worked. In the presented cases, the sequencing yielded enough support of a high rRNA content after only 10 min. With this result in mind, further measures could then be taken to deplete the rRNA to the desired level more efficiently (99% purity has been reported [30]). Knowing the rRNA fraction of a sequencing sample as soon as possible saves valuable sequencing time (and costs).
Fig. A.3
Off-target rate for every reads in Helicobacter pylori transcriptome sequencing (with rRNA as off-target). It can be seen that the ribosomal RNA content is conserved over read buckets. The first reads already give an estimate for the overall off-target rate ().
Case III: Analysing the off-target ratio over time
sequ-into has been developed in the context of phage genome sequencing with a focus on assessing sample (im-)purity. Besides the descriptive final overview, we also wanted to check whether the content of the host organism (E. coli) remains constant during sequencing. For this reason, sequ-into analyses the off-target rate for every set of reads.In the off-target rate plot of the phage DNA sequencing (Fig. A.4, E. coli genome as reference) we observed a high fraction of reads originating from E. coli at the beginning, getting fewer towards the end. Such an effect was not observed in the transcriptomic data (Fig. A.3).
Fig. A.4
Off-target rate for every reads in phage genome sequencing (with E. coli as off-target). It can be seen that the off-target reads are decreasing for later read buckets. The first ten buckets ( reads) seem to be an estimate for the upper bound of the off-target rate.
Not knowing whether this observation is special to our data only (e.g. library preparation), we also analysed the FAST5 raw data from a public dataset (accession id PRJEB8318, runs 1117, 1118 and 1121) [15]. In that experiment, an E. coli str. K12 substr. ER2738 is used as host organism, which is closely related to the E. coli str. K-12 substr. MG1655 genome available from sequ-into and used here. In this data we made similar observations regarding the contamination over the number of sequenced reads (Fig. 3). It can be seen that for all three runs the off-target rate decreases. The fraction of reads originating from E. coli are at about at the beginning, rising up to , before getting lower. Again, similar to the phage DNA sequencing in use-case 1 (Fig. A.4), the read buckets at the start of the sequencing run align better to E. coli than those afterwards. Nonetheless, even such data can be successfully analysed by sequ-into. In concordance with the results from use-case 1, we show that even with this surprising behaviour of the sequenced reads, the first few thousand reads are an useful estimate of the overall off-target rate.
Fig. 3
Fraction of off-target reads for the external phage DNA dataset (accession id PRJEB8318, runs 1117, 1118 and 1121). The reads were binned into buckets of 1000 reads with respect to their sequencing order over time. For the buckets of the 3 runs the respective percentage of ICOs is shown.
Fraction of off-target reads for the external phage DNA dataset (accession id PRJEB8318, runs 1117, 1118 and 1121). The reads were binned into buckets of 1000 reads with respect to their sequencing order over time. For the buckets of the 3 runs the respective percentage of ICOs is shown.For sequ-into only the first reads of an experiment are extracted in the real-time mode by default - the user can change this. In our analysis we have considered two frequent scenarios: the detection of ICOs in either transcriptomic reads (RNAseq) or genomic reads (DNAseq). With the genomic/phage samples we observed that occasionally off-target (E. coli) reads are slightly more frequent in the first few thousands reads (Fig. A.4 and Fig. 3) but then remain constant throughout the sequencing runs (Fig. 1). Thus, already the first (few) reads provide a useful estimate of the overall off-target rate or its upper-bound. Analysing more reads is not necessary for fast decision making, yet possible with sequ-into.
Conclusion
sequ-into offers a cross-platform, graphical-user-interface and uses state-of-the-art long-read alignment software such that everyone can perform an on-/off-target analysis, even during the sequencing run (use-case 1).It can detect large fractions of ribosomal RNA early in a sequencing experiment. If applied early in the sequencing project, sequ-into can show the high rRNA content, and thereby help to avoid a significant loss of reads to ICOs (use case 2). Additionally, users have easy access to our riboseq library, with ribosomal RNA sequences for more than species, from within sequ-into.Using sequ-into we investigated the (im-)purity of several phage DNA sequencing runs (use-case 3). We observed that the sequenced reads stem more frequently from the off-target (E. coli) at the beginning of a sequencing run, than towards the end. Still these results show that the first few thousand reads provide a useful estimate of the overall off-target rate in all evaluated cases.From within sequ-into, mappy, the python wrapper for minimap2, aligns the reads to the references. For easy sharing of the results, and for later reference, sequ-into creates an HTML report for each analysis. In our use-cases, we observe that a few reads are already sufficient to obtain a useful estimate of the off-target rate, allowing a fast availability of the results in less than a minute, even on a typical laptop computer. sequ-into supports the idea of fast protocol optimization at very low cost, by everyone and at any place.
Program and Data Availability
sequ-into is available from GitHubhttps://github.com/mjoppich/sequ-into with demo data. Documentation is available onlinehttps://sequ-into.readthedocs.io/en/latest/. sequ-into has been tested on Windows 10 Build 18363 with Ubuntu 18.04 LTS Windows Subsystem for Linux app. sequ-into has also been tested on Mac OS X 10.15 and Xubuntu 18.04.2 LTS.
Funding
This work has been supported by the Deutsche Forschungsgemeinschaft (DFG) via SFB 1123/2/Z2 (MJ/RZ) and DFG JI 221/1-1 (LFJS).
CRediT authorship contribution statement
Markus Joppich: Software, Conceptualization, Methodology, Visualization, Validation, Data curation, Supervision, Writing - original draft, Writing - review & editing. Margaryta Olenchuk: Software, Visualization, Data curation, Methodology, Writing - original draft, Writing - review & editing. Julia Mayer: Software, Visualization, Data curation, Methodology, Writing - original draft, Writing - review & editing. Quirin Emslander: Investigation, Resources, Conceptualization, Validation, Writing - review & editing. Luisa F. Jimenez-Soto: Investigation, Resources, Writing - review & editing. Ralf Zimmer: Conceptualization, Resources, Supervision, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Authors: Aline Bronzato Badial; Diana Sherman; Andrew Stone; Anagha Gopakumar; Victoria Wilson; William Schneider; Jonas King Journal: Plant Dis Date: 2018-06-20 Impact factor: 4.438
Authors: Richard M Leggett; Darren Heavens; Mario Caccamo; Matthew D Clark; Robert P Davey Journal: Bioinformatics Date: 2015-09-17 Impact factor: 6.937
Authors: Ioanna Kalvari; Joanna Argasinska; Natalia Quinones-Olvera; Eric P Nawrocki; Elena Rivas; Sean R Eddy; Alex Bateman; Robert D Finn; Anton I Petrov Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971
Authors: Margaret B Fleming; Eric L Patterson; Patrick A Reeves; Christopher M Richards; Todd A Gaines; Christina Walters Journal: J Exp Bot Date: 2018-08-14 Impact factor: 6.992