| Literature DB >> 32907883 |
Martin A Smith1,2,3,4, Tansel Ersavas1, James M Ferguson1, Huanle Liu1,5, Morghan C Lucas1,5,6, Oguzhan Begik1,2,5, Lilly Bojarski1, Kirston Barton1,2, Eva Maria Novoa1,2,5,6.
Abstract
Nanopore sequencing enables direct measurement of RNA molecules without conversion to cDNA, thus opening the gates to a new era for RNA biology. However, the lack of molecular barcoding of direct RNA nanopore sequencing data sets severely affects the applicability of this technology to biological samples, where RNA availability is often limited. Here, we provide the first experimental protocol and associated algorithm to barcode and demultiplex direct RNA nanopore sequencing data sets. Specifically, we present a novel and robust approach to accurately classify raw nanopore signal data by transforming current intensities into images or arrays of pixels, followed by classification using a deep learning algorithm. We demonstrate the power of this strategy by developing the first experimental protocol for barcoding and demultiplexing direct RNA sequencing libraries. Our method, DeePlexiCon, can classify 93% of reads with 95.1% accuracy or 60% of reads with 99.9% accuracy. The availability of an efficient and simple multiplexing strategy for native RNA sequencing will improve the cost-effectiveness of this technology, as well as facilitate the analysis of lower-input biological samples. Overall, our work exemplifies the power, simplicity, and robustness of signal-to-image conversion for nanopore data analysis using deep learning.Year: 2020 PMID: 32907883 PMCID: PMC7545146 DOI: 10.1101/gr.260836.120
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Schematic overview of the direct RNA barcoding and demultiplexing strategy. (A) Overview of Oxford Nanopore library preparation protocol for native RNA sequencing. (B) Adaptation of A to include custom DNA barcodes. (C) Barcode segmentation and transformation, where the electric current associated with a barcode adapter (highlighted in red) is extracted and converted into an image using GASF transformation. (D) Deep learning is used to classify the segmented and GASF-transformed squiggle signals into their corresponding bins, without the need of base-calling the underlying sequence. The convolution architecture of the final residual neural network classifier (ResNet-20) described in this work: FC = fully connected layer.
Mapping statistics from direct RNA sequencing runs
Accuracy and average speed of signal to image conversions from 1000 runs
Figure 2.Barcode segmentation and signal transformation. A randomly selected example of barcode signal segmentation (red outline) for each of the four barcodes is shown with its corresponding GASF image below. An additional five randomly selected segmented barcode signals and their corresponding GASF images are shown for each of the four barcodes. Sequencing reads were drawn from replicate 2. (GASF) Gramian Angular Summation Field.
Accuracy and training time of two residual neural networks on 4x Tesla V-100 GPUs
Accuracy and recovery of ResNet-20 on the testing set, validation set, and two independent replicates
Figure 3.Performance of 2D convolutional neural network barcode classifier. (A) Receiving operator characteristic (ROC) analysis and area under the curve (AUC) metrics of the final model on three evaluation sets: (1) Replicates 2–4 validation set (left column), which was generated from the same sequencing runs used to train the model but were withheld from training; (2) Replicate 1 set (middle column), composed of reads generated using the RNA001 library kit; and (3) Replicate 5 set (right column), derived from an independent sequencing run using the RNA002 kit. Optimal Youlden index (J statistic) is marked as a black cross on the ROC curve. (B) The associated precision recall curves on the three test sets. (C) Accuracy (black) and percentage of reads recovered (blue) in function of the scoring threshold (cut-off) emitted by the trained model, for three different data sets presented in A.