Literature DB >> 29845855

QuipuNet: Convolutional Neural Network for Single-Molecule Nanopore Sensing.

Karolis Misiunas¹, Niklas Ermann¹, Ulrich F Keyser¹.

Abstract

Nanopore sensing is a versatile technique for the analysis of molecules on the single-molecule level. However, extracting information from data with established algorithms usually requires time-consuming checks by an experienced researcher due to inherent variability of solid-state nanopores. Here, we develop a convolutional neural network (CNN) for the fully automated extraction of information from the time-series signals obtained by nanopore sensors. In our demonstration, we use a previously published data set on multiplexed single-molecule protein sensing. The neural network learns to classify translocation events with greater accuracy than previously possible, while also increasing the number of analyzable events by a factor of 5. Our results demonstrate that deep learning can achieve significant improvements in single molecule nanopore detection with potential applications in rapid diagnostics.

Entities: Disease Gene Species

Year: 2018 PMID： 29845855 PMCID： PMC6025884 DOI： 10.1021/acs.nanolett.8b01709

Source DB: PubMed Journal: Nano Lett ISSN： 1530-6984 Impact factor: 11.189

Nanopores have emerged as powerful sensing devices for single molecules,[2,3] with applications in DNA sequencing,[4] protein detection,[1,5−9] the study of protein folding,[10] SNP genotyping,[11] data storage,[8] and DNA computing.[12] A typical setup consists of two liquid filled reservoirs connected by a nanopore with diameters down to a few nanometres. An external electric field drives charged molecules through the nanopore, as shown in Figure a. The passage of molecules modulates the current, producing a characteristic signal that contains information about the shape of the molecule.

Figure 1

Convolutional neural networks for the analysis of nanopore data. (a) The shape of molecules is contained in the time-dependent ionic current signal from passing the molecules through a nanopore. For example, a molecule with three protrusions passing through the nanopore leads to a current event with three secondary current drops as indicated by the red arrow. (b) Current traces associated with the modified DNA molecules.[1] The first half of the molecule encodes a unique barcode: the first peak marks the start, three bits uniquely identify the molecule design, and the last peak signifies the end. The two events have barcodes “111” and “001”. The second half has a binding site for a specific molecular target. (c) Data analysis using deep learning methods: convolution layers extract local features such as current drops of different width (shown in orange). The features are interpreted by a fully connected neural network, which outputs a prediction for the barcode and the target binding state. The readout is a time-series current trace corresponding to the shape of the molecule, usually called an event. Detection of such events can be achieved using simple current thresholds, but the subsequent analysis of features within each identified event is often made difficult by a poor signal-to-noise ratio, varying conformation of the molecule, and nonspecific interactions with the nanopore surface. For example, Figure b shows two events from a multiplexed protein sensing technique published in ref (1). The authors used a DNA molecule as a carrier for a protein target. Modifications along the DNA molecule and bound targets produce secondary current drops during the translocation event, as shown in the two traces. In the first half of the structure, DNA hairpin loops at defined positions and their corresponding secondary drops were used to encode a digital barcode. This barcode uniquely identified a binding site in the other half of the DNA molecule. The presence of a target at the binding site could be inferred from a single secondary drop in the second half. This approach allows the simultaneous detection of a large number of targets, only limited by the number of distinct barcodes. The information is encoded in additional current drops during the event, much like the knots on a string used in the Inca Quipu system.[13] Analysis of the event data requires accurate detection and subsequent interpretation of secondary current drops.[1] However, simple peak finding algorithms often fail at reliably classifying large parts of the data. Common causes of errors are a varying peak magnitude, noise,[6] fluctuating velocities,[14] overlapping peaks, DNA knots,[15] and folded molecules. To mitigate these effects the nanopore community has developed sophisticated algorithms.[16−19] However, they frequently require manual parameter tuning for each data set and supervision of algorithms.[1,9] In the worst case scenario, researchers have to manually interpret the data, leading to small sample sizes, possible confirmation bias, or data analysis duration exceeding measurement time. In this paper, we show that deep learning is ideally suited for automating the analysis of nanopore sensing data. For our study, we use the previously mentioned multiplexed protein sensing data set.[1] The data set contains separate control measurements for each specific barcode, without other bit permutations present in the solution. This automatically provides labeled data to train the supervised learning model. At the same time, the data is sufficiently complex to require an elaborate algorithm for the classification of events. In ref (1), a 12 step approach was used to identify the bit sequence and presence or absence of a target on each DNA construct. That method relied on more than a dozen manually adjusted parameters that were carefully optimized, but still it could only use a small fraction (∼20%) of events, discarding up to 80% of the difficult-to-interpret events that failed some predefined set of criteria. Here, we show that machine learning models are able to interpret and classify data without the need for manual tuning and the development of complex algorithms while increasing the number of usable events by a factor of 5. Our implementation is open-source and available online to enable the adaptation of deep learning to other nanopore sensing problems.[20]

Methods

We chose convolutional neural networks (CNN) as the machine learning approach because of their suitability for detecting local patterns.[21,22] A recent study showed that CNNs perform well on simulated current traces from an STM tunnel junction.[23] For comparison, DNA bases can be accurately determined from current levels using recurrent neural networks.[24] However, our goal is fundamentally different, as we are trying to identify the pattern encoded on the DNA secondary structure from a variable nanopore system. Therefore, we chose to use the CNN architecture. A typical CNN consists of two parts, as shown in Figure c. First, a series of convolutions are applied to the raw input data. Then a dense neural network learns to interpret the processed signal. The output is a prediction about which class a particular input belongs to. In our case, the prediction is a barcode on the DNA constructs and whether a target has bound to it. Before feeding the data into the neural network, we perform two preparation steps. First, the raw data set contains erroneous detections, caused by contaminations, incomplete DNA fragments, and nonspecific interactions with the pore walls. We use standard filtration methods to remove these detections:[25] we exclude events whose area under the current trace (electronic charge deficit) lies outside two standard deviations of the mean, as well as those with current drops larger than 3.2× the unfolded event current level. Details are available in the Supporting Information.[20] This filtration removes up to 30% of the detections recorded with the measurement setup. After filtration we still observe some events with errors, such as a missing bit in the barcode structure. Therefore, perfect accuracy is unattainable using realistic data sets. Second, we want the model to identify a molecule, but not the experiment. The problem arises because nanopores vary in shape and conductivity, leading to a correlation between events measured with the same nanopore. It is possible for the neural network to overfit to these variations, thereby learning to identify a nanopore instead of the barcode on a molecule. To reduce such overfitting, we normalize the events from each nanopore to have the same unfolded current level (arbitrarily set to −1). In addition, we test the model using independent experiments to reduce the chances of spurious correlations. Table shows the number of events in the training and test sets.

Table 1

Number of Events in the Training and Testing Setsa

	event no.		experiment no.
label	train	test	without protein	with protein
000	5593	253	5	0
001	8155	502	3	4
010	2319	101	4	0
011	15178	827	4	7
100	876	83	3	0
101	7251	427	2	4
110	6473	606	5	0
111	6680	665	5	2
unbound	36551	2191	31	0
bound	15874	1273	0	17
total	52525	3464	31	17

The last two columns show the number of independent experiments without protein (unbound state) and with protein (bound state).

The last two columns show the number of independent experiments without protein (unbound state) and with protein (bound state). As mentioned above, our predictor model is based on a machine learning technique called neural networks. The architecture of such a network specifies how the network nodes are connected and what operations are applied. In order to find a suitable architecture for nanopore data, we investigated different alternatives by educated trial and error. The model presented here is inspired by the image classification network in ref (26), which we modified to perform 1D convolutions. Figure shows the architecture.

Figure 2

Architecture of the neural network, where each element is briefly described in the Methods. Acronyms: BN is a batch normalization layer; ReLU is a rectified linear unit and is shown on top. Numbers in the brackets correspond to the matrix sizes encoding a single event at that point in the network. The model has 3 995 920 trainable parameters. We optimized the (hyper-)parameters to work well for nanopore data by trial and error. A typical procedure is to pick one hyper-parameter, such as the number of convolution layers, then increase the number and measure the resulting accuracy. If the accuracy increases we stick with the new number, but if it decreases or does not change we stick with the old number. We then pick a different hyper-parameter and repeat the procedure. To avoid overfitting to the test, we measured the accuracy gains using a development set, which is independent from the test set and 20 times smaller than the training set. The reported numbers in Figure are the result of our optimization. The input for the neural network is a current trace from a measurement event. The data from ref[1] produced events with an average length of 402 data points. This includes short stretches of current recording before and after the event. As the maximum length of the event never exceeded 700 points, we use a 700-element vector as the input. The shorter events are padded at the end with Gaussian noise (μ = 0, σ = 0.072, corresponding to average noise levels). Each box in Figure corresponds to a so-called “hidden layer” that performs a specific task and passes on the information. Here, we give a brief description of each component; we refer interested readers to the machine learning literature for more details.[21,22] Convolution layers extract features with local structure, such as peaks or steps. These layers perform a discrete convolution on a segment of the input by multiplying it with a small window, called kernel, and moving along to the next segment (stepping by a single vector element). The output is large if the input features match the kernel, where its weights are learned from the training data. For example, Figure c shows the output after the first convolution layer, where the orange line corresponds to a kernel that detects peaks. Other kernels detect other features in the input data, which are often difficult to interpret, as seen by two gray lines that correspond to different kernels. After each convolution, we apply a batch normalization (BN) layer that normalizes the data to have zero mean and unit variance.[27] These layers improve our network training convergence. Finally, an activation function is applied–a piecewise function called rectified linear unit (ReLU), f(x) = max(0,x). This nonlinear function is necessary for learning nonlinear relationships between features.[22] The activation function completes one row in the diagram, its output goes into the next convolution layer. Roughly speaking, the deeper layers capture more abstract and complex features. We follow the common practice of increasing the number of kernels for deeper layers:[21] from 64 to 128, then to 256. Each step doubles the amount of information passed to the next layer. For every two convolutions, we have a “max pool” layer to reduce the amount of information by down-sampling spacial dimensions. A max pool layer splits an input vector into segments of three numbers and returns only the maximum values within the segment. This arrangement is believed to improve spacial invariance for feature extraction.[22] The dropout layer reduces overfitting by randomly switching off a fraction of nodes in the layer above. This encourages the network to learn more robust features that do not depend on a single node.[28] Note that the dropout is only applied during training, because we want maximum accuracy while using the algorithm. The second half of the network is a densely connected neural network with two hidden layers and a ReLU activation function. In a dense network, the nodes between adjacent layers are fully connected, as illustrated in Figure c. The weights for these connections are learned from the training data. The output layer is adjusted depending on the task. In our case, we have two outputs: the barcode and sensing region. The barcode output is a vector with 8 elements and a softmax activation function. The softmax normalizes the output vector to have a sum of one such that each element is a proxy for the probability for a different barcode. We take the maximal value to be the predicted barcode. For the sensing region, the output is a single number with a sigmoid activation function. This number is a proxy for the probability of having a target bound to the sensing region. Note that these are two networks that are trained separately and give independent outputs. The model is trained for 200 epochs on a GPU (Nvidia GeForce GTX 1080 TI). The aim of the training is to find the weights that maximize accuracy, which corresponds to minimizing a loss function. For barcodes, the loss function is categorical cross-entropy, while for the sensing region it is binary cross-entropy. To minimize the loss function we use the Adam optimization algorithm[29] (LR = 0.001; decay =0.97; batch size of 32). Typical training takes 200 min, while evaluation is much faster at 1600 events/s, making QuipuNet suitable for real-time classification.

Results

QuipuNet correctly identifies almost all events even with highly complex shapes, as shown in Figure . For example, the first event in column one enters the nanopore with the barcode first, while the second and third examples enter with the sensing region first. QuipuNet can interpret both directions. Columns two and three show that it learns to identify folded DNA events which occur when a nanopore captures the DNA molecule somewhere along its length. These events are particularly difficult to interpret because there are many possible outcomes and peaks tend to be less pronounced. For comparison, the method from ref (1) discarded folded events so that only the events shown in blue could be identified.

Figure 3

Example events identified by semiautomated algorithm[1] and QuipuNet. The sketches in part a show some of the possible DNA configurations during the passage through a nanopore. The shape of the molecule complicates the semiautomated analysis.[1] (b) These 9 example events present typical results from the data set in ref (1). The original algorithm only identified the two blue events: 111 unbound and 001 bound; while QuipuNet correctly identified all these events. It is important to note here that QuipuNet increased the number of usable events by a factor of ∼5. Table presents a quantitative comparison of accuracy. The first metric for accuracy is precision, which gives the fraction of correctly identified events out of attempted guesses. Precision can be boosted by refusing to label difficult events. On the other hand, the recall metric gives the fraction of correctly identified events out of all the events (after filtration). For example, the Bell and Keyser method[1] and human experts achieve high accuracy but have a low recall because events with ambiguous barcode patterns are discarded.

Table 2

Performance Comparison between QuipuNet and Other Methodsa

	precision	recall	data utilized
barcode readout
Bell and Keyser[1]	0.937	0.182	0.194
human	0.978	0.440	0.450
QuipuNet (all data)	0.946	0.946	1.000
QuipuNet (best 80%)	0.987	0.789	0.800
sensing region
Bell and Keyser[1]	0.940	0.192	0.204
human	0.931	0.405	0.435
QuipuNet (all data)	0.971	0.971	1.000
QuipuNet (best 80%)	0.997	0.798	0.800

Precision is the fraction of correctly identified samples out of attempted guesses while recall gives the fraction of correctly identified samples out of all the events. Data utilised is a fraction of events that the algorithm attempted to identify. QuipuNet achieves a precision of 0.946 for barcodes and 0.971 for the sensing regions. This is 1.0% and 3.4% higher than the Bell and Keyser method. A much bigger difference can be seen in the recall metric because QuipuNet classifies all the data. The recall is five times larger than the original method[1] for both the barcode and sensing region. These results suggest that QuipuNet accurately classifies the nanopore event data, including folded events. As a result, QuipuNet outputs five times more data than the previous method for the same experiments. To measure human expert performance, one of the authors labeled 500 randomly chosen events and compared them with the true labels (it took around 1 h). Only 45% of events could be labeled reliably because of the ambiguity introduced by folds or overlapping peaks. Compared with human performance, QuipuNet is 3.3% less precise at reading the barcode and 4.4% better at reading the sensing region. In both cases, the recall metric is more than twice that of a human expert. To optimize for accuracy, we can discard low confidence predictions to increase the precision. Practically, it makes sense to discard events where a barcode is simply missing or otherwise impossible to identify. To achieve this, we estimate the confidence using the maximal value of the softmax output vector and then discard events with the lowest confidence. We use a “data utilized” fraction to show how much data remains after discarding low confidence predictions. Figure a shows the accuracy as a function of data utilized for the barcode predictions (evaluated on the test set). The accuracy increases with the amount of discarded data, suggesting that the confidence estimator correctly identifies poor predictions. The accuracy curve is significantly above manual labeling and the Bell and Keyser method, suggesting that QuipuNet outperforms both. For illustrative purposes, at 80% utilized data QuipuNet precision is 0.987, which is higher than the human performance. Figure b shows an equivalent plot for the sensing region predictions. Here, QuipuNet achieves a nearly perfect precision of 0.997 for 80% utilized data. In both cases, discarding low confidence predictions increases the accuracy of the QuipuNet algorithm.

Figure 4

Evaluating the performance of QuipuNet. (a) Barcode prediction accuracy (precision) as a function of data utilized. The accuracy increases when the least confident predictions are removed. (b) Sensing region prediction accuracy as a function of data utilized. (c) Error matrix: rows represent true barcodes from the test set, while columns are the barcodes that QuipuNet assigned them to. In an ideal case, it would be a diagonal matrix. The matrix was evaluated using the entire test set. On the right, bars show the number of events in the training set for each barcode. Accuracy correlates with the size of the training set. The predictions for the sensing region have a higher accuracy than those for the barcodes. We attribute this to two effects. First, the sensing region typically has a higher signal-to-noise ratio, i.e., larger current drops. Second, the barcode prediction is an intrinsically harder problem, because the algorithm must distinguish between 8 different classes, instead of two. Figure c shows where the errors are made for the barcode predictions. The matrix suggests that QuipuNet makes more mistakes for certain barcodes. For example, the prediction for barcode “100” has a precision of only 0.86, which can be attributed to the small training set. It only has 876 events measured by two experiments while the third experiment was used for the test set. A larger training set is expected to improve the accuracy. The error matrix also provides insights for designing more robust barcodes. The barcodes “000”, “001”, “101”, “110”, and “111” all have a similar amount of training data, but the symmetric barcodes have a higher accuracy. Here, symmetric barcodes are “000”, “101”, and “111” (“010” has a smaller amount of training data). This observation suggests that using only symmetric patterns for barcodes might improve the overall accuracy. Finally, we trained QuipuNet on a reduced training set to assess the relationship between accuracy and training set size, as shown in Figure . For the sensing region, we randomly picked the same number of events for bound and unbound states. For the barcode, we randomly reduced the training set size of the “011” barcode to a number specified on the x̂ axis, while the other barcodes had the same number of events as specified in Table . The resulting recall metric reaches 80% at 2000 training events, 90% at 8000 and then slowly increases to >90% for more than 8000 training events. The increase in accuracy beyond 90% appears to be asymptotic and would require even larger training sets. The classification of the sensing region (blue data in Figure ) reaches higher accuracies for smaller training sets as it only has two classes and signal-to-noise for protein signals is higher than for barcodes.

Figure 5

Recall accuracy as a function of training set size. The number of events are shown for bound/unbound states and the “011” barcode.

Discussion

We have shown that convolutional neural networks can accurately classify events from nanopore data. Our network achieves better accuracy than the previous algorithm[1] or manual classification, while at the same time classifying events that were impossible to interpret before. As a result, five times more data can be analyzed from the same experiments. Furthermore, the machine learning approach simplifies the analysis by eliminating manual parameter tuning and algorithm development. Instead, we rely on experiments to generate the labels that are used to train the neural network. In the Supporting Information, we use QuipuNet to analyze raw data from other nanopore experiments.[20] In[11] the authors detected single-nucleotide polymorphisms from the presence of a single binding target. Their designed DNA molecules contain only the sensing region with no barcode. We successfully reproduce results from their analysis using QuipuNet. Despite a significantly lower signal-to-noise level for this data set we obtain accuracy of up to 72% when including folded events. If only the unfolded events are analyzed, the accuracy is 0.91. This shows that QuipuNet can be readily applied to other nanopore sensing data sets. When designing a nanopore experiment, others should consider the relationship between the desired accuracy and the number of training events. Our work suggests that deep learning is particularly suitable for nanopore sensing because the experiments can generate large amounts of training data; often with predefined labels. A similar conclusion was reached for nanopore-based DNA sequencing, where a recurrent neural network improves the precision of DNA sequencing.[24] Future work may address other difficult problems in the nanopore field. Specifically, peak localization in noisy data sets[6] can be trained using DNA with known modification positions. Also, running QuipuNet against simulated data sets (generated classically or with generative adversarial networks) could guide the design of the DNA structures in order to maximize the information density or readout accuracy. Both are critically important for information storage on DNA and hold the promise of highly multiplexed protein sensing for medical applications.

17 in total

1. Stochastic sensing of proteins with receptor-modified solid-state nanopores.

Authors: Ruoshan Wei; Volker Gatterdam; Ralph Wieneke; Robert Tampé; Ulrich Rant
Journal: Nat Nanotechnol Date: 2012-03-11 Impact factor: 39.213

2. Fast and automatic processing of multi-level events in nanopore translocation experiments.

Authors: C Raillon; P Granjon; M Graf; L J Steinbock; A Radenovic
Journal: Nanoscale Date: 2012-07-11 Impact factor: 7.790

3. Nanopore Logic Operation with DNA to RNA Transcription in a Droplet System.

Authors: Masayuki Ohara; Masahiro Takinoue; Ryuji Kawano
Journal: ACS Synth Biol Date: 2017-04-25 Impact factor: 5.110

4. Direct observation of DNA knots using a solid-state nanopore.

Authors: Calin Plesa; Daniel Verschueren; Sergii Pud; Jaco van der Torre; Justus W Ruitenberg; Menno J Witteveen; Magnus P Jonsson; Alexander Y Grosberg; Yitzhak Rabin; Cees Dekker
Journal: Nat Nanotechnol Date: 2016-08-15 Impact factor: 39.213

5. MOSAIC: A Modular Single-Molecule Analysis Interface for Decoding Multistate Nanopore Data.

Authors: Jacob H Forstater; Kyle Briggs; Joseph W F Robertson; Jessica Ettedgui; Olivier Marie-Rose; Canute Vaz; John J Kasianowicz; Vincent Tabard-Cossa; Arvind Balijepalli
Journal: Anal Chem Date: 2016-11-15 Impact factor: 6.986

6. Nanopore Sensing of Protein Folding.

Authors: Wei Si; Aleksei Aksimentiev
Journal: ACS Nano Date: 2017-07-13 Impact factor: 15.881

7. Electrophoretic Deformation of Individual Transfer RNA Molecules Reveals Their Identity.

Authors: Robert Y Henley; Brian Alan Ashcroft; Ian Farrell; Barry S Cooperman; Stuart M Lindsay; Meni Wanunu
Journal: Nano Lett Date: 2015-12-02 Impact factor: 11.189

8. DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads.

Authors: Vladimír Boža; Broňa Brejová; Tomáš Vinař
Journal: PLoS One Date: 2017-06-05 Impact factor: 3.240

9. Single molecule multiplexed nanopore protein screening in human serum using aptamer modified DNA carriers.

Authors: Jasmine Y Y Sze; Aleksandar P Ivanov; Anthony E G Cass; Joshua B Edel
Journal: Nat Commun Date: 2017-11-16 Impact factor: 14.919

10. Quantifying Nanomolar Protein Concentrations Using Designed DNA Carriers and Solid-State Nanopores.

Authors: Jinglin Kong; Nicholas A W Bell; Ulrich F Keyser
Journal: Nano Lett Date: 2016-05-03 Impact factor: 11.189

6 in total

1. Displacement chemistry-based nanopore analysis of nucleic acids in complicated matrices.

Authors: Liang Wang; Xiaohan Chen; Shuo Zhou; Golbarg M Roozbahani; Youwen Zhang; Deqiang Wang; Xiyun Guan
Journal: Chem Commun (Camb) Date: 2018-12-11 Impact factor: 6.222

2. Simulation of single-protein nanopore sensing shows feasibility for whole-proteome identification.

Authors: Shilo Ohayon; Arik Girsault; Maisa Nasser; Shai Shen-Orr; Amit Meller
Journal: PLoS Comput Biol Date: 2019-05-30 Impact factor: 4.779

3. Electrical DNA Sequence Mapping Using Oligodeoxynucleotide Labels and Nanopores.

Authors: Kaikai Chen; Felix Gularek; Boyao Liu; Elmar Weinhold; Ulrich F Keyser
Journal: ACS Nano Date: 2021-01-21 Impact factor: 15.881

Review 4. Nanopore sensors for viral particle quantification: current progress and future prospects.

Authors: Shiva Akhtarian; Saba Miri; Ali Doostmohammadi; Satinder Kaur Brar; Pouya Rezai
Journal: Bioengineered Date: 2021-12 Impact factor: 3.269

Review 5. Biological nanopores for single-molecule sensing.

Authors: Simon Finn Mayer; Chan Cao; Matteo Dal Peraro
Journal: iScience Date: 2022-03-23

6. Structural-profiling of low molecular weight RNAs by nanopore trapping/translocation using Mycobacterium smegmatis porin A.

Authors: Yuqin Wang; Xiaoyu Guan; Shanyu Zhang; Yao Liu; Sha Wang; Pingping Fan; Xiaoyu Du; Shuanghong Yan; Panke Zhang; Hong-Yuan Chen; Wenfei Li; Daoqiang Zhang; Shuo Huang
Journal: Nat Commun Date: 2021-06-07 Impact factor: 14.919

6 in total