Literature DB >> 35212544

Expanding the Molecular Alphabet of DNA-Based Data Storage Systems with Neural Network Nanopore Readout Processing.

S Kasra Tabatabaei^1,2, Bach Pham³, Chao Pan⁴, Jingqian Liu¹, Shubham Chandak⁵, Spencer A Shorkey³, Alvaro G Hernandez⁶, Aleksei Aksimentiev^1,7, Min Chen³, Charles M Schroeder^1,2,8,9, Olgica Milenkovic⁴.

Abstract

DNA is a promising next-generation data storage medium, but challenges remain with synthesis costs and recording latency. Here, we describe a prototype of a DNA data storage system that uses an extended molecular alphabet combining natural and chemically modified nucleotides. Our results show that MspA nanopores can discriminate different combinations and ordered sequences of natural and chemically modified nucleotides in custom-designed oligomers. We further demonstrate single-molecule sequencing of the extended alphabet using a neural network architecture that classifies raw current signals generated by Oxford Nanopore sequencers with an average accuracy exceeding 60% (39× larger than random guessing). Molecular dynamics simulations show that the majority of modified nucleotides lead to only minor perturbations of the DNA double helix. Overall, the extended molecular alphabet may potentially offer a nearly 2-fold increase in storage density and potentially the same order of reduction in the recording latency, thereby enabling new implementations of molecular recorders.

Entities: Chemical

Keywords: DNA Data Storage; Nanopores; Neural Networks; Single-Molecule; Unnatural Nucleotides

Mesh：

Substances：
Nucleotides
DNA

Year: 2022 PMID： 35212544 PMCID： PMC8915253 DOI： 10.1021/acs.nanolett.1c04203

Source DB: PubMed Journal: Nano Lett ISSN： 1530-6984 Impact factor: 11.189

Introduction

DNA is emerging as a robust data storage medium that offers ultrahigh storage densities greatly exceeding conventional magnetic and optical recorders. Information stored in DNA can be copied in a massively parallel manner and selectively retrieved via polymerase chain reaction (PCR).[1−10] However, existing DNA storage systems suffer from high latency caused by the inherently sequential writing process. Despite recent progress, a typical cycle time of solid-phase DNA synthesis is on the order of minutes, which limits the practical applications of this molecular storage platform.[11] Using current technologies, writing 100 bits of information requires nearly 2 h [11] and costs more than U.S. $1,[12] assuming that each nucleotide stores its theoretical maximum of two bits. To overcome these challenges, new synthesis methods and information encoding approaches are required to accelerate the speed of writing large-volume data sets.[13] Expanding the alphabet of a DNA storage media by including chemically modified DNA nucleotides can both increase the storage density and the writing speed because more than two bits are recorded during each synthesis cycle. However, designing chemically modified nucleotides as new letters for the DNA storage alphabet must be tightly coupled to the process of reading the encoded information via DNA sequencing, because current DNA sequencing methods, including single-molecule nanopore sequencing, have been developed and optimized to read natural nucleotides. Prior work reported an expanded nucleic acid alphabet of synthetic DNA and RNA nucleotides that can be replicated and transcribed using biological enzymes,[14] but this alphabet was not designed for molecular storage applications and was not accurately read using a nucleic acid sequencing method. Aerolysin nanopores were used to detect synthetic polymers flanked by adenosines, where each monomer of the polymer carries one bit of information.[15] Prior work has reported successful detection of base pairs containing single chemically modified nucleotides[16,17] or discrimination of single nucleotides in natural versus modified states.[18] Despite recent advances, single-molecule detection and sequencing of an expanded molecular alphabet based on a library of chemically diverse modified nucleotides has not yet been demonstrated. Here, we report an expanded molecular alphabet for DNA data storage comprising four natural and seven chemically modified nucleotides that are readily detected and distinguished using nanopore sequencers (Figure and Table ). Our results show that Mycobacterium smegmatis porin A (MspA) nanopores, which are widely used for ssDNA sensing and single molecule chemistry studies,[19−21] can accurately discriminate 77 combinations and orderings of chemically diverse monomers within homo- and heterotetrameric sequences (Figures , 2, S1, and S2 and Tables S1–S3). We further demonstrate that highly accurate classification (exceeding 60% on average) of combinatorial patterns of natural and chemically modified nucleotides is possible using deep learning architectures that operate on raw current signals generated by GridION of Oxford Nanopore Technologies (ONT)[22] (Figures , S3, and S4). We further study the stability of DNA duplexes containing modified nucleotides using all-atom molecular dynamics (MD) simulations[23−26] (Figures , S5, and S6 and Table S5). Overall, the extended molecular alphabet has the potential to offer a nearly 2-fold increase in storage density and potentially the same order of reduction in recording latency, thereby providing a promising path forward for the development of new molecular recorders.

Figure 1

Table 1

Chemically Modified Nucleotides Used in the DNA Data Storage System, Along with Their Chemical Propertiesa

Symbol	B1	B2	B3	B4	B5	B6	B7
Name	2,6-Diamino-purine 2′-deoxyriboside	5- Hydroxymethyldeoxycytidine	5-Hydroxybutynl-2′-deoxyuridine	5-Nitroindole-2′-deoxy-riboside	Deoxyuridine	5-Octadiynyldeoxyuridine	1,2-Dideoxyribose
Structurally most similar nucleotide	dA	dC	dT	dA	dT	dT	-
Pairing mate/interaction type (experiment*)	dT H bonds[28−30]	dG H bonds[31]	dA-	All natural nucleotides stacking[28,32]	dA H bonds[33]	dA H bonds[34,35]	-
Pairing mate/interaction type (simulation**)	dT H bonds	dG H bonds	dA H bonds	dG stacking	dA H bonds	dA, dC H bonds	-

The symbols and the names of the chemically modified nucleotides are shown in the first and second rows, and the molecular structures are depicted in Figure . Structurally similar natural nucleotides are shown in the third row. In general, distinct chemical functional groups and molecular charges play an important role in discriminating nucleobases using MspA and ONT sequencers. The last two rows show pairing properties of the modified bases. * denotes data from Integrated DNA Technologies[28] or experiment data from previous work,[29−35] while ** denotes results from molecular dynamics simulations reported in Figure and the Supporting Information (Figures S5 and S6, Table S5). Short dashes indicate that pairing is inherently impossible (e.g., B7) or that no specific information is published (e.g., interaction type of B3-dA pairing).

Figure 2

Identification of chemically modified DNA using MspA nanopores. (A) Schematic diagram of ssDNA immobilized in a MspA nanopore, where ssDNA containing a biotin–streptavidin interaction at the 5′ terminus prevents translocation through the pore. Residual ion current generated by four nucleotides at positions 13–16 from the 5′ terminus is recorded for ssDNA immobilized in the pore. (B) Histograms of average residual ionic currents Ires shown in gray for different homopolymers (A, T, C, G, and B1–B7). The fitted Gaussian curves are depicted in red for natural nucleotides (A, T, C, G) and in blue for chemically modified nucleotides (B1–B7). (C) Histograms of the average residual ionic currents and the fitted Gaussian curves at various applied voltages for tetramers involving different combinations and orderings of B2 and B3. (D) Peak values (points) and confidence intervals (bars) of the fitted Gaussians with mean residual ionic currents corresponding to tetramers obtained by inserting one of the monomers B2 and B3 into the sequence ACT, at applied biases of 150 mV and 180 mV. (E) Schematic of the shift reconciliation method for resolving ambiguities in the readouts of different tetramers.

Figure 3

Sequencing oligos containing chemically modified nucleotides using ONT GridION. (A) Schematic of oligo design and a picture of the GridION sequencer used in our experiments. (B) (Left) Illustration of current levels of polyA and polyT regions, used in our custom level calibration scheme. Dashed orange circle indicates the region harboring the signals from chemically modified nucleotides. (Right) Region-of-interest in raw current signal obtained by identifying polyA-polyT patterns. (C) Neural network model used for classification. The 1D residual neural network architecture comprises nine 1D convolution blocks. For example, a 1D convolution block (1 × 8 conv, 64) indicates that the kernel size for the convolution is 1 × 8 and that the number of output channels is 64. Half-downsampling for each channel is denoted by (/2); averaging over all channels to arrive at a single vector is referred to as “average pooling”; the (fc 128 × 30) notation indicates a fully connected layer with the shape 128 × 30. (Right) Magnified view of the operation of 1D convolutional neural networks on time-series data. (D) (Top) Confusion matrix for 66 classes, all of which have roughly the same number of samples (subsampled to ∼3500 sample oligos in each class). Random guessing would lead to a classification accuracy of 1.52%, whereas the smallest accuracy from our model is 41% (tetramer 2252). For our model-based prediction, the mean classification accuracy is 60.28% ± 0.28% (39× larger than random guessing), and the highest observed accuracy is 79% (tetramer 1111). The exact number of samples in each class is listed in Table S4. (Bottom left) Confusion matrix for six selected classes using B2 and B4 (named as listed, subsampled to roughly 5000 samples per class). Random guessing leads to an accuracy of 16.67%, whereas our model-based prediction ensures an average classification accuracy of 72.25% ± 1.46%. (Bottom right) Confusion matrix for six selected classes using B4 and B5 (named as listed, subsampled to roughly 5000 samples per class). Random guessing leads to an accuracy of 16.67%, while our model-based prediction ensures an average accuracy of 77.84% ± 0.96%.

Figure 4

Stability of DNA duplexes containing chemically modified nucleotides. The backbone of the dodecamer is shown using silver spheres, whereas the bases are drawn as molecular bonds. Chemically modified bases and the natural bases that pair with them are colored according to the atom type (cyan for carbon, blue for nitrogen, and red for oxygen). Base pairs immediately adjacent to the modified base pair are colored in red or blue. (A) Microscopic configurations of modified base pairs (from top to bottom: B1–T, B2–G, A–B3, A–B5, A–B6, and C–B6). (B) Donor (N1)–acceptor (N3) distance (black) in the modified base pair (black) and in the adjacent base pairs (red and blue) during the last 100 ns of the 350 ns MD simulation. The arrows indicate the correspondence between the base pairs and the curves. The curves show a running average of the 10 ps sampled data with a 2 ns averaging window. (C) Microscopic configuration of modified base pairs. The black lines represent hydrogen bonds. The donor and the acceptor are labeled asides the atoms. (D) Probability of observing the specified number of hydrogen bonds within a modified base pair. The H-bonding probabilities were computed using the final 100 ns of a 350 ns all-atom MD simulation of a DNA dodecamer.

DNA data storage using natural and chemically modified nucleotides. (A) Chemical structures of natural DNA nucleotides (A, C, G, T) and the selected chemically modified nucleotides employed in our study (B1–B7). (B) Schematic of the ssDNA oligo used in MspA nanopore experiments. The length of the oligos is 40 nucleotides (nts), with biotin attached at the 5′ terminus. Homo- or heterotetrameric sequences are located at positions 13–16, flanked by two polyT regions of length 12 nt and 24 nt on the 5′ and 3′ ends, respectively. (C) Sequence space for DNA homotetramers or heterotetramers used in MspA nanopore experiments. The notation aX + bY, where a and b take values in {2, 3, 4} so that a + b = 4, indicates that “a” symbols of the same kind are combined with “b” symbols of another kind and arranged in an arbitrary linear order. In total, 77 distinct tetrameric sequences were synthesized and tested experimentally. (Left) Circular diagram showing all 11 homotetramers and 12 tetrameric sequences of the form ACT + X, where X is a chemically modified nucleotide from the set {B2, B3, B5}. (Middle) Circular diagram showing all 30 tested combinations of tetrameric sequences with total composition 2X + 2Y using chemically modified monomers from the set {B1, B2, B3, B4, B5}, including sequence patterns XXYY, XYYX, and XYXY. (Right) Circular diagram showing the remaining 24 combinations of tetrameric sequences with total composition 3X + Y using the set {B2, B3, B5}. Five chemically modified nucleotides form stable base pairs with natural nucleotides via hydrogen bonds (B2–G, B3–A, B5–A, B6–A, B6–C), based on the results from molecular dynamic (MD) simulations. The symbols and the names of the chemically modified nucleotides are shown in the first and second rows, and the molecular structures are depicted in Figure . Structurally similar natural nucleotides are shown in the third row. In general, distinct chemical functional groups and molecular charges play an important role in discriminating nucleobases using MspA and ONT sequencers. The last two rows show pairing properties of the modified bases. * denotes data from Integrated DNA Technologies[28] or experiment data from previous work,[29−35] while ** denotes results from molecular dynamics simulations reported in Figure and the Supporting Information (Figures S5 and S6, Table S5). Short dashes indicate that pairing is inherently impossible (e.g., B7) or that no specific information is published (e.g., interaction type of B3-dA pairing). Identification of chemically modified DNA using MspA nanopores. (A) Schematic diagram of ssDNA immobilized in a MspA nanopore, where ssDNA containing a biotin–streptavidin interaction at the 5′ terminus prevents translocation through the pore. Residual ion current generated by four nucleotides at positions 13–16 from the 5′ terminus is recorded for ssDNA immobilized in the pore. (B) Histograms of average residual ionic currents Ires shown in gray for different homopolymers (A, T, C, G, and B1–B7). The fitted Gaussian curves are depicted in red for natural nucleotides (A, T, C, G) and in blue for chemically modified nucleotides (B1–B7). (C) Histograms of the average residual ionic currents and the fitted Gaussian curves at various applied voltages for tetramers involving different combinations and orderings of B2 and B3. (D) Peak values (points) and confidence intervals (bars) of the fitted Gaussians with mean residual ionic currents corresponding to tetramers obtained by inserting one of the monomers B2 and B3 into the sequence ACT, at applied biases of 150 mV and 180 mV. (E) Schematic of the shift reconciliation method for resolving ambiguities in the readouts of different tetramers. Sequencing oligos containing chemically modified nucleotides using ONT GridION. (A) Schematic of oligo design and a picture of the GridION sequencer used in our experiments. (B) (Left) Illustration of current levels of polyA and polyT regions, used in our custom level calibration scheme. Dashed orange circle indicates the region harboring the signals from chemically modified nucleotides. (Right) Region-of-interest in raw current signal obtained by identifying polyA-polyT patterns. (C) Neural network model used for classification. The 1D residual neural network architecture comprises nine 1D convolution blocks. For example, a 1D convolution block (1 × 8 conv, 64) indicates that the kernel size for the convolution is 1 × 8 and that the number of output channels is 64. Half-downsampling for each channel is denoted by (/2); averaging over all channels to arrive at a single vector is referred to as “average pooling”; the (fc 128 × 30) notation indicates a fully connected layer with the shape 128 × 30. (Right) Magnified view of the operation of 1D convolutional neural networks on time-series data. (D) (Top) Confusion matrix for 66 classes, all of which have roughly the same number of samples (subsampled to ∼3500 sample oligos in each class). Random guessing would lead to a classification accuracy of 1.52%, whereas the smallest accuracy from our model is 41% (tetramer 2252). For our model-based prediction, the mean classification accuracy is 60.28% ± 0.28% (39× larger than random guessing), and the highest observed accuracy is 79% (tetramer 1111). The exact number of samples in each class is listed in Table S4. (Bottom left) Confusion matrix for six selected classes using B2 and B4 (named as listed, subsampled to roughly 5000 samples per class). Random guessing leads to an accuracy of 16.67%, whereas our model-based prediction ensures an average classification accuracy of 72.25% ± 1.46%. (Bottom right) Confusion matrix for six selected classes using B4 and B5 (named as listed, subsampled to roughly 5000 samples per class). Random guessing leads to an accuracy of 16.67%, while our model-based prediction ensures an average accuracy of 77.84% ± 0.96%. Stability of DNA duplexes containing chemically modified nucleotides. The backbone of the dodecamer is shown using silver spheres, whereas the bases are drawn as molecular bonds. Chemically modified bases and the natural bases that pair with them are colored according to the atom type (cyan for carbon, blue for nitrogen, and red for oxygen). Base pairs immediately adjacent to the modified base pair are colored in red or blue. (A) Microscopic configurations of modified base pairs (from top to bottom: B1–T, B2–G, A–B3, A–B5, A–B6, and C–B6). (B) Donor (N1)–acceptor (N3) distance (black) in the modified base pair (black) and in the adjacent base pairs (red and blue) during the last 100 ns of the 350 ns MD simulation. The arrows indicate the correspondence between the base pairs and the curves. The curves show a running average of the 10 ps sampled data with a 2 ns averaging window. (C) Microscopic configuration of modified base pairs. The black lines represent hydrogen bonds. The donor and the acceptor are labeled asides the atoms. (D) Probability of observing the specified number of hydrogen bonds within a modified base pair. The H-bonding probabilities were computed using the final 100 ns of a 350 ns all-atom MD simulation of a DNA dodecamer.

Results and Discussion

To determine whether natural and chemically modified DNA nucleotides can be distinguished using the biological nanopore MspA, we designed a series of single-stranded DNA (ssDNA) molecules with the general sequence 5′-biotin-(dT)12-XXXX-(dT)24-3′, where X = {A, T, C, G, B1–B7} (Figure , Figures S1 and S2, Tables S1–S3). We hypothesized that specific chemical modifications to nucleobases such as amines, alkynes, or indole moieties can alter polymer–amino acid interactions in biological nanopores, thereby generating distinct signals in nanopore readouts. In the process, we also considered the stability of base pairing and base stacking interactions between natural and chemically modified nucleotides using a combination of MD simulations and experiments (Tables and S5, Figures and S5–S7). Following molecular design and synthesis of ssDNA oligos (the chemical characterization and mass spectrometry analysis of oligos containing chemically modified nucleotides are provided in the Supporting Information (Figures S8–S84)), we performed MspA nanopore experiments where ssDNA oligos containing biotin at the 5′ terminus were electrophoretically attracted inside MspA nanopores. The bulky streptavidin protein prevents the oligos from fully translocating through the pore without appreciably affecting the measured ionic currents.[27] Consequently, ssDNA molecules are effectively immobilized within MspA nanopores, exposing the four nucleotides at positions 13–16 from the tethering point to the constriction of the MspA pore (Figure A).[36] In this assay, streptavidin holds ssDNA in the MspA constriction in a similar fashion to a helicase enzyme that steps through double-stranded (dsDNA) in an ONT sequencer, thereby enabling long duration current readings for each sequence tetramer (Figure S1). We next used MspA nanopores to determine residual currents for homotetrameric sequences of all natural and chemically modified monomers (Figure B). Our results show that MspA accurately discriminates all four natural (A, G, C, T) and nearly all chemically modified nucleotides (B1–B7) at an applied bias of 150 mV. The abasic nucleotide B7 shows the largest residual current, which likely arises due to its small molecular size and reduced ability to interact with the reading head of MspA. The residual current levels are sensitive to the chemical identity of the nucleotides but do not directly correlate with their molecular size (Figure B). For example, current signals from B6 and B2 overlap at 150 mV, but B6 is well separated from B3 despite being structurally similar. We further studied the effect of the applied bias on the resolution of nucleotide bases. At 150 mV, four chemically modified nucleotides (B2, B3, B4, B5) showed well-resolved signals from each other and the natural nucleotides, but the current levels from B6 exhibited around 68% overlap with B2. Upon increasing the applied bias to 180 mV, the resolution between B2 and B6 was significantly improved, with an overlap area of the fitted Gaussian curves of 18%. In addition, at 180 mV, resolution in the Ires region exceeding 20% decreased, as may be seen from the residual currents of B4, A, and G which have Gaussian readout distributions that overlap in area by more than 90% (Figure B). We further used MspA to detect and identify heterotetrameric sequences with compositions 2X + 2Y, where X, Y = {B2, B3, B4, B5} (Figure C, Figures S1 and S2, Tables S1–S3). Our results show that MspA can distinguish all heterotetrameric sequences with the same nucleotide composition when measurements at all three applied biases (150 mV, 180 mV, 200 mV) are performed. Due to the large sequence space explored, here we focus our discussion on representative tetrameric combinations of B2 and B3 (Figure C). In most cases, the residual currents of heterotetramers fall between those of two corresponding homotetramers. For example, the tetramer 3223 has an Ires of 12.3%, whereas those of B2 and B3 are 10.2% and 12.6%, respectively (at 180 mV). However, some combinations of B2 and B3, including 2232, 2322, 2333, 3233, 2323, 2332, and 2233, showed significant decreases in residual currents compared to homotetramers B2 and B3 (Figure C), whereas the residual current of tetramer 3322 is larger than homotetramers of B2 and B2 at either 150 mV or 180 mV. Importantly, all tetrameric sequences were resolved by adjusting the applied bias.[37] At a higher applied bias of 200 mV, tetramers that were unresolved at lower bias were readily resolved, including 2322, 2332, and 2322 (Figure C). Overall, these results are consistent with the observation that the residual current levels of DNA tetramers are not directly correlated with molecular size, similar to the case of natural nucleotides[38] where the blockade current was found to be determined by the competition of steric and base stacking interactions.[39] We next investigated the ability of MspA pores to resolve different tetramers containing both natural and chemically modified nucleotides (Figure D). Here, we specifically focused on heterotetramers containing a single chemically modified nucleotide (B2, B3, or B5) added in different positions of the directional sequence ACT.[38] Our results clearly show that different positions of the chemically modified nucleotide in the tetramer generate distinct residual currents. For example, the residual current of heterotetrameric sequences of ACT containing four different positions of B2 (2ACT, A2CT, AC2T, and ACT2) are readily resolved at both 150 mV and 180 mV (Figure D). Although the residual current of homotetramer B2 and heterotetramer 2ACT overlap by ∼29% in their Gaussians at 150 mV, they are distinguishable at 180 mV. In addition, nearly all heterotetrameric sequences of ACT containing four different positions of B3 were resolved from the homotetramer B3 at 150 and 180 mV, whereas the residual currents of 3ACT and ACT3 were only distinguishable at 180 mV (Figure D). These results are consistent with prior work reporting that tuning the applied bias is a useful approach to enhance the accuracy of nanopore-based sequencing methods.[40] In summary, these results show the ability of MspA nanopores to accurately identify sequences containing chemically modified nucleotides. In theory, sequence context allows for high-resolution readout of arbitrary combinations and arrangements of natural and modified nucleotides (A, C, G, T, B1–B7). Although specific sets of tetramers might be confused during MspA reading, the method of shift reconciliation[41] allows for such sequences to be fully resolved using the information provided by different shifts of the tetramers within the constriction of the nanopore (Figure E). The concept of shift reconciliation is illustrated with the following example, where we consider a heterogeneous sequence of 23223. In terms of the corresponding residual current levels, the prefix tetramer 2322 is confusable with 2332 or 2323 at 150 mV. However, by shifting the sliding window one position to the right, we obtain the tetramer 3223 which is not confusable with any other block. Because the trimer prefix of 3223, 322, only matches the trimer suffix of only one of the tetramers 2322, 2332, 2322 (i.e., the first one), we unambiguously deduce that 2322 is the correct prefix tetramer. Moving beyond tetramer detection via MspA, we demonstrate that commercially available nanopore-based sequencing technology (ONT GridION) can be used to classify/sequence oligos containing the proposed molecular alphabet. For GridION experiments, the same ssDNA oligos used in MspA experiments were extended at the 3′ terminus with a polyA tail of random length of >100 nts, which is used to increase the length of the oligos and guide them inside the pore (Figure A). We retrieved raw current signals from the GridION platform following a custom RNA sequencing protocol (methods section). We processed the raw current signals using deep learning techniques to discriminate and identify different combinations and orderings of the chemically modified nucleotides. As a first step, we isolated regions in the raw current signals corresponding to chemically modified nucleotides. For this purpose, we could not use the specialized software suite Tombo,[42] designed by ONT for identifying potentially modified nucleotides from nanopore sequencing data, as it requires basecalling, alignment, and further downstream processing. Accurate basecalling of chemically modified nucleotides is difficult to accomplish, which greatly complicates alignment and classification tasks for arbitrary subregions of the signal. Moreover, the most recent ONT basecaller, Bonito, based on convolutional neural networks, is trained and specialized to work for natural DNA only.[43] For these reasons, we developed an analysis framework that directly operates on raw current signals of the chemically modified nucleotides. Analysis of raw current signals is challenging because nanopore current signals exhibit extreme variations known as level drifts (Figure S3). Level drifts arise because each membrane patch (recording channel) inside the device has its own electric circuit, and each pore has unique features. To address this challenge, we developed a two-step identification scheme depicted in Figure B. In the first step, we estimate the current level for the polyA region and subsequently use it for signal calibration. Similar calibration steps are standardly performed for nanopore sequencing of natural DNA, but they rely on adaptor-based calibrations since all analytes use identical adaptors with a well-defined sequence content. For actual level calibration, we used kernel density estimation of the signal level distribution,[44] followed by identification of the levels that have the two largest probabilities in the estimated distribution. This approach is justified because polyA regions constitute the longest signal component in our oligo sequences. Moreover, on average, polyT levels are expected to be lower than polyA levels, so readout regions that are trailed by nearly flat regions with a mean level value lower than that for the polyA tails are filtered using a finite state machine.[45] These regions are expected to bear signals from the chemically modified nucleotides. After extracting modification-bearing signals, raw current readouts are subsequently classified. For this task, we designed a 1D residual neural network model[46,47] (Figure C) containing 1D convolution layers (conv) that serve as feature extractors and one fully connected layer (fc) that serves as a classifier. The model is trained on oligo data corresponding to different combinations and orderings of chemically modified nucleotides, with each option supported by thousands of training samples (Table S4). Elements from each class are uniformly sampled at random in a balanced manner and split into training/validation/test sets with splitting percentages 60%/20%/20%, respectively. Results from neural-network-guided identification tasks pertaining to five independent experimental runs are shown in Figure D. Confusion matrices are used to summarize the prediction accuracies, ranging between 0 and 1 (with 1 corresponding to perfectly accurate identification). Importantly, these results show that most tetramers are identified with high accuracy (i.e., the diagonal elements are significantly larger than the off-diagonal elements). The average classification accuracy for each model is provided in the caption of Figure D, along with the accuracy one would expect from random guessing. For example, we observed an accuracy of 0.85 for heterotetramers (2244, 2244), which is to be interpreted as an 85% success rate in correctly identifying the sequence 2244, or a 15% chance of misinterpreting 2244 as another combination or sequence order (Figure D). Overall, we performed a total of 13 different classification tasks, including one task for all classes (77 in total, from which only 66 were depicted due to small amounts of training data for the remaining 11 classes). We further included 12 tasks involving subsets of classes containing chemically modified nucleotides shown in Figure . For brevity, two results for 2X + 2Y classes and a summary of all results are shown in Figure D; the full set of results are shown in Figure S4. Stable bonding of chemically modified nucleotides within a DNA double helix is important for DNA-based storage because it enables durable preservation of recorded information, as well as random access to the stored data by means of PCR reactions.[4] To better understand the interactions between chemically modified and natural nucleotides, we investigated the stability of modified DNA duplexes by carrying out all-atom molecular dynamics (MD) simulations of the Dickerson dodecamers[48] containing a pair of chemically modified nucleotides. Out of many possible variants, we chose to investigate the stability of B1–T, B2–G, B3–A, and B5–A base pairs, as suggested by prior publications[29−35] and Integrated DNA Technologies (IDT),[28] as well as the pairing of B4 and B6 with all four types of natural nucleotides. Each modified dodecamer was solvated in electrolyte solution and simulated for approximately 350 ns. Six modified natural base pairs (B1–T, B2–G, B3–A, B5–A, B6–A, and B6–C) were found to form stable hydrogen bond patterns within the duplex forming either two or three hydrogen bonds per base pairs (Figure ). The average number of hydrogen bonds was found to be 0.71 for B1–T, 1.37 for B2–G, 1.01 for B3–A, 1.00 for B5–A, 1.00 for B6–A, and 0.70 for B6–C, which are results compatible with the numbers computed for the canonical base pairs (0.83 for A–T and 1.23 for C–G) using the same hydrogen bond criteria. In all other modified natural combinations, we observed local disruptions of the base pairing structure (Figures S5 and S6). In B4–A and B4–T pairs, the bases were observed to protrude out from the duplex without disrupting the hydrogen bonding of the surrounding base pairs. The B6–G pair formed a base stacking pattern, forcing the breakage of hydrogen bonds in the adjacent base pairs. Local unraveling of the duplex structure was observed in the systems containing B4–G, B4–C, and B6–T base pairs. On the basis of these results, we conclude that most of our chemically modified nucleotides introduce minor perturbations to the structure of the duplex except for B4, which does not fit well within the geometry of the classical DNA duplex but is not sufficient to produce a complete unraveling of the DNA duplex. However, we observed that an isolated B4–G base pair is able to maintain stable stacking interaction when simulated under conditions that mimic the presence of a longer DNA strand (Figure S6).

Conclusion

In closing, we report an expanded alphabet for DNA data storage compatible with nanopore sequencing technology. The unique feature of our approach is coupled, iterative selection and testing that involve determining suitability for forming stable duplex structures and nanopore sequencing. Overall, the described system enables the recording of digital data with increased storage density and more bits per synthesis cycle. In particular, our storage system enables a maximum recording density of log2 11 bits in each cycle, compared to log2 4 = 2 bits for natural DNA; this strategy also theoretically increases the rate (speed) of the recorder by = 1.73-fold. Our extensive nanopore experiments provide strong evidence that many more chemically modified nucleotides can be used for molecular storage because many ionic current levels remain available; i.e., the ionic current spectrum is sparsely populated. In addition, our system allows for high-fidelity readouts and potentially enables PCR-based random-access for encodings restricted to duplex formation competent monomers. An illustrative, yet limited example of PCR-based random access is provided in Figure S7. Although not all pairings of chemical modifications may be suitable for amplification using natural enzymes, and some duplex formations may be unstable, the proposed system provides the first example of a coupled coding alphabet and channel selection and optimization paradigm. In conclusion, this work demonstrates fundamentally new directions in molecular storage that hold the potential to advance the field of DNA-based data storage.

Materials and Methods

Complete details of methods and materials used in this study are provided in the Supporting Information.

36 in total

1. DNA base-calling from a nanopore using a Viterbi algorithm.

Authors: Winston Timp; Jeffrey Comer; Aleksei Aksimentiev
Journal: Biophys J Date: 2012-05-15 Impact factor: 4.033

2. VMD: visual molecular dynamics.

Authors: W Humphrey; A Dalke; K Schulten
Journal: J Mol Graph Date: 1996-02

3. Next-generation digital information storage in DNA.

Authors: George M Church; Yuan Gao; Sriram Kosuri
Journal: Science Date: 2012-08-16 Impact factor: 47.728

4. DNA containing side chains with terminal triple bonds: Base-pair stability and functionalization of alkynylated pyrimidines and 7-deazapurines.

Authors: Frank Seela; Venkata Ramana Sirivolu
Journal: Chem Biodivers Date: 2006-05 Impact factor: 2.408

5. Random access in large-scale DNA data storage.

Authors: Lee Organick; Siena Dumas Ang; Yuan-Jyue Chen; Randolph Lopez; Sergey Yekhanin; Konstantin Makarychev; Miklos Z Racz; Govinda Kamath; Parikshit Gopalan; Bichlien Nguyen; Christopher N Takahashi; Sharon Newman; Hsing-Yeh Parker; Cyrus Rashtchian; Kendall Stewart; Gagan Gupta; Robert Carlson; John Mulligan; Douglas Carmean; Georg Seelig; Luis Ceze; Karin Strauss
Journal: Nat Biotechnol Date: 2018-02-19 Impact factor: 54.908

Review 6. Nanopore sequencing technology, bioinformatics and applications.

Authors: Yunhao Wang; Yue Zhao; Audrey Bollas; Yuru Wang; Kin Fai Au
Journal: Nat Biotechnol Date: 2021-11-08 Impact factor: 54.908

7. CHARMM general force field: A force field for drug-like molecules compatible with the CHARMM all-atom additive biological force fields.

Authors: K Vanommeslaeghe; E Hatcher; C Acharya; S Kundu; S Zhong; J Shim; E Darian; O Guvench; P Lopes; I Vorobyov; A D Mackerell
Journal: J Comput Chem Date: 2010-03 Impact factor: 3.376

8. Structure of a B-DNA dodecamer: conformation and dynamics.

Authors: H R Drew; R M Wing; T Takano; C Broka; S Tanaka; K Itakura; R E Dickerson
Journal: Proc Natl Acad Sci U S A Date: 1981-04 Impact factor: 11.205

9. Differential stabilities and sequence-dependent base pair opening dynamics of Watson-Crick base pairs with 5-hydroxymethylcytosine, 5-formylcytosine, or 5-carboxylcytosine.

Authors: Marta W Szulik; Pradeep S Pallan; Boguslaw Nocek; Markus Voehler; Surajit Banerjee; Sonja Brooks; Andrzej Joachimiak; Martin Egli; Brandt F Eichman; Michael P Stone
Journal: Biochemistry Date: 2015-01-29 Impact factor: 3.162

10. Giant single molecule chemistry events observed from a tetrachloroaurate(III) embedded Mycobacterium smegmatis porin A nanopore.

Authors: Jiao Cao; Wendong Jia; Jinyue Zhang; Xiumei Xu; Shuanghong Yan; Yuqin Wang; Panke Zhang; Hong-Yuan Chen; Shuo Huang
Journal: Nat Commun Date: 2019-12-11 Impact factor: 14.919

2 in total

1. A nanopore interface for higher bandwidth DNA computing.

Authors: Karen Zhang; Yuan-Jyue Chen; Delaney Wilde; Kathryn Doroschak; Karin Strauss; Luis Ceze; Georg Seelig; Jeff Nivala
Journal: Nat Commun Date: 2022-08-20 Impact factor: 17.694

2. Molecular Encryption and Steganography Using Mixtures of Simultaneously Sequenced, Sequence-Defined Oligourethanes.

Authors: Samuel D Dahlhauser; Christopher D Wight; Sarah R Moor; Randall A Scanga; Phuoc Ngo; Jordan T York; Marissa S Vera; Kristin J Blake; Ian M Riddington; James F Reuther; Eric V Anslyn
Journal: ACS Cent Sci Date: 2022-07-20 Impact factor: 18.728

2 in total