S Kasra Tabatabaei1,2, Bach Pham3, Chao Pan4, Jingqian Liu1, Shubham Chandak5, Spencer A Shorkey3, Alvaro G Hernandez6, Aleksei Aksimentiev1,7, Min Chen3, Charles M Schroeder1,2,8,9, Olgica Milenkovic4. 1. Center for Biophysics and Quantitative Biology, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States. 2. Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States. 3. Department of Chemistry, University of Massachusetts at Amherst, Amherst, Massachusetts 01003, United States. 4. Department of Electrical and Computer Engineering, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States. 5. Department of Electrical Engineering, Stanford University, Stanford, California 94305, United States. 6. Roy J. Carver Biotechnology Center, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States. 7. Department of Physics, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States. 8. Department of Materials Science and Engineering, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States. 9. Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States.
Abstract
DNA is a promising next-generation data storage medium, but challenges remain with synthesis costs and recording latency. Here, we describe a prototype of a DNA data storage system that uses an extended molecular alphabet combining natural and chemically modified nucleotides. Our results show that MspA nanopores can discriminate different combinations and ordered sequences of natural and chemically modified nucleotides in custom-designed oligomers. We further demonstrate single-molecule sequencing of the extended alphabet using a neural network architecture that classifies raw current signals generated by Oxford Nanopore sequencers with an average accuracy exceeding 60% (39× larger than random guessing). Molecular dynamics simulations show that the majority of modified nucleotides lead to only minor perturbations of the DNA double helix. Overall, the extended molecular alphabet may potentially offer a nearly 2-fold increase in storage density and potentially the same order of reduction in the recording latency, thereby enabling new implementations of molecular recorders.
DNA is a promising next-generation data storage medium, but challenges remain with synthesis costs and recording latency. Here, we describe a prototype of a DNA data storage system that uses an extended molecular alphabet combining natural and chemically modified nucleotides. Our results show that MspA nanopores can discriminate different combinations and ordered sequences of natural and chemically modified nucleotides in custom-designed oligomers. We further demonstrate single-molecule sequencing of the extended alphabet using a neural network architecture that classifies raw current signals generated by Oxford Nanopore sequencers with an average accuracy exceeding 60% (39× larger than random guessing). Molecular dynamics simulations show that the majority of modified nucleotides lead to only minor perturbations of the DNA double helix. Overall, the extended molecular alphabet may potentially offer a nearly 2-fold increase in storage density and potentially the same order of reduction in the recording latency, thereby enabling new implementations of molecular recorders.
Entities:
Keywords:
DNA Data Storage; Nanopores; Neural Networks; Single-Molecule; Unnatural Nucleotides
DNA is emerging as
a robust data storage medium that offers ultrahigh
storage densities greatly exceeding conventional magnetic and optical
recorders. Information stored in DNA can be copied in a massively
parallel manner and selectively retrieved via polymerase chain reaction
(PCR).[1−10] However, existing DNA storage systems suffer from high latency caused
by the inherently sequential writing process. Despite recent progress,
a typical cycle time of solid-phase DNA synthesis is on the order
of minutes, which limits the practical applications of this molecular
storage platform.[11] Using current technologies,
writing 100 bits of information requires nearly 2 h [11] and costs more than U.S. $1,[12] assuming that each nucleotide stores its theoretical maximum
of two bits. To overcome these challenges, new synthesis methods and
information encoding approaches are required to accelerate the speed
of writing large-volume data sets.[13]Expanding the alphabet of a DNA storage media by including chemically
modified DNA nucleotides can both increase the storage density and
the writing speed because more than two bits are recorded during each
synthesis cycle. However, designing chemically modified nucleotides
as new letters for the DNA storage alphabet must be tightly coupled
to the process of reading the encoded information via DNA sequencing,
because current DNA sequencing methods, including single-molecule
nanopore sequencing, have been developed and optimized to read natural
nucleotides. Prior work reported an expanded nucleic acid alphabet
of synthetic DNA and RNA nucleotides that can be replicated and transcribed
using biological enzymes,[14] but this alphabet
was not designed for molecular storage applications and was not accurately
read using a nucleic acid sequencing method. Aerolysin nanopores were
used to detect synthetic polymers flanked by adenosines, where each
monomer of the polymer carries one bit of information.[15] Prior work has reported successful detection
of base pairs containing single chemically modified nucleotides[16,17] or discrimination of single nucleotides in natural versus modified
states.[18] Despite recent advances, single-molecule
detection and sequencing of an expanded molecular alphabet based on
a library of chemically diverse modified nucleotides has not yet been
demonstrated.Here, we report an expanded molecular alphabet
for DNA data storage
comprising four natural and seven chemically modified nucleotides
that are readily detected and distinguished using nanopore sequencers
(Figure and Table ). Our results show
that Mycobacterium smegmatis porin A (MspA) nanopores,
which are widely used for ssDNA sensing and single molecule chemistry
studies,[19−21] can accurately discriminate 77 combinations and orderings
of chemically diverse monomers within homo- and heterotetrameric sequences
(Figures , 2, S1, and S2 and Tables S1–S3). We further demonstrate that highly accurate classification (exceeding
60% on average) of combinatorial patterns of natural and chemically
modified nucleotides is possible using deep learning architectures
that operate on raw current signals generated by GridION of Oxford
Nanopore Technologies (ONT)[22] (Figures , S3, and S4). We further study the
stability of DNA duplexes containing modified nucleotides using all-atom
molecular dynamics (MD) simulations[23−26] (Figures , S5, and S6 and Table S5).
Overall, the extended molecular alphabet has the potential to offer
a nearly 2-fold increase in storage density and potentially the same
order of reduction in recording latency, thereby providing a promising
path forward for the development of new molecular recorders.
Figure 1
DNA data storage
using natural and chemically modified nucleotides.
(A) Chemical structures of natural DNA nucleotides (A, C, G, T) and
the selected chemically modified nucleotides employed in our study
(B1–B7). (B) Schematic of the ssDNA oligo used in MspA nanopore
experiments. The length of the oligos is 40 nucleotides (nts), with
biotin attached at the 5′ terminus. Homo- or heterotetrameric
sequences are located at positions 13–16, flanked by two polyT
regions of length 12 nt and 24 nt on the 5′ and 3′ ends,
respectively. (C) Sequence space for DNA homotetramers or heterotetramers
used in MspA nanopore experiments. The notation aX + bY, where a and b take values in {2, 3, 4} so that a + b = 4, indicates that “a” symbols of
the same kind are combined with “b”
symbols of another kind and arranged in an arbitrary linear order.
In total, 77 distinct tetrameric sequences were synthesized and tested
experimentally. (Left) Circular diagram showing all 11 homotetramers
and 12 tetrameric sequences of the form ACT + X, where X is a chemically
modified nucleotide from the set {B2, B3, B5}. (Middle) Circular diagram
showing all 30 tested combinations of tetrameric sequences with total
composition 2X + 2Y using chemically modified monomers from the set
{B1, B2, B3, B4, B5}, including sequence patterns XXYY, XYYX, and
XYXY. (Right) Circular diagram showing the remaining 24 combinations
of tetrameric sequences with total composition 3X + Y using the set
{B2, B3, B5}. Five chemically modified nucleotides form stable base
pairs with natural nucleotides via hydrogen bonds (B2–G, B3–A,
B5–A, B6–A, B6–C), based on the results from
molecular dynamic (MD) simulations.
Table 1
Chemically Modified Nucleotides Used
in the DNA Data Storage System, Along with Their Chemical Propertiesa
Symbol
B1
B2
B3
B4
B5
B6
B7
Name
2,6-Diamino-purine 2′-deoxyriboside
5- Hydroxymethyldeoxycytidine
5-Hydroxybutynl-2′-deoxyuridine
5-Nitroindole-2′-deoxy-riboside
Deoxyuridine
5-Octadiynyldeoxyuridine
1,2-Dideoxyribose
Structurally
most similar
nucleotide
dA
dC
dT
dA
dT
dT
-
Pairing mate/interaction
type (experiment*)
dT H bonds[28−30]
dG
H bonds[31]
dA-
All natural nucleotides
stacking[28,32]
dA H bonds[33]
dA H bonds[34,35]
-
Pairing mate/interaction
type (simulation**)
dT H bonds
dG H
bonds
dA H bonds
dG stacking
dA H bonds
dA, dC H bonds
-
The symbols and the names of
the chemically modified nucleotides are shown in the first and second
rows, and the molecular structures are depicted in Figure . Structurally similar natural
nucleotides are shown in the third row. In general, distinct chemical
functional groups and molecular charges play an important role in
discriminating nucleobases using MspA and ONT sequencers. The last
two rows show pairing properties of the modified bases. * denotes
data from Integrated DNA Technologies[28] or experiment data from previous work,[29−35] while ** denotes results from molecular dynamics simulations reported
in Figure and the Supporting Information (Figures S5 and S6, Table S5). Short dashes indicate that pairing is inherently impossible (e.g.,
B7) or that no specific information is published (e.g., interaction
type of B3-dA pairing).
Figure 2
Identification
of chemically modified DNA using MspA nanopores.
(A) Schematic diagram of ssDNA immobilized in a MspA nanopore, where
ssDNA containing a biotin–streptavidin interaction at the 5′
terminus prevents translocation through the pore. Residual ion current
generated by four nucleotides at positions 13–16 from the 5′
terminus is recorded for ssDNA immobilized in the pore. (B) Histograms
of average residual ionic currents Ires shown in gray for different homopolymers (A, T, C, G, and B1–B7).
The fitted Gaussian curves are depicted in red for natural nucleotides
(A, T, C, G) and in blue for chemically modified nucleotides (B1–B7).
(C) Histograms of the average residual ionic currents and the fitted
Gaussian curves at various applied voltages for tetramers involving
different combinations and orderings of B2 and B3. (D) Peak values
(points) and confidence intervals (bars) of the fitted Gaussians with
mean residual ionic currents corresponding to tetramers obtained by
inserting one of the monomers B2 and B3 into the sequence ACT, at
applied biases of 150 mV and 180 mV. (E) Schematic of the shift reconciliation
method for resolving ambiguities in the readouts of different tetramers.
Figure 3
Sequencing oligos containing chemically modified nucleotides
using
ONT GridION. (A) Schematic of oligo design and a picture of the GridION
sequencer used in our experiments. (B) (Left) Illustration of current
levels of polyA and polyT regions, used in our custom level calibration
scheme. Dashed orange circle indicates the region harboring the signals
from chemically modified nucleotides. (Right) Region-of-interest in
raw current signal obtained by identifying polyA-polyT patterns. (C)
Neural network model used for classification. The 1D residual neural
network architecture comprises nine 1D convolution blocks. For example,
a 1D convolution block (1 × 8 conv, 64) indicates that the kernel
size for the convolution is 1 × 8 and that the number of output
channels is 64. Half-downsampling for each channel is denoted by (/2);
averaging over all channels to arrive at a single vector is referred
to as “average pooling”; the (fc 128 × 30) notation
indicates a fully connected layer with the shape 128 × 30. (Right)
Magnified view of the operation of 1D convolutional neural networks
on time-series data. (D) (Top) Confusion matrix for 66 classes, all
of which have roughly the same number of samples (subsampled to ∼3500
sample oligos in each class). Random guessing would lead to a classification
accuracy of 1.52%, whereas the smallest accuracy from our model is
41% (tetramer 2252). For our model-based prediction, the mean classification
accuracy is 60.28% ± 0.28% (39× larger than random guessing),
and the highest observed accuracy is 79% (tetramer 1111). The exact
number of samples in each class is listed in Table S4. (Bottom left) Confusion matrix for six selected classes
using B2 and B4 (named as listed, subsampled to roughly 5000 samples
per class). Random guessing leads to an accuracy of 16.67%, whereas
our model-based prediction ensures an average classification accuracy
of 72.25% ± 1.46%. (Bottom right) Confusion matrix for six selected
classes using B4 and B5 (named as listed, subsampled to roughly 5000
samples per class). Random guessing leads to an accuracy of 16.67%,
while our model-based prediction ensures an average accuracy of 77.84%
± 0.96%.
Figure 4
Stability of DNA duplexes containing chemically
modified nucleotides.
The backbone of the dodecamer is shown using silver spheres, whereas
the bases are drawn as molecular bonds. Chemically modified bases
and the natural bases that pair with them are colored according to
the atom type (cyan for carbon, blue for nitrogen, and red for oxygen).
Base pairs immediately adjacent to the modified base pair are colored
in red or blue. (A) Microscopic configurations of modified base pairs
(from top to bottom: B1–T, B2–G, A–B3, A–B5,
A–B6, and C–B6). (B) Donor (N1)–acceptor (N3)
distance (black) in the modified base pair (black) and in the adjacent
base pairs (red and blue) during the last 100 ns of the 350 ns MD
simulation. The arrows indicate the correspondence between the base
pairs and the curves. The curves show a running average of the 10
ps sampled data with a 2 ns averaging window. (C) Microscopic configuration
of modified base pairs. The black lines represent hydrogen bonds.
The donor and the acceptor are labeled asides the atoms. (D) Probability
of observing the specified number of hydrogen bonds within a modified
base pair. The H-bonding probabilities were computed using the final
100 ns of a 350 ns all-atom MD simulation of a DNA dodecamer.
DNA data storage
using natural and chemically modified nucleotides.
(A) Chemical structures of natural DNA nucleotides (A, C, G, T) and
the selected chemically modified nucleotides employed in our study
(B1–B7). (B) Schematic of the ssDNA oligo used in MspA nanopore
experiments. The length of the oligos is 40 nucleotides (nts), with
biotin attached at the 5′ terminus. Homo- or heterotetrameric
sequences are located at positions 13–16, flanked by two polyT
regions of length 12 nt and 24 nt on the 5′ and 3′ ends,
respectively. (C) Sequence space for DNA homotetramers or heterotetramers
used in MspA nanopore experiments. The notation aX + bY, where a and b take values in {2, 3, 4} so that a + b = 4, indicates that “a” symbols of
the same kind are combined with “b”
symbols of another kind and arranged in an arbitrary linear order.
In total, 77 distinct tetrameric sequences were synthesized and tested
experimentally. (Left) Circular diagram showing all 11 homotetramers
and 12 tetrameric sequences of the form ACT + X, where X is a chemically
modified nucleotide from the set {B2, B3, B5}. (Middle) Circular diagram
showing all 30 tested combinations of tetrameric sequences with total
composition 2X + 2Y using chemically modified monomers from the set
{B1, B2, B3, B4, B5}, including sequence patterns XXYY, XYYX, and
XYXY. (Right) Circular diagram showing the remaining 24 combinations
of tetrameric sequences with total composition 3X + Y using the set
{B2, B3, B5}. Five chemically modified nucleotides form stable base
pairs with natural nucleotides via hydrogen bonds (B2–G, B3–A,
B5–A, B6–A, B6–C), based on the results from
molecular dynamic (MD) simulations.The symbols and the names of
the chemically modified nucleotides are shown in the first and second
rows, and the molecular structures are depicted in Figure . Structurally similar natural
nucleotides are shown in the third row. In general, distinct chemical
functional groups and molecular charges play an important role in
discriminating nucleobases using MspA and ONT sequencers. The last
two rows show pairing properties of the modified bases. * denotes
data from Integrated DNA Technologies[28] or experiment data from previous work,[29−35] while ** denotes results from molecular dynamics simulations reported
in Figure and the Supporting Information (Figures S5 and S6, Table S5). Short dashes indicate that pairing is inherently impossible (e.g.,
B7) or that no specific information is published (e.g., interaction
type of B3-dA pairing).Identification
of chemically modified DNA using MspA nanopores.
(A) Schematic diagram of ssDNA immobilized in a MspA nanopore, where
ssDNA containing a biotin–streptavidin interaction at the 5′
terminus prevents translocation through the pore. Residual ion current
generated by four nucleotides at positions 13–16 from the 5′
terminus is recorded for ssDNA immobilized in the pore. (B) Histograms
of average residual ionic currents Ires shown in gray for different homopolymers (A, T, C, G, and B1–B7).
The fitted Gaussian curves are depicted in red for natural nucleotides
(A, T, C, G) and in blue for chemically modified nucleotides (B1–B7).
(C) Histograms of the average residual ionic currents and the fitted
Gaussian curves at various applied voltages for tetramers involving
different combinations and orderings of B2 and B3. (D) Peak values
(points) and confidence intervals (bars) of the fitted Gaussians with
mean residual ionic currents corresponding to tetramers obtained by
inserting one of the monomers B2 and B3 into the sequence ACT, at
applied biases of 150 mV and 180 mV. (E) Schematic of the shift reconciliation
method for resolving ambiguities in the readouts of different tetramers.Sequencing oligos containing chemically modified nucleotides
using
ONT GridION. (A) Schematic of oligo design and a picture of the GridION
sequencer used in our experiments. (B) (Left) Illustration of current
levels of polyA and polyT regions, used in our custom level calibration
scheme. Dashed orange circle indicates the region harboring the signals
from chemically modified nucleotides. (Right) Region-of-interest in
raw current signal obtained by identifying polyA-polyT patterns. (C)
Neural network model used for classification. The 1D residual neural
network architecture comprises nine 1D convolution blocks. For example,
a 1D convolution block (1 × 8 conv, 64) indicates that the kernel
size for the convolution is 1 × 8 and that the number of output
channels is 64. Half-downsampling for each channel is denoted by (/2);
averaging over all channels to arrive at a single vector is referred
to as “average pooling”; the (fc 128 × 30) notation
indicates a fully connected layer with the shape 128 × 30. (Right)
Magnified view of the operation of 1D convolutional neural networks
on time-series data. (D) (Top) Confusion matrix for 66 classes, all
of which have roughly the same number of samples (subsampled to ∼3500
sample oligos in each class). Random guessing would lead to a classification
accuracy of 1.52%, whereas the smallest accuracy from our model is
41% (tetramer 2252). For our model-based prediction, the mean classification
accuracy is 60.28% ± 0.28% (39× larger than random guessing),
and the highest observed accuracy is 79% (tetramer 1111). The exact
number of samples in each class is listed in Table S4. (Bottom left) Confusion matrix for six selected classes
using B2 and B4 (named as listed, subsampled to roughly 5000 samples
per class). Random guessing leads to an accuracy of 16.67%, whereas
our model-based prediction ensures an average classification accuracy
of 72.25% ± 1.46%. (Bottom right) Confusion matrix for six selected
classes using B4 and B5 (named as listed, subsampled to roughly 5000
samples per class). Random guessing leads to an accuracy of 16.67%,
while our model-based prediction ensures an average accuracy of 77.84%
± 0.96%.Stability of DNA duplexes containing chemically
modified nucleotides.
The backbone of the dodecamer is shown using silver spheres, whereas
the bases are drawn as molecular bonds. Chemically modified bases
and the natural bases that pair with them are colored according to
the atom type (cyan for carbon, blue for nitrogen, and red for oxygen).
Base pairs immediately adjacent to the modified base pair are colored
in red or blue. (A) Microscopic configurations of modified base pairs
(from top to bottom: B1–T, B2–G, A–B3, A–B5,
A–B6, and C–B6). (B) Donor (N1)–acceptor (N3)
distance (black) in the modified base pair (black) and in the adjacent
base pairs (red and blue) during the last 100 ns of the 350 ns MD
simulation. The arrows indicate the correspondence between the base
pairs and the curves. The curves show a running average of the 10
ps sampled data with a 2 ns averaging window. (C) Microscopic configuration
of modified base pairs. The black lines represent hydrogen bonds.
The donor and the acceptor are labeled asides the atoms. (D) Probability
of observing the specified number of hydrogen bonds within a modified
base pair. The H-bonding probabilities were computed using the final
100 ns of a 350 ns all-atom MD simulation of a DNA dodecamer.
Results and Discussion
To determine
whether natural and chemically modified DNA nucleotides
can be distinguished using the biological nanopore MspA, we designed
a series of single-stranded DNA (ssDNA) molecules with the general
sequence 5′-biotin-(dT)12-XXXX-(dT)24-3′, where X = {A, T, C, G, B1–B7} (Figure , Figures S1 and S2, Tables S1–S3).
We hypothesized that specific chemical modifications to nucleobases
such as amines, alkynes, or indole moieties can alter polymer–amino
acid interactions in biological nanopores, thereby generating distinct
signals in nanopore readouts. In the process, we also considered the
stability of base pairing and base stacking interactions between natural
and chemically modified nucleotides using a combination of MD simulations
and experiments (Tables and S5, Figures and S5–S7).Following molecular design and synthesis of ssDNA oligos
(the chemical
characterization and mass spectrometry analysis of oligos containing
chemically modified nucleotides are provided in the Supporting Information (Figures S8–S84)), we performed
MspA nanopore experiments where ssDNA oligos containing biotin at
the 5′ terminus were electrophoretically attracted inside MspA
nanopores. The bulky streptavidin protein prevents the oligos from
fully translocating through the pore without appreciably affecting
the measured ionic currents.[27] Consequently,
ssDNA molecules are effectively immobilized within MspA nanopores,
exposing the four nucleotides at positions 13–16 from the tethering
point to the constriction of the MspA pore (Figure A).[36] In this
assay, streptavidin holds ssDNA in the MspA constriction in a similar
fashion to a helicase enzyme that steps through double-stranded (dsDNA)
in an ONT sequencer, thereby enabling long duration current readings
for each sequence tetramer (Figure S1).We next used MspA nanopores to determine residual currents for
homotetrameric sequences of all natural and chemically modified monomers
(Figure B). Our results
show that MspA accurately discriminates all four natural (A, G, C,
T) and nearly all chemically modified nucleotides (B1–B7) at
an applied bias of 150 mV. The abasic nucleotide B7 shows the largest
residual current, which likely arises due to its small molecular size
and reduced ability to interact with the reading head of MspA. The
residual current levels are sensitive to the chemical identity of
the nucleotides but do not directly correlate with their molecular
size (Figure B). For
example, current signals from B6 and B2 overlap at 150 mV, but B6
is well separated from B3 despite being structurally similar. We further
studied the effect of the applied bias on the resolution of nucleotide
bases. At 150 mV, four chemically modified nucleotides (B2, B3, B4,
B5) showed well-resolved signals from each other and the natural nucleotides,
but the current levels from B6 exhibited around 68% overlap with B2.
Upon increasing the applied bias to 180 mV, the resolution between
B2 and B6 was significantly improved, with an overlap area of the
fitted Gaussian curves of 18%. In addition, at 180 mV, resolution
in the Ires region exceeding 20% decreased,
as may be seen from the residual currents of B4, A, and G which have
Gaussian readout distributions that overlap in area by more than 90%
(Figure B).We further used MspA to detect and identify heterotetrameric sequences
with compositions 2X + 2Y, where X, Y = {B2, B3, B4, B5} (Figure C, Figures S1 and S2, Tables S1–S3). Our results show that MspA can distinguish all heterotetrameric
sequences with the same nucleotide composition when measurements at
all three applied biases (150 mV, 180 mV, 200 mV) are performed. Due
to the large sequence space explored, here we focus our discussion
on representative tetrameric combinations of B2 and B3 (Figure C). In most cases, the residual
currents of heterotetramers fall between those of two corresponding
homotetramers. For example, the tetramer 3223 has an Ires of 12.3%, whereas those of B2 and B3 are 10.2% and
12.6%, respectively (at 180 mV). However, some combinations of B2
and B3, including 2232, 2322, 2333, 3233, 2323, 2332, and 2233, showed
significant decreases in residual currents compared to homotetramers
B2 and B3 (Figure C), whereas the residual current of tetramer 3322 is larger than
homotetramers of B2 and B2 at either 150 mV or 180 mV. Importantly,
all tetrameric sequences were resolved by adjusting the applied bias.[37] At a higher applied bias of 200 mV, tetramers
that were unresolved at lower bias were readily resolved, including
2322, 2332, and 2322 (Figure C). Overall, these results are consistent with the observation
that the residual current levels of DNA tetramers are not directly
correlated with molecular size, similar to the case of natural nucleotides[38] where the blockade current was found to be determined
by the competition of steric and base stacking interactions.[39]We next investigated the ability of MspA
pores to resolve different
tetramers containing both natural and chemically modified nucleotides
(Figure D). Here,
we specifically focused on heterotetramers containing a single chemically
modified nucleotide (B2, B3, or B5) added in different positions of
the directional sequence ACT.[38] Our results
clearly show that different positions of the chemically modified nucleotide
in the tetramer generate distinct residual currents. For example,
the residual current of heterotetrameric sequences of ACT containing
four different positions of B2 (2ACT, A2CT, AC2T, and ACT2) are readily
resolved at both 150 mV and 180 mV (Figure D). Although the residual current of homotetramer
B2 and heterotetramer 2ACT overlap by ∼29% in their Gaussians
at 150 mV, they are distinguishable at 180 mV. In addition, nearly
all heterotetrameric sequences of ACT containing four different positions
of B3 were resolved from the homotetramer B3 at 150 and 180 mV, whereas
the residual currents of 3ACT and ACT3 were only distinguishable at
180 mV (Figure D).
These results are consistent with prior work reporting that tuning
the applied bias is a useful approach to enhance the accuracy of nanopore-based
sequencing methods.[40] In summary, these
results show the ability of MspA nanopores to accurately identify
sequences containing chemically modified nucleotides.In theory,
sequence context allows for high-resolution readout
of arbitrary combinations and arrangements of natural and modified
nucleotides (A, C, G, T, B1–B7). Although specific sets of
tetramers might be confused during MspA reading, the method of shift
reconciliation[41] allows for such sequences
to be fully resolved using the information provided by different shifts
of the tetramers within the constriction of the nanopore (Figure E). The concept of
shift reconciliation is illustrated with the following example, where
we consider a heterogeneous sequence of 23223. In terms of the corresponding
residual current levels, the prefix tetramer 2322 is confusable with
2332 or 2323 at 150 mV. However, by shifting the sliding window one
position to the right, we obtain the tetramer 3223 which is not confusable
with any other block. Because the trimer prefix of 3223, 322, only
matches the trimer suffix of only one of the tetramers 2322, 2332,
2322 (i.e., the first one), we unambiguously deduce that 2322 is the
correct prefix tetramer.Moving beyond tetramer detection via
MspA, we demonstrate that
commercially available nanopore-based sequencing technology (ONT GridION)
can be used to classify/sequence oligos containing the proposed molecular
alphabet. For GridION experiments, the same ssDNA oligos used in MspA
experiments were extended at the 3′ terminus with a polyA tail
of random length of >100 nts, which is used to increase the length
of the oligos and guide them inside the pore (Figure A). We retrieved raw current signals from
the GridION platform following a custom RNA sequencing protocol (methods
section). We processed the raw current signals using deep learning
techniques to discriminate and identify different combinations and
orderings of the chemically modified nucleotides. As a first step,
we isolated regions in the raw current signals corresponding to chemically
modified nucleotides. For this purpose, we could not use the specialized
software suite Tombo,[42] designed by ONT
for identifying potentially modified nucleotides from nanopore sequencing
data, as it requires basecalling, alignment, and further downstream
processing. Accurate basecalling of chemically modified nucleotides
is difficult to accomplish, which greatly complicates alignment and
classification tasks for arbitrary subregions of the signal. Moreover,
the most recent ONT basecaller, Bonito, based on convolutional neural
networks, is trained and specialized to work for natural DNA only.[43] For these reasons, we developed an analysis
framework that directly operates on raw current signals of the chemically
modified nucleotides.Analysis of raw current signals is challenging
because nanopore
current signals exhibit extreme variations known as level drifts (Figure S3). Level drifts arise because each membrane
patch (recording channel) inside the device has its own electric circuit,
and each pore has unique features. To address this challenge, we developed
a two-step identification scheme depicted in Figure B. In the first step, we estimate the current
level for the polyA region and subsequently use it for signal calibration.
Similar calibration steps are standardly performed for nanopore sequencing
of natural DNA, but they rely on adaptor-based calibrations since
all analytes use identical adaptors with a well-defined sequence content.
For actual level calibration, we used kernel density estimation of
the signal level distribution,[44] followed
by identification of the levels that have the two largest probabilities
in the estimated distribution. This approach is justified because
polyA regions constitute the longest signal component in our oligo
sequences. Moreover, on average, polyT levels are expected to be lower
than polyA levels, so readout regions that are trailed by nearly flat
regions with a mean level value lower than that for the polyA tails
are filtered using a finite state machine.[45] These regions are expected to bear signals from the chemically modified
nucleotides. After extracting modification-bearing signals, raw current
readouts are subsequently classified. For this task, we designed a
1D residual neural network model[46,47] (Figure C) containing 1D
convolution layers (conv) that serve as feature extractors and one
fully connected layer (fc) that serves as a classifier. The model
is trained on oligo data corresponding to different combinations and
orderings of chemically modified nucleotides, with each option supported
by thousands of training samples (Table S4). Elements from each class are uniformly sampled at random in a
balanced manner and split into training/validation/test sets with
splitting percentages 60%/20%/20%, respectively.Results from
neural-network-guided identification tasks pertaining
to five independent experimental runs are shown in Figure D. Confusion matrices are used
to summarize the prediction accuracies, ranging between 0 and 1 (with
1 corresponding to perfectly accurate identification). Importantly,
these results show that most tetramers are identified with high accuracy
(i.e., the diagonal elements are significantly larger than the off-diagonal
elements). The average classification accuracy for each model is provided
in the caption of Figure D, along with the accuracy one would expect from random guessing.
For example, we observed an accuracy of 0.85 for heterotetramers (2244,
2244), which is to be interpreted as an 85% success rate in correctly
identifying the sequence 2244, or a 15% chance of misinterpreting
2244 as another combination or sequence order (Figure D). Overall, we performed a total of 13 different
classification tasks, including one task for all classes (77 in total,
from which only 66 were depicted due to small amounts of training
data for the remaining 11 classes). We further included 12 tasks involving
subsets of classes containing chemically modified nucleotides shown
in Figure . For brevity,
two results for 2X + 2Y classes and a summary of all results are shown
in Figure D; the full
set of results are shown in Figure S4.Stable bonding of chemically modified nucleotides within a DNA
double helix is important for DNA-based storage because it enables
durable preservation of recorded information, as well as random access
to the stored data by means of PCR reactions.[4] To better understand the interactions between chemically modified
and natural nucleotides, we investigated the stability of modified
DNA duplexes by carrying out all-atom molecular dynamics (MD) simulations
of the Dickerson dodecamers[48] containing
a pair of chemically modified nucleotides. Out of many possible variants,
we chose to investigate the stability of B1–T, B2–G,
B3–A, and B5–A base pairs, as suggested by prior publications[29−35] and Integrated DNA Technologies (IDT),[28] as well as the pairing of B4 and B6 with all four types of natural
nucleotides. Each modified dodecamer was solvated in electrolyte solution
and simulated for approximately 350 ns. Six modified natural base
pairs (B1–T, B2–G, B3–A, B5–A, B6–A,
and B6–C) were found to form stable hydrogen bond patterns
within the duplex forming either two or three hydrogen bonds per base
pairs (Figure ). The
average number of hydrogen bonds was found to be 0.71 for B1–T,
1.37 for B2–G, 1.01 for B3–A, 1.00 for B5–A,
1.00 for B6–A, and 0.70 for B6–C, which are results
compatible with the numbers computed for the canonical base pairs
(0.83 for A–T and 1.23 for C–G) using the same hydrogen
bond criteria. In all other modified natural combinations, we observed
local disruptions of the base pairing structure (Figures S5 and S6). In B4–A and B4–T pairs,
the bases were observed to protrude out from the duplex without disrupting
the hydrogen bonding of the surrounding base pairs. The B6–G
pair formed a base stacking pattern, forcing the breakage of hydrogen
bonds in the adjacent base pairs. Local unraveling of the duplex structure
was observed in the systems containing B4–G, B4–C, and
B6–T base pairs. On the basis of these results, we conclude
that most of our chemically modified nucleotides introduce minor perturbations
to the structure of the duplex except for B4, which does not fit well
within the geometry of the classical DNA duplex but is not sufficient
to produce a complete unraveling of the DNA duplex. However, we observed
that an isolated B4–G base pair is able to maintain stable
stacking interaction when simulated under conditions that mimic the
presence of a longer DNA strand (Figure S6).
Conclusion
In closing, we report an expanded alphabet for
DNA data storage
compatible with nanopore sequencing technology. The unique feature
of our approach is coupled, iterative selection and testing that involve
determining suitability for forming stable duplex structures and nanopore
sequencing. Overall, the described system enables the recording of
digital data with increased storage density and more bits per synthesis
cycle. In particular, our storage system enables a maximum recording
density of log2 11 bits in each cycle, compared
to log2 4 = 2 bits for natural DNA; this strategy
also theoretically increases the rate (speed) of the recorder by =
1.73-fold. Our extensive nanopore experiments
provide strong evidence that many more chemically modified nucleotides
can be used for molecular storage because many ionic current levels
remain available; i.e., the ionic current spectrum is sparsely populated.
In addition, our system allows for high-fidelity readouts and potentially
enables PCR-based random-access for encodings restricted to duplex
formation competent monomers. An illustrative, yet limited example
of PCR-based random access is provided in Figure S7. Although not all pairings of chemical modifications may
be suitable for amplification using natural enzymes, and some duplex
formations may be unstable, the proposed system provides the first
example of a coupled coding alphabet and channel selection and optimization
paradigm. In conclusion, this work demonstrates fundamentally new
directions in molecular storage that hold the potential to advance
the field of DNA-based data storage.
Materials and Methods
Complete details of methods and materials used in this study are
provided in the Supporting Information.
Authors: Lee Organick; Siena Dumas Ang; Yuan-Jyue Chen; Randolph Lopez; Sergey Yekhanin; Konstantin Makarychev; Miklos Z Racz; Govinda Kamath; Parikshit Gopalan; Bichlien Nguyen; Christopher N Takahashi; Sharon Newman; Hsing-Yeh Parker; Cyrus Rashtchian; Kendall Stewart; Gagan Gupta; Robert Carlson; John Mulligan; Douglas Carmean; Georg Seelig; Luis Ceze; Karin Strauss Journal: Nat Biotechnol Date: 2018-02-19 Impact factor: 54.908
Authors: K Vanommeslaeghe; E Hatcher; C Acharya; S Kundu; S Zhong; J Shim; E Darian; O Guvench; P Lopes; I Vorobyov; A D Mackerell Journal: J Comput Chem Date: 2010-03 Impact factor: 3.376
Authors: H R Drew; R M Wing; T Takano; C Broka; S Tanaka; K Itakura; R E Dickerson Journal: Proc Natl Acad Sci U S A Date: 1981-04 Impact factor: 11.205
Authors: Marta W Szulik; Pradeep S Pallan; Boguslaw Nocek; Markus Voehler; Surajit Banerjee; Sonja Brooks; Andrzej Joachimiak; Martin Egli; Brandt F Eichman; Michael P Stone Journal: Biochemistry Date: 2015-01-29 Impact factor: 3.162
Authors: Samuel D Dahlhauser; Christopher D Wight; Sarah R Moor; Randall A Scanga; Phuoc Ngo; Jordan T York; Marissa S Vera; Kristin J Blake; Ian M Riddington; James F Reuther; Eric V Anslyn Journal: ACS Cent Sci Date: 2022-07-20 Impact factor: 18.728