| Literature DB >> 32001691 |
Lee Organick1, Yuan-Jyue Chen2, Siena Dumas Ang2, Randolph Lopez3, Xiaomeng Liu4, Karin Strauss5, Luis Ceze6.
Abstract
Synthetic DNA is gaining momentum as a potential storage medium for archival data storage. In this process, digital information is translated into sequences of nucleotides and the resulting synthetic DNA strands are then stored for later retrieval. Here, we demonstrate reliable file recovery with PCR-based random access when as few as ten copies per sequence are stored, on average. This results in density of about 17 exabytes/gram, nearly two orders of magnitude greater than prior work has shown. We successfully retrieve the same data in a complex pool of over 1010 unique sequences per microliter with no evidence that we have begun to approach complexity limits. Finally, we also investigate the effects of file size and sequencing coverage on successful file retrieval and look for systematic DNA strand drop out. These findings substantiate the robustness and high data density of the process examined here.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32001691 PMCID: PMC6992699 DOI: 10.1038/s41467-020-14319-8
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Experiment overview.
a A high-level representation of the DNA data storage pipeline. b (Left) The bar chart depicts contents of the initial, undiluted pool. (Right) The illustration shows the serial nature of subsequent dilutions. Mean copy number refers to the mean number of copies of each file's unique sequences as determined by qPCR (Supplementary Note 1). One serial dilution used water as the diluent in each step; the other used a solution of 150 Nmers to dilute the pool to much greater complexity. c Details of how the samples were diluted. Note that the dilution steps were identical regardless of diluent. The smallest percent of pool accessed is calculated by dividing the size of the smallest file by the number of unique sequences in the 1 μL of solution used for PCR random access. This percentage refers to the 150 Nmer diluent pool since the small file in the water diluent pool is a constant 0.13%.
Fig. 2Examining sequence loss behavior.
a Each plot illustrates a file's loss of sequences recovered at 20× coverage, directly comparing the samples diluted in water to those diluted in 150 Nmers (150nt sequences comprised of random nucleotides). The threshold of the maximum number of sequences that can be lost while still permitting file recovery is plotted for reference, as determined by previous work[10]. Error bars represent 95% confidence intervals; x-axis errors are taken from triplicate qPCR data (see Methods), and y-axis errors are the result of 100 simulations of the original sequencing data sub-sampled to 20× sequencing coverage (see Methods). b Each plot illustrates behavioral similarities for each file in each diluent condition, with a power regression overlayed (see Supplementary Note 4). The data used here are also sub-sampled to a sequencing coverage of 20×.
Fig. 3Examining file recovery, sequencing coverage, and sequence loss behavior.
a The limit of successful, no bit error decoding is shown with the gray bar in each graph. Data points below the gray bar are samples where the file was successfully decoded and recovered with no bit errors. Done post-sequencing, decoding involves clustering sequences, finding consensus, then correcting errors[10]. A more detailed view of this data including exact copy number and sequencing coverage is in Supplementary Note 7. b For the small file, when each sample diluted in water was sequenced at greater depth, minimal improvement on the proportion of missing sequences occurred. The 100× coverage data were found by sub-sampling the data used to create the 200× coverage data. c Missing sequences are compared between the initial pool prior to any dilutions where the mean copy number was 194 (in red) and the last dilution where the mean copy number was <1 (in blue). Note the different total number of sequences in the small (2042), medium (26,404), and large (271,447) files. The fact that some sequences are missing only from the initial pool but “reappear” in the final dilution suggests that the lost sequences are a result of stochastic variation that occurs during sub-sampling for file recovery, rather than irretrievably lost due to some property of the sequence. This pattern of sequences reappearing in subsequent dilutions is shown for every dilution step in Supplementary Note 6. Note that for a, b, x-axis and y-axis error bars represent 95% confidence intervals; x-axis errors are taken from triplicate qPCR data (see Methods), and y-axis errors are the result of 100 simulations of the original sequencing data sub-sampled to 20× sequencing coverage (see Methods).