| Literature DB >> 36097016 |
Lifu Song1,2, Feng Geng3, Zi-Yi Gong1,2, Xin Chen4, Jijun Tang5,6, Chunye Gong7, Libang Zhou8, Rui Xia7, Ming-Zhe Han1,2, Jing-Yi Xu1,2, Bing-Zhi Li9,10, Ying-Jin Yuan11,12.
Abstract
DNA data storage is a rapidly developing technology with great potential due to its high density, long-term durability, and low maintenance cost. The major technical challenges include various errors, such as strand breaks, rearrangements, and indels that frequently arise during DNA synthesis, amplification, sequencing, and preservation. In this study, a de novo strand assembly algorithm (DBGPS) is developed using de Bruijn graph and greedy path search to meet these challenges. DBGPS shows substantial advantages in handling DNA breaks, rearrangements, and indels. The robustness of DBGPS is demonstrated by accelerated aging, multiple independent data retrievals, deep error-prone PCR, and large-scale simulations. Remarkably, 6.8 MB of data is accurately recovered from a severely corrupted sample that has been treated at 70 °C for 70 days. With DBGPS, we are able to achieve a logical density of 1.30 bits/cycle and a physical density of 295 PB/g.Entities:
Mesh:
Substances:
Year: 2022 PMID: 36097016 PMCID: PMC9468002 DOI: 10.1038/s41467-022-33046-w
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 17.694
Fig. 1De novo assembly-based strand reconstruction for DNA data storage.
(a) The issues of DNA breaks and rearrangements in DNA data storage and the proposed de novo assembly-based strategy for dealing with them. (b) The two-stage de novo assembly process of the proposed de Bruijn graph-based greedy path search algorithm (DBGPS). The representative de Bruijn graph in stage 1 was constructed from the nice error-rich sequence copies shown in Supplementary Fig 1a with a k-mer size of four. The circles stand for the k-mer nodes. The numbers inside the circles are the occurrences, i.e., coverages, of corresponding k-mers. The correct sequence is represented by the path of green nodes. (c) The designed strand structure for the DBGPS algorithm. (d) The workflow of the greedy path search and selection process.
Fig. 2Error-handling capabilities of DBGPS in comparison with MA and large-scale performance simulations.
With twenty sequence copies, the performance of DBGPS and MA in handling various rates of (a) DNA breaks, (b) DNA rearrangements, (c) indels, (d) substitutions, and (e) mixed errors. The mixed errors comprise DNA breaks, DNA rearrangements, substitutions, insertions, and deletions in a ratio of 1:1:2:1:1. f Strand reconstruction rates with various strand copies containing 3% error mixtures of substitutions (1.5%), insertions (0.75%), and deletions (0.75%). g Strand reconstruction time by DBGPS with data scales ranging from 1 MB to 1 GB. The small bar chart at the top shows the fold changes in reconstruction time per strand compared to that of the 1 MB scale. (h) Illustration of the differences between the DBG constructed with numerous short sequences and that constructed with long sequence(s). The nodes with different colors stand for various k-mers. Each node stands for a unique k-mer. Data are presented as mean values of three independent simulations in figures a–g. The standard deviation (SD) values, which are too small to be clearly visualized, are listed in the source data. Source data are provided as a Source Data file.
Fig. 3Experimental verification of the robustness of DBGPS.
a Illustration of the three harsh experiments (1, 2, 3) performed to verify the robustness of DBGPS. A 6.8 MB zipped file of Dunhuang mural pictures was recorded by oligo synthesis, generating an ssDNA “Master Pool” with 210,000 unique types of ssDNA strands. Experiment 1, accelerated aging to verify the robustness with DNA degradations (breaks). Experiment 2, multiple data retrievals with intended unspecific amplifications to verify the robustness with strand rearrangements. Experiment 3, deep error-prone PCR to introduce errors. b Data retrieval details of the three experiments. S stands for maximal strand recovery rate. S stands for strand recovery rate. The green curve shows the average fragment lengths of the accelerated aging samples. The purple curve shows the error rates of the deep error-prone PCR samples. The strand reconstruction details of the three experiments are provided in Supplementary Table 3, 4 and 5 respectively. The Dunhuang mural pictures were obtained from Dunhuang Academy (http://www.dha.ac.cn/) with permission for this study. All rights reserved for other uses.
Key achievements of this work in comparison with prior DNA storage studies
| Church et al.[ | Goldman et al.[ | Grass et al.[ | Erlich et al.[ | Organick et al.[ | Leon et al.[ | Antkowiak et al.[ | This work | |
|---|---|---|---|---|---|---|---|---|
| Data size (MB) | 0.53 | 0.74 | 0.08 | 2.14 | 200.2 | 6.42 | 0.1 | 6.8 |
| Total oligos | 54,898 | 153,335 | 4991 | 72,000 | 13,400,000 | 217,000 | 16,383 | 210,000 |
| Logical density (Bits/cycle) | 0.6 | 0.19 | 0.86 | 1.19 | 0.81 | 1.52 | 0.94 | 1.30 |
| Physical density (PB/g) | NA | NA | NA | 215 | NA | 5.9 | NA | 295 |
| Long-term stability (9.4 °C) | – | – | ~2000 years | – | – | – | – | ~20,000 years |
The long-term stability was estimated based on the results shown in Fig. 3 and the study by Grass et al.[21]. Primers were considered in the calculation of logical density (bits per synthesis cycle). Strand reconstruction details of the high-density storage study in this work is provided in Supplementary Table 6.