| Literature DB >> 31036920 |
Yeongjae Choi1, Taehoon Ryu1,2, Amos C Lee3, Hansol Choi1, Hansaem Lee1, Jaejun Park1,2, Suk-Heung Song4, Seojoo Kim4, Hyeli Kim4, Wook Park5, Sunghoon Kwon6,7,8,9.
Abstract
DNA-based data storage has emerged as a promising method to satisfy the exponentially increasing demand for information storage. However, practical implementation of DNA-based data storage remains a challenge because of the high cost of data writing through DNA synthesis. Here, we propose the use of degenerate bases as encoding characters in addition to A, C, G, and T, which augments the amount of data that can be stored per length of DNA sequence designed (information capacity) and lowering the amount of DNA synthesis per storing unit data. Using the proposed method, we experimentally achieved an information capacity of 3.37 bits/character. The demonstrated information capacity is more than twice when compared to the highest information capacity previously achieved. The proposed method can be integrated with synthetic technologies in the future to reduce the cost of DNA-based data storage by 50%.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31036920 PMCID: PMC6488701 DOI: 10.1038/s41598-019-43105-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1DNA-based data storage with addition of degenerate bases enables increased information capacity. (A) Binary data is encoded to DNA sequences comprising not only the 4 traditional encoding characters A, C, G, and T but also 11 additional degenerate bases. The length of encoded DNA is less than that of the four-character encoding method. (B) The theoretical information capacity limit is therefore increased from 2 bits/character to 3.9 bits/character. The dots in the graph describe the information capacity values in previous research, and the numbers indicate the corresponding reference. (C) A degenerate base represented by an encoding character describes a mixed pool of more than two types of nucleotides. (D) Degenerate bases can be generated by mixing the DNA phosphoramidites during the synthesis.
Figure 2Structure and decoding result of the DNA-based data storage platform. We achieved the highest information capacity and physical density of DNA-based data storage. (A) Design structure of DNA fragments. (B) DNA fragments can be analyzed using NGS. After classification by address, degenerate bases can be decoded by examining the distribution of characters in the same position (yellow bar). (C) Degenerate bases can be determined from the scatter plot of the ratio of bases in the same position. (D) The error rate of determined DNA bases in specific average coverage of the total fragments. The standard deviations (s.d.) were obtained by repeating the random sampling 10 times. The error bars represent s.d. (E) Summary of the experimental results. The information capacity is calculated from the input information in bits divided by the number of encoding characters (excluding that of adapter sites). We compared the results of our work with those of Erlich and Zielinski[10], who previously reported the highest information capacity and physical density using pooled oligo synthesis and high-throughput sequencing data. The physical density is the ratio of the number of bytes encoded to the weight of the DNA library used to decode the information.
Figure 3Error rate and cost for DNA-based data storage were analyzed. (A) The error rate per base pair according to the average NGS coverage over all fragments. The black line shows the experimental results, and the other three lines represent the Monte Carlo simulation results. For the experiment and simulation shown in green, we used A, C, G, T, W, and S for encoding. For the simulation shown in blue, we used A, C, G, T and all other degenerate bases. For the simulation shown in red, we used A, C, G, T, [R, Y, M, K, S, W – ratio of bases mixed of 3:7 and 7:3], H, V, D and N. The standard deviation of the experimental results were obtained by repeating the random sampling 5 times. The error bars represent the s.d. (B) The proposed platform is estimated to reduce the cost of DNA-based data storage by 50%. For the calculation, we assumed the cost of inkjet-based oligonucleotide pool synthesis reported by Erlich and Zielinski[10]. The cost of DNA sequencing was reported by K. Wetterstrand[22]. We used A, C, G, T and all other eleven degenerate bases as encoding characters. Additionally, we used A, C, G, T, [R, Y, M, K, S, W – ratio of bases mixed of 3:7 and 7:3], H, V, D and N as 21 encoding characters. The numbers indicate the corresponding reference. Details on the estimation method are described in the Supplementary Note.