| Literature DB >> 25414846 |
Aldrin Kay-Yuen Yim1, Allen Chi-Shing Yu2, Jing-Woei Li2, Ada In-Chun Wong3, Jacky F C Loo3, King Ming Chan3, S K Kong3, Kevin Y Yip4, Ting-Fung Chan1.
Abstract
The size of digital data is ever increasing and is expected to grow to 40,000 EB by 2020, yet the estimated global information storage capacity in 2011 is <300 EB, indicating that most of the data are transient. DNA, as a very stable nano-molecule, is an ideal massive storage device for long-term data archive. The two most notable illustrations are from Church et al. and Goldman et al., whose approaches are well-optimized for most sequencing platforms - short synthesized DNA fragments without homopolymer. Here, we suggested improvements on error handling methodology that could enable the integration of DNA-based computational process, e.g., algorithms based on self-assembly of DNA. As a proof of concept, a picture of size 438 bytes was encoded to DNA with low-density parity-check error-correction code. We salvaged a significant portion of sequencing reads with mutations generated during DNA synthesis and sequencing and successfully reconstructed the entire picture. A modular-based programing framework - DNAcodec with an eXtensible Markup Language-based data format was also introduced. Our experiments demonstrated the practicability of long DNA message recovery with high error tolerance, which opens the field to biocomputing and synthetic biology.Entities:
Keywords: DNA-based computational process; DNA-based information storage; biocomputing; error-tolerating module; synthetic biology
Year: 2014 PMID: 25414846 PMCID: PMC4222239 DOI: 10.3389/fbioe.2014.00049
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Figure 1Overall experimental design is shown. As a proof of concept, the logo of our university in BMP format was first compressed with Lempel-Ziv-Markov chain algorithm (LZMA). The binary file after compression was then split into six different fragments, each contained different sections of the binary file without overlapping. Each fragment was then embedded with low-density parity-check (LDPC) error-correction code, and the header containing the address information was then added in front of the DNA fragment. Each fragment, together with the header information, was first converted into quaternary numerical system, and further encoded into DNA bases (termed as DNA information block) by simple base conversion (A: 0, T: 1, C: 2, G: 3). The six DNA information blocks were then synthesized, each with a total size of 720 bp. To decode the information, the pool of synthesized DNA was then subjected to high-throughput sequencing using Illumina GA IIx platform. Upon quality trimming on raw reads, de novo assembly was performed. The six DNA information block could be recovered by using overlay-consensus approach, and the DNA bases were then converted back to a quaternary numerical system and further into binary format. LDPC code was then used to correct for the random mutations generated during the DNA synthesis and sequencing process. The six fragments were then concatenated based on the address information and decompressed using LZMA. The image could be recovered completely.
Figure 2. By using a binary symmetric channel as the error model, a range of error rates from 0 to 20% was used with a stepping increment of 0.1%. Random errors simulating substitution were introduced in each step at its specific error rate percentage, and each step was iterated for 20 times. DNA-based LDPC model could withstand an error rate up to 4%.