Literature DB >> 23514938

Synthetic DNA: the next generation of big data storage.

Aisling O' Driscoll1, Roy D Sleator.   

Abstract

With world wide data predicted to exceed 40 trillion gigabytes by 2020, big data storage is a very real and escalating problem. Herein, we discuss the utility of synthetic DNA as a robust and eco-friendly archival data storage solution of the future.

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 23514938      PMCID: PMC3669150          DOI: 10.4161/bioe.24296

Source DB:  PubMed          Journal:  Bioengineered        ISSN: 2165-5979            Impact factor:   3.269


“Our remedies oft in ourselves do lie” Shakespeare All's Well That Ends Well (I, i, 231–232) We are now in the Century of Biology and in this new era, the petabyte (PB) is the currency. According to the International Data Corporation (IDC), it is estimated that worldwide data, approximated at 0.8 ZB (a trillion GB) in 2009, will increase to 40 ZB by 2020. In light of this, solutions such as cloud computing have been proposed as the savior of storage, with the cloud storage market alone projected to pass $46 billion by 2018. However, to quote Einstein, “We can’t solve problems by using the same kind of thinking we used when we created them.” Therefore, the key to our data storage problems may lie not in thinking bigger but rather in thinking smaller. According to papers published recently in Science and Nature by researchers at Harvard and EMBL-EBI respectively, DNA, the original information storage molecule comprising the biological script of life, may hold the solution to our future data storage problems. DNA is a high-capacity storage medium, with a theoretical storage potential of 455 exabytes per gram ssDNA. As a consequence, all of the world’s projected 40 ZB of data could be stored in ~90 g of DNA. In addition to this, molecular biology now provides us with the tools to cut (restriction endonucleases,), paste (DNA ligase,) and copy (PCR) DNA as we might the text of a word document. Furthermore, DNA is an extremely stable molecule, with a remarkably long life-span even in suboptimal environments, making it an ideal archival material. Indeed, more than 80% of the woolly mammoth (Mammuthus primigenius) genome, comprising 3.3 billion nt, remains readable despite the fact that this species disappeared from the planet at the end of the Pleistocene (10,000 y ago). Such nuclear genome sequencing of extinct species reveals population differences not evident from the fossil records and has even led to the discovery of genetic factors that may have affected species extinction. Some of the first attempts to use DNA as a workable canvas for archival purposes include Joe Davis’ Microvenus; a 35-bit coded visual icon representing the external female genitalia and by coincidence, an ancient Germanic rune representing the female Earth. More recently, construction of JCVI-syn1.0, the first bacterial cell to contain a completely synthetic genome, employed “watermarks” to distinguish the synthetic genome from native DNA. These 7,920-bit watermarks contain strings of bases that, in code, spell out a web address, the names of the paper’s authors and quotations ascribed to Joyce, Oppenheimer and Richard Feynman. Although successful on a small scale, a significant limitation to the large scale practical application of DNA-based information storage is the difficulty of synthesizing long stretches of DNA de novo. Church and colleagues at Harvard were the first to attempt to overcome these difficulties using next-generation DNA synthesis and sequencing technologies. Rather than a single long stretch of DNA (representing the complete data string), the team opted to work with shorter, overlapping fragments which together contain all the necessary information, yet individually are easier to manipulate in vitro. Furthermore, in order to move beyond the limited encoding of uppercase text which was the basis of previous approaches, the Harvard team chose to code an entire book (Regenesis: How Synthetic Biology Will Reinvent Nature and Ourselves ISBN-13:978–0465021758), including 53,426 words, 11 JPG images (at 10:1 data compression) and one JavaScript program. The team began by converting the text to html format using the Universal Character Set Transformation Format, 8-bit (UTF-8), backward compatible with ASCII and UNICODE for special fonts and character sets. The html-coded draft was then converted into a 5.27-megabit bitstream with the resulting bit sequence subsequently converted to DNA code using a 1-bit per base encoding (A,C = 0; T,G = 1), disallowing homopolymer runs greater than three while balancing GC content. The 5.27-megabit bitstream encoded 54,898 oligos, each 159 nt in length and consisting of a 96-bit data block (96 nt), a 19-bit address (19 nt) specifying the data block location and flanking 22 nt common sequences to facilitate amplification and sequencing. Following limited cycle PCR, to amplify the library, the sequence was read using an Illumina HiSeq next generation sequencer. With ~3,000-fold nucleotide coverage, all data blocks were recovered with a total of 10 bit errors out of 5.27 million (most of the errors being predominantly located within homopolymer runs and at the sequence ends with only single sequence coverage). In an effort to improve upon Church’s work, Goldman et al. recently described a modified strategy, which seeks to significantly reduce error and as a result facilitate up-scaling of DNA-based data storage. Achieving a storage density of ~2.2 PB/g DNA (equivalent to ~468,000 DVDs), the Goldman et al. approach first converts the original file type to binary code (0, 1) which is then converted to a ternary code (0, 1, 2), which is in turn converted to the triplet DNA code. Replacing each trit with one of the three nucleotides different from the preceding one (i.e., A, T or C, if the preceding one is G) ensures that no homopolymers are generated—significantly reducing high throughput sequencing errors. A further error limiting strategy involved the generation of overlapping segments (100 nt long data blocks with 75 nt overlap; alternate segments being converted to their reverse complement), creating 4-fold redundancy. Given that a majority of the errors associated with the Church method can be ascribed to either lack of coverage and/or homopolymers (runs of ≥2 identical nt), the increased redundancy and lack of homopolymers of the Goldman et al. strategy means that it is significantly less error prone than its predecessor. As proof of concept, the authors targeted four different file types (totaling 739 kilobytes of hard-disk storage): ASCII: the text file of a compression algorithm, Huffman code and all 154 of Shakespeare’s sonnets. PDF: the classic 1953 Watson and Crick DNA structure paper. JPEG: a color photograph of the authors’ host institution, the European Bioinformatics Institute. MP3: a 26 sec excerpt from Dr Martin Luther King’s “I have a dream” speech. In line with the approach taken by Church and colleagues, all five files were represented by short stretches of DNA, specifically 153,335 strings, each comprising 117 nt (incorporating both data and address blocks to facilitate file determination and localization within the overall data stream). The oligos were synthesized using Agilent’s oligo library synthesis process (creating ~1.2 × 107 copies of each DNA string), before being read using an Illumina HiSeq sequencer. Four of the five files were fully decoded without intervention (the fifth contained two 25 nt gaps which were easily closed following manual inspection), resulting in overall file reconstruction at 100% accuracy. Based on a fixed string length (data and indexing) of 117 nt, Goldman et al. suggest that DNA-based storage currently remains feasible even at several orders of magnitude greater than current global data volumes (measured in the ZB scale, 1021 bytes). This, combined with the likely expectation of significantly longer string synthesis as the technology progresses, virtually future proofs DNA as a viable storage medium. Despite this, cost still remains an important limiting factor. Current costs, estimated to be in the order of €12,400/MB of storage, are impractical for all but century-scale archives, with limited access requirements. However, if a similar exponential correlation between storage space and cost is experienced, as was the case over the past 40 y [a 1 GB (1,000 MB) hard drive costing ca. $1,000,000 in 1980 is now available for less than 10 cents] and given the decline in DNA synthesis and sequencing costs (dropping at a rate of 5- and 12-fold per annum respectively compared with a 1.6-fold reduction in electronic media storage per year), it is likely that in less than a decade, DNA-based storage will be the medium of choice for archives with a horizon of ≥50 y. The cost of maintenance and storage must also be considered; DNA based data storage, which requires negligible maintenance, presents a significant advantage in this context compared with the current gold standard of archival magnetic tape which requires maintenance and regular data transfers. Indeed, assuming that tape archives have to be read and rewritten every 5–10 y, current DNA based storage is cost-effective over a ~600–5,000-y horizon. In a serendipitous coincidence, the Goldman et al. study follows in the aftermath of a controversial year-long analysis and exposé on the unbridled energy consumption of data centers such as Google, eBay and Facebook, published recently by the New York Times in an article entitled, “The Cloud Factories: Power, Pollution and the Internet.” In contrast, DNA mediated storage provides an eco-friendly archival data storage solution that begs the question whether future data storage solutions lie in cloud accessible bio-banks rather than energy hungry data centers. However, DNA-based storage is itself not without limitations, including the lack of random access reads, as DNA sequencers read information sequentially; the “write-once” nature of DNA; and its latency, making it practical only for archival solutions. Indeed, a significant challenge facing long-term DNA-based storage is the ability to decode the data in the distant future. Egyptian hieroglyphics for example, widely believed to be the most ancient form of writing, dating back ~3300 BC, were decoded only as a result of the Rosetta stone, inscribed with the equivalent Greek text—without this ancient translation tool we would not be able to interpret the characters and symbols which constitute this ancient language. Therefore, without an equivalent molecular Rosetta stone, long-term archival data are likely to be completely unintelligible 5,000 thousand years from now (the time-frame for which current DNA-based data storage is cost effective). However, aside from this, which is after all a limitation inherent to all long-term archival strategies, many of the other more pressing concerns are, even now, beginning to be addressed. Random access, for example, might be facilitated if sequence fragments between barcodes are PCR amplified with a file allocation tube used as a file to barcode index mechanism. The challenge of rewritable DNA storage could be circumvented by utilizing PCR amplification to create multiple redundant backups. Furthermore, researchers at Stanford recently detailed a method for rewritable DNA which uses bacteriophage enzymes called recombinases to flip a particular DNA segment back and forth to represent a binary 0 or 1. Although still in the early stages of development, the authors are currently scaling up to a byte and are reducing the latency involved (currently one hour for 1 bit of memory). Therefore, despite the economic impracticality of DNA storage in 2013, this surprisingly simple idea has the potential to reshape the global face of data storage in the not too distant future (Fig. 1). Move over, Moore’s law—make way for life’s law…

Figure 1. DNA-based data storage—the big data storage solution of the future?

Figure 1. DNA-based data storage—the big data storage solution of the future?
  17 in total

1.  Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid.

Authors:  J D WATSON; F H CRICK
Journal:  Nature       Date:  1953-04-25       Impact factor: 49.962

Review 2.  Genome engineering.

Authors:  Peter A Carr; George M Church
Journal:  Nat Biotechnol       Date:  2009-12       Impact factor: 54.908

3.  Creation of a bacterial cell controlled by a chemically synthesized genome.

Authors:  Daniel G Gibson; John I Glass; Carole Lartigue; Vladimir N Noskov; Ray-Yuan Chuang; Mikkel A Algire; Gwynedd A Benders; Michael G Montague; Li Ma; Monzia M Moodie; Chuck Merryman; Sanjay Vashee; Radha Krishnakumar; Nacyra Assad-Garcia; Cynthia Andrews-Pfannkoch; Evgeniya A Denisova; Lei Young; Zhi-Qing Qi; Thomas H Segall-Shapiro; Christopher H Calvey; Prashanth P Parmar; Clyde A Hutchison; Hamilton O Smith; J Craig Venter
Journal:  Science       Date:  2010-05-20       Impact factor: 47.728

4.  Specific synthesis of DNA in vitro via a polymerase-catalyzed chain reaction.

Authors:  K B Mullis; F A Faloona
Journal:  Methods Enzymol       Date:  1987       Impact factor: 1.600

5.  Enzymatic joining of DNA strands: a novel reaction of diphosphopyridine nucleotide.

Authors:  S B Zimmerman; J W Little; C K Oshinsky; M Gellert
Journal:  Proc Natl Acad Sci U S A       Date:  1967-06       Impact factor: 11.205

6.  A restriction enzyme from Hemophilus influenzae. II.

Authors:  T J Kelly; H O Smith
Journal:  J Mol Biol       Date:  1970-07-28       Impact factor: 5.469

7.  A restriction enzyme from Hemophilus influenzae. I. Purification and general properties.

Authors:  H O Smith; K W Wilcox
Journal:  J Mol Biol       Date:  1970-07-28       Impact factor: 5.469

8.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA.

Authors:  Nick Goldman; Paul Bertone; Siyuan Chen; Christophe Dessimoz; Emily M LeProust; Botond Sipos; Ewan Birney
Journal:  Nature       Date:  2013-01-23       Impact factor: 49.962

9.  Ancient biomolecules from deep ice cores reveal a forested southern Greenland.

Authors:  Eske Willerslev; Enrico Cappellini; Wouter Boomsma; Rasmus Nielsen; Martin B Hebsgaard; Tina B Brand; Michael Hofreiter; Michael Bunce; Hendrik N Poinar; Dorthe Dahl-Jensen; Sigfus Johnsen; Jørgen Peder Steffensen; Ole Bennike; Jean-Luc Schwenninger; Roger Nathan; Simon Armitage; Cees-Jan de Hoog; Vasily Alfimov; Marcus Christl; Juerg Beer; Raimund Muscheler; Joel Barker; Martin Sharp; Kirsty E H Penkman; James Haile; Pierre Taberlet; M Thomas P Gilbert; Antonella Casoli; Elisa Campani; Matthew J Collins
Journal:  Science       Date:  2007-07-06       Impact factor: 47.728

10.  Sequencing the nuclear genome of the extinct woolly mammoth.

Authors:  Webb Miller; Daniela I Drautz; Aakrosh Ratan; Barbara Pusey; Ji Qi; Arthur M Lesk; Lynn P Tomsho; Michael D Packard; Fangqing Zhao; Andrei Sher; Alexei Tikhonov; Brian Raney; Nick Patterson; Kerstin Lindblad-Toh; Eric S Lander; James R Knight; Gerard P Irzyk; Karin M Fredrikson; Timothy T Harkins; Sharon Sheridan; Tom Pringle; Stephan C Schuster
Journal:  Nature       Date:  2008-11-20       Impact factor: 49.962

View more
  6 in total

1.  The genetic code. Rewritten, revised, repurposed.

Authors:  Roy D Sleator
Journal:  Artif DNA PNA XNA       Date:  2014

2.  TRAPPIST-1: The dawning of the age of Aquarius.

Authors:  Roy D Sleator; Niall Smith
Journal:  Bioengineered       Date:  2017-03-21       Impact factor: 3.269

3.  Digitizing humanity.

Authors:  Roy D Sleator; Aisling O'Driscoll
Journal:  Artif DNA PNA XNA       Date:  2013 Apr-Jun

Review 4.  Role of probiotics in prevention and treatment of enteric infections: a comprehensive review.

Authors:  Zunaira Iqbal; Shahzaib Ahmed; Natasha Tabassum; Riya Bhattacharya; Debajyoti Bose
Journal:  3 Biotech       Date:  2021-04-27       Impact factor: 2.406

Review 5.  Metagenomics and novel gene discovery: promise and potential for novel therapeutics.

Authors:  Eamonn P Culligan; Roy D Sleator; Julian R Marchesi; Colin Hill
Journal:  Virulence       Date:  2013-04-01       Impact factor: 5.882

Review 6.  Synthetic DNA applications in information technology.

Authors:  Linda C Meiser; Bichlien H Nguyen; Yuan-Jyue Chen; Jeff Nivala; Karin Strauss; Luis Ceze; Robert N Grass
Journal:  Nat Commun       Date:  2022-01-17       Impact factor: 14.919

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.