| Literature DB >> 35928958 |
Penghao Wang1, Ziniu Mu1, Lijun Sun1, Shuqing Si1, Bin Wang1.
Abstract
DNA is a natural storage medium with the advantages of high storage density and long service life compared with traditional media. DNA storage can meet the current storage requirements for massive data. Owing to the limitations of the DNA storage technology, the data need to be converted into short DNA sequences for storage. However, in the process, a large amount of physical redundancy will be generated to index short DNA sequences. To reduce redundancy, this study proposes a DNA storage encoding scheme with hidden addressing. Using the improved fountain encoding scheme, the index replaces part of the data to realize hidden addresses, and then, a 10.1 MB file is encoded with the hidden addressing. First, the Dottup dot plot generator and the Jaccard similarity coefficient analyze the overall self-similarity of the encoding sequence index, and then the sequence fragments of GC content are used to verify the performance of this scheme. The final results show that the encoding scheme indexes with overall lower self-similarity, and the local thermodynamic properties of the sequence are better. The hidden addressing encoding scheme proposed can not only improve the utilization of bases but also ensure the correct rate of DNA storage during the sequencing and decoding processes.Entities:
Keywords: DNA encoding; DNA storage; encoding sequence local performance; hidden addressing; index overall self-similarity; random access
Year: 2022 PMID: 35928958 PMCID: PMC9344065 DOI: 10.3389/fbioe.2022.916615
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
FIGURE 1Overall schematic of DNA storage.
FIGURE 2Comparison of the required sequence lengths with and without the hidden addressing scheme.
FIGURE 3Schematic illustration of an example DNA storage encoding scheme for hidden addressing.
FIGURE 4Self-similarity comparison between encoding indices. (A) Utilize data instead of overall self-similarity between indices. (B) Overall self-similarity between seeds used for splicing sequences in fountain coding experiments.
Jaccard similarity coefficient of two encoding schemes under different .
| K-mers | 6-mer | 7-mer | 8-mer | 9-mer | 10-mer | 11-mer | 12-mer |
|---|---|---|---|---|---|---|---|
| Our work | 1,335.0 | 319.0 | 77.1 | 19.2 | 4.7 | 1.2 | 0.3 |
|
| 1,435.5 | 480.7 | 223.1 | 133.3 | 85.3 | 53.5 | 22.2 |
FIGURE 5Overall self-similarity generated dot plot of the encoded payload sequence.
FIGURE 6Statistical comparison of partial local GC base content in the two DNA storage systems. (A) This scheme carries out the local GC content control. (B) Fountain coding scheme without the local GC content control.
Comparison of existing encoding schemes.
| Method | NID | RA | DS | EC |
|---|---|---|---|---|
|
| 0.83 | No | 650 KB | Yes |
|
| 0.33 | No | 739 KB | Yes |
|
| 1.14 | No | 83 KB | Yes |
|
| 0.88 | Yes | 15 KB | Yes |
|
| 0.92 | No | 22 MB | No |
|
| 1.57 | No | 2.14 MB | Yes |
|
| 1.31 | Yes | 260 MB | No |
| Our work | 1.48 | Yes | 10.1 MB/40 KB | Yes |