Literature DB >> 23984392

HyDEn: a hybrid steganocryptographic approach for data encryption using randomized error-correcting DNA codes.

Dan Tulpan¹, Chaouki Regoui, Guillaume Durand, Luc Belliveau, Serge Léger.

Abstract

This paper presents a novel hybrid DNA encryption (HyDEn) approach that uses randomized assignments of unique error-correcting DNA Hamming code words for single characters in the extended ASCII set. HyDEn relies on custom-built quaternary codes and a private key used in the randomized assignment of code words and the cyclic permutations applied on the encoded message. Along with its ability to detect and correct errors, HyDEn equals or outperforms existing cryptographic methods and represents a promising in silico DNA steganographic approach.

Entities: Chemical Disease Species

Mesh：

Substances：
Codon
DNA

Year: 2013 PMID： 23984392 PMCID： PMC3745945 DOI： 10.1155/2013/634832

Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411

1. Introduction

The deluge of counterfeited goods flooding the world markets today generates a high demand for novel cryptographic and steganographic approaches that will better protect information and branded products and ensure their authenticity. Positioned at the confluence of mathematics, biology, informatics, chemistry, and physics, cryptography and steganography represent the ultimate means for information protection.

1.1. Cryptography

Cryptography is generally defined as the practice and study of techniques for secure communication performed over unsecured channels. There are two major operations involved in secure communication, namely, the encryption and decryption of a message. The purpose of encryption is to modify the information, such that only an authorized party is capable of decoding it. Both, encryption and decryption, require a key, which is needed by the authorized parties, and it is assumed to be kept secret. To date, only one encryption approach was mathematically proven to be secure and virtually unbreakable: the one-time pad [1]. Nevertheless, its practicality is hampered by the necessity of a random key, which must be at least as long as the message itself. For all other cryptographic approaches, there is a theoretical possibility of breaking them, although the time required to do so might be very long, thus making the approaches fairly secure. Examples of such cryptographic approaches include the data encryption standard (DES) [2], the advanced encryption standard (AES) [3], the Rivest-Shamir-Adleman (RSA) method [4], and the Pretty Good Privacy (PGP) [5] method.

1.2. Steganography

Steganography is the science of concealing information within different types of media, such that only the sender and the receiver are aware of its exact location. Unlike cryptography, where only the message is protected, steganography protects both the message and the communicating parties. With origins deeply rooted in ancient Greece, where messages were recorded as texts or tattoos and then hidden on wax tablets and skins, steganography was used relentlessly over the centuries under various ingenious forms such as invisible inks [6], postal stamps [7], knitted clothes [8], microdots [9], modified images [10], executable files [11], and DNA sequences embedded in various materials [12, 13].

1.3. Error-Correcting DNA Codes

Error-correcting codes consist of sets of symbols defined over a finite alphabet, such that if any code word is altered in t positions we can detect and correct the error based on knowledge of the remaining code words. For example, assume a given binary code W consisting of two code words w 1 = 000 and w 2 = 111 each of length 3. A 1-bit error occurring in any of the two code words (e.g., w 2) will produce a modified code word; let us say w 2′ = 101. By comparing the modified code word w 2′ with both code words from W, we notice that it differs in only one bit from w 2 (middle bit), while it differs in two bits compared with w 1 (flanking bits). Thus, we can quickly identify the exact location of the error and correct it based on w 2′s closest proximity to code word w 2.

1.3.1. Hamming Codes

One instance of simple and efficient error-correcting codes are Hamming codes [14], where each pair of code words differs in at least d bits. We denote by A 4(n, d) the size of a quaternary code where all pairs of code words of length n differ in at least d positions. The number of bits/positions in which two code words differ is also known as the Hamming distance. For certain combinations of n and d, the exact size of quaternary codes are unknown and thus lower and upper bounds were derived to provide approximations. The text by MacWilliams and Sloane [15] provides a succinct introduction to the topic. While Hamming codes were originally designed using a {0,1} alphabet with the purpose of sending binary information over noisy channels, the increased need for storing and retrieving information with synthetic DNA strands used as chemical bar codes, or as biological tags for DNA computing applications, facilitated the advent of Hamming codes defined over quaternary alphabets, such as the DNA alphabet {A, C, T, G}.

1.3.2. DNA Codes

A single-stranded DNA molecule is a long, unbranched polymer composed of only four types of subunits linked together by chemical bonds and attached to a sugar-phosphate chain like four kinds of beads strung on a necklace. These subunits are the deoxyribonucleotides containing the bases: adenine (A), cytosine (C), guanine (G), and thymine (T). Conceptually equivalent to a digital signal, DNA sequences are naturally and synthetically used for information encoding in living organisms and biotechnological and steganographic applications. Given the data encoding capacity of DNA and the fact that traditional data encoding techniques using binary sequences are fortified against communication errors, quaternary codes using the DNA alphabet {A, C, T, G} were proposed and continuously developed over the past decades. The design of error-correcting DNA codes of fixed length n that satisfy various combinations of constraints such as having a minimum pairwise Hamming distance (d min⁡) is a hard computational problem, whose complexity is still unknown today. Over the past two decades, a large number of publications have proposed intricate code design techniques [16-18] based on their state-of-the-art algorithms such as stochastic local search, genetic algorithms, and pure mathematical constructions. Most of these approaches lead also to the continuous improvement of upper and lower bounds for DNA codes [19-21]. Assuming that a DNA code C with k code words of length n is given and that each pair of distinct code words w and w obeys the condition that, for all pairs (w , w ) with i, j ∈ N, i ≠ j, then C can detect ⌊d/2⌋ errors and can correct ⌊(d − 1)/2⌋ errors.

1.4. Related Work

Over the past decade, complex algorithms have been devised to encode information using DNA sequences. Examples of such algorithms include the DNA triplet-based approach described by Clelland et al. [9], which extends the principle of using microdots to hide information developed during the Second World War. An extension of Clelland et al.'s work was presented by Leier et al. [22], and it consisted of encoding zeros and ones using short DNA sequences with sticky ends, which can bind together forming longer sequences. The encrypted messages include a mixture of coding and noncoding DNA sequences, and the decryption can be performed only by someone who has access to the correct primer sequences. A primer is a short DNA sequence that serves as a starting point for DNA synthesis. A similar approach based on DNA tiling was proposed by Hirabayashi et al. [23] who designed true random one-time pads using a DNA cryptosystem. The true randomness is conferred by molecular computations using hybridization of DNA sequences encoding 4 types of cipher tiles. Gehani et al. [24] extended the one-time pad approach to perform operations on DNA sequence pairs, representing plain and cipher texts. Originally, the one-time pad approach was designed to perform XOR operations on binary codes. The message encoded with DNA pairs can be retrieved and decoded using specific DNA polymerases. Arita and Ohashi [25] developed a steganographic algorithm based on the redundant codon table (see Table 1). A codon consists of 3 consecutive nucleotides, and while it is possible to have 64 (43) different codons, only 20 of them encode distinct amino acids, with the rest being redundant. Their algorithm encoded each letter in the English alphabet using binary codes of length 5, with each bit being encoded by a codon. They added an additional parity bit to each letter encoding to keep the number of bits in each bit-pattern odd and thus used for error-detection purposes. The decoding could be achieved only by someone who knows the original codon sequence.

Table 1

The redundant DNA codon table.

Amino acid	DNA codons
Alanine	GCT	GCC	GCA	GCG
Arginine	CGT	CGC	CGA	CGG	AGA	AGG
Asparagine	AAT	AAC
Aspartic acid	GAT	GAC
Cysteine	TGT	TGC
Glutamic acid	GAA	GAG
Glutamine	CAA	CAG
Glycine	GGT	GGC	GGA	GGG
Histidine	CAT	CAC
Isoleucine	ATT	ATC	ATA
Leucine	CTT	CTC	CTA	CTG	TTA	TTG
Lysine	AAA	AAG
Methionine	ATG
Phenylalanine	TTT	TTC
Proline	CCT	CCC	CCA	CCG
Serine	TCT	TCC	TCA	TCG	AGC	AGT
Threonine	ACT	ACC	ACA	ACG
Tryptophan	TGG
Tyrosine	TAT	TAC
Valine	GTT	GTC	GTA	GTG
Start (CI)	ATG
Stop (CT)	TAA	TAG	TGA

Following a different approach, Wong et al. [27] developed a DNA steganography method that encodes information in living organisms. The information is encoded with the aid of unique DNA sequences that do not exist in the particular genomes where they will be embedded, thus assuring the success of the identification stage. For this approach to succeed, the embedded foreign DNA must be replicated by the host organism together with their genomic DNA. The extraction of the information is achieved using a standard laboratory technique called the polymerase chain reaction (PCR) [28]. The DNA-Crypt approach proposed by Heider and Barnekow [29] combines and extends the steganographic and cryptographic methodologies proposed by Wong et al. [27] and Arita and Ohashi [25]. DNA-Crypt encodes information using a substitution cipher and two types of error-correcting codes, namely, Hamming [14] and WDH [30]. DNA-Crypt incorporates a fuzzy controller and powerful cryptographic algorithms such as one-time pad, AES, Blowfish [31], and RSA. Shiu et al. [32] introduced 3 data hiding methods based on properties of DNA sequences, namely, the insertion method, the complementary pair method, and the substitution method. All three methods provide distinct means to incorporate secret messages within existing DNA sequences pulled from public databases. The known DNA sequence acts as a private key, and it can be identified only by the sender and the receiver. A hybrid approach built on the substitution method described in Shiu et al. [32] that combines cryptography and DNA steganography was proposed by Torkaman et al. [33]. Their approach uses reference DNA sequences from the European Bioinformatics Institute (EBI) Database, which contains roughly 163 million entries. The encoding of information is achieved using 6 association rules. Here, we present the hybrid DNA encryption (HyDEn) approach, which combines the advantages conferred by cryptography and steganography into a unique symmetric cryptosystem. The system uses a unique private numeric key to scramble the assignment of DNA code words from a predesigned set to the extended ASCII characters and then apply a cyclic permutation on the encrypted message. The combination of key uniqueness, the randomization of code word assignments, the undisclosed code word length, and the final cyclic permutation of the encrypted message confer additional strength to the proposed approach. The information encrypted with HyDEn can be safely communicated between senders and receivers via dedicated and inconspicuous publicly accessible channels, such as bioinformatics discussion groups and DNA sequence databases.

2. HyDEn: The Hybrid DNA Encryption Approach

Deeply rooted in the ways nature encodes information using nucleic acids, DNA stegano-cryptography uses short DNA sequences to encrypt and hide messages, thus protecting their content. The hybrid DNA encryption (HyDEn) approach presented here includes a novel in silico cryptosystem that uses DNA error-correcting Hamming codes and disguises encrypted messages as long DNA sequences conveniently placed on host bioinformatics resources. Following next is a stepwise description of the HyDEn cryptosystem. Input. The message is defined over an alphabet Ω, private key pk. Encryption Algorithm

Step

Select an error-correcting DNA code with |Ω | n-ary code words obtained with one of the state-of-the-art code design techniques described in Aboluion et al. [16], Gaborit and King [19], Tulpan and Hoos [26], and Tulpan et al. [18]. Here, n represents the number of characters in a DNA code word. An example of a DNA code with n = 8 and d = 3 is given in Table 2.

Table 2

A sample DNA A 4(8,3) Hamming code consisting of 256 code words. Each code word can be associated with an extended ASCII character and used for encoding text messages. The code was obtained with the DNA word design algorithm described in Tulpan and Hoos [26].

A set with 256 code words
AAAAAAGA	ACTACACT	ATGGAGTT	CCCTTCGA	CTGGTAGT	GGAAAGGT	GTTGTATT	TCGTGTTA
AAAAGAAG	ACTACCTA	ATGGGAAG	CCGATTTC	CTGGTTCG	GGATGACA	TAACATAC	TCTCCGAG
AAAATGTT	ACTCTCAG	ATGTAAGT	CCGCGCAT	CTTCGGTG	GGCCAAGT	TAACCATA	TCTCCTTA
AAACCTGC	ACTGGAGT	ATTCATAC	CCGGCGCG	CTTGACAT	GGCCGACG	TAACGAGG	TCTGCGCA
AAACTCAC	ACTTCCGC	ATTCTGCG	CCGTAGCC	CTTGCATG	GGCCTGGA	TAAGAGCA	TCTGGCTC
AAAGATCG	ACTTGCAT	ATTTAATC	CCGTTCAG	CTTTCCAC	GGCGTGCC	TAAGTTGA	TCTGTTAC
AAATGTGG	ACTTTGGG	ATTTCAGA	CCTACCGG	GAATCATC	GGCTGCAT	TAATAGGC	TGAAAATA
AAATTGAG	AGACCCTA	CAAATACG	CCTTCTGT	GACAGCGT	GGGCATAC	TAATGGAA	TGACTCAT
AACAGCTG	AGACTTAA	CAAATCTA	CCTTGTCG	GACCAGCT	GGGCTTGG	TAATTACT	TGAGCATC
AACCTAGC	AGAGCGGT	CAATATGA	CCTTTGAC	GACCGTTA	GGGGCCCA	TACGCAAA	TGAGGGTT
AACGCGTT	AGAGTAAT	CAATTCGC	CGAACGCT	GACGGTAT	GGGGGTTC	TACTTGGG	TGATATAT
AACGGTGA	AGATCTTG	CACCTAAT	CGACCTTT	GAGAATTA	GGTAATGG	TAGACTGA	TGATTCGG
AACTACGT	AGATGGCT	CACTCGAA	CGAGAAAC	GAGAGAGC	GGTACGTA	TAGAGTAC	TGCATAAG
AACTCATA	AGCCAGCA	CAGACAGG	CGAGCGTA	GAGAGTCG	GGTATGCG	TAGGAGTG	TGGGGCGC
AAGAAACT	AGCTCGGG	CAGCAACG	CGAGCTCG	GAGTTGTT	GGTTTAGT	TAGTAACC	TGGTTTTT
AAGATAAC	AGGACTGT	CAGCCGGC	CGAGTCTT	GATACCCC	GGTTTCCC	TAGTCCGG	TGTCAGAT
AAGCACGC	AGGATGAG	CAGGTCGA	CGATGTAC	GATATTGC	GTAACGCG	TATAAATG	TGTGCAAT
AAGGTTGT	AGGCCCAT	CAGTGATC	CGCCACGA	GATCATAT	GTACTACG	TATATGGT	TGTGTTGG
AATAGTCT	AGGTACTT	CATCGAGC	CGCCTCCC	GATCCCAG	GTAGATCA	TATGTGAA	TTAAGCCG
AATCGTTC	AGGTAGGC	CATCTTTG	CGGAAGTA	GATGACTA	GTAGTCGT	TCAAACGC	TTAATTTA
AATGCGGG	AGGTGTCC	CATGCTTA	CGGTAACA	GATTGTTG	GTCATATG	TCAAAGTG	TTAGCTGT
AATGTGCT	AGTCGAAG	CATGGGGA	CGGTGTTG	GATTTACG	GTCCGAAT	TCAAGAAC	TTAGTCCA
AATTGGTT	AGTCGGGA	CCACCGCC	CGTCACAC	GCAGGTCG	GTCCTTAA	TCACAAGA	TTCAAGAC
ACACTAGT	AGTGCCGA	CCAGATGC	CGTTAGCT	GCATTCTT	GTCGCAAG	TCAGTGCC	TTCCGCAC
ACACTTCC	ATATGCCC	CCAGTATC	CTAACTCC	GCATTTCA	GTCTCCAA	TCATCTTC	TTCGAATA
ACAGCTTA	ATCACAAA	CCAGTGGA	CTAGACGG	GCCGAATT	GTGGAGAA	TCCGAGGC	TTGCGTTC
ACATCGAA	ATCACCGG	CCATGACC	CTAGAGCC	GCCGCGGT	GTGGCCAT	TCCGCCGA	TTGGGGTA
ACCGGATC	ATCCCTGA	CCATGCAA	CTATTACA	GCGAATGT	GTGTCGGT	TCCTGAAG	TTGTCTTG
ACCTCAAC	ATCGTAGG	CCCCTACG	CTATTGTT	GCGACATT	GTTATCAC	TCGATGCG	TTTACAGC
ACGCATTT	ATCTCTTC	CCCGGAGA	CTCCCAGT	GCGGGTAA	GTTCACTG	TCGGAACA	TTTCCACG
ACGCTATG	ATCTTCAC	CCCGGGAG	CTCCGGCC	GCTGAGTG	GTTCCAAC	TCGTAGAG	TTTCGTAG
ACGTCGTC	ATGACGTG	CCCTAGTT	CTCGCGGC	GCTGTCCG	GTTGCTCT	TCGTCCAT	TTTGTGTG

Using the key pk provided as input, perform a random shuffling of the n-ary DNA code words that will be associated to each character from Ω. Encrypt the message using the random assignment of DNA code words obtained in Step 2. Perform a circular rotation (mod⁡|Ω|) to the right of the characters in the message with exactly pk positions. Output. The encrypted message m. Step 1 provides the means of encoding a message using a code defined over a quaternary alphabet. The code will be able to identify and correct errors that can occur during the message transmission stage. Step 2 will generate a unique code word assignment based on the key pk. If all pk keys are unique, then the assignment will be equivalent to a one-time pad system. In the eventuality that code word length (n) is found, Step 4 is used to lower the chances of a successful frequency analysis based on well-established tests such as the Friedman test [34] and the Kasiski test [35]. The message decryption step will use the same unique key to perform the reverse circular permutation on the encrypted message and find the correct code words assignment, which will reveal the original message. The flowcharts for message encryption and decryption with HyDEn are summarized in Figure 1.

Figure 1

Flowcharts for message encryption and decryption with HyDEn.

3. Example of Message Encryption and Decryption Using HyDEn

To better understand how the HyDEn approach works, let us assume that Alice would like to transmit the message “ATTACK AT DAWN” to Bob. They have established before hand to use the secret key “5”. The message uses only 8 distinct ASCII characters, namely, “space,” “A,” “C”, “D,” “K,” “N,” “T” and “W.” Based on the unique key used by Alice and Bob, and applying Steps 1 and 2 of our approach, a unique assignment of DNA code words of length 8 is associated to each of the 8 characters, as shown in Table 3.

Table 3

A sample assignment of code words to ASCII characters.

DNA code word	ASCII character
AAAAAAGA	→space
ACTACACT	→A
ATGGAGTT	→C
CCCTTCGA	→D
CTGGTAGT	→K
GGAAAGGT	→N
GTTGTATT	→T
TCGTGTTA	→W

Using this assignment, the encrypted message resulting after Step 3 is the following: ACTACACTGTTGTATTGTTGTATTACTACACT ATGGAGTTCTGGTAGTAAAAAAGAACTACACT GTTGTATTAAAAAAGACCCTTCGAACTACACT TCGTGTTAGGAAAGGT To better visualize the encryption process, every second code word was bold faced. The encrypted message is then permuted cyclically five positions to the right, thus obtaining the following sequence of DNA bases: AAGGTACTACACTGTTGTATTGTTGTATT ACTACACTATGGAGTTCTGGTAGTAAAAAAGA ACTACACTGTTGTATTAAAAAAGACCCTTCGA ACTACACTTCGTGTTAGGA Ideally, the key (mod 256) must be different from a multiple of the code word length (n); otherwise, the permutation will shift the encrypted message exactly n letters to the right (or to the left) and will not have the desired effect.

4. Comparison Parameters

To facilitate the comparison between our approach and related encryption methodologies, we use a combination of performance parameters including the ones introduced by Shiu et al. [32], namely, capacity, payload, bpn, and the cracking probability or the probability of a successful brute-force attack P . The capacity (C) is defined as the total length of a reference sequence that encodes or includes the encrypted message. The payload (P) is the remaining length of the new sequence after subtracting the reference DNA sequence. The bpn represents the number of hidden bits per character. The previous parameters utilize the following notations: n is the length of a DNA sequence, m is the message that will be encrypted, and |m| is its length.

5. Results and Discussion

We analyze the robustness of HyDEn by estimating the probability of success for a brute-force attack, and we provide a comparative assessment between our cryptosystem and other cryptographic techniques with performance characteristics described in the literature. The comparison relies on a set of parameters introduced in Section 4. We further investigate HyDEn's strengths and weaknesses, and we provide insights into potential improvements that will augment its performance.

5.1. Robustness

Calculations of the strength of encryption against brute-force attacks are typically the worst case scenarios thus, the probability of success for a brute-force attack against the proposed cryptosystem (HyDEn) is captured where n is the length of a DNA code word and |Ω| is the number of characters in alphabet Ω. Assuming that Ω is the extended ASCII character set, then |Ω| = 256 and (2) becomes Using the Stirling approximation [36] for factorials, ln⁡(k!) ≈ k · ln⁡(k) − k, for all k ∈ ℝ, and DNA code word length n = 8, we obtain The first term in (2) comes from the fact that n is unknown to the attacker; thus, a successful attacker must first guess the length of the used code words, which would be 8 in the sample A 4(8,4) DNA code from Table 2. The second term of the equation describes the probability of finding the correct code assignment for the extended ASCII character set. We also assume that the attacker already knows what character set is encoded by the DNA code. The last term of the equation is given by the probability of finding the correct cyclic permutation applied to the encrypted message. Without knowing the correct permutation, the attempt of identifying the correct code word assignment is prone to failure.

5.2. Comparison with Other DNA Cryptographic Strategies

Using the parameter estimations described in Section 4, we compare HyDEn with other encryption approaches described in Shiu et al. [32]. Table 4 presents comparative results between HyDEn and other cryptographic methods. The methods are compared based on their capacity (C), payload (P), the number of hidden bits per character (bpn), and the probability of success for a brute-force attack (P ).

Table 4

Comparison between HyDEn and other encryption methods. n is the length of a DNA sequence, |m| is the length of the original message, |Ω| is the size of the DNA code, and k is a method-specific parameter that represents the length of the longest complementary pairs in the reference DNA sequence.

Method	C	P
Hy DE n	n	0
Insertion [32]	n+\|m\|n	n2
Complementary pair [32]	n + \|m \| ·(k + 3.5)	\|m \| ·(k + 3.5)
Substitution [32, 33]	n	0

Method	bp n	P _bf

Hy DE n	\|m\|n	1n·1\|Ω\|!·1\|Ω\| (e.g., 1211·e1163.6)
Insertion [32]	\|m\|n+\|m\|/2	11.63·108·1n-1·12\|m\|-1·12n-1·124
Complementary pair [32]	\|m\|n+\|m\|·(k+3.5)	11.63·108·1242
Substitution [32, 33]	\|m\|n	11.63·108·16 or 3ⁿ

Based on the probability of success for a brute-force attack (P ), HyDEn and the insertion method are the most secure, while the substitution method seems to be the least secure. Nevertheless, the best capacity (C), payload (P), and bpn correspond to HyDEn and the Substitution method, while the insertion method ranks second and the complementary pair third. The result expressed in (4) can be also directly compared with the result reported by Torkaman et al. [33] on page 233 in their paper. Their result states that the probability of recovering via a brute-force technique an original message hidden within a sequence database with other 163 million sequences is equal to (1/(1.63 × 108)) × (1/6). Using simple numerical inequality manipulations, we show that our technique confers higher protection against brute-force attacks compared with the method proposed by Torkaman et al.: Thus, P (HyDEn) ≪ P (substitution: Torkaman et al. [33]).

5.3. HyDEn's Strengths, Weaknesses, and Potential Extensions

Compared with the existing DNA-based cryptographic and steganographic methods, HyDEn has one of the lowest probabilities of success for brute-force attacks. HyDEn includes mechanisms such as cyclic permutations and randomized assignments of code words to protect against various types of frequency analysis such as the Kasiski and Friedman tests along with error detection and correction capabilities conferred by DNA Hamming codes. One of the drawbacks of using many-to-one character encoding schemes is the increase in size of the encrypted message, which could become a burden for the communication media and which also poses also a challenge for hiding strategies of large messages. The steganographic approach including message distribution and the selection of inconspicuous dissemination venues must be carefully analyzed. For example, large encrypted messages encoded as long in silico DNA sequences can be better hidden in databases for DNA coding sequences, DNA contigs or mRNA sequences, while relatively short messages would be better hidden as DNA and RNA primer sequences or as microarray probes. One potential weakness of the current approach could stem from peculiarities of the language in which the original message was written, assuming that the attacker has already guessed it. For example, if English is the language, then an analysis based on occurrences of double letters such as double Ls in a fairly limited number of words could be used to find partial (code word, character) associations. A potential extension inspired from the Belasso Ciphers [37], which were later wrongfully attributed to Vigenère [38], that will add confusion and increased security to HyDEn is to encode each character with multiple code words selected uniformly at random, without breaking the error detection and correction capabilities of the DNA code. Table 5 presents an A 4(8,3) code with 1024 DNA sequences of length 8 and minimum pairwise Hamming distance 3, which could be used as a replacement of the code from Table 2. Each extended ASCII character could be encoded using one out of 4 different code words, each selected with equal probability. Lower (2048) and upper (2340) bounds published by Bogdanova et al. [39] and hosted on Dr. Andries Brower's website [40] suggest that even larger A 4(8,3) DNA codes can be generated.

Table 5

A sample DNA A 4(8,3) Hamming code consisting of 1024 code words. Four distinct code words can be associated with one extended ASCII character and used for encoding text messages. The code was obtained with the DNA word design algorithm described in Tulpan and Hoos [26].

A set with 1024 code words
AAAAAAAG	AAAAAGGA	AAAACTCC	AAAAGCAC	AAACAATA	AAACAGCT	AAACCGTC	AAACGAGG
AAACGCCA	AAACGTAT	AAAGATGG	AAAGTCTT	AAAGTGGC	AAATATAA	AAATCTTT	AAATGGCG
AAATTGAT	AACACAAA	AACAGACC	AACAGCTA	AACCAGGG	AACCATCA	AACCCTAC	AACCTCGA
AACCTGTT	AACGACCG	AACGCATC	AACGGGGA	AACGTTAA	AACTCGTG	AACTGTGC	AAGAAGAC
AAGACTAG	AAGAGGTG	AAGATACA	AAGATGGT	AAGATTTC	AAGCGTGA	AAGCTCAT	AAGCTGCC
AAGGACGC	AAGGCCTG	AAGGCGCA	AAGGGATA	AAGGTAAG	AAGTAATT	AAGTGCAG	AATAATTA
AATACACG	AATACCTT	AATCAACC	AATCCGGA	AATCGCTC	AATGGAGC	AATGGCAA	AATGGTTG
AATGTGCT	AATTACTG	AATTAGCA	AATTCAAC	AATTGGGT	AATTTAGG	AATTTTCC	ACAAAGTC
ACAACCCG	ACAAGAGT	ACAATTTT	ACACATTG	ACACCAAT	ACACCTCA	ACACTACG	ACACTGTA
ACAGCTGC	ACAGGGCC	ACAGGTAG	ACATAACT	ACATACGA	ACATCATG	ACATTCAC	ACCAACCA
ACCACTGA	ACCAGTTG	ACCATGAC	ACCCAATT	ACCCCCGG	ACCCGAAG	ACCCGGCT	ACCGACAC
ACCGAGTA	ACCGCGAT	ACCGTAGC	ACCTATCG	ACCTGACA	ACCTGCTC	ACCTTTAT	ACGAATGG
ACGACGCC	ACGAGGGA	ACGATCAG	ACGCAACA	ACGCCCAA	ACGCCGTT	ACGCGTAC	ACGCTGGG
ACGGCCGT	ACGGTTCT	ACGTAGAA	ACGTTAGT	ACTAAATG	ACTACGAA	ACTAGCCC	ACTATCTA
ACTATTGC	ACTCATGT	ACTGCACC	ACTGCTTT	ACTGGTCA	ACTGTCAT	ACTTATTC	ACTTCCAG
ACTTCGCT	ACTTGTGG	ACTTTAAA	ACTTTGTG	AGAACCAT	AGAATCTC	AGAATGCG	AGACAAAC
AGACGCGT	AGACGTTA	AGACTCAA	AGAGAGCA	AGAGCACT	AGAGCCGA	AGAGTATA	AGAGTTCC
AGATCGAC	AGATGAGC	AGCAACGT	AGCAATAA	AGCACTTC	AGCAGGCA	AGCATAAT	AGCCAGAT
AGCCCACA	AGCCGTGG	AGCGCCCC	AGCGGGAC	AGCGTCAG	AGCTAAGA	AGCTATTT	AGCTCAAG
AGCTGCAA	AGCTTGGT	AGGAACTG	AGGACATT	AGGAGTGC	AGGATGTA	AGGCAAGG	AGGCATCC
AGGCCTTG	AGGGACAT	AGGGCAAA	AGGGCGGC	AGGGGACC	AGGGGGTT	AGGTAGTC	AGGTCGCG
AGGTCTGT	AGGTGTCA	AGGTTCCC	AGGTTTAG	AGTAACAC	AGTAGGTC	AGTATCCT	AGTCAGTG
AGTCCAGT	AGTCCGCC	AGTGATCG	AGTGCTAC	AGTGGAAT	AGTGGCGG	AGTGTGGA	AGTTCCTA
AGTTGACG	AGTTTATC	ATAAACTT	ATAACATA	ATAATTCA	ATACCCAG	ATACGGTT	ATACTTGC
ATAGAAGT	ATAGCGTG	ATAGGTTC	ATAGTGAA	ATATAGGC	ATATCCTC	ATATCTGG	ATATGCCT
ATCAAGTG	ATCACGGC	ATCAGCCG	ATCAGTAC	ATCATTGT	ATCCAGCC	ATCCATAG	ATCCCCCT
ATCCCTTA	ATCCTAAA	ATCGAACA	ATCGATGC	ATCGCAGG	ATCGTGTC	ATCGTTCG	ATCTAATC
ATCTACAT	ATCTCTCC	ATCTGGAG	ATCTTCGC	ATCTTGCA	ATGAAACT	ATGAGATC	ATGAGCGT
ATGCATTT	ATGCCAAC	ATGCGGAA	ATGCGTCG	ATGGACTA	ATGGAGGG	ATGGGCAC	ATGGTAGA
ATGTACCG	ATGTCGGA	ATGTGAGG	ATTAAAAA	ATTAAGGT	ATTACCCA	ATTAGTCT	ATTATTAG
ATTCCATG	ATTCCCGC	ATTCGACA	ATTGACCT	ATTGAGAC	ATTGCTGA	ATTGGGCG	ATTGTATT
ATTTCTAT	ATTTGCGA	CAAAATCA	CAAACTGG	CAAAGGAA	CAAATCCC	CAACACTC	CAACCCAT
CAACGGCC	CAAGAATT	CAAGCCTA	CAAGCTCT	CAAGGCCG	CAAGGGGT	CAATACAG	CAATATGT
CAATGTTG	CAATTGCA	CACAAAAC	CACAAGTA	CACACCCG	CACATTCT	CACCAACG	CACCGCGG
CACCTTTC	CACGCAGT	CACGCTTG	CACTACGC	CACTAGAT	CACTCTCA	CACTGAGA	CACTTTAG
CAGAAATG	CAGACCAC	CAGACGCT	CAGAGCCA	CAGAGTGT	CAGCACGA	CAGCATAT	CAGCCTCG
CAGCGAAC	CAGCTAGT	CAGGAGTC	CAGGCTGC	CAGGGTAA	CAGTATTA	CAGTCACC	CAGTCCGT
CATAGCAT	CATAGTTC	CATATAAG	CATATGTT	CATCACCT	CATCCTTT	CATCTCAC	CATCTGCG
CATCTTGA	CATGAGGG	CATGATAC	CATGCGAT	CATGTACC	CATGTCTG	CATTCATA	CATTCGGC
CATTGGAG	CATTGTCT	CCAAACTA	CCAAAGGG	CCAACAAC	CCAACGTT	CCACAAAG	CCACCCGA
CCACGTGG	CCACTGAC	CCAGAAGA	CCAGATCG	CCAGGGTG	CCAGTTTC	CCATAATC	CCATCAGT
CCATCGCC	CCATCTAG	CCATGCAT	CCATGGGA	CCATTCTG	CCCAATAT	CCCACATG	CCCAGCCT
CCCAGTGC	CCCATTTA	CCCCACCC	CCCCATGA	CCCCCGAA	CCCCGGTC	CCCCTAGG	CCCGACGT
CCCGCGGC	CCCGGCTA	CCCGTATT	CCCGTCCG	CCCTCTTC	CCCTTGCT	CCGAGTAG	CCGATAAT
CCGATGCG	CCGCAGGC	CCGCCCTC	CCGCGAGA	CCGCGGAT	CCGCTTCA	CCGGAAAC	CCGGCGAG
CCGGCTTA	CCGGTCGA	CCGTACCA	CCGTCAAA	CCGTCTCT	CCGTGATG	CCGTGCGC	CCGTTGTA
CCTAAAGC	CCTACTCA	CCTAGACG	CCTAGGAC	CCTATGGA	CCTCAATA	CCTCCCCG	CCTCCTAC
CCTCGGCA	CCTCTCGT	CCTGAGCT	CCTGCTGG	CCTGGATC	CCTTCCTT	CGAAAAGT	CGAAACCG
CGAACGCA	CGAAGTAC	CGACATCT	CGACCGGG	CGACTAGA	CGACTGTT	CGACTTAG	CGAGAGAT
CGAGATGC	CGAGCATG	CGAGCTAA	CGATAGTG	CGATGAAG	CGATGCTA	CGATTCGT	CGCACGAG
CGCAGAAA	CGCAGCTG	CGCATACC	CGCCAAGC	CGCCACTT	CGCCCGCT	CGCCGTCA	CGCCTTGT
CGCGGACT	CGCGGTTC	CGCGTCGC	CGCTCGTA	CGCTCTGG	CGCTGTAT	CGCTTCCA	CGCTTGAC
CGGACAGA	CGGACGTC	CGGATCAA	CGGCACAC	CGGCGGTG	CGGCTCCG	CGGGAACA	CGGGCCTT
CGGGGTGG	CGGGTATC	CGGGTTAT	CGGTAGGA	CGGTCCAG	CGGTGGCC	CGGTTACT	CGTAAGCC
CGTACCGT	CGTATTCG	CGTCAGAA	CGTCCATC	CGTCGCGA	CGTCTGGC	CGTGATTA	CGTGCGCG
CGTGGCAC	CGTGTAGT	CGTTAAAC	CGTTCTCC	CGTTGGTT	CTAACTTC	CTAAGACT	CTAATGGC
CTACACGT	CTACATAA	CTACCGTA	CTACGAGC	CTACGGAG	CTACTCCA	CTAGCCAC	CTAGGATA
CTATAGCT	CTATGGTC	CTATTAAA	CTATTTTT	CTCAACGG	CTCAATCC	CTCACTAA	CTCATCAT
CTCCCCTG	CTCCGGGT	CTCGACTC	CTCGCCCA	CTCGGTAG	CTCGTAAC	CTCTCATT	CTCTGCAC
CTCTTGTG	CTGAAGAA	CTGAGGTT	CTGATCTG	CTGCAGCG	CTGCGCCC	CTGCGTTA	CTGCTAAG
CTGGATGT	CTGGCAAT	CTGGCCGG	CTGGGGGA	CTGGTCCT	CTGTACTT	CTGTATAC	CTGTTAGC
CTGTTTCG	CTTAATTT	CTTACGTG	CTTAGCGC	CTTATATC	CTTCACAG	CTTCGTAT	CTTCTACT
CTTCTTTG	CTTGAATG	CTTGACGA	CTTGCAGC	CTTGGCTT	CTTGGTCC	CTTGTGTA	CTTTATCA
CTTTCGAA	CTTTGAGT	CTTTGCCG	CTTTTGCC	GAAAACGT	GAAACCTG	GAAACGGC	GAAAGTGA
GAAATATA	GAAATTAT	GAACAGTG	GAACGAAA	GAAGAACG	GAAGCTAG	GAATACCC	GAATCGAA
GAATGGTT	GAATTACT	GAATTCGG	GACAAGCG	GACATCAA	GACATTTG	GACCAAAT	GACCCCGT
GACCGGTA	GACGACGA	GACGATCC	GACGCGAC	GACGGACA	GACGGCAG	GACGTCCT	GACTAATA
GACTCCTC	GACTCTAT	GACTTGGC	GAGAACAG	GAGACCGA	GAGAGAGG	GAGAGTCC	GAGCCTTA
GAGCGGAG	GAGCTCCA	GAGCTTGC	GAGGCCCC	GAGGGGCT	GAGGTATT	GAGGTGGA	GAGTGCTA
GAGTTAAC	GAGTTGCG	GATAAACT	GATAGGCA	GATATCTC	GATCATTC	GATCCACA	GATCGCCG
GATCGGGC	GATCTATG	GATCTGAT	GATGAGTA	GATGGCGT	GATGTAAA	GATTAAAG	GATTCCCT
GATTCTTG	GATTGATC	GCAAACAC	GCAAGATG	GCAATCCT	GCAATGAA	GCACATAT	GCACCAGC
GCACCGCG	GCACGCAG	GCACGGGT	GCACTATT	GCAGCAAA	GCAGGCGC	GCAGGTCT	GCATAGCA
GCATATGG	GCATTGTC	GCCAAAAA	GCCACTCC	GCCAGGAT	GCCATCGC	GCCCAGAC	GCCCGTCG
GCCCTCAT	GCCGCTGT	GCCGGAGG	GCCGGTAC	GCCGTGAG	GCCTACTT	GCCTCACG	GCCTCGGA
GCCTGGCC	GCCTTTCA	GCGAAACG	GCGAGAAC	GCGAGTTA	GCGATAGA	GCGCAATC	GCGCACGT
GCGCCAAG	GCGGGCAT	GCGGGGTC	GCGGTTGG	GCGTATCC	GCGTCCTG	GCGTGACT	GCGTGGGG
GCGTTGAT	GCTACAGT	GCTACGTC	GCTACTAG	GCTAGCGG	GCTCACTG	GCTCCTGA	GCTCGTTT
GCTGAATT	GCTGACAA	GCTGAGGC	GCTGCGCA	GCTTCCGC	GCTTTCCG	GCTTTTAC	GGAACAAG
GGAACTGT	GGAAGGGG	GGAATAGC	GGACACGA	GGACCATA	GGACCGAT	GGACTCTG	GGAGATTT
GGAGCCCG	GGAGGAGT	GGAGGGTA	GGAGTTGA	GGATAAAA	GGATCTTC	GGATGCAC	GGATTGAG
GGCAATCT	GGCACCGG	GGCAGTAG	GGCATGGA	GGCCCCAA	GGCCCTTT	GGCCTAAC	GGCGAATC
GGCGATGG	GGCGCAAT	GGCGGGCG	GGCGTGTT	GGCTATAC	GGCTGATT	GGCTTAGG	GGGAAATA
GGGAAGAT	GGGACCCT	GGGACTAC	GGGCAGCA	GGGCATAG	GGGCGCGG	GGGCGTTC	GGGCTGGT
GGGGCGTG	GGGGCTCA	GGGGTCAC	GGGTAAGT	GGGTGGAA	GGGTTTTA	GGTAATTG	GGTAGACC
GGTAGCTA	GGTATGAC	GGTCAACG	GGTCACAT	GGTCGGCT	GGTCTCCC	GGTCTGTA	GGTCTTGG
GGTGCCTC	GGTGCGGT	GGTGGATG	GGTGGTGC	GGTGTTCT	GGTTACCA	GGTTAGGG	GGTTCTAA
GGTTTAAT	GTAAAATC	GTAATGTG	GTACCCCC	GTACGGCA	GTACTTTA	GTAGACAT	GTAGAGCC
GTAGCGGA	GTAGGTAA	GTAGTACA	GTAGTCTC	GTATCCGT	GTATCTCA	GTATGACG	GTATGTGC
GTCACAAC	GTCACGTT	GTCAGCTC	GTCAGTCA	GTCCAAGA	GTCCCGAG	GTCCGATG	GTCCTCCG
GTCGAGAA	GTCGCATA	GTCGGTTT	GTCGTAGT	GTCTAACT	GTCTATTG	GTCTGAAA	GTCTGCGG
GTGAAGGC	GTGAGGCG	GTGATACC	GTGATTAA	GTGCACAA	GTGCCAGT	GTGCCGTC	GTGCGCTT
GTGCTTCT	GTGGAAAG	GTGGATTC	GTGGGAGC	GTGTAGTA	GTGTCCAC	GTGTCTTT	GTGTGTAG
GTGTTATG	GTGTTCGA	GTTAACCC	GTTAAGAG	GTTACCAT	GTTACTGC	GTTAGATT	GTTATGCT
GTTCAGTT	GTTCCCTA	GTTCCTCG	GTTCGCAC	GTTCTAGC	GTTGCACT	GTTGGCCA	GTTGGGAT
GTTGTCGG	GTTTATGT	GTTTCAGA	GTTTGGTG	GTTTTCTT	TAAAATTG	TAAACAAT	TAAAGACG
TAAAGGTC	TAACCGAG	TAACGTGC	TAACTACC	TAACTGGT	TAAGACCT	TAAGAGAC	TAAGCTTC
TAAGGAGA	TAAGTCAA	TAAGTGTG	TAATACTA	TAATCCAC	TAATCTCG	TAATTAAG	TACAAACA
TACAACTC	TACAGGAG	TACATAGT	TACCATGT	TACCCAGA	TACCTGAA	TACGAAGC	TACGCAAG
TACGGTCT	TACGTCGG	TACGTGCC	TACTCGCT	TACTGATG	TACTGCCA	TACTTCTT	TAGACATC
TAGATCCG	TAGCACTG	TAGCGATT	TAGCTTAG	TAGGAAAT	TAGGAGCG	TAGGTCTC	TAGGTTGT
TAGTAAGA	TAGTCGGG	TAGTGGAT	TAGTGTTC	TAGTTTCA	TATAAGAT	TATAATGC	TATACGTA
TATACTCT	TATATTAA	TATCAAAA	TATCCCCC	TATCCTGG	TATCGGTG	TATCGTCA	TATGACAG
TATGCATT	TATGCCGA	TATTATTT	TATTGCGG	TATTTGTC	TCAAATGA	TCAACCGC	TCAAGAAA
TCAAGGCT	TCAATTCG	TCACAAGT	TCACACAA	TCACGATC	TCACTCGG	TCAGAACC	TCAGAGTT
TCAGCCTG	TCAGTAAT	TCAGTGCA	TCATCCCT	TCATCTTA	TCATGGAG	TCATTTGT	TCCAAGGC
TCCACACT	TCCATCTG	TCCCCCTT	TCCCCTAG	TCCCGCGC	TCCGATTC	TCCGGGAA	TCCGTGGT
TCCTACAG	TCCTCGAC	TCCTGTTT	TCCTTAGA	TCGAAATT	TCGACCCA	TCGAGCTC	TCGATCGT
TCGATTAC	TCGCCATA	TCGCCTGT	TCGCGGCC	TCGGACGG	TCGGATCA	TCGGCACG	TCGGCCAC
TCGGCGGA	TCGGGTGC	TCGTAGTG	TCGTATAT	TCGTGCCG	TCGTTATC	TCTACGCG	TCTAGTAT
TCTATAGG	TCTCAGAG	TCTCATCC	TCTCCGGC	TCTCTAAC	TCTCTGCT	TCTGCTAA	TCTGGACT
TCTGTCGC	TCTGTTTG	TCTTAACA	TCTTACGT	TCTTCAAT	TCTTGCAC	TCTTGGTA	TGAAAGTA
TGAAGTTT	TGAATCGA	TGAATGAT	TGACCACG	TGACGCCC	TGACGGAA	TGACTTCA	TGAGCAAC
TGAGGGGC	TGAGGTCG	TGATACGG	TGATAGCC	TGATCGTT	TGATGACT	TGATTTAC	TGCAAATG
TGCACTAT	TGCAGCAC	TGCAGTGA	TGCCAGCG	TGCCCTGC	TGCCGAAT	TGCCTCTC	TGCGACAA
TGCGCGGG	TGCGCTTA	TGCGGCTT	TGCGTACG	TGCTCACC	TGCTGGTC	TGCTTTCT	TGGACGAA
TGGACTGG	TGGATAAG	TGGCATTA	TGGCCCAT	TGGCGAGC	TGGCGTCT	TGGCTGAC	TGGGAGGT
TGGGGCGA	TGGGGGAG	TGGTACCT	TGGTCCGC	TGGTGATA	TGGTTCTG	TGTAACTT	TGTACACA
TGTAGAGT	TGTAGCCG	TGTCACGC	TGTCCCTG	TGTCGTAC	TGTCTATT	TGTGAAGA	TGTGCCCT
TGTGGGCA	TGTGTCTA	TGTTATAG	TGTTCAGG	TGTTTGCG	TGTTTTGA	TTAACCAA	TTAAGCGG
TTAAGTCC	TTACAACA	TTACAGTC	TTACCATT	TTACCTGA	TTACGTTG	TTACTTAT	TTAGACGC
TTAGATTA	TTAGCGCT	TTAGTAGG	TTATAAAT	TTATCAGC	TTATTCCG	TTATTGTA	TTCACTTG
TTCAGAGC	TTCATATA	TTCATGCG	TTCCAAAC	TTCCCGCA	TTCCGCAA	TTCCTTGG	TTCGAATT
TTCGCCAT	TTCGGCCC	TTCGGGTG	TTCTAGGG	TTCTATAA	TTCTCCTA	TTCTCTGT	TTCTGTCG
TTCTTGAT	TTCTTTTC	TTGAACGA	TTGACGGT	TTGAGGAC	TTGATTTT	TTGCAGAT	TTGCATGC
TTGCCCCG	TTGCGGGG	TTGCTCTA	TTGGCTCC	TTGGGTAT	TTGGTCAG	TTGGTGGC	TTGTAACC
TTGTCAAG	TTGTGTGA	TTTAATCG	TTTAGTTA	TTTATAAT	TTTCAAGG	TTTCCTTC	TTTCGCGT
TTTCTGGA	TTTGCGAG	TTTGGAAA	TTTGGTGG	TTTGTTAC	TTTTACTC	TTTTGGCT	TTTTTCAA

6. Conclusion

Here, we have presented a novel stegano-cryptographic approach called HyDEn (hybrid DNA encryption), which uses custom-built error-correcting DNA Hamming codes, a randomized code assignment procedure and cyclic permutations based on a private key. HyDEn represents a symmetric cipher that is capable of encrypting and disguising information as long DNA sequences in public bioinformatics discussion groups and DNA sequence databases. Our cryptosystem has significant error tolerance and adds another dimension to the information security field. We are currently working on experimentally evaluating and further improving HyDEn's capabilities following the ideas described in Section 5.3.

6 in total

1. Hiding messages in DNA microdots.

Authors: C T Clelland; V Risca; C Bancroft
Journal: Nature Date: 1999-06-10 Impact factor: 49.962

2. On combinatorial DNA word design.

Authors: A Marathe; A E Condon; R M Corn
Journal: J Comput Biol Date: 2001 Impact factor: 1.479

3. Cryptography with DNA binary strands.

Authors: A Leier; C Richter; W Banzhaf; H Rauhe
Journal: Biosystems Date: 2000-06 Impact factor: 1.973

4. Secret signatures inside genomic DNA.

Authors: Masanori Arita; Yoshiaki Ohashi
Journal: Biotechnol Prog Date: 2004 Sep-Oct

5. Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase.

Authors: R K Saiki; D H Gelfand; S Stoffel; S J Scharf; R Higuchi; G T Horn; K B Mullis; H A Erlich
Journal: Science Date: 1988-01-29 Impact factor: 47.728

6. DNA-based watermarks using the DNA-Crypt algorithm.

Authors: Dominik Heider; Angelika Barnekow
Journal: BMC Bioinformatics Date: 2007-05-29 Impact factor: 3.169

6 in total

2 in total

Review 1. DNA nanotechnology: new adventures for an old warhorse.

Authors: Bijan Zakeri; Timothy K Lu
Journal: Curr Opin Chem Biol Date: 2015-06-05 Impact factor: 8.822

2. Multiplexed Sequence Encoding: A Framework for DNA Communication.

Authors: Bijan Zakeri; Peter A Carr; Timothy K Lu
Journal: PLoS One Date: 2016-04-06 Impact factor: 3.240

2 in total