Literature DB >> 22649495

Is a genome a codeword of an error-correcting code?

Luzinete C B Faria1, Andréa S L Rocha, João H Kleinschmidt, Márcio C Silva-Filho, Edson Bim, Roberto H Herai, Michel E B Yamagishi, Reginaldo Palazzo.   

Abstract

Since a genome is a discrete sequence, the elements of which belong to a set of four letters, the question as to whether or not there is an error-correcting code underlying DNA sequences is unavoidable. The most common approach to answering this question is to propose a methodology to verify the existence of such a code. However, none of the methodologies proposed so far, although quite clever, has achieved that goal. In a recent work, we showed that DNA sequences can be identified as codewords in a class of cyclic error-correcting codes known as Hamming codes. In this paper, we show that a complete intron-exon gene, and even a plasmid genome, can be identified as a Hamming code codeword as well. Although this does not constitute a definitive proof that there is an error-correcting code underlying DNA sequences, it is the first evidence in this direction.

Entities:  

Mesh:

Year:  2012        PMID: 22649495      PMCID: PMC3359345          DOI: 10.1371/journal.pone.0036644

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Frequently in science, two seemingly unrelated fields find common ground in a research problem of interest. For example, the fields of biology and coding theory share the same challenge, which is to answer the question of whether or not there is an error-control mechanism in DNA sequences similar to the one employed in digital transmission systems. There are several facts about DNA sequences which motivate this line of questioning. One is that DNA sequences may be viewed as “words” written using four letters or nucleotide bases. Another is that some DNA patches code for protein sequences. Furthermore, several DNA sites have been well annotated in terms of pattern and information content [1]. The evolution of these biologically significant sequences is usually evolutionarily conserved, and it is important to avoid sequence errors in order to maintain their function. Another interesting point is that the number of genes an organism has does not correlate with its complexity. In fact, the number of non-coding DNA (ncDNA) regions, including repetitive sequences, seems to have been increasing since the beginning of the evolution of the higher eukaryotes, which suggests that organism complexity is related to gene regulation through ncDNA [2]. It is well established that non-coding sequences are biologically important; e.g. regulatory regions (promoters, TFBS, enhancer elements, ncRNA, introns, splicing sites etc). Finally, and most importantly, the DNA replication process is far from being the only source of sequence errors. DNA integrity is frequently jeopardized by physical and chemical agents, which means that DNA damage repair mechanisms are indispensable in preventing collateral effects [3]. Interestingly, more than one of these mechanisms is described in the literature [4]. Is it reasonable to infer that some DNA repair mechanisms are a biological implementation of error-correcting codes? The coding theory community has proposed several methodologies to verify whether or not a particular DNA sequence, usually a protein coding sequence, has an underlying error-correcting code (ECC) [5] and [6]. In spite of their relevance, the results of earlier works do not provide the definitive answer. For instance, based on the procedure for determining whether or not the lac operon and cytochrome c gene can be identified as codewords of linear block codes, the answer is no [7]. Actually, we cannot even conclude that there is no linear block code in other DNA sequences. Of course, as is often the case, there is at least one alternative approach to solving this problem, which is to demonstrate that an ECC underlies DNA sequences. This task is far from easy to accomplish, because a complex error-correcting scheme might consist of many distinct concatenated codes, rather than a single global one, although, to the best of our knowledge, there is no evidence that such an ECC exists. In [8], we attempted to answer a recurring question: Are there DNA sequences that can be identified as codewords for ECCs? If so, we will have taken the first step in a long research journey. The majority of candidate DNA sequences have been positively identified as codewords for a class of cyclic block codes. Such codewords are consistently different from actual DNA sequences by one single nucleotide. Is this difference biologically significant? Are these codewords actually ancient DNA sequences? Up to now, researchers in the fields of biology and coding theory have been working almost independently of one another, and the two groups need to work together to address the new challenges. In this paper, we ask whether or not a whole intron-exon gene structure can be identified as a codeword, and, furthermore, can a whole genome be identified as a codeword? In the following sections, we describe our experiments and results.

Methods

BCH Code

ECCs are always used when transmitting or storing information. The main objective of an ECC is, as the name suggests, to correct errors that might occur during information transmission through noisy channels. BCH codes form a subset of parameterized ECCs, which were first proposed in 1959 by Hocquenghem [9] and independently rediscovered by Bose and Chaudhuri [10] in 1960. The acronym BCH is made up of the initials of Bose, Chaudhuri, and Hocquenghem, in that order. Usually BCH codes are employed in the transmission of information in computer networks and in sequence generation. Due to the simplicity of their encoding and decoding processes, these codes are good candidates for use in the identification and reproduction of DNA sequences, [8], [11]–[19]. By “identification”, we mean that the DNA sequence may be either a codeword for an ECC or one of the code sequences. These code sequences may differ from the codeword up to the error correction capability of the code. In the latter case, we say that such code sequences belong to a codeword set. The BCH codes constitute an important generalization of the Hamming codes by allowing multiple error corrections. The parameters associated with a BCH code are denoted by , where is the codeword length (number of base pairs in DNA sequences); is the code dimension (length of the input information sequence responsible for generating the DNA sequence); and is the minimum code distance (the smallest number of positions by which any two codewords may differ).

Converting Nucleotides into Numbers

It is desirable that the alphabet of an ECC have an associated algebraic structure. Although the genetic code has an associated alphabet, the identification of a related algebraic structure remains an open problem. We have considered the ring of integers modulo 4, denoted by , owing to the easy of code construction of using this algebraic structure. Since the alphabet of the genetic code must be converted into the alphabet of the ECC, and vice-versa, it follows that this conversion has to take into consideration all the possibilities of associating the elements of the set , where is adenine, is cytosine, is guanine, and is thymine, with the elements of the set . We call this association a labeling. The labeling between the set of nucleotides and the set consists of the twenty-four permutations involved, as shown in Figure 1. The aim of these labelings is to determine which permutation matches the codeword with the given DNA sequence.
Figure 1

Permutations associated with labelings , and .

Next, in order to match the length of the DNA sequence to the codeword length, we must find the degree of the Galois ring extension, denoted by , using the equality , where is the DNA sequence length in base pairs. For instance, if , then the degree of the Galois ring extension is 6. The primitive polynomial is obtained once we know the value of , and, for every value of there are many primitive polynomials to consider. In looking for a new code, we have observed that there is a generator polynomial of the BCH code that corresponds to each primitive polynomial . In the code construction process, the DNA sequence generation algorithm takes into consideration three important facts. The first is to consider every possible value taken by the minimum distance of the code, that is, , where denotes the number of errors the code is able to correct. The second is to consider all with degree to be used in the Galois ring extension, (Step 2 and Step 3) and all labeling A, B and C (Step 4), owing to the as yet unknown interdependence of the geometric and algebraic structures in the code construction, where denotes the ring of all the polynomials with coefficients in , and denotes the ideal generated by . The third is to consider determining the group of units in , where denotes the cardinality of and denotes the set of all non zero elements in . The additional computational complexity in the solution of this problem comes from the fact that the greater the degree of the Galois ring extension, the larger the number of to be considered in the code construction. Knowing that the number of codewords generated by these codes grows exponentially with the code dimension, instead of generating all the codewords and comparing them with the given DNA sequence, the twenty-four permutations are applied to that DNA sequence, and these sequences are considered as “possible codewords”. Then, to determine which of the twenty-four sequences are, in fact, codewords, the relation is employed, where is each of the possible codewords and denotes the transpose of the parity-check matrix. The analysis to be performed with the DNA sequence, as a result of the one nucleotide difference from the codeword, is to consider the other three possible nucleotides at each position in the sequence for each permutation, and again to use the relation , in order to verify whether or not is a possible codeword. Single stranded DNA sequences, such as single stranded chromosomes, genes, introns, exons, repetitive DNA, and mRNA sequences, may be either a codeword for an ECC or belong to the codeword set of an ECC. In order to verify whether or not a DNA sequence may actually be identified as a codeword, we can use an ad hoc strategy, i.e. generate all the codewords and compare the DNA sequence with each codeword. However, this is not a practical strategy, because the computational effort to do this would be prohibitive, as explained below. In order to address this identification problem, we have developed an algorithm called the DNA Sequence Generation Algorithm, which verifies whether or not a given DNA sequence can be identified as a codeword of an ECC. This algorithm is the same as the one in [8], however it differs from the algorithm in [20] in that it considers the Galois ring extension as the algebraic structure, instead of the Galois field extension. There are also some conceptual differences, which are discussed in [15] and [17].

DNA Sequence Generation Algorithm

Input data: 1) original DNA sequence in nucleotides (NCBI); 2) ; and 3) . Step 1 - Generate all primitive polynomials with degree to be used in the Galois ring extensions; Step 2 - Select one from Step 1, and find the set in which the elements have the inverse, the group of units of , denoted by ; Step 3 - Find the generator and parity-check polynomials of the BCH code by knowing the minimum distance and the primitive polynomial derived in Step 2. In this way, the generator, as well as the parity-check matrices and its transposes, are determined; Step 4 - From the mapping , convert the seq with elements in into the corresponding sequence with elements in ; Step 5 - Verify by use of the syndrome , whether or not each of the converted DNA sequences is a codeword: If , then store the sequence; If implies that up to nucleotide differences may exist. If so, then the combinations to must be considered by taking into account the other three nucleotide possibilities in each of the combinations of the DNA sequence. Verify that every combination is a codeword: if so, store it; otherwise disregard it; Step 6 - From the mapping convert each stored sequence in Step 5 with elements in into the corresponding sequence with elements in . Compare each of these sequences with the seq and show the position at which the nucleotides differ; Step 7 - Go to Step 1. Select another and verify whether or not all the have already been used: if not, repeat Steps 2 to 6 for each from Step 1; otherwise, go to Step 8. Step 8 - End.

Results and Discussion

We have successfully applied this algorithm to the TRAV7 gene sequence and the plasmid Lactococcus lactis genome sequence. These sequences are represented in Table 1 and Table 2 using the following abbreviations: Ont = original nucleotide; Olb = original labeling; Glb = generated labeling and Gnt = generated nucleotide. Although we have used all the , all the corresponding , and all the possible minimum code distances in the construction of the BCH code over , the results show that only codes with the minimum distance associated with a specific , which in turn is associated with its and labeling, are able to identify the TRAV7 gene and the plasmid genome sequences. Consequently, the algebraic structure, alphabet, labeling, , and have to be considered in the construction of BCH codes over rings.
Table 1

TRAV7 gene sequence chromosome 14.

1Ont: atggagaaga tgcggagacc tgtcctaatt atattttgtc tatgtcttgg ctgtaagttg
Olb:031101001031211010223132230033030333313230313233112313001331
Glb:031101001031211010223132230033030333313230313233112313001331
Gnt: atggagaaga tgcggagacc tgtcctaatt atattttgtc tatgtcttgg ctgtaagttg
61Ont: agggttctaa gaactgggga ccccaggaga catttattca agtccttttg gggagatggg
Olb:011133230010023111102222011010203330332001322333311110103111
Glb:011133230010023111102222011010203330332001322333311110103111
Gnt: agggttctaa gaactgggga ccccaggaga catttattca agtccttttg gggagatggg
121Ont: gatgtagtct ggacttactt gtcattgctt gtttgagatt aagaaataaa attatgaaag
Olb:103130132311023302331320331233133310103300100030000330310001
Glb:113130132311023302331320331233133310103300100030000330310001
Gnt: ggtgtagtct ggacttactt gtcattgctt gtttgagatt aagaaataaa attatgaaag
181Ont: gtctaaatta aaatgtacat attgtacctg atgtctttct gaataggggc aaatggagaa
Olb:132300033000031302030331302231031323332310030111120003110100
Glb:132300033000031302030331302231031323332310030111120003110100
Gnt: gtctaaatta aaatgtacat attgtacctg atgtctttct gaataggggc aaatggagaa
241Ont: aaccaggtgg agcacagccc tcattttctg ggaccccagc agggagacgt tgcctccatg
Olb:002201131101202012223203333231110222201201110102133122322031
Glb:002201131101202012223203333231110222201201110102133122322031
Gnt: aaccaggtgg agcacagccc tcattttctg ggaccccagc agggagacgt tgcctccatg
301Ont: agctgcacgt actctgtcag tcgttttaac aatttgcagt ggtacaggca aaatacaggg
Olb:012312021302323132013213333002003331201311302011200003020111
Glb:012312021302323132013213333002003331201311302011200003020111
Gnt: agctgcacgt actctgtcag tcgttttaac aatttgcagt ggtacaggca aaatacaggg
361Ont: atgggtccca aacacctatt atccatgtat tcagctggat atgagaagca gaaaggaaga
Olb:031113222000202230330322031303320123110303101001201000110010
Glb:031113222000202230330322031303320123110303101001201000110010
Gnt: atgggtccca aacacctatt atccatgtat tcagctggat atgagaagca gaaaggaaga
421Ont: ctaaatgcta cattactgaa gaatggaagc agcttgtaca ttacagccgt gcagcctgaa
Olb:230003123020330231001003110012012331302033020122131201223100
Glb:230003123020330231001003110012012331302033020122131201223100
Gnt: ctaaatgcta cattactgaa gaatggaagc agcttgtaca ttacagccgt gcagcctgaa
481Ont: gattcagcca cctatttctg tgctgtagat g
Olb:1033201220223033323131231301031
Glb:1033201220223033323131231301031
Gnt: gattcagcca cctatttctg tgctgtagat g
Table 2

Lactococcus lactis plasmid genomic sequence.

1Ont: cctacatttt tttattgctc tgctatgatt gtttatcgat agttttttat acagataagc
Olb:113010333333303321313213032033233303120302333333030102030021
Glb:113010333333303321313213032033233303120302333333030102030021
Gnt: cctacatttt tttattgctc tgctatgatt gtttatcgat agttttttat acagataagc
61Ont: gtgcgacgct tgctctttcc gaggaggaag tcatgctgac aagcacggca gagcctccgc
Olb:232120121332131333112022022002310321320100210122102021131121
Glb:232120121332131333112022022002310321320100210122102021131121
Gnt: gtgcgacgct tgctctttcc gaggaggaag tcatgctgac aagcacggca gagcctccgc
121Ont: atgaaatgct ctcaatgaaa ttgccggcgg agcttttttg agcttgtgcc acttgcgaaa
Olb:032000321313100320003321122122021333333202133232110133212000
Glb:032000321313100320003321122122021333333202133232110133212000
Gnt: atgaaatgct ctcaatgaaa ttgccggcgg agcttttttg agcttgtgcc acttgcgaaa
181Ont: aaaacaagaa caaaagagac aggaaactgt ctttttttgc ttgcttgggg attggggcaa
Olb:000010020010000202010220001323133333332133213322220332222100
Glb:000010020010000202010220001323133333332133213322220332222100
Gnt: aaaacaagaa caaaagagac aggaaactgt ctttttttgc ttgcttgggg attggggcaa
241Ont: cgccccaaaa ataaaaagaa tcgtctgaaa cgaggaacaa actaaaatgt aaattttagt
Olb:121111000003000002003123132000120220010001300003230003333023
Glb:121111000003000002003123132000120220010001300003230003333023
Gnt: cgccccaaaa ataaaaagaa tcgtctgaaa cgaggaacaa actaaaatgt aaattttagt
301Ont: tgttaccgag tggaagatga atacttttta acctatgtgt atacacacat agtaagctcg
Olb:323301120232200203200301333330011303232303010101030230021312
Glb:323301120232200203200301333330011303232303010101030230021312
Gnt: tgttaccgag tggaagatga atacttttta acctatgtgt atacacacat agtaagctcg
361Ont: ctataatact ttataacgtt tttatttaca tgagcaaagc gagtttttcc aacacgttta
Olb:130300301333030012333330333010320210002120233333110010123330
Glb:130300301333030012333330333010320210002120233333110010123330
Gnt: ctataatact ttataacgtt tttatttaca tgagcaaagc gagtttttcc aacacgttta
421Ont: atctaaaata ttggcaattt ataccatgat tttcatggta tgtaagtgcg cccttaggaa
Olb:031300003033221003330301103203333103223032300232121113302200
Glb:031300003033221003330301103203333103223032300232121113302200
Gnt: atctaaaata ttggcaattt ataccatgat tttcatggta tgtaagtgcg cccttaggaa
481Ont: aataatttga atatatttca gattttcaat ctgactgctc ctgtcatcga gcagaccgat
Olb:003003332003030333102033331003132013213113231031202102011203
Glb:003003332003030333102033331003132013213113231031202102011203
Gnt: aataatttga atatatttca gattttcaat ctgactgctc ctgtcatcga gcagaccgat
541Ont: gaggaaaaca aaaagaggac taaacaaaaa agtttagtcc tctttttgtt ttgaatagtt
Olb:202200001000002022013000100000023330231131333332333320030233
Glb:202200001000002022013000100000023330231131333332333320030233
Gnt: gaggaaaaca aaaagaggac taaacaaaaa agtttagtcc tctttttgtt ttgaatagtt
601Ont: ctagaacgtc atattttgcg ttttaagcaa ttttgactaa ctaggcgggg atttttactt
Olb:130200123103033332123333002100333320130013022122220333330133
Glb:130200123103033332123333002100333320130013022122220333330133
Gnt: ctagaacgtc atattttgcg ttttaagcaa ttttgactaa ctaggcgggg atttttactt
661Ont: agaaattatt caaaacgtct gtaaagtgct taaaatcgtt tctaagagct tttagcgttt
Olb:020003303310000123132300023213300003123331300202133330212333
Glb:020003303310000123132300023213300003123331300202133330212333
Gnt: agaaattatt caaaacgtct gtaaagtgct taaaatcgtt tctaagagct tttagcgttt
721Ont: atttcgttta gttatcggca taatcgttaa aacaggcgtt atcgtagcgg aaaagccctt
Olb:033312333023303122103003123300001022123303123021220000211133
Glb:033312333023303122103003123300001022123303123021220000211133
Gnt: atttcgttta gttatcggca taatcgttaa aacaggcgtt atcgtagcgg aaaagccctt
781Ont: gagcgtagcg tggctttgca gtgaagatgt tgtctgttag attatgaaag ccgataactg
Olb:202123021232213332102320020323323132330203303200021120300132
Glb:202123021232213332102320020323323132330203303200021120300132
Gnt: gagcgtagcg tggctttgca gtgaagatgt tgtctgttag attatgaaag ccgataactg
841Ont:aatgaaataa taagcgtagc gccccttatt tcggtcggag gaggctcaag ggagtttgag
Olb:003200030030021230212111133033312231220220221310022202333202
Glb:003200030030021230212111133033312231220220221310022202333202
Gnt: aatgaaataa taagcgtagc gccccttatt tcggtcggag gaggctcaag ggagtttgag
901Ont: ggaatgaaat tccctcatgg ttttaaaatt gcttgcaatt ttgccgagcg gtagcgctgg
Olb:220032000331113103223333000033213321003333211202122302121322
Glb:220032000331113103223333000033213321003333211202122302121322
Gnt: ggaatgaaat tccctcatgg ttttaaaatt gcttgcaatt ttgccgagcg gtagcgctgg
961Ont: aaaatttttg aaaaaaattt ggaatttgga aaaatggggg ggtactacga ccccccccta
Olb:000033333200000003332200333220000032222222301301201111111130
Glb:000033333200000003332200333220000032222222301301201111111130
Gnt: aaaatttttg aaaaaaattt ggaatttgga aaaatggggg ggtactacga ccccccccta
1021Ont: tgtggtaatt tggtaacttg gtcaaaattg atactaatat atattaaaac agcacaaaac
Olb:323223003332230013322310000332030130030303033000010210100001
Glb:323223003332230013322310000332030130030303033000010210100001
Gnt: tgtggtaatt tggtaacttg gtcaaaattg atactaatat atattaaaac agcacaaaac
1081Ont: agaatcttat gatataataa gatatactga aatttgaagg agtaaaaaat ggcagaagag
Olb:020031330320303003002030301320003332002202300000032210200202
Glb:020031330320303003002030301320003332002202300000032210200202
Gnt: agaatcttat gatataataa gatatactga aatttgaaggagtaaaaaatggcagaagag
1141Ont: aaaaaaagag ttttgctaac tttgtcgttg gacaaagcag aagaattaga aactatatca
Olb:000000020233332130013332312332201000210200200330200013030310
Glb:000000020233332130013332312332201000210200200330200013030310
Gnt: aaaaaaagag ttttgctaac tttgtcgttg gacaaagcag aagaattaga aactatatca
1201Ont: aaagaaatgg gaattagtaaatctgctcttgttagtttat ggattgcgga aaattctaga
Olb:000200032220033023000313213133233023330322033212200003313020
Glb:000200032220033023000313213133233023330322033212200003313020
Gnt: aaagaaatgg gaattagtaa atctgctctt gttagtttat ggattgcgga aaattctaga
1261Ont: aaataaaaaa agagccacgg cgaatggctc tagtatattt acggttagga atattatagc
Olb:000300000002021101221200322131302303033301223302200303303021
Glb:000300000002021101221200322131302303033301223302200303303021
Gnt: aaataaaaaa agagccacgg cgaatggctc tagtatattt acggttagga atattatagc
1321Ont: atatgacaga aaaaaaacta gaaaaaaatg acccagttag aaactggagt tgggttgttt
Olb:030320102000000001302000000032011102330200013220233222332333
Glb:030320102000000001302000000032011102330200013220233222332333
Gnt: atatgacaga aaaaaaacta gaaaaaaatg acccagttag aaactggagt tgggttgttt
1381Ont: atccagagtc tgctcctgaa aattggagaa cattgttaga cgaaactgga gaaaaatgga
Olb:031102023132131132000033220200103323302012000132202000003220
Glb:031102023132131132000033220200103323302012000132202000003220
Gnt: atccagagtc tgctcctgaa aattggagaa cattgttaga cgaaactgga gaaaaatgga
1441Ont: ttgagagtcc gttgcatgat aaagatatta acgaaacaac aaacgaaccg aaaaaggcac
Olb:332020231123321032030002030330012000100100012001120000022101
Glb:332020231123321032030002030330012000100100012001120000022101
Gnt: ttgagagtcc gttgcatgat aaagatatta acgaaacaac aaacgaaccg aaaaaggcac
1501Ont: attggcatat aataatttct ttttcaaata aaaaaagtta taagcaagta ttaaaaattt
Olb:033221030300300333133333100030000000233030021002303300000333
Glb:033221030300300333133333100030000000233030021012303300000333
Gnt: attggcatat aataatttct ttttcaaata aaaaaagtta taagcacgta ttaaaaattt
1561Ont: ctgaaatgtt aaatgcacca gagcctgtaa aaacaaaaaa tttacaaggg tcagttcaat
Olb:132000323300032101102021132300000100000033301002223102331003
Glb:132000323300032101102021132300000100000033301002223102331003
Gnt: ctgaaatgtt aaatgcacca gagcctgtaa aaacaaaaaa tttacaaggg tcagttcaat
1621Ont: atttgtggca cagaaacaat cctgaaaaat atcagtataa taaaagcgat gttgttgctc
Olb:033323221010200010031132000003031023030030000212032332332131
Glb:033323221010200010031132000003031023030030000212032332332131
Gnt: atttgtggca cagaaacaat cctgaaaaat atcagtataa taaaagcgat gttgttgctc
1681Ont: ataatgggtt taaatataga caatatttaa cagatattgg agttgatact gattctattt
Olb:030032223330003030201003033300102030332202332030132033130333
Glb:030032223330003030201003033300102030332202332030132033130333
Gnt: ataatgggtt taaatataga caatatttaa cagatattgg agttgatact gattctattt
1741Ont: tacaagaagt tatagaatgg ataaaagaaa ctggatgttc tgaatataga gatttagtcg
Olb:301002002330302003220300002000132203233132003030202033302312
Glb:301002002330302003220300002000132203233132003030202033302312
Gnt: tacaagaagt tatagaatgg ataaaagaaa ctggatgttc tgaatataga gatttagtcg
1801Ont: attatgcagt atcagaacgt ttcgatgatt ggtttcctac agtcagaagt caaaccatat
Olb:033032102303102001233312032033223331130102310200231000110303
Glb:033032102303102001233312032033223331130102310200231000110303
Gnt: attatgcagt atcagaacgt ttcgatgatt ggtttcctac agtcagaagt caaaccatat
1861Ont: ttttaaattc ttatttacgc tcaaatcgtc atagtcagaa aaaatataat ccagaaacag
Olb:333300033133033301213100031231030231020000003030031102000102
Glb:333300033133033301213100031231030231020000003030031102000102
Gnt: ttttaaattc ttatttacgc tcaaatcgtc atagtcagaa aaaatataat ccagaaacag
1921Ont:gagaggtgtt atgaaagttg aaattatagc tagtgttttt agtgaaaaat cagttcagaa
Olb:202022323303200023320003303021302323333302320000031023310200
Glb:202022323303200023320003303021302323333302320000031023310200
Gnt:gagaggtgtt atgaaagttg aaattatagc tagtgttttt agtgaaaaat cagttcagaa
1981Ont: aaaagtaaat aattttattg attatttaaa tgacaataat tttgaagtat tggaagttca
Olb:000023000300333303320330333000320100300333320023033220023310
Glb:000023000300333303320330333000320100300333320023033220023310
Gnt: aaaagtaaat aattttattg attatttaaa tgacaataat tttgaagtat tggaagttca
2041Ont: atatagg
Olb:0303022
Glb:0303022
Gnt: atatagg
The fact that a DNA sequence is identified as a sequence belonging to a codeword set of a BCH code with the minimum distance (and no other minimum distance) implies that this BCH code is equivalent to the Hamming code with parameters , independently of the algebraic structure associated with the alphabet of the code. Therefore, the Hamming codes constructed by considering the group of units in are able to identify and reproduce the DNA sequences that differ by one nucleotide from the posted NCBI sequences. We have also noted that the labeling, which is the set consisting of the twenty-four permutations, is split into three subsets, each of which contains eight permutations and defines a labeling denoted by , , and - Figure 1. The TRAV7 predicted gene has 511 nucleotides, and therefore the codeword length is - Table 1. Using the equality , it is easy to calculate the degree of the Galois ring extension, which is 9. The number of for this extension is 48 [11], [12]. Among these, just one is associated with a of the Hamming code (511, 502, 3), that is,andFurthermore, this identification was made using the labeling. A statistical analysis related to the TRAV7 gene sequence chromosome 14 of the human genome is as follows: with each primitive polynomial there is a corresponding generator polynomial of a code. For the given DNA sequence we use the 24 labeling and the resulting 24 sequences are multiplied by the generator matrix. From this operation results 24 codewords. Each one of these codewords is multiplied by the parity-check matrix. If the result is zero then the given DNA sequence is a codeword. Otherwise, we have to verify what happens if in each position we have different nucleotides. To do that, we have to realize three substitutions in each position of the original DNA sequence and verify again if this modified sequence is or is not a codeword. Since the TRAV7 gene genomic sequence has , it follows that . From this, the degree of the primitive polynomial is 9 and as a result we have 48 different primitive polynomials. Since for each one of them we have to use the 24 labeling, this leads to 1152 codewords to verify for a given error-correcting capability. Since in this case we have 256 possibilities, an upperbound is 294,912 codewords to be tested. Now, since there is always one nucleotide difference, we have to realize three times 63 tests for each one of the 294,912 codewords. Therefore, yielding a total of tests to be realized. Thus, the probability of finding a given sequence is , that is, approximately 1 sequence out of . The Lactococcus lactis plasmid genomic sequence has 2047 nucleotides. So, the codeword length is and the degree of the Galois ring extension is 11. The number of is 176 [11], [12]. Again, among these, only one is associated with a of the Hamming code , that is,andand this identification was made using the labeling, as shown in Table 2. A statistical analysis related to the Lactococcus lactis plasmid genomic sequence is as follows: with each primitive polynomial there is a corresponding generator polynomial of a code. For the given DNA sequence we use the 24 labeling and the resulting 24 sequences are multiplied by the generator matrix. From this operation results 24 codewords. Each one of these codewords is multiplied by the parity-check matrix. If the result is zero then the given DNA sequence is a codeword. Otherwise, we have to verify what happens if in each position we have different nucleotides. To do that, we have to realize three substitutions in each position of the original DNA sequence and verify again if this modified sequence is or is not a codeword. Since the Lactococcus lactis plasmid genomic sequence has , it follows that . From this, the degree of the primitive polynomial is 11 and as a result we have 176 different primitive polynomials. Since for each one of them we have to use the 24 labeling, this leads to 4224 codewords to verify for a given error-correcting capability. Since in this case we have 1018 possibilities, an upperbound is 4,300,032 codewords to be tested. Now, since there is always one nucleotide difference, we have to realize three times 63 tests for each one of the 4,300,032 codewords. Therefore, yielding a total of tests to be realized. Thus, the probability of finding a given sequence is , that is, approximately 1 sequence out of . Note that is also a primitive polynomial, since by reducing modulo 2 its coefficients leads to . Therefore, both polynomials are associated with the same algebraic and geometric properties. Contrary to our expectations, there is just one , its corresponding , and a labeling capable of identifying each sequence under consideration. This suggests the existence of an intrinsic geometric property that may be associated with each DNA sequence. What has been observed is that, in all the DNA sequences previously identified, there is always a difference of a single nucleotide between the NCBI sequence and the codeword generated by a Hamming code. Although the code (owing to its error correction capability) allows a difference in any position in the sequence, this difference occurs at one specific position. In the biological context, this mismatch is known as a single nucleotide polymorphism (SNP). We can observe that the SNP occurred at position 122 in the TRAV7 predicted gene, changing , and so originating a transition mutation (change of one purine/purine or pyrimidine/pyrimidine) - Table 1. In contrast, in the Lactococcus lactis plasmid genomic sequence, the SNP occurred at position 1547, changing , and so originating a transversion mutation (change of a purine for a pyrimidine, or vice-versa) - Table 2. Note that in the TRAV7 predicted gene the SNP occurred in the intronic region, whereas in the Lactococcus lactis plasmid genomic sequence the SNP occurred in the region, where the repB gene is located - Figure 2. One possible interpretation is that either the codeword generated by a Hamming code is an ancestor of the corresponding NCBI sequence, or it is an SNP with respect to the corresponding NCBI sequence, or the other way around. However, since this mismatch is within the error correction capability of the code, it follows that the modified Berlekamp-Massey decoding algorithm [15] is capable of detecting and correcting such a mismatch.
Figure 2

Plasmidial DNA and TRAV7 gene generation by Hamming codes.

Conclusion

In this paper, we have shown that not only are some protein coding sequences identified with the codewords of Hamming codes, but a gene, and even a whole genome, is identified with codewords as well. Although this is not a definitive answer to the question of whether or not there is an error-correcting code underlying actual DNA sequences, it is an encouraging result. The majority of the DNA sequences were reproduced by the Hamming codes over rings. One possible explanation is provided by the arithmetic and computational flexibilities of this algebraic structure. As a consequence, sequences reproduced by the Hamming codes over fields exhibit less adaptability than those offered by the Hamming codes over rings. This observation suggests that it is possible to classify the proteins according to their stability in the mutation index. As usually occurs when a new result appears, many new questions emerge. Do they, in fact, reveal the existence of a mathematical structure underlying DNA sequences? Why does the code point to a specific position for each reproduced sequence? Biologically, how important is the SNP in the position pointed out by the code?
  6 in total

1.  Examining coding structure and redundancy in DNA. How does DNA protect itself from life's uncertainty?

Authors:  Gail L Rosen
Journal:  IEEE Eng Med Biol Mag       Date:  2006 Jan-Feb

2.  Repeat performance: how do genome packaging and regulation depend on simple sequence repeats?

Authors:  Ram Parikshan Kumar; Ramamoorthy Senthilkumar; Vipin Singh; Rakesh K Mishra
Journal:  Bioessays       Date:  2010-02       Impact factor: 4.345

3.  Is there an error correcting code in the base sequence in DNA?

Authors:  L S Liebovitch; Y Tao; A T Todorov; L Levine
Journal:  Biophys J       Date:  1996-09       Impact factor: 4.033

4.  Information content of binding sites on nucleotide sequences.

Authors:  T D Schneider; G D Stormo; L Gold; A Ehrenfeucht
Journal:  J Mol Biol       Date:  1986-04-05       Impact factor: 5.469

5.  Are introns in-series error-detecting sequences?

Authors:  D R Forsdyke
Journal:  J Theor Biol       Date:  1981-12-21       Impact factor: 2.691

Review 6.  DNA repair mechanisms in mammalian germ cells.

Authors:  Saffet Ozturk; Necdet Demir
Journal:  Histol Histopathol       Date:  2011-04       Impact factor: 2.303

  6 in total
  1 in total

1.  Ancient DNA sequence revealed by error-correcting codes.

Authors:  Marcelo M Brandão; Larissa Spoladore; Luzinete C B Faria; Andréa S L Rocha; Marcio C Silva-Filho; Reginaldo Palazzo
Journal:  Sci Rep       Date:  2015-07-10       Impact factor: 4.379

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.