| Literature DB >> 23591850 |
Abstract
A new approach for encoding DNA sequences as input for DNA sequence analysis is proposed using the error correction coding theory of communication engineering. The encoder was designed as a convolutional code model whose generator matrix is designed based on the degeneracy of codons, with a codon treated in the model as an informational unit. The utility of the proposed model was demonstrated through the analysis of twelve prokaryote and nine eukaryote DNA sequences having different GC contents. Distinct differences in code distances were observed near the initiation and termination sites in the open reading frame, which provided a well-regulated characterization of the DNA sequences. Clearly distinguished period-3 features appeared in the coding regions, and the characteristic average code distances of the analyzed sequences were approximately proportional to their GC contents, particularly in the selected prokaryotic organisms, presenting the potential utility as an added taxonomic characteristic for use in studying the relationships of living organisms.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23591850 PMCID: PMC3645750 DOI: 10.3390/ijms14048393
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Curves of average code distance of the 12 prokaryotes near initiation site.
Figure 2Curves of average code distance of the 12 prokaryotes near termination site.
Figure 3Curves of average code distance of the nine eukaryotes near initiation site.
Figure 4Curves of average code distance of the nine eukaryotes near termination site.
Selected prokaryotes and their features.
| NCBI Ref. Seq. Access Number | Selected Prokaryotes | GC Content (%) | CACD | CACD |
|---|---|---|---|---|
| NC_006349 | 68 | 1.9076 | ||
| NC_009434 | 63 | 2.3062 | ||
| NC_003197 | 52 | 2.2336 | 1.8721 | |
| NC_000913 | 50 | 2.2330 | 1.8712 | |
| NC_004088 | 47 | 2.2195 | 1.8624 | |
| NC_003098 | 39 | 2.1547 | 1.7943 | |
| NC_004070 | 38 | 2.1529 | 1.8028 | |
| NC_004350 | 36 | 2.1524 | 1.7987 | |
| NC_002662 | 35 | 2.1184 | 1.7827 | |
| NC_004461 | 32 | 2.1149 | 1.7699 | |
| NC_002758 | 32 | 2.1139 | 1.7658 | |
| NC_010163 | ||||
| 31 | max − min = 0.2369 | max − min = 0.1823 |
All sequences are complete genome;
CACD of the (6,3,2) convolutional code near initiation site;
CACD of May et al.’s (5, 2) block code near initiation site.
CACD, characteristic average code distances.
Selected eukaryotes and their features.
| NCBI Ref. Seq. Access Number | Selected Eukaryotes | GC Content (%) | CACD | CACD |
|---|---|---|---|---|
| NC_006070 | 49 | 2.2267 | ||
| NW_045720 | 45 | 2.2350 | 1.8954 | |
| NC_008403 | 44 | 1.8945 | ||
| NT_011512 | 39 | 2.1900 | 1.8594 | |
| NC_001147 | 38 | 2.1537 | ||
| NC_001148 | 38 | 1.8347 | ||
| NC_003075 | 36 | 2.2015 | 1.8409 | |
| NC_004353 | 36 | 2.1966 | 1.8501 | |
| NC_003421 | 36 | 2.1814 | 1.8328 | |
| max − min = 0.1273 | max − min = 0.0879 |
These sequences are complete sequences, with the exception of NW_045720, commented as whole genome shotgun sequence, and NT_011512, commented as reference assembly complete sequence;
CACD of the (6,3,2) convolutional code near initiation site;
CACD of May et al.’s (5, 2) block code near initiation site.
Figure 5Designed (6,3,2) convolutional encoder.
Operation for addition and multiplication.
| Addition Multiplication | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| + | 0 | 1 | 2 | 3 | × | 0 | 1 | 2 | 3 | |
| 0 | 0 | 1 | 2 | 3 | 0 | 0 | 0 | 0 | 0 | |
| 1 | 1 | 0 | 3 | 2 | 1 | 0 | 1 | 2 | 3 | |
| 2 | 2 | 3 | 0 | 1 | 2 | 0 | 2 | 3 | 1 | |
| 3 | 3 | 2 | 1 | 0 | 3 | 0 | 3 | 1 | 2 | |