| Literature DB >> 15869708 |
Abstract
BACKGROUND: Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p binding sites, CACGTG and CACGTT. In humans, microsatellites play a role in a number of rare neurodegenerative diseases such as spinocerebellar ataxia type 1 (SCA1). SCA1 is a hereditary neurodegenerative disease caused by an expanded CAG repeat in the coding sequence of the gene. In bacterial pathogens, microsatellites are proposed to regulate expression of some virulence factors. For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants. A recent analysis of the complete sequences of the Helicobacter pylori strains 26695 and J99 has identified 46 putative phase-variable genes among the two genomes through their association with homopolymeric tracts and dinucleotide repeats. Life scientists are increasingly interested in studying the function of small sequences of DNA. However, current search algorithms often generate thousands of matches -- most of which are irrelevant to the researcher.Entities:
Mesh:
Year: 2005 PMID: 15869708 PMCID: PMC1131890 DOI: 10.1186/1471-2105-6-111
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Time study over 3.224 GBases using randomly obtained sequences
| Query Length (Bases) | Average Search Time (Seconds) | Average Number of Hits |
| 4 | 56.40 | 15,428,878 |
| 8 | 1.147 | 98,141 |
| 12 | 2.254 | 2308 |
| 16 | 2.235 | 121 |
| 32 | 2.451 | 1.55 |
| 64 | 2.517 | 1.15 |
| 128 | 2.648 | 1.07 |
| 256 | 2.956 | 1.03 |
| 512 | 3.598 | 1.01 |
| 1024 | 4.738 | 1.01 |
Cross-species search for PHO4P binding sites, CACGTG and CACGTT (with reverse complement)
| Total Hits In Genome | 30,056 | 1077 | 81 | 119 | 67,881 | 4027 |
| Hits With 'Phosphatase' Present In Annotation | 267 | 1 | 0 | 0 | 535 | 51 |
| Percentage Of Hits With 'Phosphatase' Where Hit Is In Promoter (3000 Base Pairs Upstream) | 53.9 | 100 | 0 | 0 | 28.8 | 52.9 |
Conversions between DNA sequences and the radix 10 number system
| SEQ | NUMBER | SEQ | NUMBER | SEQUENCE | NUMBER | SEQUENCE | NUMBER |
| A | 1 | (C)3 | 84 | AATGCT | 3,301 | GCTCACTG | 62,003 |
| C | 4 | (A)4 | 85 | AGTGTCA | 8,941 | GGGAGGTGAA | 388,991 |
| (A)2 | 5 | (C)4 | 340 | ATGGGGT | 12,281 | GGGCGGAATT | 679,999 |
| (C)2 | 20 | (A)5 | 341 | GTAGATAA | 23,003 | GTTTCCTGCG | 1,111,211 |
| (A)3 | 21 | (C)5 | 1,364 | GCGGCTGA | 32,003 | GCTAAAAGGC | 1,299,827 |
Figure 1Index file : CAATTACGAGCTCTGCCTACAATGAT. The format for and are discussed in the text. To demonstrate how different regions map to different genes, the first 13 bases map to the gene with PID = 1234 and the last 13 bases map to the gene with PID = 5678. We add leading zeroes to each location so that all numbers in are four bytes and we record this as numbersize in each line in . Keys in this example are made from two bases of sequence so there are 42 = 16 lines in ranging from m(AA) = 5 through m(CC) = 20. Key number m(GT) = 11 and number m(GG) = 15 are not present in the sequence. For clarity, each offset in is repeated in the correct position above the line in and each PID is underlined. Two arrows map two different lines from into by pointing to two bubbles that show the content of two hash bins.
Figure 2Database search pseudo code. The length of the query sequence Q determines which block of code will execute. Lines 3 – 18 execute for |Q|
| Location | 2-Word |
| 08 | AG |
| 13 | TG |
| 22 | TG |
| 06 | CG |
NCBI terms correlation to gene ontology (GO) terms
| 14,349 Unique NCBI One Term Windows | 18,934 Unique NCBI Two Term Windows | 24,747 Unique NCBI Three Term Windows | |
| GO Term Equality | 160 | 164 | 73 |
| GO Term Similarity | 160 | 164 | 73 |
| GO Phrase Similarity | 561,294 | 6840 | 818 |
Gene ontology (GO) term expansion with siblings
| 14,349 Unique NCBI One Term Windows | 18,934 Unique NCBI Two Term Windows | 24,747 Unique NCBI Three Term Windows | |
| GO Term Equality | 16,001 | 9096 | 4313 |
| GO Term Similarity | 16,001 | 9096 | 4313 |
| GO Phrase Similarity | 19,137,006 | 228,624 | 28,556 |