| Literature DB >> 28056763 |
Hao Tong1, Paul Schliekelman1, Jan Mrázek2.
Abstract
BACKGROUND: DNA sequences contain repetitive motifs which have various functions in the physiology of the organism. A number of methods have been developed for discovery of such sequence motifs with a primary focus on detection of regulatory motifs and particularly transcription factor binding sites. Most motif-finding methods apply probabilistic models to detect motifs characterized by unusually high number of copies of the motif in the analyzed sequences.Entities:
Keywords: Archaea; Bacteria; DNA sequence repeats; Genome; Motif-finding; Sequence motifs
Mesh:
Substances:
Year: 2017 PMID: 28056763 PMCID: PMC5217627 DOI: 10.1186/s12864-016-3400-0
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Comparison between approximate and exact p-value cutoffs.n is the total number of occurrences of some motif pair; fcut indicates the cutoff value corresponds to the selected p value threshold. Solid and dashed lines represent approximate values calculated by formula (2) and exact values computed with SymPy, respectively
Summary statistics of 3326 motif pairs identified in 569 genomes
| Min | 1st Qu. | Median | Mean | 3rd Qu. | Max | |
|---|---|---|---|---|---|---|
| d | 5 | 19 | 43 | 39.9 | 53 | 89 |
| Initial_n | 6 | 41 | 62 | 80.6 | 100 | 644 |
| Reduced_n | 6 | 27 | 44 | 57.8 | 73 | 633 |
| Difference | 0 | 2 | 9 | 22.7 | 26 | 335 |
| Cutoff | 6 | 18 | 31 | 47.8 | 64 | 521 |
| Reduced_n/Cutoff | 1 | 1.06 | 1.16 | 1.33 | 1.42 | 4.93 |
| Gene | 0% | 20% | 81% | 61% | 96% | 100% |
| Intergenic | 0% | 2% | 12% | 35% | 73% | 100% |
| Overlap | 0% | 0% | 2% | 4% | 4% | 100% |
The summary statistics is per motif. Meaning of abbreviations in the table: d, spacer length of a motif pair; Initial_n, number of copies of the motif pair before alignment; Reduced_n, number of copies after alignment and elimination of duplicate spacers; Difference, the difference between Initial_n and Reduced_n; Cutoff, significance cutoff (the lowest number of copies for the motif pair to be considered significant); Reduced_n/Cutoff, the ratio of Reduced_n and Cutoff, indicating the relative significance for the motif pair; Gene, the percentage of each motif pair occurrences found in genes; Intergenic, the percentage of each motif pair occurrences that are in intergenic regions; Overlap, the percentage of each motif pair occurrences that overlap with a gene start or end. These percentages are calculated as follows: For any given significant motif, we run a query with Pattern Locator, which gives the percentage of the motif occurrences that fall in gene, intergenic region or overlap with gene starts or ends. The quantiles in the table are for these percentages over all significant motifs
Fig. 2Distribution of key motif statistics. Frequency is number of copies of each significant motif pair; Percentage is the percentage of significant motif pairs found in corresponding region
Fig. 3Distribution of motif pairs with respect to gene and inter-genic region. In each plot, the horizontal axis indicates the percentage of occurrences of significant pairs found in corresponding region
Fig. 4Spacer Length Distribution of Candidate Pairs
Significant motif pairs related to CRISPR repeats
| Genome | Pattern |
|---|---|
| Anaeromyxobacter dehalogenans 2CP 1 | GGGGA(N)43TCCCC |
| Candidatus Desulforudis audaxviator MP104C | AAACG(N)35GTTTC |
| Clostridium thermocellum ATCC 27405 | TACGA(N)52CCTCA |
| Corynebacterium aurimucosum ATCC 700975 | GGGGA(N)43TCCCC |
| Corynebacterium jeikeium K411 | GGGGA(N)43TCCCC |
| Corynebacterium urealyticum DSM 7109 | GGGGA(N)43TCCCC |
Pattern X1X2X3X4X5(N)DY1Y2Y3Y4Y5 denotes 5-mer X1X2X3X4X5 is followed by 5-mer Y1Y2Y3Y4Y5 with D nucleotides apart; for non-palindromic patterns, only sequence in one DNA strand is listed
Motif pairs that are part of transcription terminators
| Genome | Pattern |
|---|---|
| Actinobacillus pleuropneumoniae serovar 5b L20 | CAAAA(N)13TGACC |
| Haemophilus influenzae Rd KW20 | GGTCA(N)10TGTTT |
tRNA-related motifs identified
| Genome | Pattern | tRNA related |
|---|---|---|
|
| AGAGC(N)27GTTCG | Inside tRNA |
| Candidatus | TAGAG(N)27GGTTC | Inside tRNA |
|
| TAGAG(N)28GTTCG | Inside tRNA |
|
| CGTTA(N)9TAACG | tRNA downstream |
Fig. 5Sequence logo for motif pair downstream of tRNA genes in M. kandleri
Genomes with Shine-Dalgarno sequence identified
| Genome | Pattern |
|---|---|
| Campylobacter concisus 13826 | AAGGA(N)6
|
| Campylobacter curvus 525 92 | AAGGA(N)6
|
| Exiguobacterium sibiricum 255 15 | GGAGG(N)6C |
| Helicobacter hepaticus ATCC 51449 | AAGGA(N)6
|
| Lactobacillus plantarum JDM1 | AGGAG(N)9
|
| Thermosipho africanus TCF52B | GGAGG(N)6T |
| Thermosipho melanesiensis BI429 | GGAGG(N)6T |
Start codon is shown in bold
Genomes with significant protein-related motif pairs
| Genome | Pattern | Protein related |
|---|---|---|
|
|
| Inside YD repeat-containing protein |
|
| AC | Inside cytochrome c |
|
| A(G/C) | Inside cytochrome c |
|
| T | Inside cell wall binding repeat-containing protein |
|
|
| Inside ABC transporter ATPase |
|
|
| Inside ATP-binding ABC transporter protein |
|
|
| Inside ABC transporter ATP-binding protein |
|
|
| Inside ABC transporter ATP-binding protein |
|
| G | Inside PE-PGRS family protein |
|
|
| Inside PE-PGRS family protein |
|
|
| Inside PPE family protein |
|
| G | Inside PE-PGRS family protein |
|
|
| Inside extra-cytoplasmic solute receptor |
Complete codons are highlighted in bold face
Selected potential novel motifs
| Genome | Motif Paira | Gene | Intergenic | Overlap |
|---|---|---|---|---|
|
| TTAAT(N)5ATTAA | 0% | 100% | 0% |
|
| GGGACAG(N)15TGTCCC | 0% | 95% | 5% |
|
| CCTAC(N)19TAGGT | 3% | 97% | 0% |
|
| TTGAC(N)19ATAAT | 16% | 83% | 1% |
|
| GCTTAT(N)5AAGCG | 4% | 94% | 2% |
|
| GAATCCAT(N)23ATGGATTC | 0% | 90% | 10% |
|
| ATAGCT(N)22CAAAAG | 14% | 81% | 5% |
|
| ATTATA(N)18GTCAA | 5% | 86% | 9% |
a Significant motif pairs comprised of overlapping pentamers were combined and reported as a single motif pair