| Literature DB >> 19409082 |
Jens Keilwagen1, Jan Baumbach, Thomas A Kohl, Ivo Grosse.
Abstract
Valuable binding-site annotation data are stored in databases. However, several types of errors can, and do, occur in the process of manually incorporating annotation data from the scientific literature into these databases. Here, we introduce MotifAdjuster http://dig.ipk-gatersleben.de/MotifAdjuster.html, a tool that helps to detect these errors, and we demonstrate its efficacy on public data sets.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19409082 PMCID: PMC2718512 DOI: 10.1186/gb-2009-10-5-r46
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Annotation results
| Gene ID | Gene name | No. BS | BS length | No. removed BSs | No. shifted BSs | Percentage |
| 218 | 22 | 20 | 31 | 23.4% | ||
| 74 | 7 | 2 | 11 | 17.6% | ||
| 68 | 21 | 13 | 17 | 44.1% | ||
| 54 | 14 | 2 | 3 | 9.3% | ||
| 46 | 15 | 1 | 43 | 95.7% | ||
| 43 | 12 | 4 | 23 | 62.8% | ||
| 33 | 15 | 9 | 6 | 45.5% | ||
| Total | 536 | 51 | 134 | 34.5% |
Summary of the results of the application of MotifAdjuster to all data sets of CoryneRegNet 4.0 from Escherichia coli with at least 30 BSs and of at most 25 bp length. Columns 1 and 2 show the gene ID and gene name of the TF; columns 3 and 4 show the number of BSs stored in the database and their lengths; columns 5 and 6 show the number of BSs proposed to be removed and to be shifted; and column 7 shows the percentage of BSs to be removed or shifted. Interestingly, the percentage of proposed adjustments varies strongly from TF to TF, ranging from 9.3% for Fnr to 95.7% for Fur. In summary, we find in the complete data set of 536 BSs that 51 BSs are proposed to be removed and 134 BSs are proposed to be shifted, resulting in 34.5% of the data set being proposed for adjustments.
Figure 1Comparison of binding-site conservation, showing the original sequence logos, the consensus sequences for the TFs obtained from the literature [56-61], and the adjusted sequence logos for the data sets of the TFs CpxR, Crp, Fis, Fnr, Fur, Lrp, and NarL. We find in all seven cases that (i) the adjusted sequence logos show a higher conservation than the original sequence logos, (ii) the adjusted sequence logos are more similar to the consensus sequences than to the original sequence logos; and (iii) clear motifs can be recognized in the adjusted sequence logos of the TFs CpxR, Fur, and NarL that could not be recognized in the original sequence logos.
NarL annotation results: Number of binding-site shifts and strand switches
| No strand switch | Strand switch | |
| No position shift | 36 | 25 |
| Position shift | 5 | 6 |
| Removed | 2 | |
Application of MotifAdjuster to the set of 74 NarL BSs results in adjustments proposed for 38 of these BSs. Two BSs are proposed to be removed from the data set. Of the remaining 36 BSs, 25 BSs are labeled with a wrong strand annotation but a correct position, and five BSs are proposed to have a correct strand annotation but a wrong position. For six BSs, both strand annotation and position are proposed to be wrong.
NarL binding sites with questionable annotations
| Gene ID | Gene name | BS | Lit. | Occ. | Shift | Strand | Adj. BS |
| AATAAAT | [ | 1 | +1 | Reverse | TATTTAT | ||
| ATAATGC | [ | 1 | +1 | Forward | TAATGCT | ||
| ATATCAA | [ | 1 | +1 | Forward | TATCAAT | ||
| CAACTCA | [ | 1 | +1 | Forward | AACTCAT | ||
| CATTAAT | [ | 1 | +1 | Reverse | TATTAAT | ||
| GATCGAT | [ | 1 | +1 | Reverse | TATCGAT | ||
| GTAATTA | [ | 1 | +1 | Forward | TAATTAT | ||
| TATCGGT | [ | 1 | +1 | Reverse | TACCGAT | ||
| TTACTCC | [ | 1 | +1 | Forward | TACTCCG | ||
| CACTGTA | [ | 0 | - | - | - | ||
| TAGGAAT | [ | 1 | +1 | Reverse | AATTCCT | ||
| TGTGGTT | [ | 1 | +1 | Reverse | TAACCAC | ||
| ATGTTAT | [ | 0 | - | - | - |
Annotated NarL BSs for which MotifAdjuster proposes either to shift the BS or to remove it from the data set. Columns 1 to 3 contain gene ID, gene name, and the BS (as stored in the database). Column 4 indicates the original literature related to this BS. The following three columns (5 through 7) comprise the three possible adjustments suggested by MotifAdjuster, removal, shift, and strand orientation (relative to the target gene). In column 5, a value of 0 indicates that the BS is proposed for removal, and in column 6, a positive (negative) value denotes a shift of the BS to the right (left). Finally, column 8 provides the adjusted BS. Interestingly, we find that the two BSs that are proposed to be removed are not mentioned in the original literature, and in 10 of the 11 cases, the shifted BS is consistent with the BS published in the original literature. In addition, MotifAdjuster also proposes to switch the BS strand in six of the 11 cases.
Figure 2Position of the predicted NarL binding site in the upstream region of torC. The NarL BS TACCCT is located on the forward strand with respect to the target operon torCAD starting at position -209 bp (red color). All positions are relative to the first nucleotide of the start codon of torC. (a) The fragment of the upstream region of the torCAD operon containing the NarL BS predicted by the PWM model trained on the adjusted data set. (b) Histogram of all positions of NarL BSs in the database. The red line indicates the position of the predicted BS.