| Literature DB >> 23241390 |
Mohamed Elati1, Rémy Nicolle, Ivan Junier, David Fernández, Rim Fekih, Julio Font, François Képès.
Abstract
Conventional approaches to predict transcriptional regulatory interactions usually rely on the definition of a shared motif sequence on the target genes of a transcription factor (TF). These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices, which may match large numbers of sites and produce an unreliable list of target genes. To improve the prediction of binding sites, we propose to additionally use the unrelated knowledge of the genome layout. Indeed, it has been shown that co-regulated genes tend to be either neighbors or periodically spaced along the whole chromosome. This study demonstrates that respective gene positioning carries significant information. This novel type of information is combined with traditional sequence information by a machine learning algorithm called PreCisIon. To optimize this combination, PreCisIon builds a strong gene target classifier by adaptively combining weak classifiers based on either local binding sequence or global gene position. This strategy generically paves the way to the optimized incorporation of any future advances in gene target prediction based on local sequence, genome layout or on novel criteria. With the current state of the art, PreCisIon consistently improves methods based on sequence information only. This is shown by implementing a cross-validation analysis of the 20 major TFs from two phylogenetically remote model organisms. For Bacillus subtilis and Escherichia coli, respectively, PreCisIon achieves on average an area under the receiver operating characteristic curve of 70 and 60%, a sensitivity of 80 and 70% and a specificity of 60 and 56%. The newly predicted gene targets are demonstrated to be functionally consistent with previously known targets, as assessed by analysis of Gene Ontology enrichment or of the relevant literature and databases.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23241390 PMCID: PMC3561985 DOI: 10.1093/nar/gks1286
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Principle of the Solenoidal Coordinate Method (SCM). A set of gene positions (red dots along horizontal line, upper left corner) derives from a perfectly P-periodic pattern (blurred red dots). Some of the initially periodic genes are missing (false negative) or have different positions (noise), and random genes have been added (false positive). Although the resulting pattern looks aperiodic, the position of the genes in a solenoidal coordinate of period P (lower left) reveals some alignment properties. The algorithm provides a score that is built using a distance-based information content for the organization of the genes on the solenoid face view (lower right), rewarding exceptionally dense or void regions (22). This information content is computed for all periods, which leads to a spectrum. The peaks that are abnormally high in this spectrum then reveal the periodic tendencies. Note that high scores at the period equal to full chromosome length reflect chromosomal proximity, as in this case, the solenoid is composed of only one loop, which is the whole chromosome itself. For a given period, SCM allocates a positional score to each gene, equal to the co-logarithm of the likelihood for this gene to be periodically positioned in dense regions with respect to the other genes of the data set.
Figure 2.ROC curves on a test set for Position classifier, Sequence classifier and PreCisIon. The curves shown here are the individual curves of three TFs: Lrp, SigD and CRP (panels a–c); the average of all the individual ROC curves obtained for each single TF, using the ‘threshold averaging’ method (32). The standard deviation bars indicate the variation around the average curve (panel d). The gray diagonal denotes the ROC curve of a random classifier.
Prediction performance
| TF | NG | Sequence classifier | Position classifier | PreCisIon | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| AUC | Sn | Sp | AUC | Sn | Sp | AUC | Sn | Sp | ||
| CRP | 293 | 0.54 | 0.83 | 0.17 | 0.53 | 0.47 | 0.59 | 0.70 | 0.74 | 0.64 |
| FNR | 132 | 0.56 | 0.86 | 0.12 | 0.50 | 0.49 | 0.50 | 0.68 | 0.53 | 0.51 |
| IHF | 107 | 0.55 | 0.83 | 0.21 | 0.48 | 0.38 | 0.61 | 0.56 | 0.75 | 0.43 |
| Fis | 86 | 0.60 | 0.85 | 0.08 | 0.53 | 0.21 | 0.76 | 0.59 | 0.73 | 0.52 |
| ArcA | 80 | 0.49 | 1.00 | 0.13 | 0.54 | 0.41 | 0.62 | 0.53 | 0.60 | 0.48 |
| H-NS | 75 | 0.59 | 0.84 | 0.11 | 0.48 | 0.16 | 0.75 | 0.58 | 0.68 | 0.43 |
| Fur | 45 | 0.66 | 0.71 | 0.15 | 0.54 | 0.35 | 0.8 | 0.64 | 0.86 | 0.43 |
| Lrp | 41 | 0.56 | 0.89 | 0.13 | 0.74 | 0.56 | 0.82 | 0.73 | 0.71 | 0.65 |
| CpxR | 36 | 0.56 | 0.69 | 0.23 | 0.46 | 0.05 | 0.85 | 0.56 | 0.55 | 0.59 |
| NarL | 36 | 0.55 | 0.58 | 0.33 | 0.57 | 0.14 | 0.87 | 0.65 | 0.58 | 0.62 |
| SigA | 277 | 0.55 | 1.00 | 0.12 | 0.47 | 0.57 | 0.40 | 0.51 | 0.52 | 0.60 |
| SigE | 65 | 0.60 | 0.98 | 0.07 | 0.54 | 0.57 | 0.65 | 0.50 | 0.78 | 0.38 |
| SigB | 63 | 0.65 | 0.98 | 0.14 | 0.51 | 0.27 | 0.72 | 0.68 | 0.79 | 0.30 |
| SigG | 52 | 0.65 | 0.98 | 0.07 | 0.51 | 0.27 | 0.72 | 0.60 | 0.62 | 0.49 |
| SigK | 46 | 0.69 | 0.81 | 0.19 | 0.45 | 0.26 | 0.77 | 0.71 | 0.71 | 0.51 |
| Ccpa | 38 | 0.72 | 0.92 | 0.30 | 0.50 | 0.16 | 0.81 | 0.68 | 0.92 | 0.52 |
| LexA | 33 | 0.79 | 0.97 | 0.44 | 0.54 | 0.18 | 0.88 | 0.82 | 0.81 | 0.76 |
| AbrB | 33 | 0.64 | 0.78 | 0.49 | 0.52 | 0.27 | 0.80 | 0.68 | 0.51 | 0.70 |
| SigW | 32 | 0.64 | 0.80 | 0.32 | 0.51 | 0.20 | 0.79 | 0.75 | 0.76 | 0.66 |
| SigD | 29 | 0.76 | 0.74 | 0.38 | 0.51 | 0.12 | 0.84 | 0.90 | 0.73 | 0.70 |
Area under ROC curve (AUC), sensitivity (Sn) and Specificity (Sp) of all the tested TFs from E. coli and B. subtilis on test set (3-fold cross-validation). NG: Number of target Genes.
Figure 3.Methods combining Sequence and Position classifiers. (a) ROC analysis comparing three classifier fusion algorithms: PreCisIon (classifier fusion using boosting), linear combination based on average and Stacked generalization based on Naive Bayes learning; (b) ROC analysis of the boosted contributions of the individual Sequence or Position view, and of their combination into PreCisIon.
Functional validation of predicted gene targets against GO
| TF | Top GO category | ||
|---|---|---|---|
| CRP | carbohydrate catabolic process | 3 × 10−9 | 6 × 10−7 |
| FNR | metal ion transport | 4 × 10−3 | 9 × 10−1 |
| IHF | nitrogen biosynthetic process | 2 × 10−11 | 6 × 10−9 |
| FIS | carbohydrate catabolic process | 6 × 10−11 | 5 × 10−2 |
| ArcA | cellular biosynthetic process | 2 × 10−9 | 1 × 10−8 |
This table shows the most significant GO categories for newly predicted gene targets for E. coli TFs obtained by applying PreCisIon to the most curated known targets (KTs) from RegulonDB. The table compares the enrichment P-value (p) of this category for the newly predicted targets (PTs) and known targets. The reported uncorrected P-values are based on the ‘EASE Score’ (35), a modified Fisher test, for gene-enrichment analysis.
Validation against ChIP-chip data
| TF | TG | P | Sequence classifier | PreCisIon | ||||
|---|---|---|---|---|---|---|---|---|
| AUC | PT | FP | AUC | PT | FP | |||
| CRP | 293 | 19 015 | 0.54 | 1862 | 1698 | 0.70 | 428 | 264 |
| FNR | 132 | 107 899 | 0.56 | 1068 | 937 | 0.68 | 162 | 82 |
| Lrp | 41 | 149 666 | 0.56 | 986 | 838 | 0.73 | 259 | 171 |
This table shows the number of documented target genes (TG) and the identified periods in the data (P) for the TFs CRP, FNR and Lrp. It also shows for both Sequence classifier and PreCisIon the area under the ROC curve (AUC), the number of predicted targets (PT) genes and the number of FPs using ChIP-chip data as references.