| Literature DB >> 20140085 |
Valerie Storms1, Marleen Claeys, Aminael Sanchez, Bart De Moor, Annemieke Verstuyf, Kathleen Marchal.
Abstract
BACKGROUND: Computational de novo discovery of transcription factor binding sites is still a challenging problem. The growing number of sequenced genomes allows integrating orthology evidence with coregulation information when searching for motifs. Moreover, the more advanced motif detection algorithms explicitly model the phylogenetic relatedness between the orthologous input sequences and thus should be well adapted towards using orthologous information. In this study, we evaluated the conditions under which complementing coregulation with orthologous information improves motif detection for the class of probabilistic motif detection algorithms with an explicit evolutionary model.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20140085 PMCID: PMC2815771 DOI: 10.1371/journal.pone.0008938
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Overview of the test setup.
Panel A presents the three different information spaces in which motif detection was assessed: the coregulation, the combined coregulation-orthology and the orthologous space. The coregulation space consists of a set of non-coding sequences from a reference species (Spec1 = REF) that each contain at least one motif site for a common TF (indicated by Gene 1 to Gene N). For the combined space, we extent the coregulation space with orthologous sequences selected from different species (indicated by Spec 2 to Spec M). One reference gene together with its orthologs is referred to as an orthologous set (indicated by a blue frame). The combined space thus consists of multiple orthologous sets while the orthologous space consists of a single orthologous set. We assessed the specific contribution of each space to the success rate of motif detection by performing the tests summarized in panels B and C. At first we tested the effect of adding different types of orthologous information as shown in Panel B. These tests involve changing the topology by which the orthologs are related (equal, unequal star and non star like topology), changing the mutual distance between the orthologs (represented by elongating the branches of the tree) and using datasets with a different number of orthologs. Secondly, the effect of altering the signal to noise ratio of the datasets on the accuracy of the results was tested 1) by changing the degree of degeneracy of the motifs and 2) by omitting motifs sites. We differentiate between leaving out motif sites in the coregulation direction versus their omission in the orthologous direction as is illustrated for a dataset in the combined space.
Figure 2Results for motif detection in the coregulation space.
Each dataset consists of ten coregulated genes from the reference species (proximity 0.80). Panel A displays the results for a synthetic dataset in which all sequences contain a site sampled from a high IC motif (A). Panel B shows the results for a dataset in which all sequences contain a site sampled from a low IC motif (B) and panel C shows the results of a dataset where the motif site is missing in two out of ten sequences. The remainder of the sequences contains a motif site sampled from the high IC motif. Results were assessed by the performance measures D1: the number of datasets with an output out of 100 datasets, D1*RR: the number of datasets with a correct output and the quality measures PPV (the percentage of true sites among the predicted motif sites, averaged over all correct outputs) and Sens (the percentage of the true sites recovered by the algorithm, averaged over all correct outputs).
Figure 3Effect of adding orthologs with distinct phylogenetic distances on motif detection in the combined space.
Results are displayed on the retrieval of a low IC motif in a synthetic dataset. Panel (A) shows the results for the coregulation space that consists of ten coregulated reference genes. The remaining panels represent the results for the combined space that consists of the ten coregulated reference genes together with their orthologs, also referred to as ten orthologous sets. Each orthologous set consists of five prealigned sequences related through an equal star topology: the reference sequence with proximity 0.80 and four equally distant sequences with proximities of respectively 0.90 (B), 0.50 (C) and 0.20 (D). For the measures D1, D1*RR, PPV and Sens see Figure 2.
Figure 4Effect of the number of added orthologs on motif detection in the combined space.
Results on the retrieval of both a high and a low IC motif are displayed for the real datasets: 1) results from the Gamma-proteobacterial datasets are indicated as black curves and 2) those of the Saccharomyces dataset are indicated as gray curves. Results for the high IC motif are indicated by circles and correspond to those obtained for LexA (bacterial dataset) or URS1H (yeast dataset), results for the low IC motif are indicated by stars and correspond to those obtained for TyrR (bacterial dataset) or RAP1 (yeast dataset). The panels represent the results of a dataset containing for each coregulated reference gene two (A), four (B) and six (for the bacterial datasets) or five (for the yeast datasets) prealigned orthologs (the reference gene included) (C). Panel (D) represents the results of a dataset containing for each coregulated reference gene six or five unaligned orthologs (the reference gene included). Results were assessed by the F-value defined as the harmonic mean of the spPPV (the percentage of true sites amongst the predicted motif sites for the reference species, averaged over all correct outputs) and the spSens (the percentage of the true sites found by the algorithm for the reference species, averaged over all correct outputs). The reference species are respectively E. coli (bacterial data) or S. cerevisiae (yeast data). The Y-axis represents the difference between the F-value obtained from searching motifs in the combined coregulation-orthology space and the F-value obtained from searching in the coregulation space only.
Figure 5Effect of motif loss on motif detection in the combined space.
The results are displayed for a synthetic dataset containing sites sampled from a high IC motif. Each dataset consists of ten coregulated reference genes complemented with their orthologs, also referred to as ten orthologous sets. Each orthologous set consists of five prealigned sequences related through an unequal star topology: four closely related orthologs with proximities of respectively 0.80 (reference ortholog), 0.90, 0.85 and 0.75 and one distantly related ortholog with a proximity of 0.20. Panel (A) represents the results when a motif site is present in all sequences of the orthologous sets. Panels (B) and (C) display the results when motif loss occurs in all sequences derived from respectively a closely (q = 0.75) or a distantly (q = 0.20) related species. Panel (D) shows the results when motif loss occurs in two out of ten coregulated reference genes and in all their corresponding orthologs. For the measures D1, RR*D1, PPV and Sens see Figure 2.
Figure 6Results for motif detection in the orthologous space.
Results are displayed for a synthetic dataset with motif sites sampled from a high IC (on top) and a low IC motif (below). Each dataset consists of only one reference gene and its orthologs, referred to as one orthologous set. Panel (A) and (B) represent the results when the orthologous set contains respectively five and ten prealigned orthologs related through an equal star topology with a proximity of 0.50. Panel (C) represents the results when the orthologous set contains five prealigned orthologs related through an equal star topology with a proximity of 0.90 and panel (D) represents the results when the orthologous set contains five prealigned orthologs related through an unequal star topology. Note that for most tests the PPV equaled the Sens resulting in overlapping dots. For the measures D1, RR*D1, PPV and Sens see Figure 2.
Summary of user-guidelines.
| PROBLEM | CONSTRUCTING DATASET | PREFERRED TOOLS | REMARKS |
| 1. COREGULATION SPACE | |||
| Maximizing the signal to noise ratio in the dataset (i.e. the enrichment of motif sites in the dataset) improves the success rate. | Only select sequences that are likely to contain the motif. Keep the input sequences as short as possible.Adding orthologs (see 2: combined space) improves the success rate at a low signal to noise ratio. | PS: the ensemble centroid solution guarantees a high success rate for datasets with low signal to noise ratios. MEME: easy to use with performances comparable to those of PG and PS. | Both PG and PS provide a statistical procedure to filter out non-significant motif sites = > Overestimating the ‘expected number of motif sites’ affects the performance less than underestimating them. For MEME misestimating the expected number of motif sites affects the motif quality. |
| 2. COMBINED SPACE | |||
| It is crucial to use a | Use a tree based on a neutral evolution rate or a protein tree with corrected distances to prevent underestimating the evolution rate. | Both PG and PS are sensitive to overestimating the evolutionary proximity of the orthologous intergenic regions. | The type of topology (star, tree like structure) does not affect the performance of the phylogenetic tools. |
| The |
|
| The |
|
| Avoid sequences for which one expects that the mode of regulation has changed (mostly the distantly related sequences). | PG/PS performs better if the motif is omitted in the distant/close ortholog. MEME: not dependent on the type of ortholog. | |
| 3. ORTHOLOGOUS SPACE | |||
| The same issues as in 2 are valid regarding the phylogenetic tree and the characteristics of the orthologs. | The more orthologs are added, the better the results. | PG performs best when the orthologs are prealigned and slightly outperforms MEME. PS underperforms in the orthologous space. | Observing a PG output that only contains unaligned motif sites indicates that the input tree underestimates the true evolution rate. In that case, lower the proximities. |