| Literature DB >> 22247280 |
Jonathan Göke1, Marcel H Schulz, Julia Lasserre, Martin Vingron.
Abstract
MOTIVATION: The identity of cells and tissues is to a large degree governed by transcriptional regulation. A major part is accomplished by the combinatorial binding of transcription factors at regulatory sequences, such as enhancers. Even though binding of transcription factors is sequence-specific, estimating the sequence similarity of two functionally similar enhancers is very difficult. However, a similarity measure for regulatory sequences is crucial to detect and understand functional similarities between two enhancers and will facilitate large-scale analyses like clustering, prediction and classification of genome-wide datasets.Entities:
Mesh:
Year: 2012 PMID: 22247280 PMCID: PMC3289921 DOI: 10.1093/bioinformatics/bts028
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Running time comparison. All pairwise scores were calculated for random sequences of length 1000 bp.
Running time of the different methods in O notation.
| Running time in | |
|---|---|
| ( | |
n: number of sequences; l: average sequence length; k:k-mer size; m: Markov model order. The running time for D2*is dominated by the quadratic term. The running time for N2 is dominated by the linear term (pre-processing).
Fig. 2.Influence of single sequences on pairwise scores. All pairwise scores for 500 sequences generated by the same model were calculated. C measures the number of sequence pairs for sequence S among the highest 5% of all scores (high scoring pairs). Since all sequences were created using the same model, the distribution of C={C1,…, C} from alignment-free methods should be similar to the distribution of C obtained from a random scoring method (‘expected’, black line). A different distribution would indicate that the number of high scoring pairs is strongly dependent on the individual sequence, indicating that pairwise scores are dependent on the single sequence noise rather than on the similarity of the sequence pair. (A) Uniform nucleotide distribution, all methods show the expected behaviour. (B) AT-rich nucleotide distribution, D2 and D2z differ from the expected behaviour, showing that these pairwise scores are strongly influenced by the sequence composition.
Comparison of the different methods (k=6) when the genomic orientation of the motif is unknown
| Performance with implanted | |||||||
|---|---|---|---|---|---|---|---|
| 5%Precision | AUC ROC | AUC PR | |||||
| Motif setting: | m1r8 | m4r2 | m1r8 | m4r2 | m1r8 | m4r2 | |
| 0.88 | 0.59 | 0.72 | 0.54 | 0.72 | 0.54 | ||
| 0.91 | 0.64 | 0.74 | 0.56 | 0.73 | 0.56 | ||
| 0.87 | 0.66 | 0.71 | 0.58 | 0.70 | 0.57 | ||
| 0.86 | 0.65 | 0.71 | 0.58 | 0.70 | 0.57 | ||
Bold numbers indicate best performance.
Comparison of the different methods (k=6) when motifs are sampled from all k-mers with one mismatch to the word
| Performance with implanted | |||||||
|---|---|---|---|---|---|---|---|
| 5 Precision | AUC ROC | AUC PR | |||||
| Motif setting: | m1r8 | m4r2 | m1r8 | m4r2 | m1r8 | m4r2 | |
| 0.59 | 0.51 | 0.53 | 0.48 | 0.53 | 0.49 | ||
| 0.59 | 0.54 | 0.54 | 0.51 | 0.53 | 0.51 | ||
| 0.60 | 0.54 | 0.54 | 0.51 | 0.54 | 0.51 | ||
| 0.59 | 0.54 | 0.54 | 0.51 | 0.54 | 0.51 | ||
| 0.60 | 0.54 | 0.55 | 0.51 | 0.54 | 0.51 | ||
Bold numbers indicate best performance.
Fig. 3.Precision–recall curve for enhancers active during mouse development. The plots show the precision average over 25 samples each time drawing 500 enhancer sequences (positive) and 500 unrelated genomic sequences of equal length as the enhancers (negative). (A) Precision–recall curve for forebrain enhancers. (B) Precision-recall curve for limb enhancers.
Comparison of the different methods on tissue-specific enhancers
| Performance on tissue-specific enhancer sequences | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5% Precision | AUC ROC | AUC PR | |||||||||||
| Tissue | F | M | L | H | F | M | L | H | F | M | L | H | |
| 0.61 | 0.64 | 0.55 | 0.50 | 0.55 | 0.55 | 0.50 | 0.45 | 0.54 | 0.55 | 0.51 | 0.47 | ||
| 0.66 | 0.69 | 0.63 | 0.56 | 0.57 | 0.57 | 0.56 | 0.53 | 0.57 | 0.57 | 0.55 | 0.52 | ||
| 0.71 | 0.70 | 0.67 | 0.60 | 0.62 | 0.62 | 0.59 | 0.55 | 0.60 | 0.60 | 0.58 | 0.54 | ||
| 0.65 | 0.64 | 0.62 | 0.58 | 0.58 | 0.57 | 0.56 | 0.53 | 0.57 | 0.56 | 0.55 | 0.53 | ||
| 0.71 | 0.67 | 0.68 | 0.60 | 0.61 | 0.59 | 0.58 | 0.55 | 0.60 | 0.58 | 0.58 | 0.55 | ||
Bold numbers indicate the best performance. Positive sequences were obtained by ChIP-Seq of p300 in forebrain (F), midbrain (M), limb (L) and heart (H) tissue of the mouse embryo. Negative sequences were randomly sampled from the mouse genome. All pairwise scores were computed with repeats masked, k=6, background Markov model of order 1. Results show average values over 25 samples each time drawing 500 sequences.
Fig. 4.Precision–recall curve for forebrain enhancers in the mouse. Enhancers active in different tissues were used as the background set.