| Literature DB >> 33962556 |
Bjørn André Bredesen1, Marc Rehmsmeier2.
Abstract
BACKGROUND: Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs.Entities:
Keywords: Cis-regulatory element; Machine learning; Motif; Random forest; Support vector machine
Year: 2021 PMID: 33962556 PMCID: PMC8105988 DOI: 10.1186/s12859-021-04143-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Models implemented in MOCCA. * Requires optional integration with Shogun [17]
| Model type | Description |
|---|---|
| Unweighted sum | Sum of specified feature spaces |
| Log-odds | Log-odds model of specified feature spaces |
| General SVM | SVM model of specified feature spaces |
| General RF | RF model of specified feature spaces |
| General LDA * | LDA model of specified feature spaces |
| General Averaged Perceptron * | Averaged Perceptron model of specified feature spaces |
| CPREdictor [ | Re-implementation of the PREdictor [ |
| Dummy PREdictor | Unweighted version of the PREdictor |
| SVM-MOCCA [ | Modelling sequence landscapes around motif occurrences using SVMs |
| RF-MOCCA | Modelling sequence landscapes around motif occurrences using RFs |
Fig. 1RF-MOCCA improves separation of both PREs and TAD boundaries from background. Cross-validation Precision/Recall curves with 110 positives and negatives for training, and 50 positives and 5000 negatives for testing, as in [5]. a Generalization to PREs [21] versus dummy genomic sequences. b Generalization to TAD boundaries [22] versus dummy genomic sequences
Fig. 2RF-MOCCA predicts more PcG-enriched candidate PREs and boundary element factor enriched candidate BEs. The numbers of candidate PREs and BEs predicted genome-wide by different models. Predictions in accessible chromatin are broken down into “strongly evidenced” (PRE predictions that overlap with PREs from [26] and BE predictions that overlap with TAD boundaries from [22]), “evidenced” (PRE predictions that overlap with modENCODE signals [27] of Pc, Psc or dSFMBT and BE predictions that overlap with modENCODE signals [27] of BEAF-32 or CP190) and “accessible” (the remainder)
Running times for genome-wide prediction in D. melanogaster, using the same training data as for the first cross-validation iteration, on an Intel Core i9-9900K CPU (3.6GHz, 8 cores)
| Model type | Running time (hh:mm:ss) | Cores used |
|---|---|---|
| jPREdictor M2019 | 0:01:21 | 1 |
| CPREdictor M2019 (PREs) | 0:00:06 | 1 |
| SVM-MOCCA M2019 (PREs) | 8:20:05 | 1 |
| RF-MOCCA M2019 (PREs) | 14:01:48 | 8 |
| cdBEST | 9:42:23 | 1 |