| Literature DB >> 29764360 |
Wanwen Zeng1,2, Mengmeng Wu1,3, Rui Jiang4,5.
Abstract
BACKGROUND: Precise identification of three-dimensional genome organization, especially enhancer-promoter interactions (EPIs), is important to deciphering gene regulation, cell differentiation and disease mechanisms. Currently, it is a challenging task to distinguish true interactions from other nearby non-interacting ones since the power of traditional experimental methods is limited due to low resolution or low throughput.Entities:
Keywords: Enhancer-promoter interactions; Natural language processing; Three-dimensinal interactions; Unsupervised learning
Mesh:
Year: 2018 PMID: 29764360 PMCID: PMC5954283 DOI: 10.1186/s12864-018-4459-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Details of each cell line dataset. The enhancers (or promoters) column indicates the number of all known active enhancers (or promoters) for each cell line, which are used for unsupervised feature learning for enhancer (or promoter) sequences
| Dataset | enhancers | promoters | true EPIs | false EPIs |
|---|---|---|---|---|
| K562 | 82806 | 8196 | 1977 | 1975 |
| IMR90 | 108996 | 5253 | 1254 | 1250 |
| GM12878 | 100036 | 8453 | 2113 | 2110 |
| HUVEC | 65358 | 8180 | 1524 | 1520 |
| HeLa-S3 | 103460 | 7794 | 1740 | 1740 |
| NHEK | 144302 | 5254 | 1291 | 1280 |
| FANTOM | 43011 | 49620 | 61542 | 61542 |
Fig. 1The two-stage workflow of EP2vec. Stage 1 of EP2vec is unsupervised feature extraction which transforms enhancer sequences and promoter sequences in a cell line into sequence embedding features separately. Given a set of all known enhancers or promoters in a cell line, we first split all the sequences into k-mer words with stride s = 1 and assign a unique ID to each of them. Regarding the preprocessed sequences as sentences, we embed each sentence to a vector by Paragraph Vector. Concretely, we use vectors of words in a context with the sentence vector to predict the next word in the context using softmax classifier. After training converges, we get embedding vectors for words and all sentences, where the vectors for sentences are exactly the sequence embedding features that we need. Note that in sentence ID, SEQUENCE is a placeholder for ENHANCER or PROMOTER, and is the total number of enhancers or promoters in a cell line. Stage 2 is supervised learning for predicting EPIs. Given a pair of sequences, namely an enhancer sequence and a promoter sequence, we represent the two sequences using the pre-trained vectors and then concatenate them to obtain the feature representation. Lastly, we train a Gradient Boosted Gradient Trees classifier to predict whether this pair is a true EPI
The mean values and the standard deviations of F1 scores for EP2vec and other three baseline methods in 10-fold cross-validation experiments. For FANTOM dataset, we do not evaluate TargetFinder due to lack of experimental features, and we do not evaluate SPEID since it is extremely time-consuming to run 10-fold cross validation of SPEID on so many samples
| Dataset | EP2vec | TargetFinder | gkmSVM | SPEID |
|---|---|---|---|---|
| K562 | 0.882 (0.019) | 0.881 (0.014) | 0.821 (0.018) | 0.846 (0.024) |
| IMR90 | 0.872 (0.020) | 0.863 (0.017) | 0.749 (0.026) | 0.825 (0.032) |
| GM12878 | 0.867 (0.014) | 0.844 (0.010) | 0.779 (0.015) | 0.809 (0.018) |
| HUVEC | 0.875 (0.024) | 0.878 (0.022) | 0.731 (0.028) | 0.809 (0.023) |
| HeLa-S3 | 0.920 (0.013) | 0.913 (0.014) | 0.822 (0.021) | 0.888 (0.023) |
| NHEK | 0.933 (0.015) | 0.922 (0.018) | 0.800 (0.024) | 0.900 (0.019) |
| FANTOM | 0.841(0.004) | / | 0.803(0.017) | / |
Fig. 2The F1 scores of different embedding dimensions. As the embedding dimensions increase, the performance increses. And embedding dimension d = 100 is sufficient to obtain the near-optimal performances in all these datasets
Fig. 3The enriched motifs in HUVEC and K562. MYB_f1, IKZF1_f1, GFI1_f1 and SOX15_f1 are enriched in HUVEC. KLF6_si and TFE3_f1 are enriched in K562
Fig. 4The F1 scores of combined features and two single types of features in 10-fold cross-validation. The combination of both types of features generate even better performance, indicating sequence embedding features and experiment features can be complementary to each other