| Literature DB >> 31465497 |
Semyon K Kolmykov1,2,3, Yury V Kondrakhin1,2, Ivan S Yevshin1,2, Ruslan N Sharipov1,4, Anna S Ryabova1,2, Fedor A Kolpakov1,2.
Abstract
Chromatin immunoprecipitation followed by sequencing, i.e. ChIP-Seq, is a widely used experimental technology for the identification of functional protein-DNA interactions. Nowadays, such databases as ENCODE, GTRD, ChIP-Atlas and ReMap systematically collect and annotate a large number of ChIP-Seq datasets. Comprehensive control of dataset quality is currently indispensable to select the most reliable data for further analysis. In addition to existing quality control metrics, we have developed two novel metrics that allow to control false positives and false negatives in ChIP-Seq datasets. For this purpose, we have adapted well-known population size estimate for determination of unknown number of genuine transcription factor binding regions. Determination of the proposed metrics was based on overlapping distinct binding sites derived from processing one ChIP-Seq experiment by different peak callers. Moreover, the metrics also can be useful for assessing quality of datasets obtained from processing distinct ChIP-Seq experiments by a given peak caller. We also have shown that these metrics appear to be useful not only for dataset selection but also for comparison of peak callers and identification of site motifs based on ChIP-Seq datasets. The developed algorithm for determination of the false positive control metric and false negative control metric for ChIP-Seq datasets was implemented as a plugin for a BioUML platform: https://ict.biouml.org/bioumlweb/chipseq_analysis.html.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31465497 PMCID: PMC6715275 DOI: 10.1371/journal.pone.0221760
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1The workflow of algorithm for determination of FPCM and FNCMs.
FPCMs and FNCMs for several meta-sets of TFBRs.
| Meta-set | TF (TF-class) | FPCM | FNCM | |||
|---|---|---|---|---|---|---|
| GEM | MACS | PICS | SISSRs | |||
| PEAKS035099 | CTCF (2.3.3.50.1) | 0.998 | 0.776 | 0.874 | 0.689 | 0.702 |
| PEAKS039626 | CTCF (2.3.3.50.1) | 24.901 | 0.881 | 0.929 | 0.767 | 0.899 |
| PEAKS033754 | CTCF (2.3.3.50.1) | 0.782 | 0.871 | 0.861 | 0.483 | 0.141 |
| PEAKS033837 | GATA3 (2.2.1.1.3) | 0.995 | 0.661 | 0.677 | 0.267 | 0.149 |
| PEAKS039665 | ESR1 (2.1.1.2.1) | 1.004 | 0.674 | 0.742 | 0.36 | 0.144 |
| PEAKS033184 | TAL1 (1.2.3.1.1) | 0.991 | 0.653 | 0.793 | 0.446 | 0.536 |
| PEAKS038038 | PR (2.1.1.1.3) | 48.883 | 0.827 | 0.792 | 0.625 | 0.868 |
| PEAKS038673 | SIX-1 (3.1.6.1.1) | 40.463 | 0.356 | 0.909 | 0.885 | 0.296 |
| PEAKS038812 | ZFP-28 (2.3.3.0.192) | 49.214 | 0.727 | 0.929 | 0.77 | 0.733 |
| PEAKS040149 | EHF (3.5.2.4.1) | 49.914 | 0.397 | 0.53 | 0.501 | 0.579 |
Accuracies of the classification models.
| Classification model type | Training subset | Test subset |
|---|---|---|
| Perceptron | 0.817 | 0.814 |
| Fisher’s discriminant model | 0.823 | 0.812 |
| Logistic regression | 0.869 | 0.861 |
| SVM | 0.918 | 0.905 |
Mean values of the quality control metrics FPCM and FNCMs calculated on 5078 human ChIP-Seq datasets available in the GTRD database.
| Quality metrics | All datasets | Datasets with input control | Datasets without input control | Wilcoxon test (Z-score) | p-value |
|---|---|---|---|---|---|
| FPCM(first version) | 18.251 | 19.918 | 11.876 | 16.482 | < 10−14 |
| FPCM(second version) | 5.655 | 3.923 | 8.562 | 17.026 | < 10−14 |
| FNCM(GEM) | 0.509 | 0.516 | 0.484 | 3.997 | 6.4 * 10−5 |
| FNCM(MACS) | 0.651 | 0.645 | 0.672 | 0.864 | 0.389 |
| FNCM(PICS) | 0.36 | 0.292 | 0.62 | 28.461 | < 10−14 |
| FNCM(SISSRs) | 0.454 | 0.398 | 0.668 | 24.753 | < 10−14 |
Fig 2Empirical densities of (a) FPCM and (b) FNCM obtained for peak caller PICS.
The most frequent arrangements of the peak callers.
| Priority | Type of datasets | Observed proportion | Ratio between observed and expected proportions, Ro/e |
|---|---|---|---|
| MACS > GEM > SISSRs > PICS | All datasets | 0.181 | 4.3 |
| Datasets with input control | 0.195 | 4.6 | |
| Datasets without input control | 0.156 | 3.7 | |
| {MACS and GEM} > {SISSRS and PICS} | All datasets | 0.469 | 5.6 |
| Datasets with input control | 0.544 | 6.6 | |
| Datasets without input control | 0.338 | 4.1 | |
| {SISSRs, MACS and PICS} > GEM | Datasets without input control | 0.635 | 2.5 |
Relationships between the proposed quality metrics and features of both types.
| Features type | Quality metric | Regression model | Correlation between observed and predicted quality metrics | Relevant feature | Individual correlation |
|---|---|---|---|---|---|
| Quality metrics introduced by | FNCM (GEM) | OLS | 0.472 | FRiP(GEM) | 0.302 |
| FNCM (MACS) | OLS | 0.336 | NRF | 0.279 | |
| FNCM (PICS) | OLS | 0.415 | FRiP(PICS) | 0.392 | |
| FNCM (SISSRs) | OLS | 0.259 | FRiP(SISSRs) | 0.157 | |
| FPCM | OLS | 0.044 | - | - | |
| Peak caller characteristics | FNCM (GEM) | OLS | 0.233 | Noise | -0.233 |
| FNCM (MACS) | OLS | 0.371 | Tags number | -0.172 | |
| FNCM (PICS) | OLS | 0.031 | Score | -0.267 | |
| FNCM (SISSRs) | OLS | 0.353 | -lg(p-value) | 0.355 | |
| FPCM | OLS | 0.119 | - | - |
Fig 3Relationship between FNCM(PICS) observed and predicted by the random forest regression model.
Fig 4Quality metrics values for some low-quality ChIP-Seq data from GTRD.
Fig 5ROC curves for (a) whole dataset PEAKS038038 and (b) for PEAKS038038 without orphans.
Values of area under ROC curve for datasets mentioned in Table 1.
| Dataset | Whole dataset | Without orphans | ||
|---|---|---|---|---|
| MATCH | HOCOMOCO | MATCH | HOCOMOCO | |
| PEAKS035099 | 0.880 | 0.888 | 0.887 | 0.896 |
| PEAKS039626 | 0.684 | 0.691 | 0.849 | 0.858 |
| PEAKS033754 | 0.780 | 0.794 | 0.783 | 0.795 |
| PEAKS033837 | 0.620 | 0.655 | 0.628 | 0.663 |
| PEAKS039665 | 0.778 | 0.817 | 0.786 | 0.825 |
| PEAKS033184 | 0.790 | 0.824 | 0.840 | 0.868 |
| PEAKS038038 | 0.565 | 0.633 | 0.808 | 0.843 |
| PEAKS038673 | 0.564 | 0.556 | 0.813 | 0.844 |
| PEAKS038812 | 0.603 | 0.640 | 0.776 | 0.796 |
| PEAKS040149 | 0.595 | 0.578 | 0.722 | 0.623 |
Values of area under ROC curve when peaks from distinct ChIP-Seq studies were merged.
| TF name | Dataset type | GEM | MACS | PICS | SISSRs | ||||
|---|---|---|---|---|---|---|---|---|---|
| HOCOMOCO | MATCH | HOCOMOCO | MATCH | HOCOMOCO | MATCH | HOCOMOCO | MATCH | ||
| ATF-1 (1.1.7.1.2) | whole | 0.57 | 0.57 | 0.51 | 0.49 | 0.52 | 0.51 | 0.53 | 0.52 |
| without orphans | 0.78 | 0.78 | 0.61 | 0.6 | 0.91 | 0.9 | 0.84 | 0.83 | |
| SRF | whole | 0.64 | 0.61 | 0.57 | 0.56 | 0.57 | 0.55 | 0.57 | 0.56 |
| without orphans | 0.77 | 0.74 | 0.65 | 0.63 | 0.81 | 0.8 | 0.82 | 0.81 | |
| NF-E2 | whole | 0.7 | 0.69 | 0.67 | 0.66 | 0.59 | 0.58 | 0.61 | 0.6 |
| without orphans | 0.78 | 0.76 | 0.76 | 0.75 | 0.74 | 0.73 | 0.89 | 0.88 | |