| Literature DB >> 30519505 |
Maja Malkowska1, Julian Zubek2, Dariusz Plewczynski2,3, Lucjan S Wyrwicz1.
Abstract
MOTIVATION: The identification of functional sequence variations in regulatory DNA regions is one of the major challenges of modern genetics. Here, we report results of a combined multifactor analysis of properties characterizing functional sequence variants located in promoter regions of genes.Entities:
Keywords: DNA sequence variation; DNA shape; Machine learning; Promoter; Single-nucleotide polymorphism; Variant prioritization
Year: 2018 PMID: 30519505 PMCID: PMC6275119 DOI: 10.7717/peerj.5742
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Mean importance of five best scoring features in each feature group.
Figure 2Joint distributions of the two most important features in the two classes.
WT-SNP difference corresponds to difference of scores between reference (wild type) and mutated (SNP) variants.
Figure 3The strongest feature interdependencies.
Cross-validation classification results for different feature groups on TSS-balanced data set.
| AUC | AUC_std | Accuracy | Accuracy_std | F1 | F1_std | Precision | Precision_std | Recall | Recall_std | size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| All | 0.9764 | 0.0133 | 0.9258 | 0.0247 | 0.8803 | 0.0456 | 0.8840 | 0.0643 | 0.8792 | 0.0480 | 227.0 |
| Best 25 | 0.9243 | 0.0345 | 0.8449 | 0.0418 | 0.7551 | 0.0785 | 0.7456 | 0.1079 | 0.7713 | 0.0710 | 25.0 |
| Sequence | 0.5555 | 0.0473 | 0.6162 | 0.0584 | 0.3170 | 0.0416 | 0.3766 | 0.0878 | 0.2834 | 0.0453 | 52.0 |
| GC-content | 0.7765 | 0.0525 | 0.7051 | 0.0626 | 0.4934 | 0.0634 | 0.5560 | 0.1054 | 0.4546 | 0.0713 | 8.0 |
| Shape | 0.5571 | 0.0566 | 0.6251 | 0.0690 | 0.2546 | 0.0597 | 0.3574 | 0.0994 | 0.2039 | 0.0551 | 88.0 |
| Conservation | 0.5440 | 0.0416 | 0.6569 | 0.0522 | 0.2693 | 0.0764 | 0.4313 | 0.1547 | 0.2003 | 0.0545 | 10.0 |
| TFBS ChIP-seq | 0.5255 | 0.0482 | 0.6674 | 0.0755 | 0.2416 | 0.0707 | 0.4722 | 0.1589 | 0.1683 | 0.0550 | 12.0 |
| Histone modifications | 0.5664 | 0.0641 | 0.6270 | 0.0690 | 0.3342 | 0.0702 | 0.3987 | 0.1069 | 0.2994 | 0.0844 | 38.0 |
| DNase I | 0.5846 | 0.0622 | 0.6662 | 0.0817 | 0.1474 | 0.0674 | 0.4088 | 0.1921 | 0.0914 | 0.0431 | 1.0 |
| Dinucleotide content | 0.5205 | 0.0615 | 0.6211 | 0.0614 | 0.2354 | 0.0798 | 0.3407 | 0.1323 | 0.1858 | 0.0647 | 16.0 |
| Max TFBS log-odds ratio score + TF disruption pval | 0.5141 | 0.0613 | 0.6773 | 0.0824 | 0.0364 | 0.0381 | 0.3812 | 0.3618 | 0.0193 | 0.0205 | 2.0 |
| Sequence + GC-content | 0.7689 | 0.0404 | 0.6997 | 0.0465 | 0.5029 | 0.0578 | 0.5426 | 0.1159 | 0.4816 | 0.0477 | 60.0 |
| Shape + GC-content | 0.9175 | 0.0313 | 0.8395 | 0.0333 | 0.7399 | 0.0627 | 0.7557 | 0.1052 | 0.7332 | 0.0583 | 96.0 |
| Sequence + GC-content + Shape | 0.9787 | 0.0140 | 0.9446 | 0.0208 | 0.9124 | 0.0381 | 0.8894 | 0.0616 | 0.9400 | 0.0437 | 148.0 |
| Sequence + GC-content + Shape + TF disruption pval | 0.9787 | 0.0132 | 0.9471 | 0.0231 | 0.9161 | 0.0400 | 0.8899 | 0.0624 | 0.9468 | 0.0401 | 149.0 |
| Sequence + GC-content + Shape + TF disruption pval + Max TFBS log-odds ratio score | 0.9782 | 0.0139 | 0.9442 | 0.0189 | 0.9118 | 0.0318 | 0.8933 | 0.0595 | 0.9346 | 0.0374 | 150.0 |
| Sequence + GC-content + TFBS ChIP-seq | 0.7902 | 0.0332 | 0.7206 | 0.0410 | 0.5252 | 0.0614 | 0.5698 | 0.0934 | 0.4933 | 0.0616 | 72.0 |
| Sequence + GC-content + Histone modifications | 0.7981 | 0.0426 | 0.7249 | 0.0464 | 0.5359 | 0.0656 | 0.5882 | 0.1170 | 0.5054 | 0.0664 | 98.0 |
Figure 4Precision-recall curves for different classifiers.
Results are given for hold-out test set (A) and an external validation set based on ClinVar data (B).
Figure 5Precision-recall curves for variants of ShapeGTB in which feature vectors from specific feature groups were permuted (effectively reducing their usefulness).
-GC corresponds to classifier with GC-derived features permuted, -Shape corresponds to classifier. Results are given for hold-out test set (A) and an external validation set based on ClinVar data (B).