| Literature DB >> 28155662 |
Shayoni Dutta1, Spandan Madan1, Harsh Parikh2, Durai Sundar3.
Abstract
BACKGROUND: The ability to engineer zinc finger proteins binding to a DNA sequence of choice is essential for targeted genome editing to be possible. Experimental techniques and molecular docking have been successful in predicting protein-DNA interactions, however, they are highly time and resource intensive. Here, we present a novel algorithm designed for high throughput prediction of optimal zinc finger protein for 9 bp DNA sequences of choice. In accordance with the principles of information theory, a subset identified by using K-means clustering was used as a representative for the space of all possible 9 bp DNA sequences. The modeling and simulation results assuming synergistic mode of binding obtained from this subset were used to train an ensemble micro neural network. Synergistic mode of binding is the closest to the DNA-protein binding seen in nature, and gives much higher quality predictions, while the time and resources increase exponentially in the trade off. Our algorithm is inspired from an ensemble machine learning approach, and incorporates the predictions made by 100 parallel neural networks, each with a different hidden layer architecture designed to pick up different features from the training dataset to predict optimal zinc finger proteins for any 9 bp target DNA.Entities:
Keywords: Domain adaptation; Neural network; Statistical sampling; Targeted genome editing; Zinc finger proteins
Mesh:
Substances:
Year: 2016 PMID: 28155662 PMCID: PMC5260015 DOI: 10.1186/s12864-016-3323-9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1A schematic representation of DNA-zinc finger protein interaction depicting the two possible modes of binding. a) The binding affinity of each finger is affected by the adjacent fingers due to co-operativity - Synergistic mode of binding and b) Binding affinity of each finger with its respective 3 bp DNA sub-site is independent of each other - modular mode of binding
Fig. 2The pipeline for our algorithm to predict optimal ZFPs for any 9 bp target DNA. K-means sampling was used to identify sample points that represent the whole sample space well. These DNA samples are docked with mutants of the Zif-268 protein to generate the training samples for our ensemble micro neural network model. Finally, the model is used for making predictions for user queried 9 bp DNA targets
DNA Sequences used for training and testing of micro neural network Model
| Training Sample Set | Testing Sample Set | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (Orientation 5′ → 3′) | ||||||||||||||
| CGA | AAT | CGC | GCT | TAT | ACT | GCA | GCC | TTT | TTT | GCT | TCA | CAT | TTA | GTG |
| CAT | GTA | TGA | AGG | GCA | GCG | TAG | TCC | ATT | TTA | TTA | TGG | GGA | GGA | GGA |
| GTG | GCG | GGC | CCA | TAT | GCG | CTT | ACT | CTG | GGA | GCG | ATC | ACT | CAG | CTC |
| TAA | GCT | CAA | GTG | TAT | ATA | GCC | CAC | GAA | ACG | CAA | CAG | GGG | GGG | GGG |
| TGG | TGG | GGA | ACT | ACG | CTA | GAC | CCA | TAC | CGC | TTA | TTA | TGG | TGT | CCG |
| TCG | GCG | TGA | TAA | TGT | GGT | AGC | TAT | TTC | TCC | TCG | TGT | GTT | GTT | GTT |
| CAA | TCA | GAT | CCA | GAG | TCC | CGG | AGA | AGG | GTT | TCT | CTC | GCC | GCC | GCC |
| TGC | AAT | TGA | GTG | ATA | ATC | GCT | AGT | TAG | ACG | ATT | AGG | GCA | GCA | GCA |
| ACC | GAG | CTA | TTA | AGA | GAG | CGC | AGC | TAG | ATA | TTC | GAG | GAG | GAG | GAG |
| TGC | AGC | TAT | GAA | CGA | AGA | CCC | CAA | CTG | TTC | GGG | CAA | GGC | GGC | GGC |
Accuracy of micro neural network model for both the training and testing datasets (Sequence Identity and BLAST e-value scores)
| Training Data | Testing Data | |
|---|---|---|
| Median BLAST e-value score | 2.00E-21 | 7.00E-12 |
| Geometric Mean of BLAST e-value scores | 3.00E-21 | 1.70E-12 |
| Average Sequence Identity | 100% | 83% |
Comparison of ZifNN predictions with other tools reported in literature. ZiFNN, ZiFiT [6] and Zinc Finger Tools [4] were compared with experimental data mined from literature (KD and helix prediction)* using Hamming distance as the metric
| DNA Target | References from Literature | Experimentally Found ZFP | Best prediction made by ZiFNN | Identity for ZiFNN | ZiFit Prediction | Identity for ZiFiT | Zinc Finger Tools | Identity for Zinc Finger Tools | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| F1 | F2 | F3 | F1 | F2 | F3 | F1 | F2 | F3 | ||||||
| GTGGAGGAA | [ | QSGNLTRRSGHLTRRSGELTR | DSGHLTRDSGHLTRDSGHLTR | 0.76 | RNVNLVTRQDNLGRQASNLLR | 0.33 | RSDELVRRSDNLVRQSSNLVR | 0.47 | ||||||
| GCTGCTGCT | [ | RSGELTRTSGELTRRSGELTR | TSGELTETSGELTETSGELTE | 0.76 | LRASLRRQRSDLTRMKNTLTR | 0.38 | TSGELVRTSGELVRTSGELVR | 0.76 | ||||||
| GAGGAGGAT | [ | QSGNLTRRSGNLTRRSGNLTR | QSGHLTRQSGHLTRQSGHLTR | 0.76 | - | - | RSDNLVRRSDNLVRTSGNLVR | 0.66 | ||||||
| CTGGCGGCA | [ | RSGALTERSGDLTRQSGDLTR | RSGDLTTRSGDLTTRSGDLTT | 0.76 | - | - | RNDALTERSDDLVRQSGDLRR | 0.76 | ||||||
| GGGGCGGGG | [ | KSGHLTARSGELTRRSGHLTK | RSGHLTRRSGHLTRRSGHLTR | 0.80 | RKHRLDGRTDTLARRGNHLRR | 0.33 | RSDKLVRRSDDLVRRSDKLVR | 0.42 | ||||||
| GCTGGGGGC | [ | RSGELTRTSGHLTRDSGHLTR | QSGHLTRQSGHLTRQSGHLTR | 0.80 | VSNSLARRREHLVRTNSKLTR | 0.42 | TSGELVRRSDKLVRDPGHLVR | 0.61 | ||||||
| GCGTGGGGA | [ | RSGELTRRSGHLTRQSGHLTR | QSGTLTRRSGTLTRQSGTLTR | 0.80 | - | - | RSDDLVRRSDHLTTQRAHLER | 0.61 | ||||||
| GCGTGGGCA | [ | RSGELTRRSGHLTRRSGELTR | RSGTLTRRSGTLTRRSGTLTT | 0.80 | - | - | RSDDLVRRSDDLVRQSGDLRR | 0.57 | ||||||
| GCGTGGGAA | [ | RSGELTRRSGHLTRQSGNLTR | RSGTLTRRSGTLTRRSGTLTR | 0.80 | - | - | RSDDLVRRSDHLTTQSSNLVR | 0.66 | ||||||
| GCGGGCCGC | [ | RSGELTRDSGALTRRSGELTR | RSGHLTRRSGHLTRRSGHLTR | 0.80 | - | - | RSDDLVRDPGHLVRHTGHLLE | 0.47 | ||||||
| GCAGCGGAC | [ | RSGELTRRSGHLTRQSGSLTR | QSGHLTRQSGHLTRQSGHLTR | 0.80 | QKGTLGRRTDTLARDPSNLIR | 0.38 | QSGDLRRRSDDLVRDPGNLVR | 0.52 | ||||||
| GAGGAAGGG | [ | RSGHLTRQSGNLTRRSGNLTR | QSGHLTRQSGHLTRQSGHLTR | 0.80 | RRDNLNRQQTNLTRKRERLDR | 0.48 | RSDNLVRQSSNLVRRSDKLVR | 0.61 | ||||||
| ACTACTGGA | [ | TSGDLTRTSGDLTRQSGHLTR | TSGELTRTSGELTRTSGELTR | 0.80 | - | - | THLDLIRTHLDLIRQRAHLER | 0.57 | ||||||
| GCTGGGGGC | [ | QSGDLTRRSGHLTRDSGHLTR | QSGHLTRQSGHLTRQSGHLTR | 0.85 | VSNSLARRREHLVRTNSKLTR | 0.48 | TSGELVRRSDKLVRDPGHLVR | 0.61 | ||||||
| GAAGAGGGT | [ | QSGHLTRRSGNLTRQSGNLTR | QSGHLTRQSGHLTRQSGHLTR | 0.85 | QRNNLGRRQDNLGRTRQKLET | 0.38 | QSSNLVRRSDNLVRTSGHLVR | 0.61 | ||||||
| GAGGAAGGT | [ | TSGHLTRTSGHLTRRSGELTR | TSGHLTRTSGHLTRTSGHLTR | 0.90 | RRDNLNRQQTNLTRTKQRLEV | 0.28 | RSDNLVRQSSNLVRTSGHLVR | 0.47 | ||||||
| Average for ZifNN | 0.81 | Average For ZiFit | 0.38 | Average for Zinc Finger Tool | 0.58 | |||||||||