| Literature DB >> 35832620 |
Robert S Piecyk1, Luca Schlegel1, Frank Johannes1,2.
Abstract
Gene regulation in eukaryotes is profoundly shaped by the 3D organization of chromatin within the cell nucleus. Distal regulatory interactions between enhancers and their target genes are widespread and many causal loci underlying heritable agricultural or clinical traits have been mapped to distal cis-regulatory elements. Dissecting the sequence features that mediate such distal interactions is key to understanding their underlying biology. Deep Learning (DL) models coupled with genome-wide 3C-based sequencing data have emerged as powerful tools to infer the DNA sequence grammar underlying such distal interactions. In this review we show that most DL models have remarkably high prediction accuracy, which indicates that DNA sequence features are important determinants of chromatin looping. However, DL model training has so far been limited to a small set of human cell lines, raising questions about the generalization of these predictions to other tissue-types and species. Furthermore, we find that the model architecture seems less relevant for model performance than the training strategy and the data preparation step. Transfer learning, coupled with functionally curated interactions, appear to be the most promising approach to learn cell-type specific and possibly species- specific sequence features in future applications.Entities:
Keywords: 3D Chromatin Interaction; Chromosome conformation capture (3C); Deep Learning; Epigenetics; Genome folding
Year: 2022 PMID: 35832620 PMCID: PMC9271978 DOI: 10.1016/j.csbj.2022.06.047
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1(A), (B1) – (B5) Hi-C sequencing. A restriction enzyme buffer in combination with SDS solubilization enables the access to open cross-linked chromatin and removes any other substances. Next, a type II restriction endonuclease digests the accessible chromatin. HindIII enzyme detects and cleaves of all 5’-AAGCTT-3’sequences, which are filled with biotin-14-dCTP. Special dilute conditions favor proximity ligation and can be identified by its unique 5’-GCTAGC-3’ Nhel site. These chromatin ligation products are degraded by Proteinase K followed by several DNA purification steps [44]. (B6), (C6) DNA samples are mapped to a given DNA library with the correct reference genome including many quality steps. (A), (C1) - (C5) ChIA-PET sequencing. Formaldehyde stabilizes cross-linked DNA–protein complexes before sonication is used for digestion. ChIP is applied, while using the corresponding antibody for the protein of interest. The precipitate, which is enriched with the digested chromatin complexes, is divided into two separate locations with two different half-linker oligonucleotides. Both samples are mixed, which activates proximal half-linker and self-ligation. The restriction enzyme Mmel is added for digestion and DNA fragments from paired-end tags (PETs) are isolated. The final DNA sequences can be assigned uniquely by their ligation type. Self-ligated sequences are considered as chromatin looping with small distance, while mixed linkers referring to long distance base pairs, eventually on different chromosomes [49].
Fig. 2Training procedure of a typical CNN + LSTM model. (A) During input data preprocessing, chromatin sequences are translated using a one-hot encoding technique. (B) Schematic representation of a commonly used architecture in chromatin interaction detection. Two input anchors are managed separately by CNN blocks, which are defined by the number of convolutional and pooling layers and several hyperparameters. Once the architecture, the arrangement of the layers and hyperparameters are selected, we start with an initial unbiased parameter distribution . The concept of ’training’ refers to minimizing a certain loss function L with respect to lambda, which contains the input values x, the parameter set and all non-linear functions . This representation is drastically reduced and on a highly abstract level. Many additional decisions are necessary to define a full CNN model with LSTM units. (C) After training, we end up with a set of optimal parameters . This set of trained parameters in combination with the model architecture must be applied to a final test data for validation, before applying to completely new data.
Deep Learning algorithms for 3D chromatin interactions, sorted by architecture. All models are based on Convolution Neural Networks or Recurrent Neural Networks. [18], [41], [43], [9], [19], [27], [28], [13], [22], [29], [33], [38], [40], [42], [31], [30][11], [12], [14], [15], [16], [17], [20], [21], [23], [25], [26], [32], [34], [35], [36], [37].
Fig. 3Performance bar plot for models, which provided AUPR value for cell type specific training and testing. Light blue bars indicate a pure CNN model, dark blue refers to a CNN + RNN model. Transfer learning is represented by dotted bars. The yellow interval is defined by the minimum and maximum AUPR value.
Fig. 4Biological insights by Deep Learning. Deep Learning models can be used to predict chromatin interaction in combination with several nucleotide variants through in silico mutagenesis. A change in the loop probabilities and between sequence variants indicates the importance of the specific single nucleotide polymorphisms in chromatin looping. Transfer learning can be used to extend previously trained knowledge to different species or cell lines.