| Literature DB >> 36084008 |
Bjørn André Bredesen-Aa1, Marc Rehmsmeier2.
Abstract
Gene expression is regulated through cis-regulatory elements (CREs), among which are promoters, enhancers, Polycomb/Trithorax Response Elements (PREs), silencers and insulators. Computational prediction of CREs can be achieved using a variety of statistical and machine learning methods combined with different feature space formulations. Although Python packages for DNA sequence feature sets and for machine learning are available, no existing package facilitates the combination of DNA sequence feature sets with machine learning methods for the genome-wide prediction of candidate CREs. We here present Gnocis, a Python package that streamlines the analysis and the modelling of CRE sequences by providing extensible APIs and implementing the glue required for combining feature sets and models for genome-wide prediction. Gnocis implements a variety of base feature sets, including motif pair occurrence frequencies and the k-spectrum mismatch kernel. It integrates with Scikit-learn and TensorFlow for state-of-the-art machine learning. Gnocis additionally implements a broad suite of tools for the handling and preparation of sequence, region and curve data, which can be useful for general DNA bioinformatics in Python. We also present Deep-MOCCA, a neural network architecture inspired by SVM-MOCCA that achieves moderate to high generalization without prior motif knowledge. To demonstrate the use of Gnocis, we applied multiple machine learning methods to the modelling of D. melanogaster PREs, including a Convolutional Neural Network (CNN), making this the first study to model PREs with CNNs. The models are readily adapted to new CRE modelling problems and to other organisms. In order to produce a high-performance, compiled package for Python 3, we implemented Gnocis in Cython. Gnocis can be installed using the PyPI package manager by running 'pip install gnocis'. The source code is available on GitHub, at https://github.com/bjornbredesen/gnocis.Entities:
Mesh:
Substances:
Year: 2022 PMID: 36084008 PMCID: PMC9462789 DOI: 10.1371/journal.pone.0274338
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Core data preparation features.
|
| FASTA (loading/saving) |
| 2bit (loading) | |
| Streaming from disk | |
| Sliding window extraction | |
|
| Coordinate lists (loading/saving) |
| General Feature Format (GFF) (loading/saving) | |
| Browser Extensible Data (BED) (loading/saving) | |
| G-zipped GFF (loading) | |
| G-zipped BED (loading) | |
|
| Wiggle (loading/saving) |
| G-zipped Wiggle (loading) | |
| Thresholding | |
|
| Merge |
| Intersect | |
| Exclude | |
| Get overlapping | |
| Get non-overlapping | |
| Resize | |
| Randomly recentre | |
| Extract underlying sequences | |
|
| Genomic sequences via sequence file operations |
| Loading of annotation (Ensembl General Transfer Format, GTF) | |
|
| Define biomarker set based on sets of experimental signals |
| Extract highly biomarker-enriched (HBME) genomic regions | |
| Extract lowly biomarker-enriched (LBME) genomic regions | |
|
| Training of i.i.d. sequence model and generation of sequences |
| Training of nth-order sequence model and generation of sequences | |
|
| Plotting of genomic regions and curves with Matplotlib [ |
| Plotting barplots of region overlap statistics with Matplotlib |
Gnocis supports standard file formats for regions, curves and sequences, and implements a wide selection of operations in order to facilitate data preparation and handling.
Sequence feature analysis.
|
| IUPAC nucleotide code motifs |
| Position Weight Matrices (PWMs) | |
| Transformation of motif sets into feature sets | |
|
| Motif occurrence frequencies |
| Motif pair occurrence frequencies | |
| k-spectrum kernel | |
| k-spectrum mismatch kernel | |
|
| Combination of feature sets |
| Filtering of feature sets | |
| Feature pairing by product | |
| Scaling | |
|
| Construction of feature value tables |
| Output of summary statistics | |
| Output of differential summary statistics | |
| Conversion to NumPy [ | |
| Conversion to Pandas [ |
Gnocis provides a flexible framework for the specification of sequence feature sets and integrates with NumPy [22] and Pandas [23] for analyses with external packages.
Models.
|
| Unweighted sum |
| Log-odds | |
| Support Vector Machine via Scikit-learn [ | |
| Support Vector Machine with GPU application via Scikit-learn [ | |
| Random Forest via Scikit-learn [ | |
|
| Combination of base models and feature sets |
| Keras Neural Networks via TensorFlow [ | |
| Convolutional Neural Networks via TensorFlow [ | |
| PyPREdictor, a reimplementation of the PREdictor [ | |
| Dummy PREdictor as used in [ | |
| Wrapper for SVM-MOCCA [ | |
| Deep-MOCCA | |
|
| Operations on feature sets for the definition of models |
| Multi-core processing | |
| Validation | |
| Prediction threshold calibration | |
| Genome-wide prediction | |
| Multi-class model specification via sequence labels | |
| Retraining |
Gnocis provides a flexible and extensible modelling API, with implementations of a variety of models and integrations with Scikit-learn [15] and TensorFlow [16].
Fig 1Deep-MOCCA schematic.
Deep-MOCCA is a convolutional neural network architecture that mimics the structure of SVM-MOCCA [7].
Fig 2Cross-validation Precision/Recall curves.
We cross-validated our models trained with PREs and non-PREs, and tested with independent A) PREs versus dummy PREs and B) PREs versus coding sequences.
Multiprocessing and GPU application of SVMs significantly reduces run-times.
| Application method | Running time (h:mm:ss) |
|---|---|
| 1 core | 0:10:24 |
| 12 cores/threads | 0:05:18 |
| GPU | 0:01:28 |
We applied a quadratic 5-spectrum mismatch SVM to 19,600 3kb-long dummy genomic sequences using a single core, twelve cores/threads and a GPU. CPU: Intel Core i9–9900K, 3.6 GHz, 8 cores, 16 threads. GPU: GeForce GTX 980.
Fig 3Numbers of predictions.
Fig 4Predictions at the A) invected and B) vestigial loci.
Visualized using the Gnocis genomic track plotting, which uses Matplotlib [26]. Opaque predictions are predicted in the majority of cross-validation repeats, and semi-transparent predictions in a subset of repeats.
Fig 5Prediction overlap with experimental data.
A) Overlap sensitivity of predictions to Enderle et al. (2011) [35] PREs. B) Nucleotide precision of predictions to Enderle et al. (2011) [35] PREs. In order to avoid bias, for the calculations in both A) and B), we removed PREs from [35] and predictions that were within 1kb of overlapping with a Kahn et al. (2014) [34] PRE.