Literature DB >> 30923381

Selene: a PyTorch-based deep learning library for sequence data.

Kathleen M Chen¹, Evan M Cofer^2,3, Jian Zhou^1,2, Olga G Troyanskaya^4,5,6.

Abstract

To enable the application of deep learning in biology, we present Selene (https://selene.flatironinstitute.org/), a PyTorch-based deep learning library for fast and easy development, training, and application of deep learning model architectures for any biological sequence data. We demonstrate on DNA sequences how Selene allows researchers to easily train a published architecture on new data, develop and evaluate a new architecture, and use a trained model to answer biological questions of interest.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 30923381 PMCID： PMC7148117 DOI： 10.1038/s41592-019-0360-8

Source DB: PubMed Journal: Nat Methods ISSN： 1548-7091 Impact factor: 28.547

Deep learning describes a set of machine learning techniques that use stacked neural networks to extract complicated patterns from high-dimensional data[1]. These techniques are widely used for image classification and natural language processing and have led to very promising advances in the biomedical domain, including genomics and chemical synthesis[1-3]. In regulatory genomics, networks trained on high-throughput sequencing data (e.g. ChIP-seq), or “sequence-based models,” have become the de facto standard for predicting regulatory and disease impact of mutations[4-7]. While deep-learning related publications are often accompanied by the associated pre-trained model[6,8,9], a key challenge in both developing new deep learning architectures and training existing architectures on new data is the lack of a comprehensive, generalizable, and user-friendly deep learning library for biology. Beyond regulatory genomics, sequence-level deep learning models have broad promise in a wide range of research areas, including recent advances on prediction of disease risk of missense mutations in proteins[10] and potential applications to, for example, predicting target site accessibility in genome editing. We must enable the adoption and active development of deep learning-based methods in biomedical sciences. For example, a biomedical scientist excited by a publication of a model capable of predicting the disease-associated effect of mutations should be able to train a similar model on their own ChIP-seq data focused on their disease of interest. A bioinformatician interested in developing new model architectures should be able to experiment with different architectures and evaluate all of them on the same data. Currently, this requires advanced knowledge specific to deep learning[2,11], substantial new code development, and associated time investment far beyond what most biomedical scientists are able to commit. Here we present Selene, a framework for developing sequence-level deep learning networks, that provides biomedical scientists with comprehensive support for model training, evaluation, and application across a broad range of biological questions. Sequence-level data refers to any type of biological sequence such as DNA, RNA, or protein sequences and their measured properties (e.g. TF binding, DNase sensitivity, RBP binding). Selene contains modules for (1) data sampling and training for model development (Fig. 1a), and (2) prediction and visualization for analyses using the trained model (Fig. 1b, c). With Selene, researchers can run model development and analysis workflows out-of-the-box. For more advanced use cases, Selene provides templates for extending modules within each workflow so that users can adapt the library to their particular research questions.

Figure 1.

Overview of Selene.

(a) Selene enables users to train and evaluate new deep learning models with very few lines of code. As input, the library accepts (left) the model architecture, dataset, and (mid) a configuration file that specifies the necessary input data paths and training parameters; Selene automatically splits the data into training and validation/testing, trains the model, evaluates it, and (right) generates figures from the results. (b) Selene also supports the use of trained models to interpret variants. In addition to being able to run variant effect prediction with the same configuration file format, Selene includes a visualization of the variants and their difference scores as a Manhattan plot, where a user can hover over each point to see variant information. (c) Users interested in finding informative bases that contribute to the binding of a certain protein can run Selene to get mutation effect scores and visualize the scores as a heatmap.

There has been recent work to make deep learning in biology more accessible: DragoNN is a toolkit for teaching deep learning in regulatory genomics; pysster[12] is a Python package for training convolutional neural networks on biological sequence data; and Kipoi[13] is a framework to archive, use, and build on published predictive models in genomics. These resources constitute the nascent software ecosystem for sequence-level deep learning. Selene is our contribution to this ecosystem. Selene supports general model development not constrained to a particular architecture (in contrast to pysster) or task (in contrast to DragoNN) and is designed for users with different levels of computational experience. Users are supported in tasks ranging from simply applying an existing model, to retraining it on new data (tasks also supported by Kipoi), to developing new model architectures (a task that is challenging to do with any other tool). The models developed using Selene can be shared and used through the Kipoi framework. To demonstrate Selene’s capabilities for developing and evaluating sequence-level deep learning models, we use it to (1) train a published architecture on new data, (2) develop, train, and evaluate a new model (improving a published model) and (3) apply a trained model to data and visualize the resulting predictions.

Case 1: Training an existing (e.g. published) architecture on a different dataset

In this case study, a researcher reads a manuscript about a deep learning model and wants to train the model on different data. For illustration, we will use the DeepSEA model architecture as a starting point in our case studies; however, Selene is completely general and a user can easily use or specify any model of their choice using modules in PyTorch. Suppose a cancer researcher is interested in modeling the regulatory elements of the transcription factor GATA1, specifically focusing on proerythroblasts in bone marrow. This is a tissue-specific genomic feature that DeepSEA does not predict. The researcher downloads peaks data from Cistrome[14] and a reference genome FASTA file. Once a researcher formats the data to match the documented inputs (https://selene.flatironinstitute.org/overview/cli.html) and fills out the necessary training parameters (e.g. batch size, learning rate), they can use Selene to train the DeepSEA architecture on their data with no new lines of Python code. In this example, they find that the model obtains an AUC of 0.942 on this feature (Fig. 2a).

Figure 2.

Visualizations generated by using Selene to train and apply a model to sequences.

(a) Selene visualization of the performance of the model trained in case study 1. (b) Selene visualization of in silico mutagenesis on the case-study-trained model for 20 randomly selected GATA1 sequences in the test set (two representative plots displayed here, all heatmaps generated are displayed in the example Jupyter notebook: https://github.com/FunctionLab/selene/blob/master/manuscript/case1/3_visualize_ism_outputs.ipynb). Bases in the original sequence are distinguished by the gray stripes in the heatmap cells.

Selene automatically generates training, testing, and validation samples from the provided input data. The samples generated for each partition can be saved and used in subsequent model development so that comparisons can be made across models with different architectures and/or parameters. Further, Selene automatically evaluates the model on the test set after training and, in this case, generates figures to visualize the model’s performance as receiver operating characteristic and average precision curves. Now that the researcher has a trained model, they can use Selene to apply in silico mutagenesis to a set of GATA1 sequences from the test set. In silico mutagenesis involves “mutating” every position in the sequence to every other possible base[4] (DNA and RNA) or amino acid (protein sequences) and examining the consequences of these “mutations.” Selene supports visualizing the outputs of in silico mutagenesis as a heatmap and/or motif plot. By visualizing the log2 fold change for these sequences in a heatmap, the researcher can see that the model detects disruptions in binding at the GATA motif (Fig. 2b). We provide the code and results for this example in Selene’s GitHub repository (https://github.com/FunctionLab/selene/tree/master/manuscript/case1). The case study demonstrates that Selene enables researchers to quickly get started with sequence-based deep learning; researchers can train and apply a model to their area of interest with minimal code and deep learning knowledge. Without Selene, a researcher would need to dedicate substantial time to preprocess their datasets and write code for training and evaluating their model.

Case 2: Developing a new architecture and comparing performance across architectures

In another use case, a researcher may want to develop and train a new model architecture. For example, a bioinformatician might want to modify a published model architecture to see how that affects performance. First, the researcher uses modules in PyTorch to specify the model architecture they are interested in evaluating; in this case study, they try to enhance the DeepSEA architecture with batch normalization and three additional convolutional layers. The researcher specifies parameters for training and the paths to the model architecture and data in a configuration file and passes this as input to the library’s command-line interface. Training is automatically completed by Selene; afterwards, the researcher can easily use Selene to compare the performance of their new model to the original DeepSEA model on the same chromosomal holdout dataset. In this case study, the researcher finds that the deeper architecture achieves an average AUC of 0.938 (Fig. 3a) and an average AUPRC of 0.362, which is an improvement over the average AUC of 0.933 and AUPRC of 0.342 of the original 3-convolutional-layer model. The researcher can share this model with a collaborator (e.g. a human geneticist, see case study 3 below) and upload it to the Kipoi[13] model zoo, a repository of trained models for regulatory genomics, with which Selene-trained models are fully compatible (see an example at https://github.com/FunctionLab/selene/tree/master/manuscript/case2/3_kipoi_export).

Figure 3.

Using Selene to train a model and obtain model predictions for variants in an Alzheimer’s GWAS study.

(a) Selene visualization of the performance of the trained model from case-study 2. (b) The trained model, on average, predicts greater differences for the nominally significant variants (p-value < 0.05, n = 422398) reported in the IGAP early onset Alzheimer’s GWAS study[16] compared to those that are nonsignificant (p-value > 0.50, n = 3842725). Here we visualize the mean and 95% confidence intervals of the quantile-normalized (against the Gaussian distribution) predicted effect scores of the two variant groups for the genomic feature H3K36me3 in K562 cells, the feature in the model with the most significant difference (one-sided Wilcoxon rank-sum test, adjusted p-value using Benjamini—Hochberg of 3.89 ✕ 10−67). After applying the multiple testing correction, 914 of the 919 features that the model predicts showed a significant difference (ɑ < 0.05) between the groups.

Using Selene, researchers can substantially reduce the amount of work needed to develop, train, and compare new models. Researchers are able to focus on experimenting with various model architectures rather than writing all new code for model training and evaluation.

Case 3: Applying a new model to variants

In this case study, a human geneticist studying Alzheimer’s wants to apply the model developed in case study 2 above, so they first assess its ability to prioritize potential disease-associated variants. Specifically, they use Selene to make variant effect predictions for nominally significant variants (p-value < 0.05) and nonsignificant variants (p-value > 0.50) reported in the International Genomics of Alzheimer’s Project[15] Alzheimer’s disease GWAS[16]. The researcher finds that the predicted effect is significantly higher for GWAS nominally significant variants versus non-significant variants, indicating that the new model is indeed able to prioritize potential disease-associated variants (one-sided Wilcoxon rank-sum test; the most significant feature, H3K36me3 in K562 cells, has an adjusted p-value, by Benjamini—Hochberg correction, of 3.89 ✕ 10−67) (Fig. 3b). Selene’s modeling capability extends far beyond case studies shown above. The library can be applied to not only DNA, but also RNA and protein sequences; and not only chromatin data, but any current genome-, transcriptome-, or even proteome-wide measurements. We developed Selene to increase the accessibility of deep learning in biology and facilitate the creation of reproducible workflows and results. Furthermore, Selene is open-source software that will continue to be updated and expanded based on community and user feedback.

Online methods

Overview of Selene

Selene consists of two components: a Python library for developing sequence-level neural networks, and a command line interface for prototypical use cases of the library (i.e. training a new model, evaluating an existing model, and analyzing sequence data and variants with a trained model). We herein refer to these components as the software development kit (SDK) and the command line interface (CLI) respectively. Importantly, all functionality provided by the CLI is also available to the user through the SDK. Rather than supplanting the SDK, the CLI is intended to maximize code reuse and minimize user time spent learning SDK by heavily reducing the configuration tasks left to the user (e.g. when GPU usage is specified, the CLI ensures all appropriate computations are performed on the GPU). When appropriate, the SDK does deliver functionality beyond that of the CLI. For instance, the SDK includes several data visualization methods that would be too unwieldy as executables run from the command line. Thorough documentation for the SDK is available at https://selene.flatironinstitute.org, and tutorials for both the CLI and SDK can be found at https://github.com/FunctionLab/selene/tree/master/tutorials. Notably, one tutorial demonstrates how to use Selene to train a deep neural network regression model (https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/regression_mpra_example.ipynb). This tutorial illustrates Selene’s use outside of the models of transcriptional regulation shown in the case studies.

Selene software development kit

The Selene SDK, formally known as selene_sdk, is an extensible Python package intended to ease development of new programs that leverage sequence-level models through code reuse. The Selene CLI is built entirely upon the functionality provided by the SDK, but it is likely that users will use the SDK outside this context. For example, after training a new sequence-level model with the CLI, one could use the SDK in conjunction with a Python-based web application framework (e.g. Flask, Django) to build a web server so that other researchers can submit sequences or variants and get the trained model’s predictions as output. Leveraging the SDK in a user Python project is no different from using any other Python module. That is, one only needs to import the selene_sdk module or any of its members, and supply them with the correct parameters. The runtime behavior of each component of selene_sdk, as well as the required parameters for all members of selene_sdk, is described in detail in the online documentation (https://selene.flatironinstitute.org/overview/overview.html).

Selene command line interface

The Selene CLI is a usable program to be run from the command line by the user. It encapsulates the configuration, execution, and logging of Selene’s most common use cases. Said use cases are embodied by the CLI’s three commands: train, evaluate, and analyze. These commands are used to train new models, evaluate the performance of trained models, and analyze model predictions (perform in silico mutagenesis or variant effect prediction) respectively. Each command configures its specific runtime environment with a combination of command line arguments and parameters drawn from user-provided configuration files. The flexibility of these configuration files allows them to leverage user-developed code as well, and further extends the usability of the CLI. We provide a step-by-step tutorial that describes the CLI configuration file format and shows some example configuration keys and values at https://github.com/FunctionLab/selene/blob/master/tutorials/getting_started_with_selene/getting_started_with_selene.ipynb. Additional examples of CLI configuration files are available at https://github.com/FunctionLab/selene/tree/master/config_examples as well. Finally, comprehensive documentation detailing all possible configurations supported by Selene can be found at https://selene.flatironinstitute.org/overview/cli.html. Users can reference any of these resources when creating their own configuration files.

Model Architectures

DeepSEA architecture used in case 1 (from the supplementary note in the DeepSEA publication[4]): Convolutional layer (320 kernels, window size = 8, step size = 1) Pooling layer (window size = 4, step size = 4) Convolutional layer (480 kernels, window size = 8, step size = 1) Pooling layer (window size = 4, step size = 4) Convolutional layer (960 kernels, window size = 8, step size = 1) Fully connected layer (919 genomic features) Sigmoid output layer Dropout proportion (proportion of outputs randomly set to 0): Layer 2: 20% Layer 4: 20% Layer 5: 50% All other layers: 0% Architecture used in cases 2 and 3: Convolutional layer (320 kernels, window size = 8, step size = 1) Convolutional layer (320 kernels, window size = 8, step size = 1) Pooling layer (window size = 4, step size = 4) Convolutional layer (480 kernels, window size = 8, step size = 1) Convolutional layer (480 kernels, window size = 8, step size = 1) Pooling layer (window size = 4, step size = 4) Convolutional layer (960 kernels, window size = 8, step size = 1) Convolutional layer (960 kernels, window size = 8, step size = 1) Fully connected layer (919 genomic features) Sigmoid output layer Dropout proportion: Layer 5: 20% Layer 8: 50% Batch normalization applied after layers 2, 5, and 8 and before dropout. Both architectures use the binary cross-entropy loss function and stochastic gradient descent optimizer (momentum = 0.9, weight decay = 10−6).

Reproducing the case studies

Below, we have described the steps taken for each of the case studies. The code required to reproduce each case study is included in the GitHub repository (https://github.com/FunctionLab/selene/tree/master/manuscript). We have also created Zenodo records for each case that contain all the input data, data processing scripts, and outputs files generated from Selene: Case 1. doi: 10.5281/zenodo.1442433 Case 2. doi: 10.5281/zenodo.1442437 Case 3. doi: 10.5281/zenodo.1445555 Download the data from Cistrome. In this case, we are only working with 1 dataset for 1 specific genomic feature. Cistrome ID: 33545, measurements from GSM970258 (Xu et al., 2012): http://dc2.cistrome.org/api/downloads/eyJpZCI6IjMzNTQ1In0%3A1fujCu%3ArNvWLCNoET6o9SdkL8fEv13uRu4b/ Format the data. We use tools from Samtools[17] (specifically, tabix[18] and bgzip from HTSlib, https://www.htslib.org/). Create a .bed file of chromosome, start, end, and the genomic feature name (useful when there is more than 1 feature). Sort this file and compress it into a .gz file. Tabix-index this file. Specific commands: Only use the columns [chr, start, end]: cut -f 1–3 > Note: Eventually, we will add support for parsing BED files with strand specific features and/or continuous values that quantify these features. Add the genomic feature name as the 4th column of the file: sed -i “s/$/\t/” Sort the file by [chr, start, end]: sort -k1V -k2n -k3n > Compress the file: bgzip This compresses the file to a .gz file in-place. To separately generate the .gz file, run bgzip -c > .gz Tabix index the file: tabix -p bed .gz Create a file of distinct features that the model will predict, where each feature is a single line in the file. This can easily be created from the .bed file in step 2 by running: cut -f 4 | sort -u > Download the GRCh38/hg38 FASTA file. We downloaded the reference sequences GRCh37/hg19 and GRCh38/hg38 used in our analyses from ENCODE: https://www.encodeproject.org/data-standards/reference-sequences/. Specify the model architecture, loss, and optimizer as a Python file. This is done for you in the case of DeepSEA: https://github.com/FunctionLab/selene/blob/master/models/deepsea.py Fill out the configuration file with the appropriate file paths and training parameters. We recommend starting from one of the example training files in the tutorials (e.g. https://github.com/FunctionLab/selene/blob/master/tutorials/getting_started_with_selene/getting_started_with_selene.ipynb) or in the “config_examples” directory (https://github.com/FunctionLab/selene/tree/master/config_examples). You can also review the documentation for the configuration parameters on Selene’s website (https://selene.flatironinstitute.org/overview/cli.html). Run Selene. Collect sequences you want to visualize as a FASTA file. For this particular case, we provide a script to do so (https://zenodo.org/record/2214130/files/data.tar.gz). Fill out the configuration file with the appropriate file paths (path to the FASTA file, information about the trained model). Run Selene. You will get the raw predictions and the log2 fold change scores as output files. Follow one of the Jupyter notebook tutorials for in silico mutagenesis (https://github.com/FunctionLab/selene/tree/master/tutorials) to generate visualizations for the sequences. We have done this in https://github.com/FunctionLab/selene/blob/master/manuscript/case1/3_visualize_ism_outputs.ipynb. Download the code and data bundle from the DeepSEA website (http://deepsea.princeton.edu/media/code/deepsea_train_bundle.v0.9.tar.gz). You only need the .mat files in this directory. We also include a file listing the 919 genomic features that the model predicts. This is from the resources directory in the standalone version of DeepSEA (http://deepsea.princeton.edu/media/code/deepsea.v0.94b.tar.gz). Zenodo record: https://zenodo.org/record/2214970/files/DeepSEA_data.tar.gz. Fill out the configuration file for Selene’s MultiFileSampler (https://selene.flatironinstitute.org/overview/cli.html#multiple-file-sampler) and specify the path to each .mat file for training, validation, and testing. Run Selene.

Case 2: Developing a new architecture and making model comparisons

Steps to train “deeper DeepSEA” on the same exact data as DeepSEA.

Please see the DeepSEA publication for details about data processing and training: https://www.nature.com/articles/nmeth.3547#methods. In the main text, we report test performance for the model trained using the online sampler. When training on the same exact data (the .mat files) as DeepSEA, we achieve an average AUC of 0.934 and an average AUPRC of 0.361. Download all chromatin feature profiles used for training DeepSEA, specified in Supplementary Table 1 of the DeepSEA manuscript (https://media.nature.com/original/nature-assets/nmeth/journal/v12/n10/extref/nmeth.3547-S2.xlsx). We have done this for you (https://zenodo.org/record/2214970/files/chromatin_profiles.tar.gz). For each file, keep the chromosome, start, and end columns. In addition, create a fourth column with the feature’s name. Concatenate all these files and create the distinct features file. We provide a Python script for this step (https://github.com/FunctionLab/selene/blob/master/manuscript/case2/1_train_with_online_sampler/data/process_chromatin_profiles.py). Format the data according to the instructions in the “Getting started” tutorial: Sort the file by [chr, start, end]: sort -k1V -k2n -k3n > Compress the file: bgzip This compresses the file to a .gz file in-place. To separately generate the .gz file, run bgzip -c > .gz Tabix index the file: tabix -p bed .gz https://github.com/FunctionLab/selene/blob/master/manuscript/case2/1_train_with_online_sampler/data/process_data.sh Download the hg19 FASTA file (https://www.encodeproject.org/files/male.hg19/@@download/male.hg19.fasta.gz). Specify the model architecture, loss, and optimizer as a Python file: https://github.com/FunctionLab/selene/blob/master/selene_sdk/utils/example_model.py Fill out the configuration file with the appropriate file paths and training parameters. We set the training parameters (number of steps, batches, etc.) so that they matched how DeepSEA was originally trained. Run Selene. Download the SNPs from the International Genomics of Alzheimer’s Project. (https://www.niagads.org/igap-age-onset-survival-analyses-p-value-only) Group the variants into those with p-values below 0.05 (significant) and those with p-values above 0.50 (nonsignificant). Fill out the configuration file with the paths to the two variants files and the trained model weights file from Case 2. Run Selene. Follow the script provided for this case to analyze the variant predictions (https://github.com/FunctionLab/selene/blob/master/manuscript/case3/2_variant_groups_comparison.sh).

Case 3: Applying a new model to variants

Statistical analysis

Details of the statistical test used for case study 3 are specified in the associated text and figure legend (Fig. 3b).

16 in total

1. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.

Authors: Babak Alipanahi; Andrew Delong; Matthew T Weirauch; Brendan J Frey
Journal: Nat Biotechnol Date: 2015-07-27 Impact factor: 54.908

Review 2. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

Review 3. Deep learning in bioinformatics.

Authors: Seonwoo Min; Byunghan Lee; Sungroh Yoon
Journal: Brief Bioinform Date: 2017-09-01 Impact factor: 11.622

4. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

5. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences.

Authors: Daniel Quang; Xiaohui Xie
Journal: Nucleic Acids Res Date: 2016-04-15 Impact factor: 16.971

6. Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse.

Authors: Shenglin Mei; Qian Qin; Qiu Wu; Hanfei Sun; Rongbin Zheng; Chongzhi Zang; Muyuan Zhu; Jiaxin Wu; Xiaohui Shi; Len Taing; Tao Liu; Myles Brown; Clifford A Meyer; X Shirley Liu
Journal: Nucleic Acids Res Date: 2016-10-26 Impact factor: 16.971

Review 7. Opportunities and obstacles for deep learning in biology and medicine.

Authors: Travers Ching; Daniel S Himmelstein; Brett K Beaulieu-Jones; Alexandr A Kalinin; Brian T Do; Gregory P Way; Enrico Ferrero; Paul-Michael Agapow; Michael Zietz; Michael M Hoffman; Wei Xie; Gail L Rosen; Benjamin J Lengerich; Johnny Israeli; Jack Lanchantin; Stephen Woloszynek; Anne E Carpenter; Avanti Shrikumar; Jinbo Xu; Evan M Cofer; Christopher A Lavender; Srinivas C Turaga; Amr M Alexandari; Zhiyong Lu; David J Harris; Dave DeCaprio; Yanjun Qi; Anshul Kundaje; Yifan Peng; Laura K Wiley; Marwin H S Segler; Simina M Boca; S Joshua Swamidass; Austin Huang; Anthony Gitter; Casey S Greene
Journal: J R Soc Interface Date: 2018-04 Impact factor: 4.293

8. Integrative analysis of 111 reference human epigenomes.

Authors: Anshul Kundaje; Wouter Meuleman; Jason Ernst; Misha Bilenky; Angela Yen; Alireza Heravi-Moussavi; Pouya Kheradpour; Zhizhuo Zhang; Jianrong Wang; Michael J Ziller; Viren Amin; John W Whitaker; Matthew D Schultz; Lucas D Ward; Abhishek Sarkar; Gerald Quon; Richard S Sandstrom; Matthew L Eaton; Yi-Chieh Wu; Andreas R Pfenning; Xinchen Wang; Melina Claussnitzer; Yaping Liu; Cristian Coarfa; R Alan Harris; Noam Shoresh; Charles B Epstein; Elizabeta Gjoneska; Danny Leung; Wei Xie; R David Hawkins; Ryan Lister; Chibo Hong; Philippe Gascard; Andrew J Mungall; Richard Moore; Eric Chuah; Angela Tam; Theresa K Canfield; R Scott Hansen; Rajinder Kaul; Peter J Sabo; Mukul S Bansal; Annaick Carles; Jesse R Dixon; Kai-How Farh; Soheil Feizi; Rosa Karlic; Ah-Ram Kim; Ashwinikumar Kulkarni; Daofeng Li; Rebecca Lowdon; GiNell Elliott; Tim R Mercer; Shane J Neph; Vitor Onuchic; Paz Polak; Nisha Rajagopal; Pradipta Ray; Richard C Sallari; Kyle T Siebenthall; Nicholas A Sinnott-Armstrong; Michael Stevens; Robert E Thurman; Jie Wu; Bo Zhang; Xin Zhou; Arthur E Beaudet; Laurie A Boyer; Philip L De Jager; Peggy J Farnham; Susan J Fisher; David Haussler; Steven J M Jones; Wei Li; Marco A Marra; Michael T McManus; Shamil Sunyaev; James A Thomson; Thea D Tlsty; Li-Huei Tsai; Wei Wang; Robert A Waterland; Michael Q Zhang; Lisa H Chadwick; Bradley E Bernstein; Joseph F Costello; Joseph R Ecker; Martin Hirst; Alexander Meissner; Aleksandar Milosavljevic; Bing Ren; John A Stamatoyannopoulos; Ting Wang; Manolis Kellis
Journal: Nature Date: 2015-02-19 Impact factor: 69.504

9. Follow-up of loci from the International Genomics of Alzheimer's Disease Project identifies TRIP4 as a novel susceptibility gene.

Authors: A Ruiz; S Heilmann; T Becker; I Hernández; H Wagner; M Thelen; A Mauleón; M Rosende-Roca; C Bellenguez; J C Bis; D Harold; A Gerrish; R Sims; O Sotolongo-Grau; A Espinosa; M Alegret; J L Arrieta; A Lacour; M Leber; J Becker; A Lafuente; S Ruiz; L Vargas; O Rodríguez; G Ortega; M-A Dominguez; R Mayeux; J L Haines; M A Pericak-Vance; L A Farrer; G D Schellenberg; V Chouraki; L J Launer; C van Duijn; S Seshadri; C Antúnez; M M Breteler; M Serrano-Ríos; F Jessen; L Tárraga; M M Nöthen; W Maier; M Boada; A Ramírez
Journal: Transl Psychiatry Date: 2014-02-04 Impact factor: 6.222

10. Predicting the clinical impact of human mutation with deep neural networks.

Authors: Laksshman Sundaram; Hong Gao; Samskruthi Reddy Padigepati; Jeremy F McRae; Yanjun Li; Jack A Kosmicki; Nondas Fritzilas; Jörg Hakenberg; Anindita Dutta; John Shon; Jinbo Xu; Serafim Batzoglou; Xiaolin Li; Kyle Kai-How Farh
Journal: Nat Genet Date: 2018-07-23 Impact factor: 38.330

20 in total

1. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization.

Authors: Zhen Chen; Pei Zhao; Chen Li; Fuyi Li; Dongxu Xiang; Yong-Zi Chen; Tatsuya Akutsu; Roger J Daly; Geoffrey I Webb; Quanzhi Zhao; Lukasz Kurgan; Jiangning Song
Journal: Nucleic Acids Res Date: 2021-06-04 Impact factor: 16.971

Review 2. A guide to machine learning for biologists.

Authors: Joe G Greener; Shaun M Kandathil; Lewis Moffat; David T Jones
Journal: Nat Rev Mol Cell Biol Date: 2021-09-13 Impact factor: 94.444

Review 3. How Machine Learning Will Transform Biomedicine.

Authors: Jeremy Goecks; Vahid Jalili; Laura M Heiser; Joe W Gray
Journal: Cell Date: 2020-04-02 Impact factor: 41.582

Review 4. Decoding disease: from genomes to networks to phenotypes.

Authors: Aaron K Wong; Rachel S G Sealfon; Chandra L Theesfeld; Olga G Troyanskaya
Journal: Nat Rev Genet Date: 2021-08-02 Impact factor: 53.242

5. Improving the trans-ancestry portability of polygenic risk scores by prioritizing variants in predicted cell-type-specific regulatory elements.

Authors: Tiffany Amariuta; Kazuyoshi Ishigaki; Hiroki Sugishita; Tazro Ohta; Masaru Koido; Kushal K Dey; Koichi Matsuda; Yoshinori Murakami; Alkes L Price; Eiryo Kawakami; Chikashi Terao; Soumya Raychaudhuri
Journal: Nat Genet Date: 2020-11-30 Impact factor: 38.330

Selene: a PyTorch-based deep learning library for sequence data.

Case 1: Training an existing (e.g. published) architecture on a different dataset

Case 2: Developing a new architecture and comparing performance across architectures

Case 3: Applying a new model to variants

Online methods

Overview of Selene

Selene software development kit

Selene command line interface

Model Architectures

Reproducing the case studies

Case 2: Developing a new architecture and making model comparisons

Steps to train “deeper DeepSEA” on the same exact data as DeepSEA.

Case 3: Applying a new model to variants

Statistical analysis

1. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.

Review 2. Deep learning.

Review 3. Deep learning in bioinformatics.

4. The Sequence Alignment/Map format and SAMtools.

5. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences.

6. Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse.

Review 7. Opportunities and obstacles for deep learning in biology and medicine.

8. Integrative analysis of 111 reference human epigenomes.

9. Follow-up of loci from the International Genomics of Alzheimer's Disease Project identifies TRIP4 as a novel susceptibility gene.

10. Predicting the clinical impact of human mutation with deep neural networks.

1. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization.

Review 2. A guide to machine learning for biologists.

Review 3. How Machine Learning Will Transform Biomedicine.

Review 4. Decoding disease: from genomes to networks to phenotypes.

5. Improving the trans-ancestry portability of polygenic risk scores by prioritizing variants in predicted cell-type-specific regulatory elements.

6. Modeling transcriptional regulation of model species with deep learning.

7. BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models.

8. PlantDeepSEA, a deep learning-based web service to predict the regulatory effects of genomic variants in plants.

Review 9. Gaining insight into metabolic diseases from human genetic discoveries.

10. CROTON: an automated and variant-aware deep learning framework for predicting CRISPR/Cas9 editing outcomes.