| Literature DB >> 31783882 |
Angel Ruiz-Reche1, Akanksha Srivastava2,3, Joel A Indi1,4, Ivan de la Rubia1, Eduardo Eyras5,6,7.
Abstract
We describe ReorientExpress, a method to perform reference-free orientation of transcriptomic long sequencing reads. ReorientExpress uses deep learning to correctly predict the orientation of the majority of reads, and in particular when trained on a closely related species or in combination with read clustering. ReorientExpress enables long-read transcriptomics in non-model organisms and samples without a genome reference without using additional technologies and is available at https://github.com/comprna/reorientexpress.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31783882 PMCID: PMC6883653 DOI: 10.1186/s13059-019-1884-z
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1ReorientExpress deep learning models. ReorientExpress implements two deep neural networks (DNNs) to predict the orientation of cDNA long reads. a A multilayer perceptron (MLP) is trained on k-mer frequencies extracted from sequences of known orientation. For each test read, the orientation is predicted using the k-mer frequencies of the read as input. b A convolutional neural network (CNN) is trained on 500-nt sliding windows from sequences of known orientation, using one-hot encoding for each window. Prediction is performed by scoring all windows in a test read and calculating the mean score independently for each orientation
Fig. 2ReorientExpress accuracy analysis. a Receiving operating characteristic (ROC) curves, representing the false positive rate (x axis) versus the true positive rate (y axis) for the prediction of the orientation of human ONT cDNA reads with the multilayer perceptron (MLP) and convolutional neural network (CNN) models trained on either the human (Hs) or the mouse (Mm) transcripts. b ROC curves for the prediction of the orientation of yeast ONT cDNA reads with the MLP and CNN models trained on either the S. cerevisiae (Sc) or C. glabrata (Cg) transcripts. c Number of clusters (y axis) according to the proportion of human ONT cDNA reads in the cluster with orientation correctly predicted by ReorientExpress (x axis) with the MLP model trained on the human transcriptome (left panel) (Hs-MLP) or the S. cerevisiae transcriptome (right panel) (Sc-MLP). Clusters with > 2 reads are shown. Similar plots for all clusters (> 1 read) and for the CNN model are given in Additional file 1: Figure S1. d Comparison of the proportion of human (Hs) or S. cerevisiae (Sc) cDNA reads correctly oriented in three cases: taking the default orientation from the FASTQ file (Default) in blue, using the CNN and MLP ReorientExpress models in green, and using a majority vote in clusters to predict the orientation of all reads in each cluster (ReorientExpress and clustering) in yellow. Clustering and predictions in (c) and (d) were performed with all labeled cDNA reads (see the “Methods” section). Models used to on the total set of labeled cDNA reads in this figure were trained on 50,000 randomly selected transcript sequences from the annotation, or all of them if there were less (S. cerevisiae and C. glabrata)
Fig. 3Read types and sequence motifs. The proportion of cDNA reads that were unambiguously mapped to each transcript type (y axis) and classified as correct (True) or incorrect (False) (x axis) by the CNN model (a) and the MLP model (b). All transcript type annotations from the autosomes and sex chromosomes with more than 10 reads mapped are represented in the plot. c The motifs derived from three CNN filters with significant matches to previously described RNA-binding motifs (Additional file 3): filter M24 (RBM42, q-value = 0.0165997), filter M17 (HuR, q-value = 0.0312701), and filter M9 (PCBP1, q-value = 0.0426708). As information content (y axis) is low, the axis scale is shown between 0 and 1