| Literature DB >> 35291940 |
Shang Liu1,2, Hailiang Cheng1,2, Javaria Ashraf1,3, Youping Zhang1,2, Qiaolian Wang1,2, Limin Lv1,2, Man He1, Guoli Song4,5, Dongyun Zuo6,7.
Abstract
BACKGROUND: Upland cotton provides the most natural fiber in the world. During fiber development, the quality and yield of fiber were influenced by gene transcription. Revealing sequence features related to transcription has a profound impact on cotton molecular breeding. We applied convolutional neural networks to predict gene expression status based on the sequences of gene transcription start regions. After that, a gradient-based interpretation and an N-adjusted kernel transformation were implemented to extract sequence features contributing to transcription.Entities:
Keywords: Convolutional neural network; Cotton fiber; Model interpretation; Motif detection; Transcription
Mesh:
Year: 2022 PMID: 35291940 PMCID: PMC8922751 DOI: 10.1186/s12859-022-04619-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Selected models for three stages
| Developmental stage | Number of filters | Kernel size | Maxpooling size | Average accuracy |
|---|---|---|---|---|
| Initiation | 24 | 14 | 10 | 0.8 |
| Elongation | 32 | 22 | 10 | 0.8 |
| SCW | 24 | 16 | 10 | 0.78 |
Parameters and mean accuracy of corresponding models in three developmental stages. These three selected models had highest mean accuracy compared with other combined parameters in each stage
Fig. 1Accuracies and ROC curves of models with single convolutional layer and models in maize for three stages. a Accuracies of models in three stages, blue bars represent for accuracies of our models and orange bars represent for accuracies of models in maize. b ROC curve for our model and the model in maize for fiber initiation, X-axis is FPR (false positive rate) and Y-axis is TPR (true positive rate). Red line for model in maize and blue line for our model. c ROC curve for models in fiber elongation. d ROC curve for models in SCW
Fig. 2Average effect of each loc in input sequences. Plot visualized effects of 991 input features except for upstream and downstream 10nts from transcription start site. Scatter under the effects plot visualized effects of upstream and downstream 10 nts
Fig. 3Pipeline to interpretate CNN models by adjusted kernel transformation. Kernels from CNN models will be transformed into PWM with N-regularized method. Transformed kernel will be aligned with known plant motifs to detect both annotated and de novo sequence features
Fig. 4Model interpretation results by N-adjusted kernel transformation. a Visualization for effects of N in model0 during initiation. Values annotated in the heatmap indicate the order of N sorted by effects of 5 characters in sequences. 4 means N has the largest effect and 0 means effects of N is minimum among 5 characters. b Distribution for 4 transcription activators detected from kernel transformation on expressed (red line) and low expressed genes (blue line). c Distribution for 6 DOF genes detected from kernel transformation on expressed (red line) and low expressed genes (blue line). d Comparison among features extracted by CNN interpretation and meme. Features extracted by two methods were input into SVM model to predict transcription status. SVM models were evaluated in accuracy, recall rate, precise rate and f1 score. Bars with shadows represent for SVM build on features extracted by MEME and bars without features represent for SVM build on features detected from CNN
Fig. 5Heatmap of transcriptional abundance of DOF5.1 in upland cotton. Mean FPKM on each timepoint was calculated based on 3 replicates and log2 (FPKM + 1) was used for heatmap visualization