| Literature DB >> 32913254 |
Yuhua Fu1,2, Jingya Xu1, Zhenshuang Tang1, Lu Wang1, Dong Yin1, Yu Fan1, Dongdong Zhang2, Fei Deng2, Yanping Zhang2, Haohao Zhang2, Haiyan Wang1, Wenhui Xing2, Lilin Yin1, Shilin Zhu1, Mengjin Zhu1, Mei Yu1, Xinyun Li1, Xiaolei Liu3, Xiaohui Yuan4, Shuhong Zhao5.
Abstract
The analyses of multi-omics data have revealed candidate genes for objective traits. However, they are integrated poorly, especially in non-model organisms, and they pose a great challenge for prioritizing candidate genes for follow-up experimental verification. Here, we present a general convolutional neural network model that integrates multi-omics information to prioritize the candidate genes of objective traits. By applying this model to Sus scrofa, which is a non-model organism, but one of the most important livestock animals, the model precision was 72.9%, recall 73.5%, and F1-Measure 73.4%, demonstrating a good prediction performance compared with previous studies in Arabidopsis thaliana and Oryza sativa. Additionally, to facilitate the use of the model, we present ISwine ( http://iswine.iomics.pro/ ), which is an online comprehensive knowledgebase in which we incorporated almost all the published swine multi-omics data. Overall, the results suggest that the deep learning strategy will greatly facilitate analyses of multi-omics integration in the future.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32913254 PMCID: PMC7483748 DOI: 10.1038/s42003-020-01233-4
Source DB: PubMed Journal: Commun Biol ISSN: 2399-3642
Overview of the multi-omics data used to construct the integrated swine omics knowledgebase.
| Omics | Samples | Classification | Scale | Source |
|---|---|---|---|---|
| Genome | 864 | 59 breeds | 32.88 TB | 42 projects |
| Transcriptome | 3526 | 95 tissues | 20.0 TB | 263 projects |
| QTXs | – | 89 traits | 26,357 entries | 653 studies |
Fig. 1Schematic of the gene prioritization framework for the integrated swine omics knowledgebase.
The circles represent a list of candidate genes from GWAS or any other omics analysis. The rectangles represent positive training samples and negative training samples. The dotted box represents a CNN model trained by using variation counts, expression level, QTANs/QTALs number, and WGCNA module features of the training data. The output layer of the model shows the probability that the gene is a credible candidate gene by using the “softmax” function. The candidate genes with a probability >50% were denoted as credible candidate genes and can be ranked according to their probability.
Comparison of the performance of the four machine learning models for gene prioritization.
| Models | Accuracy | Precision | Recall | F1-Measure |
|---|---|---|---|---|
| LR | 0.657 | 0.686 | 0.571 | 0.623 |
| LinearSVC | 0.639 | 0.658 | 0.571 | 0.612 |
| MLP | 0.692 | 0.678 | 0.726 | 0.701 |
| CNN | 0.734 | 0.729 | 0.738 | 0.734 |
The F1- Measure of the two linear models (LR and LinearSVC) with strong explanatory power were lower than deep learning models (MLP and CNN) that were based on neural networks, which suggested the deep learning models were superior to the linear models. Between MLP and CNN, the accuracy, precision, and F1- Measure of CNN were higher than MLP, and the performance of CNN was slightly better than MLP.
LR Logistic regression, LinearSVC Linear Support Vector Classifier, MLP Multi-Layer Perceptron, CNN Convolutional Neural Networks.
Fig. 2Evaluation of the CNN model.
a The relative importance of 14 features used in the CNN model. Except for the top five features, the relative importance of other features was <50%, and the top five features may have played important roles in gene prioritization. b The scores of candidate genes in nine real case studies. Each column represents one real case study, and each grid represents the score of a candidate gene. The green, pink, and gray backgrounds represent that the candidate gene was reported to be a credible candidate in the case literature, in other published sources, or non-reported, respectively. The CT10 and CL10 means top 10, last 10 genes from the predicted credible candidate genes, and the NT10 means top 10 genes from predicted non-credible candidate genes. c Proportion of credible candidate genes identified in different score ranges; genes with higher scores were more likely to be credible candidate genes. d Proportion of credible candidate genes relative to the distance of credible candidate gene from the peak. Candidate genes close to the peak have a higher proportion of credible candidate genes than those far away from the peak, but the proportion of credible candidate genes in near and far distance ranges were similar, which indicated that distal regulation should be considered in the identification of credible candidate genes.
Fig. 3Interface and general functions of the integrated swine omics knowledgebase.
a The population design module in variation database, b the variation visualization module in variation database, c the Heatmap page in expression database, d the tissue expression module in expression database, e the physical map in QTX database, f the QTX information page in QTX database, g search engine of the integration database, h the gene information incorporated into the integration database, i the gene prioritization model embedded in integration database, j the primer design module for primer design, k the JBrowser module for genome visualization, and l the BLAST module for location of target gene sequences.