Huatao Liu1, Kai Xing1, Yifan Jiang1, Yibing Liu1, Chuduan Wang1, Xiangdong Ding1. 1. National Engineering Laboratory for Animal Breeding, Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing 100193, China.
Abstract
Fat deposition in pigs is not only closely related to pig production efficiency and pork quality but also an ideal model for human obesity. Transcriptome sequencing is widely used to study fat deposition. However, due to small sample sizes, high false positive rates, and poor consistency of results from different studies, new strategies are urgently needed. Machine learning, a new analysis method, can effectively fit complex data and accurately identify samples and genes. In this study, 36 samples of adipose tissue, muscle tissue, and liver tissue were collected from Songliao black pigs and Landrace pigs, and the mRNA of all the samples was sequenced. In addition, we collected transcriptome data for 64 samples in the GEO database from four different sources. After standardization and imputation of missing values in the data set comprising 100 samples, traditional differential expression analysis was carried out, and different numbers of expressed genes were selected as features for the training model of eight machine learning methods. In the 1000 replications of fourfold cross validation with 100 samples, AdaBoost performed best, with an average prediction accuracy greater than 93% and the highest mean area under the curve in predicting the high- and low-fat content groups among the eight ML methods. According to their performance-based ranks inferred by AdaBoost, 12 genes related to fat deposition were identified; among them, FASN and APOD were specifically expressed in adipose tissue, and APOA1 was specifically expressed in the liver, which could be important candidate biomarkers affecting fat deposition.
Fat deposition in pigs is not only closely related to pig production efficiency and pork quality but also an ideal model for human obesity. Transcriptome sequencing is widely used to study fat deposition. However, due to small sample sizes, high false positive rates, and poor consistency of results from different studies, new strategies are urgently needed. Machine learning, a new analysis method, can effectively fit complex data and accurately identify samples and genes. In this study, 36 samples of adipose tissue, muscle tissue, and liver tissue were collected from Songliao black pigs and Landrace pigs, and the mRNA of all the samples was sequenced. In addition, we collected transcriptome data for 64 samples in the GEO database from four different sources. After standardization and imputation of missing values in the data set comprising 100 samples, traditional differential expression analysis was carried out, and different numbers of expressed genes were selected as features for the training model of eight machine learning methods. In the 1000 replications of fourfold cross validation with 100 samples, AdaBoost performed best, with an average prediction accuracy greater than 93% and the highest mean area under the curve in predicting the high- and low-fat content groups among the eight ML methods. According to their performance-based ranks inferred by AdaBoost, 12 genes related to fat deposition were identified; among them, FASN and APOD were specifically expressed in adipose tissue, and APOA1 was specifically expressed in the liver, which could be important candidate biomarkers affecting fat deposition.
Entities:
Keywords:
biomarkers; data integration; fat deposition; machine learning; pigs
With the improvement of living standards,
people pay more and more
attention to the quality of meat. As the main source of meat, pork
is closely related to human health, and fat deposition in pigs is
closely related to pork quality and yield.[1] Therefore, it has become an important topic among scientists to
improve the meat quality and yield of pigs by exploring the mechanism
of fat deposition. Moreover, due to their physiological similarity
with humans, pigs have gradually become an ideal model animal for
the study of human obesity and metabolic syndrome.[2] Fat deposition is a dynamic equilibrium process involving
the synthesis, breakdown, and transport of fat that takes place mainly
in adipose tissue, liver, and muscle.[3] In
addition, fat deposition is temporally and spatially regulated by
multiple genes. Transcriptome sequencing data from different tissues
at different times have been widely used to explore the mechanisms
of fat deposition. However, most transcriptome studies used few replicates
and can only identify the genes with the largest changes in expression,
thus lacking the ability to detect differences at the level of biological
significance.[4] Some studies have also shown
that different methods for detecting genes with differential expression
lack sufficient statistical power and have a certain false positive
rate and false negative rate.[5] Therefore,
increasing the sample size and seeking new analysis strategies are
crucial for overcoming the limitations of traditional transcriptome
analysis.Machine learning (ML), a new big data analysis method,
can effectively
fit complex data and accurately identify samples and genes. Due to
the high flexibility of ML algorithms, it is possible to use them
in complex omics data analysis.[6] At present,
many ML algorithms are being widely applied in this field. ML algorithms
are used in biological modeling, and their performance is better than
that of traditional mathematical models.[7] Moreover, ML has proven effective in cancer prognosis analysis,
therapeutic target prediction, and drug target prediction, in which
classification functions can be used to discover new biomarkers and
new drug targets.[8] In addition, Belgian
researchers studied hundreds of children’s blood samples and
characteristics of the immune system and found that the ML algorithm
obtained an arthritis diagnosis accuracy as high as 90%.[9] In animal husbandry, ML has also begun to be
used for genomic selection, with significantly better accuracy than
traditional methods.[10] ML has also been
gradually applied in the study of the economic traits of pigs. There
are studies that have used ML to predict daily gain[11] and total number born[12] of pigs,
which showed high accuracy. However, due to the relatively high cost
and complex processing requirements of RNA sequencing, ML also faces
the problem of a small sample size. The collection of multiple samples
with similar experiments can not only preserve biological information
but also improve the effectiveness and practicality of gene expression
analysis.[5]Therefore, in this study,
transcriptome sequencing data were collected
from the major organs of fat deposition in pigs from multiple sources,
and the strategy of imputing missing values was applied to unify data
from different sources. In addition, eight ML methods were also compared
to evaluate the prediction accuracy of ML models, and genes affecting
fat deposition were predicted by the best ML method. Meanwhile, the
efficiency of traditional differential expression analysis was also
compared with that of ML methods.
Materials and Methods
Pig Samples and RNA-Seq
RNA-seq data from five sources
were used in this study (Table ). The experimental population used in this study was from
a pig breeding farm in Tianjin, China. A total of 500 Landrace (n = 341) and Songliao black (a Chinese breed, n = 159) sows were selected. The backfat thickness (5 cm between the
third and fourth ribs) of live pigs (∼100 kg body weight) was
measured by B-ultrasound in vivo as an index of fat deposition, because
the backfat thickness was highly positively correlated with the fat
deposition content.[13] For each breed, six
individuals with the highest and lowest backfat thicknesses were sampled.
Adipose tissue, muscle tissue, and liver tissue samples were collected
from these 24 individuals, and 36 samples were selected according
to sample quality, including 16 Songliao black pig samples and 20
Landrace pig samples (Table S1). The numbers
of samples in the high- and low-fat content groups of each breed were
equal.
Table 1
Sample Information from Different
Sources
source
breed
tissue
fat content index
high
group
low group
ours
Landrace, Songliao black pig
adipose, muscle, liver
backfat thickness
18
18
GSE61271
Duroc × Göttingen Minipigs
adipose
obesity index
12
12
GSE144780
Italian Large White
muscle
intramscular
fat content
6
6
GSE116951
Iberian pig
muscle
intramscular fat content
6
6
GSE122349
Pulawska breed
adipose
backfat thickness
8
8
All animal studies were evaluated and authorized by
the Institutional
Animal Care and Use Committee (IACUC). The whole procedure for samples
collected was carried out in strict accordance with the protocol approved
by the IACUC at the China Agricultural University. The IACUC of the
China Agricultural University specifically approved this study (permit
number DK996).We extracted RNA from 36 samples and sequenced
mRNA using the Illumina
HiSeq 2000 sequencing platform after sample processing. IlluQC.pl
(NGS QC Toolkit)[14] was used for quality
control of the sequenced reads, and HISAT2[15] was used for fast and accurate sequence alignment. Finally, SAMtools[16] and FeatureCounts[17] were used to transform the transcriptome gene expression count file
in order to obtain the gene expression profile in each tissue sample.In addition, we also collected similar transcriptome data from
64 samples in the GEO database from four different sources (Table ), including adipose
tissue samples of 24 Duroc × Gottingen minipigs[18] and 16 Pulawska pigs[19] and muscle
tissue samples of 12 Italian Large White pigs[20] and 12 Iberian pigs.[21] The samples were
screened and grouped according to their phenotypic information, including
the obesity index, intramuscular fat content, and backfat thickness.
Samples from each source were divided into two groups (high- and low-fat
contents or obesity indexes), and the numbers of samples in the groups
were equal.
Data Standardization and Imputation
Data standardization
was first carried out to make the five different sources comparable.
Each data set was transformed into fragments per kilobase per million
mapped fragments (FPKM) values in a unified manner. Then, the data
were combined, and the gene names were transformed according to the
pig reference genome (Sscrofa11.1). The genes with gene symbols were
retained, and the genes with missing rates greater than 20% were excluded.
A variety of strategies were implemented to impute the remaining missing
values.Ten imputation methods (MINIMUM,[22] stochastic minimal value (MINPROB),[23] row median (ROWMEDIAN),[24] singular
value decomposition (SVD),[25] maximum likelihood
estimation (MLE),[26] sequential imputation
(IMSEQ),[27] robust sequential imputation
(IMPSEQROB),[28] K-nearest neighbor (KNN),[25] sequential KNN (SEQ-KNN),[29] and quantile regression (QR)[23]) were compared. MINIMUM, MINPROB, and ROWMEDIAN are simple and fast
because they are the minimum, random, and median values, respectively,
to directly replace missing values. SVD, MLE, IMPSEQ, and IMPSEQROB
consider the global structure of the gene matrix, decompose the data
matrix or minimize the determinant of the covariance and then iteratively
impute the missing values. KNN, SEQ-KNN, and QR consider only values
near the missing value and impute missing data on the basis of local
similarity of the gene expression profile. Four evaluation criteria,
the average correlation coefficient between the original value and
imputed value (ACC_OI), NRMSE, NRMSE-based SOR, and PSS, were used
to evaluate the efficiency of data imputation.[30] After data imputation, the batch effect of five different
sources was corrected using the R package combat.[31] PCA and cluster analysis were carried out on the data before
and after removing the batch effect to show the batch effect.
Differentially Expressed Gene Analysis
Differential
expression analysis was performed on the data after adjusting for
batch effects. Limma,[32] which can fit linear
models of gene modes to gene expression data in order to detect differential
expression, was used to identify the DEGs. In the merged data of five
different sources, the groups were the same as previously set. In
addition, DEG analysis was also conducted for individual sources.
According to the results of differential expression analysis, genes
were ranked according to P value, and genes with
a P value less than 0.05 were regarded as differentially
expressed.
Machine Learning
To further screen for candidate genes
affecting fat deposition, we performed ML analysis based on the results
of differential expression analysis. The whole process of ML analysis
is illustrated in Figure . The 500, 1000, 2000, and 3000 genes with the most significant P values from differential expression analysis were chosen
as selection features to facilitate better ML model training. Meanwhile,
all the genes (6658) detected in the differential expression analysis
were also chosen as selection features. In addition, organization
type was also added to the data set in the form of a numerical value
as the feature vector of the samples. For these five cases, we conducted
ML model training and evaluation. To achieve better and faster convergence
of the models, normalization and standardization of the constructed
data sets were applied; that is, the expression level of each gene
was scaled to 0–1, and the variances of all genes were equalized.
For the 100 samples, 1000 replications of fourfold cross validation
(CV) were carried out to evaluate the ML model. For each instance
of CV, 75 and 25 samples were used to build the classification model
and to evaluate the accuracy of the model, respectively. The prediction
accuracy of the ML model was the rate of correct sample classification
in the validation population. We fine-tuned the hyperparameters of
the ML model manually to improve the accuracy of the model prediction.
Figure 1
Workflow
of machine learning analysis.
Workflow
of machine learning analysis.To find the ML model that best fits the data in
this study, we
tested eight commonly used classification models (Linear Support Vector
Classification (LinearSVC),[33] Radial Basis
Function Kernel Support Vector Machine (RBF SVM),[34] RandomForest,[35] Nearest Neighbors,[36] Gaussian Process,[37] Decision Tree,[38] Neural Network,[39] and AdaBoost[40]).
These models are fully supervised ML classification models, including
linear, nonlinear, and integrated ensemble methods. According to the
accuracy of the ML model, the optimal training features, the optimal
training model, and the optimal parameter combination of the model
were determined. In addition, for the feature numbers with the highest
accuracy, ROC curves were drawn for each of the eight models to further
evaluate model quality.The best model was selected, and all
100 samples were reanalyzed
using the model. Each gene was ranked by RFE[41] to screen out the most important genes for model classification.
For this study, the higher the rank of the genes based on RFE, the
more likely they were to determine whether a sample was classified
into a high- or low-fat content group, indicating that these genes
play important roles in fat deposition.The ML model was developed
and RFE was applied using the package
Scikit-learn V.1.0. All steps were performed using Python V.3.9.6.
NumPy V.1.22 and pandas V.1.3.4 were used for data collation and basic
statistical calculations, respectively.
Gene Function Analysis
Gene ontology (GO) and Kyoto
Encyclopedia of Genes and Genomes (KEGG) analyses were performed on
the top 100 genes screened by the best ML methods, KOBAS[42] was used to conduct enrichment analysis of four
features (cellular components, molecular functions, biological processes,
and pathways) for the top 100 genes, and a false discovery rate (FDR)-corrected P value less than 0.05 was considered significant. In addition,
ClueGO[43] in Cytoscape software[44] was further applied to the top 100 genes to
detect relationships between different enrichment pathways. Candidate
genes were further selected in combination with the annotation of
Sscrofa11.1. In addition, enrichment analysis was also conducted for
the top 100 genes screened by P value ranking in
DEG analysis and all the DEGs. We also analyzed the expression of
candidate genes in 72 different tissue samples (7096 in total) from
pigs[45] to discover their unique expression
patterns.
Results
Standardization of Gene Expression Data from Different Sources
The numbers of expressed genes in the samples from five sources
were 14975, 15455, 9568, 14971, and 14875, among which 8317 genes
overlapped (Figure S1). After merging data
from the five sources and imputing missing values, 6658 overlapping
genes remained, only 5% of all gene expression values were missing,
and the distribution of missing values was relatively concentrated
(Figure S2). Efficient imputation of missing
values can improve the accuracy of subsequent data analysis. Table shows the imputation
performance of 10 data imputation methods. Among them, IMSEQROB performed
the best; it ranked 1st for all four evaluation criteria, yielding
the highest average correlation coefficient (0.99) and the lowest
normalized root mean squared error (NRMSE), NRMSE-based sum of ranks
(SOR), and Procrustes sum of squared errors (PSS). A similar approach
to IMSEQROB, SEQKNN, ranked 2nd in the overall evaluation except in
terms of the average correlation coefficient, for which it ranked
third. This method yielded a higher NRMSE than the IMPSEQ method,
and IMPSEQ was the third best among the 10 data imputation methods.
Single-value replacement methods are simple and quick but performed
poorly in this study, with MINIMUM and ROWMEDIAN ranking at the bottom
of the middle. In all scenarios, MLE performed the worst; it ranked
last for each evaluation criterion, generating the lowest average
correlation coefficient (an extremely low value of 0.01) and highest
NRMSE, NRMSE-based SOR, and PSS. Therefore, we chose the IMPSEQROB
method for the imputation of missing data.
Table 2
Comparison of Different Methods for
Imputing Missing Values
methods
Cor_mean
NRMSE
PSS
SOR
NRMSE_Rank
SOR_Rank
ACC_OI_Rank
PSS_Rank
Rank_Mean
IMPSEQROB
0.9936
0.3640
2.00 × 10–5
758
1
1
1
1
1
SEQKNN
0.9879
0.5124
4.00 × 10–5
1290
3
2
2
2
2.25
IMPSEQ
0.9874
0.4118
6.00 × 10–5
1330
2
3
3
4
3
KNNMETHOD
0.9871
0.5398
5.00 × 10–5
1490
4
4
4
3
3.75
ROWMEDIAN
0.9554
0.8981
0.00018
2279
5
5
5
5
5
MINIMUM
0.8663
1.0496
0.00154
3148
7
6
6
6
6.25
MINPROB
0.7920
1.0505
0.00215
3489
8
7
8
7
7.5
SVDMETHOD
0.6583
1.0002
0.0034
3608
6
8
9
9
8
QRILC
0.8125
2.9811
0.00283
3668
9
9
7
8
8.25
MLE
0.0124
2351.45
0.01001
4680
10
10
10
10
10
In the merged data after imputation, a large batch
effect was indicated
among the five different data sources, as shown in Figure , with different data sources
clustered onto different branches. Moreover, tissue and breed effects
were also detected in the data clustering analysis. Figure A further illustrates that
all samples were obviously divided into three groups based on principal
component analysis, and most of the samples in each group came from
the same source, indicating heterogeneity in the data. After correcting
for the batch effect by combat, the batch effect, tissue effect, and
breed effect were all reduced, as illustrated in Figure B. The range of gene expression
values in the samples decreased from 10,000 to 8000, PC1 decreased
from 57.87 to 45.13% (Figure ), and the samples were more uniform after standardization.
Figure 2
Cluster
diagrams of genes before (A) and after (B) removing the
batch effect.
Figure 3
Distribution and PCA before and after removing the batch
effect
(A,C) is the data distribution and PCA clustering diagram of the original
data. (B,D) is the data distribution and PCA cluster graph after removing
the batch effect.
Cluster
diagrams of genes before (A) and after (B) removing the
batch effect.Distribution and PCA before and after removing the batch
effect
(A,C) is the data distribution and PCA clustering diagram of the original
data. (B,D) is the data distribution and PCA cluster graph after removing
the batch effect.
Analysis of Differentially Expressed Genes in the Merged Data
Set
After standardization and removal of the batch effect,
235 differentially expressed genes (DEGs, P value
<0.05) were identified by limma. Figure illustrates that the DEGs were mostly downregulated
in expression in the high-fat content group and upregulated in expression
in the low-fat content group. The 10 genes with the smallest P values are shown in Figure B, and the gene with the largest fold change was FASN
(logFC = 244.5, P value = 0.016). In addition, 280,
2048, 577, 931, and 2088 DEGs were also identified for individual
sources by limma, while no overlap was found among these DEGs (Figure S3), implying that it is difficult to
find candidate genes by summarizing results from different data sources.
Most of the 235 DEGs in the merged data set overlapped with those
from different sources, and only 33 of them could not be found through
differential expression analysis of a single source.
Figure 4
Distribution (A) and
Volcanogram (B) of differentially expressed
genes (DEGs) (A) Expression distribution of differentially expressed
genes and the clustering of samples according to gene expression.
(B) Situation of the differential genes. The x-axis
represents the multiple of difference, which is denoted by log2FoldChange.
The larger the absolute value is, the larger the multiple of difference
is. The y-axis represents the significance of the
difference, which is denoted by −log 10 (P-value). The larger the value is, the more significant the difference
is. The panel shows the names of the top 10 genes with the most significant
differences.
Distribution (A) and
Volcanogram (B) of differentially expressed
genes (DEGs) (A) Expression distribution of differentially expressed
genes and the clustering of samples according to gene expression.
(B) Situation of the differential genes. The x-axis
represents the multiple of difference, which is denoted by log2FoldChange.
The larger the absolute value is, the larger the multiple of difference
is. The y-axis represents the significance of the
difference, which is denoted by −log 10 (P-value). The larger the value is, the more significant the difference
is. The panel shows the names of the top 10 genes with the most significant
differences.
Comparison of Eight Machine Learning Models
For different
ML models, parameter tuning was carried out, and the optimal parameter
combination was selected according to the results of CV (Table S2). Table shows the prediction accuracies of eight ML models
under different characteristic gene scenarios. The accuracies of the
eight ML models showed almost the same trend for all tested numbers
of selection feature genes. AdaBoost yielded the highest accuracy
(more than 90%) in predicting the high- and low-fat content groups,
and Nearest Neighbors yielded the lowest accuracy. Similar to that
of Nearest Neighbors, the prediction accuracies of Linear SVM, RBF
SVM, Gaussian Process, and Neural Net s were lower than 80% in all
scenarios, Random Forest yielded accuracies greater than 80%, and
Decision Tree performed similarly to AdaBoost. On the other hand,
when the top 2000 genes were selected, AdaBoost performed better than
when other feature numbers were selected, yielding an average prediction
accuracy of 93 and 93.4% in the high- and low-fat content groups,
respectively, and the narrowest 95% confidence intervals of 84–100%
for both the high- and low-fat content groups. Moreover, the receiver
operating characteristic (ROC) curves of the 8 ML methods further
illustrated that AdaBoost performed best (Figure ). The area under the ROC curve (AUC) of
AdaBoost was much higher than those of the other models. The ROC curve
of the AdaBoost model also had minimal variance, indicating that the
model was more stable than the other models for different data sets
under CV. Therefore, we selected the AdaBoost model and the top 2000
genes to train all samples and to rank the genes in terms of importance
based on recursive feature elimination (RFE).
Table 3
Comparison of the Accuracy and Confidence
Interval (CI) of Eight Machine Learning Models with Different Feature
Numbers on Fat Content Classification in Four-Fold Cross Validation
of 1000 Replicatesa
feature
accuracy and 95% CI
accuracy and 95% CI
Top500 genes
linear SVM
H: 75.3 (56.0–88.0)
Gaussian Process
H: 70.3 (52.0–84.0)
L: 75.5 (60.0–88.1)
L:
69.9 (52.0–84.0)
RBF SVM
H: 70.9 (44.0–76.0)
Decision Tree
H: 90.8 (80.0–100.0)
L: 70.8 (44.0–76.0)
L:
90.8 (80.0–100.0)
random forest
H: 87.4 (76.0–100.0)
Neural Net
H: 78.1 (64.0–92.0)
L: 87.7 (76.0–100.0)
L: 77.0 (60.0–92.0)
nearest neighbors
H: 62.6 (44.0–80.0)
AdaBoost
H: 92.9 (80.0–100.0)
L: 62.7 (44.0–80.0)
L:
92.9 (84.0–100.0)
Top1000 genes
linear SVM
H:
74.1 (56.0–92.0)
Gaussian Process
H: 68.4 (52.0–84.0)
L: 74.4 (59.9–88.0)
L:
64.4 (48.0–80.0)
RBF SVM
H: 66.7 (52.0–80.0)
Decision Tree
H: 92.5 (80.0–100.0)
L: 66.6 (51.9–84.0)
L:
92.4 (80.0–100.0)
random forest
H: 85.8 (72.0–96.0)
Neural Net
H: 73.6 (56.0–88.0)
L: 85.8 (68.0–96.0)
L:
74.2 (56.0–84.0)
nearest neighbors
H: 59.3 (40.0–76.0)
AdaBoost
H: 92.9 (84.0–100.0)
L: 59.1 (40.0–76.0)
L:
93.1 (84.0–100.0)
Top2000 genes
linear SVM
H:
70.8 (52.0–88.0)
Gaussian Process
H: 64.0 (48.0–76.0)
L: 70.6 (52.0–88.0)
L:
58.7 (44.0–72.0)
RBF SVM
H: 62.8 (48.0–80.0)
Decision Tree
H: 92.0 (80.0–100.0)
L: 62.7 (44.0–80.0)
L:
91.8 (80.0–100.0)
random forest
H: 86.0 (72.0–100.0)
Neural Net
H: 69.3 (52.0–88.0)
L: 86.4 (72.0–100.0)
L: 69.3 (52.0–84.0)
nearest neighbors
H: 56.2 (40.0–72.0)
AdaBoost
H: 93.0 (84.0–100.0)
L: 56.0 (40.0–72.0)
L:
93.4 (84.0–100.0)
Top3000 genes
linear SVM
H:
66.8 (48.0–84.0)
Gaussian Process
H: 64.8 (48.0–80.0)
L: 66.8 (48.0–84.0)
L:
64.5 (48.0–80.0)
RBF SVM
H: 60.1 (44.0–76.0)
Decision Tree
H: 92.0 (80.0–100.0)
L: 60.1 (44.0–76.0)
L:
92.1 (80.0–100.0)
random forest
H: 84.7 (68.0–96.0)
Neural Net
H: 64.6 (48.0–84.0)
L: 84.6 (68.0–96.0)
L:
64.2 (48.0–84.0)
nearest neighbors
H: 53.7 (36.0–68.0)
AdaBoost
H: 92.4 (80.0–100.0)
L: 53.4 (36.0–68.0)
L:
92.7 (84.0–100.0)
all 6658 genes
linear SVM
H: 54.9 (36.0–76.0)
Gaussian Process
H: 48.1 (40.0–56.0)
L: 54.8 (36.0–72.0)
L:
52.2 (44.0–60.0)
RBF SVM
H: 52.7 (36.0–68.0)
Decision Tree
H: 91.7 (80.0–100.0)
L: 52.8 (36.0–68.0)
L:
91.8 (80.0–100.0)
random forest
H: 86.5 (72.0–100.0)
Neural Net
H: 54.3 (36.0–72.0)
L: 86.6 (72.0–100.0)
L: 53.3 (36.0–72.0)
nearest neighbors
H: 49.0 (32.0–64.0)
AdaBoost
H: 92.1 (80.0–98.0)
L: 48.9 (32.0–64.0)
L:
92.0 (84.0–100.0)
Models are sorted based on the average
accuracy of all the groups. H: high-fat group, L: low-fat group.
Figure 5
Receiver operating characteristic
(ROC) curves of eight machine
learning models with 2000 selection feature genes. The figure shows
the ROC response of different data sets, created from four-fold cross
validation. The blue curve shows the mean value under different conditions,
which can represent the average performance of the model.
Receiver operating characteristic
(ROC) curves of eight machine
learning models with 2000 selection feature genes. The figure shows
the ROC response of different data sets, created from four-fold cross
validation. The blue curve shows the mean value under different conditions,
which can represent the average performance of the model.Models are sorted based on the average
accuracy of all the groups. H: high-fat group, L: low-fat group.
Identification of Genes with Unique Expression Patterns
We ranked 2000 genes and tissue factors involved in the analysis
according to importance by applying RFE. Among the top 100 genes with
the highest importance, only 16 were differentially expressed (Figure ), implying that
ML is quite different from traditional differential expression analysis,
and the top three genes were EFCAB7, ZDHHC18, and LRPPRC (Table S3). In addition, the functional enrichment
of the top 100 genes screened by ML was quite different from that
of all DEGs or the top 100 DEGs. There were no common enrichment items,
and the DEGs and top 100 differentially expressed enrichment items
were not directly related to fat development (Figures and S4).
Figure 6
Venn map of
differential expression analysis and machine learning
The figure shows the distribution of the top 100 genes by machine
learning and differentially expressed genes.
Figure 7
Gene enrichment analysis of top 100 genes in machine learning
(A).
The enrichment of top 100 genes in machine learning. The y-axis represents the gene enriched entries according to the GO and
KEGG analyses, and the x-axis represents −Lg(P value) of the enriched entries or path. The color of the
bar is the same as the color in the circular network. (B) Gene-enriched
items and the relationships between the items according to ClueGO.
The lines between the dots indicate the presence of common genes between
items.
Venn map of
differential expression analysis and machine learning
The figure shows the distribution of the top 100 genes by machine
learning and differentially expressed genes.Gene enrichment analysis of top 100 genes in machine learning
(A).
The enrichment of top 100 genes in machine learning. The y-axis represents the gene enriched entries according to the GO and
KEGG analyses, and the x-axis represents −Lg(P value) of the enriched entries or path. The color of the
bar is the same as the color in the circular network. (B) Gene-enriched
items and the relationships between the items according to ClueGO.
The lines between the dots indicate the presence of common genes between
items.Through functional enrichment analysis of the top
100 genes screened
by ML, the two most significant items directly related to the formation
of fat were identified, namely, ether lipid metabolism and lipid catabolic
process, which were associated with the genes PLA2G6, PLA2G7, PLD4, and PLD1 and PLA2G6, PLCB1, PLD1, and PLA2G7, respectively (Figure ). Additionally, several gene groups were also involved in lipase
activity and energy metabolism. Finally, 12 genes (IFIT1, ZDHHC18, FASN, PLA2G6, PLA2G7, PLCB1, PLD4, PLCG2, PLD1, APOA1, APOD, and APOOL) were found to
be associated with fat growth and development and could be candidate
genes for the regulation of fat deposition (Table S3). Among them, ZDHHC18, IFIT1, and FASN ranked 2nd, 4th, and 10th, respectively,
in the ML model (Table S3). Figure further illustrates the expression
of these 12 candidate genes in 12 main tissues from 72 samples of
pigs. FASN and APOD were specifically
expressed in adipose tissue. APOA1 was specifically
expressed in the liver. In muscle tissue, the expression levels of
these candidate genes were very low.
Figure 8
Expression of candidate genes in different
tissues of pigs. The
expression level of genes is the standardized TPM value, and the expression
level in different tissues is the mean value of different samples,
then scaled to 0–1.
Expression of candidate genes in different
tissues of pigs. The
expression level of genes is the standardized TPM value, and the expression
level in different tissues is the mean value of different samples,
then scaled to 0–1.
Discussion
Comprehensive analysis of data is considered
a key method for extracting
the most effective information from different genomic data sets, which
is conducive to the discovery of important biological phenomena.[46] At present, there are two different comprehensive
analysis strategies: meta-analysis and data combination. When there
is large heterogeneity among original studies and the number of studies
is small, random effects cannot be fully considered in the model,
resulting in invalid conclusions with type I error and leading to
low consistency among studies.[47] This point
was confirmed by our findings, in which the DEGs from five different
sources showed poor consistency (Figure S3). Therefore, such meta-analysis is not applicable to these data
sets. The data combination method involves combining samples from
different sources to enlarge the data set and then analyzing the newly
combined data set.[48] The advantages of
data combination over the meta-analysis method mainly lie in the greater
statistical significance of the results obtained by analyzing the
combined large sample sets and the more rigorous inference results.[49] Our results confirmed that combining data from
five different sources yielded the most DEGs obtained through single-source
analysis, and no common DEGs were found among single-source analyses
even though many DEGs were identified. In addition, in order to expand
the sample size as much as possible to meet the training requirements
of machine learning, we put three kinds of important tissue samples
directly related to fat deposition in the same data set for joint
analysis.When combining data, some key issues must be solved
to unify the
data, for example, batch effects and missing values. In this study,
batch and tissue type effects were found to influence the uniformity
of the data. Many studies have shown that ComBat adjustment of data
results in improved statistical power and control of false positives
in differential expression analysis compared to those of data adjustment
by other available methods.[50] The empirical
Bayes method in ComBat was adopted to eliminate the effect of covariates
for batch effect correction. Our results indicated that the batch
effect was significantly corrected (Figures and 3). Although
a good trial design can reduce batch effects, it is difficult to eliminate
systematic bias completely;[51] therefore,
we incorporated the tissue type of the sample as a feature into the
ML training model to reduce bias. Regarding missing values, we compared
the imputation effects of various methods and showed that Global Structure
Approach IMPSEQROB ranked first in all evaluations because it fully
considers the relationships between genes, which is more in line with
biological characteristics. Studies have shown that when processing
biological data such as protein expression data, IMPSEQROB has a higher
completion effect on missing values, and the distribution of filled
data is more similar to that of real data.[30]There is no perfect ML algorithm that can solve all problems;
instead,
ML algorithms must be tailored to different data.[52] In this study, we evaluated almost all popular ML algorithms
through CV. Compared with SVM and neural network algorithms widely
used in the biomedical field, the AdaBoost algorithm performed best
in this study, which further indicates that different algorithms should
be used for specific problems. The neural network approach has better
model complexity and can fit complex data more accurately than other
approaches. However, for this study, the sample information complexity
was high, but the number of training samples was relatively small,
and the samples were not as heterogeneous as tumor samples, so overfitting
was easily caused by using an overly complex model. By training different
weak classifiers, the AdaBoost algorithm integrates these weak classifiers
to form a strong classifier, which can solve the problem of complex
structures with high classification accuracy. The subclassifier is
a CART decision tree. As a binary classifier, it has a simple structure.
It not only has a very fast training speed but also can adapt well
to the characteristics of small training sets such as that used in
this study. The most important feature is that the whole model is
not easy to overfit, and its iteration process has the effect of increasing
the margin.[53] The results also indicate
that higher model complexity in ML is not necessarily better. For
data with a small sample size and complex information, an integrated
ML method similar to AdaBoost can be adopted. In addition, we found
that the accuracy of the decision tree was also quite high (>90%),
as its algorithm is similar to AdaBoost’s basic classifier,
but both the accuracy and precision of the decision tree were lower
than those of AdaBoost. Therefore, the integration of multiple identical
models can not only effectively improve accuracy but also improve
precision. In addition, the parameters of the model can directly affect
its prediction accuracy.[54] In this study,
only a few parameters were manually adjusted for each model, and almost
all models performed best under their default parameters (Table S4). Other parameters were not tested one
by one in this study, and the selected parameters may not be the best
ones. However, in this study, AdaBoost’s AUC was 0.96, almost
as high as expected (Table and Figure ).Sample feature selection is needed for training ML models,
but
there is currently no unified screening standard for sample features.
In this study, the top 500, 1000, 2000, and 3000 genes based on differential
expression analysis were used; in addition, all genes were also used
for feature selection. Our results showed that the highest accuracy
was obtained for ML with the top 2000 genes (Table ). We noticed that if feature selection is
not carried out, almost all ML models have the lowest prediction accuracy
because the feature number is far larger than the sample number, which
indeed causes serious overfitting. Therefore, implementing differential
expression analysis for feature selection is straightforward and useful.
On the other hand, the determination of candidate genes in ML is different
from that in traditional differential expression analysis. The genes
were ranked in terms of importance by using RFE and repeated model
building, and the most important genes contributed more to ML model
classification. Differential expression analysis can identify candidate
genes only by significance or multiple differences, with a certain
rate of false positives. RFE is not only suitable for multiple models
but also can accurately rank each gene, providing a new method for
screening important genes, and it has been proven to be effective
in relevant studies.[55]The enrichment
items of the top 100 genes screened by the AdaBoost
algorithm showed high relevance to fat deposition, indicating that
this method is effective in screening candidate genes. We further
screened 12 candidate genes, all of which have been shown to be involved
in regulating fat deposition. PLA2G6 and PLA2G7 catalyze the hydrolysis of phospholipids (PLs) to
generate fatty acids, and their abnormal expression can cause dysregulated
lipid metabolism.[56]PLD1 regulates cytosolic lipid droplet formation, and increased expression
of PLD1 increases lipid droplet formation.[57] Phospholipase D4 (PLD4) affects
phospholipase activity in mice, which in turn affects fat deposition.[58] Deficiency of PLCB1 leads to myotonic dystrophy
because pi-PLC β1 is involved in adipogenesis through a double
phase mechanism.[59] Studies have shown that PLD1, PLCB1, and ZDHHC18 are highly relevant to lipid phenotypes.[60] A decrease in lipid droplet content was accompanied by a decrease
in IFIT1 expression, and IFIT1 affects
the metabolism of fatty acids by regulating fat oxidation.[61] Fatty acid synthase (FASN)
is a key enzyme in the synthesis of fatty acids in mammals and predominantly
generates straight-chain fatty acids using acetyl-CoA as the initiating
substrate. It is directly involved in the regulation of fat formation.[62] Apolipoprotein A1 (ApoA1) has
been verified to play a vital role in modulating lipid metabolism
and homeostasis both in plasma and in cells, consequently affecting
fat deposition.[63] Similar to APOA1, an important aspect of APOD’s role in lipid
metabolism appears to involve the transport of arachidonic acid and
the modulation of eicosanoid production and delivery in metabolic
tissues.[64] Overexpression of APOOL led to fragmentation of mitochondria, a reduced basal oxygen consumption
rate, and altered crista morphology. Its expression is closely related
to energy metabolism.[65]The overlap
between the top 100 genes in the ML analysis and DEGs
was less than 20% (Figure ). The Spearman correlation between gene ranks based on ML
and DEG analysis was only 0.046. Therefore, ML is different from differential
gene expression analysis. According to the results of functional enrichment
analysis of the top 100 genes and DEGs, the two most significant items
of the former were directly related to fat deposition, while none
of the items of the latter were significantly related to fat development,
indicating that ML can yield more convincing findings (Figures and S4). In addition, ML can find useful information that differential
expression gene analysis cannot find. Of the 12 candidate genes involved
in fat deposition identified by ML, only three were DEGs, implying
a high false positive rate of DEG analysis, as pointed out in many
other studies. In contrast to the single-gene scope of DEG analysis,
ML can consider a large number of genes simultaneously and analyze
them as a whole. The nine candidate genes (PLA2G6, PLA2G7, PLCB1, PLD4, PLCG2, PLD1, APOA1, APOD, and APOOL) belong to the
same enrichment item or pathway, and most belong to the same family
of protein-coding genes. However, they were adjacent in ML rankings
(Table S4), suggesting that ML was able
to classify them as genes with similar effects. The use of AdaBoost
ML to group genes has shown extraordinary biometric capabilities that
traditional statistical analyses such as differential expression analysis
do not. This is also a unique advantage of ML, but the biological
algorithm principle needs to be further explored. Among the 12 candidate
genes found by ML, FASN, APOD, and APOA1 were highly expressed only in adipose tissue and liver
tissue (Figure ),
showing strong tissue specificity, which further indicated that they
were closely related to the occurrence of fat deposition and played
a more important role than other genes; they could be verified in
future studies.In conclusion, machine learning can efficiently
analyze large data
sets, and can find useful information that differential expression
gene analysis cannot find. This research strategy can provide ideas
for the merged analysis of large data sets. According to the results
of machine learning analysis, 12 genes including FASN, APOD, and APOA1
may be involved in the regulation of fat deposition in pigs, which
lays a foundation for further research on the molecular regulation
mechanism behind fat deposition in pigs. The results can provide a
reference for the genetic improvement of pork quality traits.
Authors: Matthew E Ritchie; Belinda Phipson; Di Wu; Yifang Hu; Charity W Law; Wei Shi; Gordon K Smyth Journal: Nucleic Acids Res Date: 2015-01-20 Impact factor: 16.971