Literature DB >> 36123893

Identification of signature of gene expression in biliary atresia using weighted gene co-expression network analysis.

Yongliang Wang¹, Hongtao Yuan¹, Maojun Zhao², Li Fang³.

Abstract

Biliary atresia (BA) is the most common cause of obstructive jaundice during the neonatal period. This study aimed to identify gene expression signature in BA. The datasets were obtained from the Gene Expression Omnibus database. Weighted gene co-expression network analysis identified a critical module associated with BA, whereas Gene Ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analysis revealed the functions of the essential modules. The high-connectivity genes in the most relevant module constructed protein-protein interaction networks via the string website and Cytoscape software. Hub genes screened by lasso regression consisted of a disease classification model using the randomforest method. Receiver operating characteristic curves were used to assess models' sensitivity and specificity and the model was verified using the internal and external validation sets. Ten gene modules were constructed by WGCNA, of which the brown module had a strong positive correlation with BA, comprising 443 genes. Functional enrichment analysis revealed that module genes were mainly involved in biological processes, such as extracellular matrix organization, cell adhesion, inflammatory response, and the Notch pathway (P < .001), whereas these genes were involved in the metabolic pathways and cell adhesion molecules (P < .001). Thirty-nine high-connectivity genes in the brown module constructed protein-protein interaction networks. keratin 7 (KRT7) and C-X-C motif chemokine ligand 8 (CXCL8) were used to construct a diagnostic model that had an accuracy of 93.6% and the area under the receiver operating curves for the model was 0.93. The study provided insight into the signature of gene expression and possible pathogenesis of BA; furthermore, it identified that the combination of KRT7 and CXCL8 could be a potential diagnostic model for BA.

Entities: Chemical

Mesh：

Substances：

Year: 2022 PMID： 36123893 PMCID： PMC9478247 DOI： 10.1097/MD.0000000000030232

Source DB: PubMed Journal: Medicine (Baltimore) ISSN： 0025-7974 Impact factor: 1.817

1. Introduction

Biliary atresia (BA) is one of the most severe diseases of the hepatobiliary system in infancy and is the most common cause of liver transplantation in children.[ BA also is the most common cause of neonatal cholestasis (25%–55%).[ The pathological changes include bile duct hyperplasia, cell infiltration, portal fibrosis,[ and the absence of sinusoidal fibrosis, culminating in cirrhosis. The early operation, including Kassai and its variants, is key to a better prognosis; thus, early diagnosis of BA is crucial.[ Diagnosis of BA was screened by clinical manifestation, laboratory examination, and imaging examination, confirmed by liver biopsy and intraoperative cholangiography.[ Nevertheless, as an invasive operation, the appliance of intraoperative cholangiography is limited. Despite the high diagnostic accuracy of liver biopsy for BA,[ some hepatobiliary diseases have histological features that overlap with BA,[ including MDR3 (multidrug resistance protein 3) deficiency disease, cystic fibrosis, doublecortin domain containing 2 disease, alpha1-antitrypsin deficiency, and parenteral nutrition-associated cholestasis, with their histology features involving duct proliferation, portal tract fibrosis, inflammation, bile plugs.[ In the early stage of disease, the classic histological changes of BA might be atypical, resulting in false-negative diagnoses.[ So a series of liver biopsies are necessary.[ Clinical features, laboratory parameters, and genetic testing are essential to arrive at a correct diagnosis, as distinguishing early BA from the above disorders by histological features is a challenge.[ As a complementary method, kinds of molecular markers, such as interleukin-33, matrix metallopeptidase 7, interleukin -8, and microRNAs, have been shown to be diagnostically effective in BA.[ Therefore, we hope to exploit varied analytic tactics to mine more potential molecular biomarkers of BA from existing data, supplement current diagnostic tools, and improve the diagnostic accuracy of BA. Weighted gene co-expression network analysis (WGCNA) is a method to analyze the complicated relationship between gene and phenotype, which has been utilized in various research of biological contexts[ and can instruct the association of the module with disease results.[ A unique advantage of the WGCNA is that it retains the continuous nature of the underlying correlation information for construction of a network based on soft thresholding of the correlation coefficient, compared to the unweighted network requiring the choice of a hard threshold.[ The data, divided into multiple groups, requires repeated pairwise comparisons and multiple hypothesis tests when performing the differentially expressed genes (DEG) analysis, whereas, unlike DEG analysis, WGCNA directly modularization relationship between gene expression and phenotype, reducing computational effort. Therefore, we utilized this method to analyze expression profiles of BA in the Gene Expression Omnibus (GEO) database and to discover the genetic signature of BA.

2. Materials and Methods

2.1. Materials

The workflow was shown in (Supplementary Digital Content 1, http://links.lww.com/MD/H91). The mRNA profiles of GSE46960 (85 liver samples from age-matched infants and 10 liver samples from adults) and GSE84954 (11 liver, 13 fat, and 13 muscle samples from children) were downloaded from the GEO database (https://www.ncbi.nlm.nih.gov/gds/?term=). All adults and nonliver samples were excluded. According to the known diagnosis, eligible samples were grouped as BA group (biliary atresia group), Non-BA group (control group for hepatobiliary diseases without the BA), and normal control group (NC group). GSE46960 contained 64 BA samples; 14 non-BA samples; 7 NC samples. In GSE84954, 11 liver samples were picked out and divided into BA group (6 samples) and non-BA group (5 samples). Samples’ detail were recorded in (Supplementary Digital Content 2, http://links.lww.com/MD/H92). Data were analyzed and plotted using R 4.12 and R Studio 1.4.1717 software. Additionally, these R-packages,“oligo,”[ “WGCNA,”[ “glmnet,”[ “pROC,”[ “randomForest,”[ “caret,”[ “dplyr,”[ “kknn”[ were used for statistical analysis. The following R-packages were used to plot: “ggplot2,”[ “Pheatmap,”[ “treemap,”[ “simplifyEnrichment,”[ “corplot.”[ String website (https://string-db.org) and Cytoscape software (v3.8.2) to construct the protein–protein interaction network. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis were by the database annotation, visualized, and integrated discovery (annotated, visualized, and integrated Discovery Database [DAVID], https://david.ncifcrf.gov/).

2.2. Data processing

Normalization could eliminate the variations in expression (Intensity) caused by experimental techniques, and keep the data at the same level for each sample and parallel experiments. Downloaded microarray data were read and proceeded, including robust multichip average background correction, quantile normalization, base 2 logarithmic conversions, and probe ID transformation. The data without corresponding gene symbol or duplicate gene symbols were removed. The genes with low expression values and missing values were then filtered out. Relative expression plots of datasets are shown in (Supplementary Digital Content 3, http://links.lww.com/MD/H93).

2.3. Construction of co-expression network

The genes with the top 5000 absolute median deviations in GSE46960 have been used to construct a co-expression network via the “WGCNA” R package. First, cluster analysis was performed on the samples using the class average method to remove the outliers. According to sample cluster analysis, there were no outliers in GSE46960. Next, an appropriate soft threshold power was calculated based on the scale-free topology criterion, with the optimal soft threshold power of 12 and the corresponding scale-free R2 of 0.9. The co-expression network was constructed based on the soft threshold power, and then the cluster dendrogram of gene modules was plotted after hierarchical clustering and branching cuts. The minimum size of the module was set to 30 and CutHeight = 0.25 and the above parameters were used to merge the colser modules into new modules.

2.4. Identification of significant module and high-connectivity genes

The WGCNA package reckoned each sample’s module eigengene (ME) matrix, figured the correlation matrix and correlation P value between ME and clinical traits, plotted the “module-trait relationship” heatmap, and selected the module with the strongest correlation to the group BA. The verboseScatterplot of the engaging module was depicted based on the Module Membership (MM) value of genes and genes traits significance (GS). Following this, genes with high MM and GS were used to construct protein–protein interaction networks on the String website (https://string-db.org). The connectivity degree and combination score of genes were entered into Cytoscape software to output the protein-protein interaction (PPI) network plot.

2.5. Detection of module function

Enrichment analyses of GO and KEGG pathways for module genes were performed on the DAVID website, and a P value of < 0.05 for GO terms or KEGG terms was considered statistically significant.

2.6. Compressing variates

The expression heatmap of genes in the PPI network was presented. The number of variates needs to be compressed, as there were still too many variates that can build diagnostic models. Lasso regression was selected to compress variates. Since the NC group could be easily distinguished from the BA group by laboratory examination and clinical manifests, the NC group was no longer included in the subsequent analysis. Sixty percent of the remaining were set as a training set and 40% as the test. The training set was used to build a lasso regression model. The minimum λ value of the training set was determined by tenfold cross-validation and the minimum λ was brought into the test set to work out variates’ coefficients. Those genes retaining coefficients were viewed as nominated genes to construct a diagnostic model by randomforest; meanwhile, the correlation matrix of genes in the PPI network was plotted.

2.7. Construction of the diagnostic model

Various randomforest models were constructed by different collocations of the above-nominated genes, and their diagnostic accuracy was calculated respectively. As for the accuracy of model = 100-OOB (Out of Bag or estimated error rate), the best model was identified by comparing the parameters of models, the receiver operating characteristic (ROC) curve of the models was plotted to assess the diagnostic ability of the model, and the area under the curve of models was ciphered. The principal component analysis (PCA) of original data was compared with PCA of diagnostic model to evaluate the classification effectiveness of the model.

2.8. Validation for the model

In GSE46960, the expression level of model genes in the BA was contrasted with other groups to verify the difference in model genes between the BA and other groups. Tukey honestly significant difference (Honestly Significant Difference) was executed on the outcomes; P < .05 was considered statistically significant. For testing the diagnostic ability of the model in external data, the model predicted the grouping results of data in GSE84954 utilizing randomforest and k-nearest neighbors algorithm (KNN). Ultimately, the accuracies were calculated.

3. Results

3.1. Construction of co-expression network & identification of significant module

A systematic clustering tree of gene modules was illustrated in Figure 1A, including 10 gene modules. The label heatmap “Module-trait relationship” showed the correlation of module with clinical traits (Fig. 1B). The correlations contained positive and negative correlations, with color depth representing the strength of the correlation. The brown module (cor = 0.73, P = 8 × 10-17) had the highest positive correlation with BA among these modules. VerboseScatterplot (Fig. 1C) corresponded to the MM and GS values of genes in the brown module. In addition, the 84 genes (Supplementary Digital Content 4, http://links.lww.com/MD/H94) in the upper right quadrant of the verbosescatterplot (MM > 0.8, GS > 0.6) formed the PPI network, which is shown in Figure 1D based on connectivity degree and combination score.

Figure 1.

(A) Cluster dendrogram. The upper interface was a dendrogram of genes, and the lower was the corresponding cluster of genes. Varied colors typified various modules; a total of 10 modules were identified. The gray module was a collection of genes with no distinctive features. (B) Module-Trait relationship. Each column conformed to a clinical phenotype, and each row fitted to a module. The numbers in each grid represented the correlation coefficients between the module and the clinical trait. The numbers in brackets were parallel P value of correlation coefficients. The depth of color with grid symbolized the strength of the association between a module and a clinical phenotype; red denoted a positive association, and green represented a negative association. (C) VerboseScatterplot of the brown module. The X-axis was the MM, Y-axis was the gene significance of BA. The points in the range of the blue ellipse were genes with high connectivity. (D) Protein–protein interaction network of hub genes (PPI network). The color depth and width of the ellipse indicated the connectivity degree of a gene; the width of edge represented the combination score that reflects degree of link during genes. BA = biliary atresia, GS = genes traits significance, MM = module membership, NC = normal control group, Non-BA = control group for hepatobiliary diseases without the biliary atresia, PPI = protein–protein interaction.

3.2. Function enrichment analysis of significant module

Cellular Component enrichment analysis showed that genes mainly existed in the cytoplasm, extracellular exosome, and cytosol (Fig. 2A). The main molecular functions referred to protein binding, transcription factor activity, DNA binding, and calcium ions binding (Fig. 2B). The biological processes involved include the Notch signaling pathway, signal transduction, positive regulation of transcription from RNA polymerase II promoter, and inflammatory response. (Fig. 2C). KEGG pathway enrichment analysis showed that genes in the module were mainly involved in the metabolic pathways, axon guidance, and cell adhesion molecules (CAMs; Fig. 2D).

Figure 2.

(A) GO for CC. The X-axis denoted GO TERM, and the Y-axis represented gene counts of GO TERM. (B) GO for MF enrichment analysis. Each column or row was a BP term, and the color depth symbolized the similarity between terms, while terms with high similarity formed a cluster, and the font size of the word cloud next to the heatmap was positively proportional to the number of terms. (C) GO for BP enrichment analysis. The bubble size represented the gene counts of GO TERM, and the color deepness coincided with the P value. (D) Treemap of KEGG pathway analysis. Box size conformed to gene counts in a pathway and the color depth matched up P value. BP = biological processes, CAM = cell adhesion molecule, CC = cellular component, GO = gene ontology, MF = molecule function.

3.3. Compressing variates

Figure 3A exhibited the expression heatmap of 39 genes in the PPI networks. Thirty-nine genes were clearly expressed differently between the BA and NC groups, in contrast, a few genes were significantly differentially expressed between the BA and the non-BA groups. According to the result of lasso regression, 5 variates still owned coefficients, when minimal λ = 0.0497 (Table 1). The correlation matrix of genes in the PPI network (Fig. 3B) demonstrated the robust correlation between the 5 genes and other genes.

Figure 3.

(A) Heatmap of the hub genes in the PPI network. (B) Correlation matrix of the hub genes in the PPI network. The deeper color indicated a stronger correlation. BA = biliary atresia, NC = normal control group, Non-BA = control group for hepatobiliary diseases without the biliary atresia, PPI = protein–protein interaction.

Table 1

Lasso regression coefficients of 39 hub genes.

Symbol	Coefficient
VCAN	−0.84
KRT23
LUM
CCDC80
EPCAM
CLIC6
STMN2
ANXA13
CXCL8	−0.10
SLIT2
KRT7	−1.04
KRT19	−0.10
EPHA3
SCTR
ASPN
SULF1
JAG1
ANXA4
FBN1
PLXDC2
UGCG
CEP170
IGFBP7
FSTL1
VIM
MMP14
DPYSL3
ENAH
MAML2
CTTNBP2NL
PDP1
DBN1
ANXA2	−2.50
MED17
ANXA11
CLIC1
LUZP1
ABI1
JAK1

Lasso regression coefficients of 39 hub genes. (A) Heatmap of the hub genes in the PPI network. (B) Correlation matrix of the hub genes in the PPI network. The deeper color indicated a stronger correlation. BA = biliary atresia, NC = normal control group, Non-BA = control group for hepatobiliary diseases without the biliary atresia, PPI = protein–protein interaction.

3.4. Construction of the diagnostic model

Model constituted of keratin 7 (KRT7), KRT19, Versican (VCAN), Annexin A2 (ANXA2), and C-X-C motif chemokine ligand 8 (CXCL8) possessed the highest classification accuracy of 94.5% (Table 2, Supplementary Digital Content 5, http://links.lww.com/MD/H95); moreover, the model only containing CXCL8 and KRT7 owned a similar accuracy of 93.6%, which was as high as the above-stated model. CXCL8 and KRT7 for the BA group had a classification error rate of <2%, while the non-BA group had a classification error rate of 29%. Figure 4 showed the ROC and area under curve (AUC) for the 2 models; The AUC equated to 0.93 and 0.92. Finally, the model consisting of CXCL8 and KRT7 was selected as best model, because it had a high accuracy and minor number of genes. Figure 5A and B exhibited that the PCA of the model could better separate BA and Non-BA, compared to the PCA of primitive data.

Table 2

Parameters of different model.

Model	Accuracy	95% CI	Sensitivity	Specificity
ANXA2+KRT19+VCAN+CXCL8+KRT7	0.95	(0.87–0.99)	1.0	0.71
ANXA2+KRT19+VCAN+CXCL8	0.91	(0.82–0.96)	0.97	0.64
ANXA2+VCAN+CXCL8	0.91	(0.82–0.96)	0.95	0.71
ANXA2+VCAN+CXCL8+KRT7	0.94	(0.86–0.98)	0.98	0.71
ANXA2+VCAN+KRT7	0.90	(0.81–0.95)	0.95	0.64
ANXA2+VCAN	0.85	(0.75–0.92)	0.92	0.50
KRT19+CXCL8+KRT7	0.92	(0.84–0.97)	0.98	0.64
CXCL8+KRT7	0.94	(0.86–0.98)	0.98	0.71
KRT7	0.78	(0.67–0.87)	0.89	0.28
KRT19+KRT7	0.83	(0.73–0.91)	0.92	0.43
CXCL8	0.90	(0.81–0.95)	0.94	0.71

ANXA2 = annexin A2, CI = confidence interval, CXCL8 = C-X-C motif chemokine ligand 8, KRT7 = keratin 7, KRT19 = keratin19, VCAN = versican.

Figure 4.

ROC curves of 2 models. Model 1: KRT7 and CXCL8, Model 2: VCAN, ANXA2, KRT7, KRT19, and CXCL8. ANXA2 = annexin A2, AUC = area under curves, CXCL8 = C-X-C motif chemokine ligand 8, FPR = false positive rate, KRT7 = keratin 7, KRT19 = keratin 19, ROC = receiver operating characteristic, TPR = true positive rate, VCAN = versican.

Figure 5.

(A) PCA of original data. (B) PCA of the model of KRT7 and CXCL8. BA = biliary atresia, CXCL8 = C-X-C motif chemokine ligand 8, KRT7 = keratin 7, NC = normal control group, Non-BA = control group for hepatobiliary diseases without the biliary atresia, PCA = principal component analysis, PC1 = principal component 1, PC2 = principal component 2, VarExp = variance explained.

Parameters of different model. ANXA2 = annexin A2, CI = confidence interval, CXCL8 = C-X-C motif chemokine ligand 8, KRT7 = keratin 7, KRT19 = keratin19, VCAN = versican. ROC curves of 2 models. Model 1: KRT7 and CXCL8, Model 2: VCAN, ANXA2, KRT7, KRT19, and CXCL8. ANXA2 = annexin A2, AUC = area under curves, CXCL8 = C-X-C motif chemokine ligand 8, FPR = false positive rate, KRT7 = keratin 7, KRT19 = keratin 19, ROC = receiver operating characteristic, TPR = true positive rate, VCAN = versican. (A) PCA of original data. (B) PCA of the model of KRT7 and CXCL8. BA = biliary atresia, CXCL8 = C-X-C motif chemokine ligand 8, KRT7 = keratin 7, NC = normal control group, Non-BA = control group for hepatobiliary diseases without the biliary atresia, PCA = principal component analysis, PC1 = principal component 1, PC2 = principal component 2, VarExp = variance explained.

3.5. Validation for the model

In GSE46960, the expression level of KRT7 in the BA group was significantly higher than in other groups (F = 42.65, BA vs NC P < .001; BA vs non-BA P < .001), and the expression level of CXCL8 in the BA group also was significantly higher than other groups (F = 74.74, BA vs NC P < .001; BA vs non-BA P < .001) (Fig. 6A). In the data set GSE84954, the classification accuracy of the model utilizing randomforest was 73% (95% confidence interval: 0.39–0.94). The result of classification is predicted as shown in Table 3. Figure 6B revealed the probability of each sample being classified as BA calculated by the randomforest and the expression level of KRT7 and CXCL8 in the corresponding sample. A sample with consistently high expression of KRT7 and CXCL8 has a high probability of being identified as the BA. GSM2254637 and GSM2254649 were misclassified as the BA, whereas GSM2254643 was incorrectly assigned to non-BA. KNN algorithm (3 folds crossing validation) had better results, with the mean accuracy rising slightly around 0.83 (Table 4). Comparing randomforest, similarly, GSM2254637 and GSM2254649 were also misclassified as BA.

Figure 6.

(A) Expression level of KRT7 and CXCL8 in GSE46960.*BA VS NC P < .001,**BA VS Non-BA P < .001. (B) The probability of each sample being classified as BA, calculated by the randomforest and the expression level of KRT7 and CXCL8 in the corresponding sample. Each point in upper graphic corresponded to the probability of a sample being classified as BA. Red typeface in annotation referred to the non-BA sample being misclassified as BA. Blue typeface in annotation referred to BA sample being misclassified as non-BA. The lower graphic exhibited the expression level of KRT7 and CXCL8 corresponding to every sample. BA = biliary atresia, CXCL8 = C-X-C motif chemokine ligand 8, KRT7 = keratin 7, NC = normal control group, Non-BA =control group for hepatobiliary diseases without the biliary atresia.

Table 3

Confusion matrix of predicting classification in GSE84954 (randomforest).

			Reference
			BA	Non-BA
Liver tissue	Prediction	BA	5	2
Liver tissue	Prediction	Non-BA	1	3

BA = biliary atresia, Non-BA =control group for hepatobiliary diseases without the biliary atresia.

Table 4

Confusion matrix of predicting classification in GSE84954 (KNN).

			Reference
			BA	Non-BA
Liver tissue	Prediction	BA	6	2
Liver tissue	Prediction	Non-BA	0	3

BA = biliary atresia, KNN = k-nearest neighbors algorithm, Non-BA = control group for hepatobiliary diseases without the biliary atresia.

Confusion matrix of predicting classification in GSE84954 (randomforest). BA = biliary atresia, Non-BA =control group for hepatobiliary diseases without the biliary atresia. Confusion matrix of predicting classification in GSE84954 (KNN). BA = biliary atresia, KNN = k-nearest neighbors algorithm, Non-BA = control group for hepatobiliary diseases without the biliary atresia. (A) Expression level of KRT7 and CXCL8 in GSE46960.*BA VS NC P < .001,**BA VS Non-BA P < .001. (B) The probability of each sample being classified as BA, calculated by the randomforest and the expression level of KRT7 and CXCL8 in the corresponding sample. Each point in upper graphic corresponded to the probability of a sample being classified as BA. Red typeface in annotation referred to the non-BA sample being misclassified as BA. Blue typeface in annotation referred to BA sample being misclassified as non-BA. The lower graphic exhibited the expression level of KRT7 and CXCL8 corresponding to every sample. BA = biliary atresia, CXCL8 = C-X-C motif chemokine ligand 8, KRT7 = keratin 7, NC = normal control group, Non-BA =control group for hepatobiliary diseases without the biliary atresia.

4. Discussion

The etiology and mechanism of BA are still obscure, and there are many hypotheses for its pathogenesis, including abnormal bile duct development, inflammation, genetic variants, and immune disorder caused by a viral infection and even the occurrence of BA may be an end-stage phenotype induced by multiple mechanisms.[ The study found that the brown module genes most closely associated with BA enriched in Cell adhesion molecule (CAM), extracellular matrix (ECM) organization, inflammatory response, and notch pathway. One concept is that biliary epithelium cells actively participate in the pathogenesis of cholangiopathies via their transformation into reactive cholangiocytes; they have a critical role in biliary fibrosis via crosstalk with ECM-producing cells, inflammatory cells, and ECM, and also promote fibrosis via the secreting of proinflammatory or chemotactic cytokines and the expression of adhesion molecules.[ Laminin Subunit Gamma 1 (LAMC1) is an ECM glycoprotein that participated in many processes, such as cell adhesion, and its location on the base membrane and the positive association between LAMC1 and liver fibrosis was discovered long ago.[ Integrins (ITG), one of cellular receptors of the LAMC family, bind to LAMC and transmit base membrane signals to cells.[ LAMC binds to ITGA/B and formats the focal adhesion pathway in which Focal adhesion kinase regulates downstream hedgehog pathway in response to injury of liver or biliary, fibrosis.[ Yu et al[ of Nanjing Medical University proved expression of LAMC1 was modulated via rs3768617 of LAMC1, thereby affecting the binding of miRNA-548b-3p to LAMC1, which could be a potential therapeutic target for BA. CAMs like Intercellular Adhesion Molecule 1 (ICAM1), Vascular Cell Adhesion Molecule 1 (VCAM1), and E-selectin are overexpressed in patients’ liver tissue and serum with BA.[ When the injury occurred, CAM mediated cell-cell contact and recruited leukocytes to the site of injury, followed by sustained inflammation.[ A process that also existed in the biliary epithelium was infected by the virus and led to BA.[ The activity of CAM might reflect inflammation of the biliary ducts and the development of cirrhosis.[ During follow-up to the BA patients after operation, Professor Shan Zheng’s team at Fudan University found that patients with their jaundice not eliminated possessed high expression levels of VCAM1, suggesting that VCAM1 may play an important role in the pathogenesis of liver fibrosis in infants with BA.[ The same phenomenon was detected in patients with primary biliary cirrhosis, primary sclerosing cholangitis, and alcoholic liver disease.[ In addition, the excessively accumulated extracellular matrix represents the process of liver fibrogenesis; extracellular affords complex functions, such as cell adhesion, cell migration, and proliferation.[ These pathological changes match the progress of BA. The notch pathway is an evolutionarily conserved intercellular signaling pathway to maintain progenitor cells.[ It is a regulator in the shape of the intrahepatic bile duct.[ The abnormal alterations of the notch pathway and its ligand Jagged Canonical Notch Ligand 1 (JAG1) are the etiology of Alagille syndrome characterized by a paucity of intrahepatic bile ducts.[ The notch pathway plays a key role in BA, promoting the differentiation of hepatic progenitor cells into cholangiocytes for repairing bile duct epithelium.[ KRT7 and KRT19 were reported as markers of biliary epithelium, which appeared on biliary progenitor cells during development and regeneration,[ a process, however, led to the obstruction of the bile ducts. The PPI network exhibited that hub genes of brown module principally converged to 3 centers: Epithelial Cell Adhesion Molecule (EPCAM) and JAG1, ANXA2, Fibrillin 1(FBN1), which were associated with notch pathway, annexins family, and elastin respectively. Notch pathway had been described in the above statement, likewise, annexins family and elastin were also in charge of the BA process. Annexins were involved in tissue regeneration as these genes are involved in cell-to-cell communication and extracellular matrix growth.[ A hallmark of liver cirrhosis was the accumulation of large amounts of elastic fiber.[ FBN1, usually together with elastin, was a component of the extracellular matrix[ and was regularly co-expressed with elastin in children with cholestatic diseases.[ FBN1 in cholestatic disease was characterized as an apparent high expression with fibrosis of the bile duct, such as BA, sclerosing cholangitis, FBN1 cross-linked with tropoelastin to format microfibrils, further formed elastic fiber with elastin.[ TGF-beta is one of the most crucial cytokines regulating elastin expression levels and is perceived as the most potent fibrogenic cytokine in the liver; FBN1 could bind TGF-beta to modulate levels of TGF-beta.[ Next, KRT7 and CXCL8 were found to play a core role among all genes. CXCL8 chiefly responded for specificity, distinguishing non-BA liver diseases from BA. Therefore, CXCL8 plus KRT7 elevated the correct rate of identifying BA. The accuracy of combining KRT7 and CXCL8 was very close to the model consisting of VCAN, ANXA2, CXCL8, KRT19, KRT7, which only misidentified a BA as a non-BA. In contrast, the model’s accuracy descended to 74.74% in external validation set GSE84954. Through in-depth analysis, the classification error of GSE46960 mainly occurred when the Non-BA group was classified as BA group, and the samples misclassified were principally idiopathic cholestasis. In GSE84954, the propensity for error was similar to GSE46960; non-BA diseases miscategorized separately were biliary cirrhosis and alpha-1-antitrypsin deficiency. Furthermore, samples with the higher expression level of KRT7 and CXCL8 tended to be sorted as BA to a great extent, a phenomenon that was striking as shown by Figure 6B. The errors would occur when one of KRT7 and CXCL8 was expressed at an extremely high or low level, and another gene was expressed at a normal level. Given the expression of KRT7 and CXCL8 observed in GSE46960, thereby we took into account that the reason that induced the low accuracy of the model in GSE84954 mostly was the deviation of the results arising from the smaller sample size. The overall accuracy of the model, which was disturbed for errors caused by aberrant values, may improve with increasing sample size. Another reason we speculated on was that not all cholestatic diseases suited the model to differentiate. In our study, the result of WGCNA analysis and function enrichment disclosed the changes of mRNA profiles with BA were in line with previously identified pathogenesis of cholestatic diseases containing BA and implied the core role of biliary progenitor, ECM and Notch pathway in the progress of BA. A novel diagnostic model was developed by lasso regression and randomforest algorithm, which was comparable in accuracy to the model comprised of CXCL8 and LAMC2 originating from the same dataset. This conclusion inferred that the model of KRT7 and CXCL8 might be worthy of further study and made us notice the unique value of CXCL8 to discriminate BA with several cholestatic diseases. As the overlap in pathogenesis characteristic between BA and partial cholestasis, no single preoperative examination that enables the diagnosis of BA to be made certainty and thus a complex model that incorporates clinical manifest, laboratory examinations, image examinations, histological cachet, and biomarkers might be promising. Patients who opt for diagnosis by an exploratory laparotomy will benefit from it. Given the limitations of our study in the cohort size (14 disease control samples) and the small size of the external validation set (11 samples), the outcome should be investigated in a broader context. Another aspect, since these data were from European and American countries, the practical value of the model in the mainland China needs to be inspected with native biospecimen.

5. Conclusion

WGCNA constructed a co-expression net encompassing ten modules and identified that the brown module is mainly related to BA. GO and KEGG enrichment analysis uncovered the genes of the brown module are predominantly enriched in biological processes of CAM, extracellular matrix organization, inflammatory response, and notch pathway. Thirty-nine hub genes from the brown module consisted of a PPI network. KRT7 and CXCL8 were screened out from the 39 genes by lasso regression and a diagnostic model was formed using a randomforest algorithm to distinguish between BA and non-BA with an approximate accuracy of 93.6% and an AUC of 0.93.

Acknowledgments

The authors thank Heya Wang Prof, from the 6th people’s hospital of Guiyang city, for reviewing this paper and her helpful comments.

Author contributions

Yongliang Wang had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data interpretation. Concept and design: Fang Li. Acquisition, analysis, or interpretation of data: Yongliang Wang. Drafting of the manuscript: Yongliang Wang. Critical revision of the manuscript for important intellectual content: All authors. Supervision: Hongtao Yuan.

52 in total

1. Network structures and algorithms in Bioconductor.

Authors: Vincent J Carey; Jeff Gentry; Elizabeth Whalen; Robert Gentleman
Journal: Bioinformatics Date: 2004-08-05 Impact factor: 6.937

2. Molecular basis of elastic fiber formation. Critical interactions and a tropoelastin-fibrillin-1 cross-link.

Authors: Matthew J Rock; Stuart A Cain; Lyle J Freeman; Amanda Morgan; Kieran Mellody; Andrew Marson; C Adrian Shuttleworth; Anthony S Weiss; Cay M Kielty
Journal: J Biol Chem Date: 2004-03-23 Impact factor: 5.157

3. A general framework for weighted gene co-expression network analysis.

Authors: Bin Zhang; Steve Horvath
Journal: Stat Appl Genet Mol Biol Date: 2005-08-12

Review 4. Hepatic fibrogenesis.

Authors: Jinsheng Guo; Scott L Friedman
Journal: Semin Liver Dis Date: 2007-11 Impact factor: 6.115

Review 5. Liver fibrosis: Pathophysiology, pathogenetic targets and clinical issues.

Authors: Maurizio Parola; Massimo Pinzani
Journal: Mol Aspects Med Date: 2018-09-13

6. Inhibition of the Notch Signaling Pathway Reduces the Differentiation of Hepatic Progenitor Cells into Cholangiocytes in Biliary Atresia.

Authors: Yongzhong Mao; Shaotao Tang; Li Yang; Kang Li
Journal: Cell Physiol Biochem Date: 2018-09-07

7. Abnormal hepatic expression of fibrillin-1 in children with cholestasis.

Authors: Thierry Lamireau; Liliane Dubuisson; Sébastien Lepreux; Paulette Bioulac-Sage; Monique Fabre; Jean Rosenbaum; Alexis Desmoulière
Journal: Am J Surg Pathol Date: 2002-05 Impact factor: 6.394

8. Co-existence of ABCB11 and DCDC2 disease: Infantile cholestasis requires both next-generation sequencing and clinical-histopathologic correlation.

Authors: Georg-Friedrich Vogel; Elisabeth Maurer; Andreas Entenmann; Simon Straub; A S Knisely; Andreas R Janecke; Thomas Müller
Journal: Eur J Hum Genet Date: 2020-03-20 Impact factor: 4.246

Review 9. Biliary atresia.

Authors: Jane L Hartley; Mark Davenport; Deirdre A Kelly
Journal: Lancet Date: 2009-11-14 Impact factor: 79.321

Review 10. Epithelial-mesenchymal interactions in biliary diseases.

Authors: Luca Fabris; Mario Strazzabosco
Journal: Semin Liver Dis Date: 2011-02-22 Impact factor: 6.115