Literature DB >> 33787392

Identification of candidate biomarkers of liver hydatid disease via microarray profiling, bioinformatics analysis, and machine learning.

Jinwu Peng^1,2,3, Zhili Duan⁴, Yamin Guo⁵, Xiaona Li⁴, Xiaoqin Luo⁴, Xiumin Han⁵, Junming Luo^2,4.

Abstract

OBJECTIVES: Liver echinococcosis is a severe zoonotic disease caused by Echinococcus (tapeworm) infection, which is epidemic in the Qinghai region of China. Here, we aimed to explore biomarkers and establish a predictive model for the diagnosis of liver echinococcosis.
METHODS: Microarray profiling followed by Gene Ontology and Kyoto Encyclopedia of Genes and Genomes analysis was performed in liver tissue from patients with liver hydatid disease and from healthy controls from the Qinghai region of China. A protein-protein interaction (PPI) network and random forest model were established to identify potential biomarkers and predict the occurrence of liver echinococcosis, respectively.
RESULTS: Microarray profiling identified 1152 differentially expressed genes (DEGs), including 936 upregulated genes and 216 downregulated genes. Several previously unreported biological processes and signaling pathways were identified. The FCGR2B and CTLA4 proteins were identified by the PPI networks and random forest model. The random forest model based on FCGR2B and CTLA4 reliably predicted the occurrence of liver hydatid disease, with an area under the receiver operator characteristic curve of 0.921.
CONCLUSION: Our findings give new insight into gene expression in patients with liver echinococcosis from the Qinghai region of China, improving our understanding of hepatic hydatid disease.

Entities: Chemical Disease Gene Mutation Species

Keywords: CTLA4; FCGR2B; Liver hydatid disease; Qinghai region of China; echinococcosis; microarray profiling

Mesh：

Substances：
Biomarkers

Year: 2021 PMID： 33787392 PMCID： PMC8020228 DOI： 10.1177/0300060521993980

Source DB: PubMed Journal: J Int Med Res ISSN： 0300-0605 Impact factor: 1.671

Introduction

Hydatid disease (or echinococcosis) is a zoonotic disorder caused by Echinococcus (tapeworm) infection, which is an uncommon but developing public health issue with increasing expansion to endemic regions.[1] Hydatid disease is epidemic in areas of North America, eastern Europe, central Europe, and Asia, especially in China, where hydatid disease is present in about 23 provinces, with an estimated 66 million individuals at risk of infection and nearly 380,000 affected individuals.[2] The incidence of hydatid disease is particularly high in the Qinghai region of China because of the environment, living habits, and traditional beliefs and customs.[3,4] Around 70% of hydatid disease lesions occur in the right lobe of the liver, with 40% infringing on the hepatic hilum.[5] In the liver, growth of the larva compresses the organ, and the liver is impaired by inflammation reaction, compression, and direct erosion, which can lead to severe complications, including liver failure, severe cirrhosis, portal hypertension, and ascites.[6] Therefore, a better understanding of and comprehensive investigation into hepatic hydatid disease is needed. Diagnostic challenges exist, particularly in endemic regions.[7] Diagnosis of the infection at an early stage, suitable therapy, and follow-up are essential to improve the quality of life of affected patients.[8] Clinically, hydatid disease has nondistinct markers and imaging characteristics, so diagnosis currently depends on a combination of clinical history, imaging, serology, and histopathology, which makes early diagnosis difficult.[5,9] Microarray profiling has been widely applied in the exploration of prognostic and diagnostic biomarkers for many diseases.[10] Machine learning is a type of data science that trains computers to observe patterns in massive datasets and then uses the patterns to derive rules or algorithms that optimize task performance; as such, machine learning has great potential to be useful in clinical prediction.[11] Random forests is a powerful multipurpose tool for predicting and understanding data; for example, it can rank genes in terms of their usefulness in separating groups, and it is widely used to identify hub genes and diagnostic biomarkers. However, the integration of microarray profiling analysis and machine learning based on random forests in the investigation of liver hydatid disease is limited.[12] The diagnosis of liver hydatid disease in the Qinghai region of China is difficult and lagging behind that of other regions,[13] and more reliable diagnostic biomarkers and practical prediction models are urgently needed. In this study, we aimed to comprehensively analyze differentially expressed genes (DEGs) between patients with hepatic hydatid disease and healthy controls in the Qinghai region of China by integrating microarray profiling with bioinformatic analysis and machine-learning algorithms. Furthermore, we performed Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis, constructed a protein–protein interaction (PPI) network of DEGs to identify hub genes, and established a random forest model based on the identified hub genes to predict the occurrence of liver hydatid disease. Our investigation, with regional characteristics, will enhance the understanding of the development of liver hydatid disease, especially in the Qinghai region of China, providing potential biomarkers and targets for this epidemic area.

Materials and methods

Ethics statement

All study participants provided written informed consent before inclusion in the study. The study protocol was approved by the Ethics Committee of the Qinghai Provincial People’s Hospital, China (approval no. 2020-160).

Collection of clinical hydatid disease samples

Six patients with hepatic hydatid disease were diagnosed between June 2017 and July 2018 at the Department of General Surgery of Qinghai Provincial People’s Hospital in this study. Diagnosis of all patients was confirmed in accordance with the diagnostic criteria of the Expert Consensus on Diagnosis and Treatment of Liver Echinococcosis.[14] The diagnosis of hepatic hydatid disease relies on clinical findings, medical history, histopathologic verification, laboratory evaluations, and radiologic imaging. At least two of the following four criteria have to be met for the diagnosis of hepatic hydatid disease: (1) identification of parasite nucleic acids in clinical specimens, (2) characteristic lesions shown by imaging, (3) pathologic verification of Echinococcus multilocularis metacestodes, and (4) specific serum antibodies to Echinococcus antigens. Patients with hepatitis B or tuberculosis or other liver diseases were excluded from this analysis. We also excluded participants with other chronic consumptive diseases and organ echinococcosis. We collected the lesion including the border with normal tissue as the lesion group, and nonlesional tissue as the control group. Approximately 0.5 to 250 mg of healthy or hydatid disease liver tissue was collected by resection and stored in liquid nitrogen until analysis.

RNA extraction

Liver tissue samples were collected from patients with hepatic hydatid disease; lesional and adjacent nonlesional tissue samples were processed separately. Total RNA was extracted from liver tissue by using the RNeasy Lipid Tissue kit (Qiagen, Hilden, Germany) and purified by applying RNeasy columns (Qiagen). The RNA concentration and quality (RNA integrity number and 28S/18S ratio) were determined using a NanoDrop spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and the RNA 6000 Nano Assay on an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA), respectively.

Microarray expression profiling

Microarray expression profiling was conducted in liver tissue from healthy controls (n = 6) and in patients with liver hydatid disease (n = 6). Briefly, total RNA (100 ng) was labeled with the mRNA Complete Labeling and Hyb Kit (Agilent Technologies) and hybridized on the SurePrint G3 Human Gene Expression 8x60K v2 microarray (Agilent Technologies). The microarray contains 50,599 probes for 32,776 human mRNAs originating from authoritative databases, including RefSeq, Ensemble, GenBank, and the Broad Institute. After hybridization and washing, processed slides were scanned with the Agilent G2505C microarray scanner (Agilent Technologies).

Identification of DEGs

Raw data from the microarray profiling were extracted by applying Feature Extraction (version10.7.1.1; Agilent Technologies). Next, quantile normalization and subsequent data processing were carried out by using Genespring GX software (version 12.0; Agilent Technologies). The differential expression level of mRNA from microarray was analyzed by using the limma R package (http://www.bioconductor.org/packages/release/bioc/html/limma.html). After quantile normalization, raw signals from the microarray were log2 transformed. Differential expression of mRNA was defined by an absolute value of fold change (FC) >2 (|log2FC| >1) and P-value <0.05 (Student’s t-test).

GO and KEGG analysis

Differentially expressed mRNAs were submitted to the KEGG Orthology Based Annotation System (KOBAS) database (http://kobas.cbi.pku.edu.cn/) to be classified into different GO domains, including molecular function, biological process, and cellular component, and KEGG annotation groups (KEGG pathway and Reactome bioinformatics analysis database (http://reactome.org). A P-value < 0.05 was regarded as significant.

PPI analysis

The PPI analysis was perfomed using the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database (https://string-db.org/cgi/input.pl). The hub genes among the identified DEGs were analyzed and selected based on the score of their maximal clique centrality (MCC) by using the cytoHubba plugin of Cytoscape software (https://cytoscape.org/).

Construction of the random forest model

To identify the most critical genes correlated with the development of liver hydatid disease, we constructed a random forest model of candidate hub genes in the PPI network based on the “bagging” (bootstrap aggregating) method by using package “randomForest” in R (https://www.stat.berkeley.edu/∼breiman/RandomForests/), in which the sample type (hydatid disease or not) served as the outcome variable and the expression levels of selected candidate hub genes served as the prediction variable. Genes with higher scores of both MeanDecreaseAccuracy and MeanDecreaseGini in the random forest model were identified as hub genes.

Statistical analysis

Statistical analysis was conducted using R software (version 3.5.2; https://cran.r-project.org).

Results

Identification of DEGs between hydatid disease and healthy liver tissues

To comprehensively understand the development of hydatid disease of the liver and explore potential diagnostic biomarkers, we performed gene expression profiling based on the microarray analysis in patients with liver hydatid disease (n = 6) from the Qinghai region of China, using lesional and nonlesional (control) tissue from the cut edge from patients. First, we normalized the microarray data to obtain biologically significant changes in gene expression, minimizing the experimental data deviation of gene expression intensity (Figure 1A). We identified 1152 DEGs, including 936 upregulated and 216 downregulated genes in lesional liver tissue compared with control tissues (Log2FC > 1, P < 0.05) (Figure 1B). Of these, the most significant changes are shown in a heatmap (Figure 1C).

Figure 1.

Identification of differentially expressed genes (DEGs) between hydatid disease and healthy human liver tissues. (a) Boxplot showing the intensity of raw data and normalized data in the microarray profiling analysis; NB and PB indicate sample numbers. (b) Volcano plot filtering map displaying differential expression of mRNA in liver tissues from patients with hydatid disease (n = 6) compared with that of adjacent normal tissue (the cut edge) (n = 6). Red and green represent upregulated and downregulated mRNAs, respectively; black represents no difference. The x-axis indicates the Log2 fold change (FC) and the y-axis shows −log10 (P-value). (c) Heatmap indicating DEGs; the x-axis shows samples and the y-axis shows DEGs, in which red and green represent upregulated and downregulated DEGs, respectively.

GO and KEGG analysis of DEGs

To understand the importance of these DEGs, we performed GO and KEGG pathway analysis based on the KOBAS database. The top 30 significant (P < 0.05) GO terms and KEGG pathways based on the 1152 DEGs are presented in Figure 2A and B. GO analysis, including the biological process, molecular function, and cellular component domains, showed that multiple essential biological processes, such as the immune system process, immune response, T-cell activation, and cell defense response, were remarkably altered (Figure 2A). KEGG analysis revealed that several crucial signaling pathways, involving the immune system, T-cell receptor signaling pathway, B-cell receptor signaling pathway, and chemokine signaling pathway, were enriched (Figure 2B).

Figure 2.

Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses were performed using data of differentially expressed genes in lesional and nonlesional (healthy control) liver tissue from patients with hydatid disease based on the KOBAS database (http://kobas.cbi.pku.edu.cn/). The top 30 significant cellular processes and signaling pathways were demonstrated by (a) GO, and (b) KEGG enrichment scatterplot. The y-axis shows the name of cellular process or signaling pathway and the x-axis indicates the gene ratio; the size of the dot indicates the number of genes.

PPI network construction and candidate hub gene selection

To further identify candidate hub genes among the DEGs in healthy cases patients with hydatid disease, we constructed a PPI network based on the 1152 DEGs using the STRING online database. Then, the MCC algorithm was used to evaluate the significance of these DEGs in the PPI networks. We selected the top 20 candidate hub genes: ITGAM, PTPRC, CD19, CTLA4, CD69, CD28, CCR7, SELL, ITGAX, FCGR2B, GZMB, CD2, CD27, CD38, IL7R, CD40LG, CXCR3, FCGR2A, IL3RA, and CD5, in the PPI networks based on MCC (Figure 3A); the details are shown in Table 1.

Figure 3.

Protein–protein interaction (PPI) network construction and random forest model construction. (a) The PPI analysis was performed by using the STRING online database (https://string-db.org/cgi/input.pl). The PPI network with the top 20 differentially expressed genes (DEGs) is shown. The color change from yellow to red indicates the gradually increasing degree of expression. (b) A random forest model of the candidate hub genes in the PPI network was constructed using the package “randomForest” in R, in which the sample type (hydatid disease or not) served as the outcome variable, and the expression levels of selected 20 candidate hub genes were the prediction variables. The variable importance, as shown in the figure, was evaluated by the MeanDecreaseAccuracy and MeanDecreaseGini scores. (c) The accuracy of the model was assessed by the area under the ROC curve (AUC) analysis. The x-axis shows the false positive rate, and the y-axis the true positive rate.

Table 1.

The top 20 most important genes by maximal clique centrality (MCC) analysis in the protein–protein interaction network.

Gene symbol	Score
ITGAM	4.0038 × 10²¹
PTPRC	4.0036 × 10²¹
CD19	4.0036 × 10²¹
CTLA4	4.0036 × 10²¹
CD69	4.0036 × 10²¹
CD28	4.0036 × 10²¹
CCR7	4.0035 × 10²¹
SELL	4.0031 × 10²¹
ITGAX	4.0027 × 10²¹
FCGR2B	4.0026 × 10²¹
GZMB	4.0022 × 10²¹
CD2	3.9978 × 10²¹
CD27	3.9916 × 10²¹
CD38	3.9910 × 10²¹
IL7R	3.9890 × 10²¹
CD40LG	3.901 × 10²¹
CXCR3	3.9006 × 10²¹
FCGR2A	3.7932 × 10²¹
IL3RA	3.6281 × 10²¹
CD5	3.5804 × 10²¹

Random forest model construction and hub gene identification

Next, we constructed the random forest model of the candidate hub genes in the PPI network, in which the sample type (hydatid disease or not) served as the outcome variable and the expression levels of selected 20 candidate hub genes served as the prediction variable. FCGR2B and CTLA4 were identified as hub genes with the highest scores of MeanDecreaseAccuracy and MeanDecreaseGini in the random forest model (Figure 3B), implying that these two genes may be critical to the development of liver hydatid disease. Accordingly, we established a random forest model by applying the expression of FCGR2B and CTLA4 as the prediction variable and sample type (hydatid disease or not) as the outcome variable. The area under the receiver operator characteristic (ROC) curve of the model was 0.921 (Figure 3C), suggesting that a prediction model based on these two genes could reliably distinguish liver hydatid disease and control samples.

Discussion

Hydatid disease is a dangerous zoonosis caused by Echinococcus, with a complicated life cycle involving definitive hosts and intermediates, including cattle, sheep, wolves, foxes, and dogs.[15] Individuals can be infected without knowing and develop hydatid disease.[16] Although it garners little attention, hydatid disease affects many people globally.[17] Hydatid disease is predominantly found in the liver (60%–70%), and less often in the brain, osseous tissue, lung, and other abdominal organs.[18] Multiple studies have reported the use of microarray profiling to investigate hepatic hydatid disease. MicroRNA expression profiling has shown that miR-483-3p is a potential diagnostic biomarker for hydatid disease.[19] Gene expression profiling in the liver of Echinococcus-infected sheep showed 87 upregulated genes and 66 downregulated genes, which were correlated with signaling and transport, the immune system, and metabolism.[20] Transcriptional profiling reveals that cytokine/chemokine factors are involved in the infection of Echinococcus multilocularis in mice.[21] The gene expression profile in the liver of Echinococcus-infected mice indicates that chemokine-activated MAPK signaling and the peroxisome proliferator-induced PPAR (peroxisome proliferator-activated receptor) pathway respond to the infection in the system.[22] However, gene expression profiling analysis in human samples is limited. In this study, microarray profiling identified 1152 DEGs (936 upregulated and 216 downregulated) in lesional and adjacent nonlesional (control) liver tissues from patients with hydatid disease in the Qinghai region of China. This mRNA expression profiling analysis by microarray in patients with hepatic hydatid disease provides valuable information and an integrative landscape in which to increase our understanding of hepatic hydatid disease. Previous studies have shown that multiple biological processes and signaling pathways are involved in the development of liver hydatid disease. The transforming growth factor (TGF)-β/Smad pathway enhances the differentiation of interleukin (IL)-9-producing CD4+ T cells in humans following infection with Echinococcus.[23] Infection with Echinococcus multilocularis can cause exhaustion of T cells by the checkpoint receptor TIGIT.[24] Regulatory B cells with CD24hiCD38hi CD19+ are involved in the infection of humans with alveolar hydatid disease of the liver.[25] Plasma IL-5 and IL-23 are markers of metabolic lesion activity in patients with hepatic alveolar echinococcosis.[26] The expression of CTLA4, CD28, CD80, and CD86 is decreased in the liver of patients with cystic echinococcosis.[27] Our GO and KEGG analysis of 1152 DEGs revealed that multiple essential biological processes, including immune system process, immune response, T-cell activation, and cell defense response, and crucial signaling pathways, including immune system, T-cell receptor signaling pathway, B-cell receptor signaling pathway, and chemokine signaling pathway, were enriched in lesional and adjacent nonlesional (control) liver tissues from patients with hydatid disease. Our results suggest that these cellular processes and signaling may play crucial roles in the development of hepatic hydatid disease. The top 20 candidate hub genes (ITGAM, PTPRC, CD19, CTLA4, CD69, CD28, CCR7, SELL, ITGAX, FCGR2B, GZMB, CD2, CD27, CD38, IL7R, CD40LG, CXCR3, FCGR2A, IL3RA, and CD5) were selected from among the DEGs in the PPI networks based on their MCC. These genes may be involved in the progression of hepatic hydatid disease. Further studies are needed to validate the significance of these genes in the progression, diagnosis, prognosis, and therapy of hepatic hydatid disease. It has been reported that the G-protein-coupled receptor (GPCR) signaling, rhodopsin signaling, and GPVI-mediated activation cascade are involved in liver diseases.[28-30] In our study, GO and KEGG analysis also showed enrichment of GPCR and rhodopsin signaling and GPVI-mediated activation cascade, implying that these biological processes may play crucial roles in hepatic hydatid disease. The incidence of liver hydatid disease is high in the Qinghai region of China,[3] and previous epidemiologic studies have revealed that many mammals are involved in the transmission of hydatid disease.[31] Tibetan foxes, red foxes, and dogs can act as definitive hosts, whereas voles, pika, Tibetan hare, yaks, and sheep may serve as intermediates.[32] Because of the harsh environment of the high-altitude plain, traditional beliefs and customs, entrenched poverty, and semi-nomadic pastoralism, the lifestyle of people in the Qinghai region involves close connection with both wildlife and domestic mammals.[33] These social and ecological contributors lead to the burden of hydatid disease in this region.[34] As noted, the diagnosis of hydatid disease is based on serology, imaging results, and clinical findings.[35] Hematoxylin and eosin (H&E) staining of pathological lesions allow physicians to accurately diagnose hydatid disease and evaluate the therapeutic response in patients.[36] Typically, physicians depend on standard segmentation systems to evaluate computed tomography images.[37] However, limited medical resources and the difficult diagnosis increase the burden of this disease in the Qinghai region.[38] Hence, identification of reliable biomarkers and practical and useful prediction models for liver hydatid disease in the Qinghai region of China is urgently needed. In this study, we identified two hub genes, FCGR2B and CTLA4, by integration of microarray profiling, bioinformatic analysis, and machine-learning algorithms. These genes are potential biomarkers for the diagnosis of liver hydatid disease. Moreover, we established a random forest model based on these hub genes and effectively predicted the occurrence of liver hydatid disease. This prediction model, with regional characteristics of the Qinghai region of China, provides a new potential diagnostic model and strategy for liver hydatid disease in this area. Further investigations are needed to explore the specific effect of these hub genes on liver hydatid disease. In conclusion, we identified 1152 DEGs in lesional and adjacent nonlesional (control) liver tissues from patients with hydatid disease in the Qinghai region of China. The DEGs were involved in multiple biological processes and pathways, including immune response and immune system signaling. Two hub genes, FCGR2B and CTLA4, were identified, and a random forest model based on these genes was able to reliably and accurately predict the occurrence of liver hydatid disease. Our findings give new insights into the comprehensive landscape of gene expression in patients with liver hydatid disease, including the development of hepatic hydatid disease, especially in the Qinghai region of China. The identified hub genes may serve as potential biomarkers or targets for hepatic hydatid disease, although additional research is needed for clinical verification. Click here for additional data file. Supplemental material, sj-jpg-1-imr-10.1177_0300060521993980 for Identification of candidate biomarkers of liver hydatid disease via microarray profiling, bioinformatics analysis, and machine learning by Jinwu Peng, Zhili Duan, Yamin Guo, Xiaona Li, Xiaoqin Luo, Xiumin Han and Junming Luo in Journal of International Medical Research

38 in total

Review 1. Random forests for microarrays.

Authors: Adele Cutler; John R Stevens
Journal: Methods Enzymol Date: 2006 Impact factor: 1.600

Review 2. GPCR-Mediated Signaling of Metabolites.

Authors: Anna Sofie Husted; Mette Trauelsen; Olga Rudenko; Siv A Hjorth; Thue W Schwartz
Journal: Cell Metab Date: 2017-04-04 Impact factor: 27.287

Review 3. Epidemiology of human alveolar echinococcosis in China.

Authors: Philip S Craig
Journal: Parasitol Int Date: 2005-12-09 Impact factor: 2.230

Review 4. Hepatic alveolar hydatid disease (Echinococcus multilocularis), a mimic of liver malignancy: a review for the radiologist in non-endemic areas.

Authors: M D Chouhan; E Wiley; P L Chiodini; Z Amin
Journal: Clin Radiol Date: 2019-02-10 Impact factor: 2.350

5. Molecular characterization of human echinococcosis in Sichuan, Western China.

Authors: Jingye Shang; Guangjia Zhang; Wenjie Yu; Wei He; Qian Wang; Bo Zhong; Qi Wang; Sha Liao; Ruirui Li; Fan Chen; Yan Huang
Journal: Acta Trop Date: 2018-09-29 Impact factor: 3.112

Review 6. Impact of anthropogenic and natural environmental changes on Echinococcus transmission in Ningxia Hui Autonomous Region, the People's Republic of China.

Authors: Yu Rong Yang; Archie C A Clements; Darren J Gray; Jo-An M Atkinson; Gail M Williams; Tamsin S Barnes; Donald P McManus
Journal: Parasit Vectors Date: 2012-07-24 Impact factor: 3.876