Literature DB >> 26072489

PAGER: constructing PAGs and new PAG-PAG relationships for network biology.

Zongliang Yue¹, Madhura M Kshirsagar¹, Thanh Nguyen¹, Chayaporn Suphavilai¹, Michael T Neylon¹, Liugen Zhu¹, Timothy Ratliff¹, Jake Y Chen².

Abstract

In this article, we described a new database framework to perform integrative "gene-set, network, and pathway analysis" (GNPA). In this framework, we integrated heterogeneous data on pathways, annotated list, and gene-sets (PAGs) into a PAG electronic repository (PAGER). PAGs in the PAGER database are organized into P-type, A-type and G-type PAGs with a three-letter-code standard naming convention. The PAGER database currently compiles 44 313 genes from 5 species including human, 38 663 PAGs, 324 830 gene-gene relationships and two types of 3 174 323 PAG-PAG regulatory relationships-co-membership based and regulatory relationship based. To help users assess each PAG's biological relevance, we developed a cohesion measure called Cohesion Coefficient (CoCo), which is capable of disambiguating between biologically significant PAGs and random PAGs with an area-under-curve performance of 0.98. PAGER database was set up to help users to search and retrieve PAGs from its online web interface. PAGER enable advanced users to build PAG-PAG regulatory networks that provide complementary biological insights not found in gene set analysis or individual gene network analysis. We provide a case study using cancer functional genomics data sets to demonstrate how integrative GNPA help improve network biology data coverage and therefore biological interpretability. The PAGER database can be accessible openly at http://discovery.informatics.iupui.edu/PAGER/.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2015 PMID： 26072489 PMCID： PMC4553834 DOI： 10.1093/bioinformatics/btv265

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

To characterize complex patterns of genetic variations or RNA/protein expressions in surging Omics data sets, bioinformatics researchers have developed new data analysis approaches—gene-set analysis (Dinu ; Nam and Kim, 2008) and network/pathway analysis (Wu ). When compared with individual gene-based analysis approaches, “gene-set, network, and pathway analysis” (GNPA) have the potential advantage of increasing results reproducibility, model robustness, and data interpretability, while reducing biases in noisy Omics experimental data (Khatri ). GNPA is essentially an integrative analysis strategy that takes advantage of a priori data structures acquired from many sources in: e.g. gene ontology (GO) (Kim ), pathways (Luo ), gene/protein mutation or expression signatures (Chuang ; Zhang and Chen, 2013), chemical perturbations (Oprea ), metabolomics signatures (Xia and Wishart, 2010), curated literature (Araki ), computational predictions (Nam, 2010) or public databases for gene-sets, molecular signatures and pathway/network modules (Huang ). Depending on the pathway details, these analyses can take as minimal information as possible, e.g. using gene list alone as in Gene-set Analysis (Dinu ), incorporating gene-set databases with expression rank information as in Gene Set Enrichment Analysis (Subramania ), or take as much pathway/network interaction detail as possible, e.g. using pathway reaction/regulation details as in EnrichNet (Glaab ). For biomedical researchers, GNPA can significantly enhance their ability to annotate genes from Omics results (Ganter and Giroux, 2008), interpret heterogeneous genetic study results (Hale ), identify disease subtypes and progression (Hung, 2013; Zhang and Chen, 2013), select or prioritize drug targets (Sivachenko and Yuryev, 2007) and understand biological mechanisms (Chen ; Murohashi ). Heterogeneous bioinformatics tools have been developed to perform different aspects of GNPA Huang da ). Computationally, GNPA can be categorized into over-representation analysis such as DAVID (Dennis ), gene-set scoring and ranking such as GSEA (Subramania ), multivariate machine learning Wu ) and network topological analysis Martini ). Common databases for GNPA include: manually curated functional or phenotypic databases such as Online Mendelian Inheritance in Man (OMIM) (Baxevanis, 2012), GAD (Becker )] and GO (Ashburner ); curated signaling and regulatory pathway databases such as Reactome (Croft ) and pathway interaction database (PID) (Schaefer ); comprehensive curated gene signature database such as GeneSigDB (Culhane ) and mSigDB (Liberzon ); or comprehensive integrated knowledge repositories such as human pathway database (HPD) (Chowbina ) and PAGED (Huang ). Recently, multi-scale GNPA analysis tools such as Bioinformatics Enrichment Tools (Huang da, ), MetaNet (Parikh ) and EnrichNet (Glaab ) have also been developed to explore network topological structures between pathways and gene-sets using a variety of computational strategies. However, even the most popular tools such as DAVID and GSEA lack major functionality or comprehensive data content found in other tools, forcing users to try out several tools and combining findings manually. In this work, we address the major challenge of constructing a comprehensive database infrastructure in performing integrative GNPA. The theoretical space of gene-sets for a given species with N genes can be extremely large, i.e. 2N gene-sets. Based on our practical experience in building comprehensive pathway and gene-set databases (Chowbina ; Huang ), different data sources collected for integrated GNPA are often partially overlapping yet complementary to one another. For example, GeneSigDB and mSigDB, two popular databases for Gene Set Analysis, have virtually no overlapping gene-sets, in which an overlap is defined as two gene-sets sharing >80% genes. KEGG (Ogata, ) is the earliest effort in curating heterogeneous pathway data sources; however, its coverage in gene-set is missing. GO is widely used for gene set analysis; however, tools performing GO-based gene set analysis do not usually perform pathway/network analysis. HPD (Chowbina ), PAGED (Huang ) and GeneSetDB (Araki ) are among the first attempts to integrate annotated gene lists from heterogeneous sources into a unified database. Although these databases provide significantly higher coverage of gene-sets and other annotated gene list, these databases lack pathway/network interaction or regulatory relationships details. In addition, increased gene-set coverage gives rise to gene-set data quality concerns that must be addressed. There is no report on how a new gene-set submitted from community users, e.g. from WikiPathways (Pico ), should be evaluated for its quality before a database of gene-set should adopt it for data analysis. Moreover, there is a rising need to perform integrative network analysis both within gene-sets and between gene-sets; therefore, understanding how gene-sets relate to one another to build gene-set to gene-set relationships has become critical for ongoing integrative GNPA tools. Our work has made the following contributions for future GNPA. First, we defined a new concept—PAG, which stands for Pathways (P-type), Annotated lists (A-type) and Gene-sets (G-type), to integrate heterogeneous gene-sets, networks, and pathways into a new comprehensive database called PAG Electronic Repository (PAGER). Each P-type PAG refers to a connected set of molecules (genes/proteins/metabolites), among which some detail of curated mechanism of actions, e.g. protein interactions, reactions, or gene regulations, are available (Biological Pathway, http://www.genome.gov/27530687). Each A-type PAG refers to a curated list of genes/proteins identified from a specific biological context, e.g. a shared GO category or a shared protein family. Each G-type PAG refers to a list of genes/proteins derived from any given high throughput Omics experiment, e.g. functional genomics, under a shared biological condition. PAGER collects and organizes 38 663 PAGs—the largest collection known to date—using a three-letter-code PAG classification system. Second, we developed a statistical measure called Cohesion Coefficient (CoCo) to help assess PAG data quality—the degree of biological relevance beyond random chance found among genes curated in each PAG. We demonstrated that the CoCo score can effectively separate biologically curated PAGs in PAGER from randomly generated PAGs with an sensitivity = 0.75 and specificity = 0.94, covering >94% of all PAGs compiled into the PAGER database, Third, we computed a novel type of relationship among PAGs called regulatory PAG–PAG relationships (r-type) in addition to computing the co-membership based PAG–PAG relationships described in earlier work (Chowbina ). These r-type PAG–PAG relationships (n = 65 872 for human) significantly complement the single gene regulatory relationships that are known prior to this work (n = 22 127 for human). The PAGER database is accessible at http://discovery.informatics.iupui.edu/PAGER/ and the database content for PAGs may be downloaded for separate GNPA tools such as GSEA.

2 Methods

2.1 Source data collection and preprocessing of PAGs

In the PAGER database, we compiled PAGs from the following data sources (for download date and description details, refer to Supplementary Data S1): WikiPathway (Pico ), from which we collected 202 public validated pathways; Reactome (Croft ), from which we collected 651 peer-reviewed pathways via the HPD (Chowbina ); BioCarta (Nishimura, 2001), from which we collected 253 pathways via the HPD; KEGG (Ogata ), from which we collected 200 pathways via the HPD; PID (Schaefer ), from which we collected 132 NCI-Nature curated pathways via the HPD; Protein Lounge (http://www.proteinlounge.com/Pathway), from which we collected 393 pathways; OMIM (Baxevanis, 2012), from which we collected 4409 manually-curated gene lists associating with phenotype terms; SPIKE (Elkon ), from which we collected 28 signaling pathways from the Genetic Association Database; GAD (Becker ), from which we collected 1679 unique phenotype/disease-related protein records; PharmGKB (Thorn ), from which we collected 102 chemical-associated pathways; MSigDB (Liberzon ), from which we collected 10 295 gene sets; GeneSigDB (Culhane ), from which we collected 3515 gene signatures; NHGRI GWAS Catalog (Welter ), from which we collected 1754 curated publications of 11 912 SNPs; and NGS Catalog (Xia ), from which we collected 69 annotated gene list curated from next generation sequencing data analysis literature. During the data integration process, gene/protein identified obtained from different sources were all mapped to NCBI official gene symbols (Brown ). In the PAGER database, we also constructed a special type of PAGs, ‘singleton PAGs’ (sPAG), to refer to PAGs consisting of only one gene. sPAGs are essential for constructing PAG–PAG relationships (to be described later). The final PAGER database consists of 19 772 sPAGs, which represent all sPAGs from the underlying sources and an additional 15 161 human genes from the reviewed subset of the UniProt Knowledgebase (UniProt, 2013). In the PAGER web database, we mask out these sPAGs by default to avoid causing confusions to users who are primarily interested in performing analysis with regular PAGs (n = 18 607, among which 16 125 are for human). A total of 324 830 gene–gene relationship data are also imported into the PAGER database from various data sources. In total, 205 185 molecular association data are derived from the STRING version 9.1 (Szklarczyk ) database after removing those with confidence score of 800 or less. A total of 93 713 human protein–protein interaction (PPI) data are derived from the HAPPI database (quality ≥ 3-star ratings) (Chen ). A total of 25 932 gene regulation data are derived from the TRANSFAC (Wingender ) (with quality defined as having five-binding sites or more), TRED (Jiang ) and SPIKE (Elkon ) databases. The PPI data are used to build PAG biological relevance measures. The gene regulation data are used to construct regulatory PAG–PAG relationships later. We follow the PAG ID standard naming convention as shown in Figure 1. Every PAG begins with a three-letter-code followed by six digits. The first letter position is used to signify the PAG data type category. The second letter position is used to signify the PAG derivation category. The third letter position is used to signify PAG relationship details. The human-readable naming convention may make it easy for end users to understand how the PAGs are constructed.

Fig. 1.

PAG ID standard naming convention and three types of PAGs

2.2 Defining CoCo measures to evaluate the biological relevance of each PAG

To evaluate the biological relevance of PAGs, we developed three PAG cohesion measures, i.e. Cohesion by Interaction enrichment (CoI), Cohesion by Triangle enrichment (CoT) and CoCo. Conventional tests such as chi-square, hypergeometric test or Fisher’s exact test have all been used to assess gene-set enrichment results. However, none of these statistical tests may generate true statistical P-values in practice (Huang da ). Therefore, for simplicity of calculation, we calculate all three cohesion measures based on signed probability mass function of hypergeometric distribution. Mathematical notations are defined as the following: Π is the set of genes in the reference PPI database; i and j are PAG indexes; I or I is the set of PPIs in PAG and PAG, respectively; IΠ is the set of all PPIs in the referenced database; T or T is the set of PPI triangles in PAG and PAG, respectively given IΠ; TΠ is the set of all PPI triangles in Π. For an overview of how each cohesion measure is calculated, refer to the illustrated example shown in Figure 2.

Fig. 2.

Example of calculation of each Cohesion measure

2.2.1 CoI definition

We developed CoI to measure the statistical significance of observing a given number of PPIs among all genes within a PAG. We define CoI from a hypergeometric probability mass function as: in which is the number of theoretical PPIs in Π, K = |IΠ| is the number of actual PPIs in the referenced database, is the theoretical number of PPIs inside PAG i, and k = |I| is the number of actual PPIs inside PAG i. The sign function compares the expected PPI count ratio K/N and PAG i’s PPI count ratio k/n. It returns 1 for over-representation if the PPI count ratio inside PAG i is more than the expected ratio and returns −1 for under-representation if the interaction ratio inside PAG i is less than the expected ratio. For PAGs containing no PPIs, we will not calculate a CoI. A high positive CoI score implies that genes inside the PAG are strongly linked while a high negative CoI score implies that genes inside the PAG are impossibly linked. Randomly generated PAGs should have a mean CoI close to 0.

2.2.2 CoT definition

We developed CoT to measure the statistical significance of observing a given number of triangles (PPIs forming a connected loop with three exact nodes/edges) among all genes within a PAG. Similar to CoI, CoT is also calculated using the same hypergeometric distribution probability mass function (hence we will not repeat here) but with different parameters for counting PPI triangles instead: is the number of theoretical triangles in Π, K = |TΠ| is the number of actual triangles in the referenced database, is the theoretical number of triangles inside PAG_i, and k = |T| is the number of actual triangles inside PAG_i. When compared with CoI, it has similar characteristics for highly positive CoT or highly negative CoT cases. For PAGs containing no PPIs, we will not calculate a CoT. A high positive CoT score implies that genes inside the PAG are strongly linked while a high negative CoT score implies that genes inside the PAG are impossibly linked. Randomly generated PAGs should have a mean CoT close to 0.

2.2.3 CoCo definition

We define CoCo = CoI + CoT for PAGs that contain at least one PPI triangle, and CoCo = CoI for PAGs that contain no PPI triangle but at least one PPI. When compared with CoI and CoT, the combined CoCo score has similar characteristics for highly positive CoCo or highly negative CoCo cases. For PAGs without valid calculated CoI, there will not be a CoCo score. In practice, since we used a high-coverage PPI database (the HAPPI database), the portion of PAGs without any PPIs within is relatively low.

2.2.4 Performance evaluations

To compare and evaluate the performance of three cohesion measures, we plot Receiver Operator Characteristic (ROC) curves for each of the measures, CoI, CoT, and CoCo. To create the positive set, we use true PAGs (size>1 for CoI and size>2 for CoT or CoCo) from the PAGER database. To create the negative set, we substitute all genes in each PAGi with genes randomly picked from the PAGER database. After all the cohesion measures are calculated, we calculate true positive rate (true positives over all positives cases) and true negative rate (true negatives over all negatives cases) for each possible cohesion measure threshold that exists in the data before plotting them on the ROC curve.

2.3 Inferring PAG–PAG relationships

We computationally derive two types of PAG–PAG relationships: co-membership based PAG–PAG relationships (m-type) and regulatory PAG–PAG relationships (r-type). These two types of PAG–PAG relationships may be calculated by Fisher’s exact test (Al-Shahrour ; Li ; Parikh ) using a 2 × 2 contingency table. In this study, however, we used hypergeometric distribution probability mass function instead to score each PAG–PAG relationship. Mathematical notations are defined as the following: Ω is the number of genes covered in PAGER, R_inis the set of gene regulations coming into genes in PAG, R_out is the set of gene regulations coming out from genes in PAG, R is the set of gene regulations from genes in PAG to genes in PAG, R is the set of gene regulations from genes in PAG to genes in PAG if and only if the genes belong to PAG∩ PAG.

2.3.1 Inferring m-type PAG–PAG relationships

We define the m-type relationship as the significance of observing the number of shared genes among PAGi and PAGj given the gene regulation data related to the two PAGs, or membership strength score in short. The hypergeometric distribution parameters for m-type relationships are defined as: N = Ω, K = |PAGi|, n = |PAGj| and k = |PAGi ∩ PAGj|.

2.3.2 Inferring r-type PAG–PAG relationships

We define the r-type relationship from PAG to PAG as the significance of observing the number of gene regulations from genes in PAG to genes in PAG, excluding the genes in PAG ∩ PAG, or regulatory strength score in short. The hypergeometric distribution parameters for m-type relationships are defined as: N = |R_in| + |R_out| + |R_in| + |R_out| − |R| is the total number of gene regulations in PAG and PAG, K = |R_out|, n = |R_in| and k = |R| − |R| is the number of gene regulation from genes in PAG to genes in PAG, excluding the genes in PAG ∩ PAG.

2.4 Developing the web application and user interface

We developed a web interface for the PAGER database, which is located at http://discovery.informatics.iupui.edu/PAGER/, for users to retrieve PAGs and PAG–PAG relationships with gene or keyword based queries. We used PHP5 and Codeigniter version 2.1.3 (EllisLab, 2014) as the web presentation framework and Oracle 11g as the database backend. Real-time calculation of hypergeometric probability mass function was implemented with PDL (Meagher ), a PHP library for statistics. Cytoscape.js (http://js.cytoscape.org), an open-source graph library, and jQuery were used to visualize gene and PAG networks. D3.js (http://d3js.org/) was used to perform matrix visualizations. We also implemented advanced features such as batch gene search, matrix or network visualization, gene or PAG transaction management to hold temporary user-selected contents, and data bulk download.

2.5 Case study: application of PAGER in myeloid-derived suppressor cells expression data analysis

Myeloid-derived suppressor cell (MDSC) microarray data analysis between tumor and normal control conditions were performed at Purdue University Center for Cancer Research. The study aims to establish a hypothesis on what factors were essential in promoting MDSC during cancer progression to transition from tumor-suppressor cells to tumor-helper cells. We collected tumor MDSC at the peritoneal cavity (T0) and compared control MDSC at the spleen (N0) of inflamed mice. Using the pulsed electroacoustic method, we extracted two cell sub-populations: CD11b+ Gr-1low (PC Glow) and CD11b+ Gr-1low (Sp Glow). Standard differential expression analysis of individual genes were performed using SAM (Tusher ). After the data analysis using pmf < 0.05 as the filter, we set the minimum log fold-change of 2.5 to select 1105 differentially expressed genes (N0), which includes 576 over-expressed genes (N+) and 529 under-expressed genes (N−). To generate an MDSC tumor versus control (T0 versus N0) gene regulatory network, we queried the N0 gene-set against the available gene regulation data in the PAGER database to obtain the seeded gene regulatory network. In addition, we queried N+ and N− gene-sets against the PAGER database to obtain N+ associated PAGs and N− associated PAGs. Using the r-type PAG–PAG regulatory relationships in PAGER, we also acquired the MDSC-specific network.

3 Results

3.1 An overview of summary statistics of the new PAGER database

From Table 1, we can obtain several data characteristics for the new PAGER database. First, there are 38 379 unique PAGs, which consists of 19 772 sPAGs and 18 607 mPAGs (regular PAGs) with a size variation ranging between 1 and 4939 (refer to Supplementary Fig. S1 for additional details of PAG size distribution categorized into different data sources). When compared with MSigDB—the largest gene-set database prior to this publication, PAGER represents a 181% increase of gene-set data coverage. In addition, we noticed an unusual spike in the size distribution for data derived only from MSigDB (with PAG sizes in the 200–212 range, Supplementary Fig. S1). This suggests potential PAG data curation size bias, i.e. taking only the top 200 genes from an expression signature, in the popular MSigDB. Second, there is a high percentage of PAGs determined to be biologically relevant as determined from the cohesion measure, CoCo score. In total, 14 701 (79%) of the 18 607 mPAGs are computed with a valid CoCo score and 13 856 (94%) of these mPAGs also have a CoCo score > 1. This finding suggests fairly high biological relevance in the majority of PAGs that we integrated into the PAGER database. Third, we also noticed that PAGs are more likely to overlap with each other than to regulate on one another. In the data statistics table, there 3 101 499 m-type PAG–PAG relationships—a number significantly higher than that of the total 72 824 r-type PAG–PAG relationships identified.

Table 1.

Basic statistics of the PAGER database

	In PAGER DB	In PAGER (Human)
Genes in PAGs	44 313	40 476
Gene–gene relationships	324 830	306 066
Molecular association	205 185	190 226
PPI	93 713	93 713
Gene regulation	25 932	25 932
PAGs	38 379	35 897
Singleton (n = 1)	19 772	19 772
Regular (n > 1)	18 607	16 125
with CoCo scores (n > 1)	14 701	12 496
with CoCo score ≥1	13 856	11 784
PAG–PAG pairs
m-type (P-value < 1e-5)	3 101 499	2 230 614
r-type (pmf < 0.05)	72 824	65 983
sPAG to mPAG	7250	7022
mPAG to mPAG	39 253	32 966
sPAG to sPAG	23 842	23 842
mPAG to sPAG	2479	2153

Basic statistics of the PAGER database

3.2 Cohesion scores to classify PAGs based on biological relevance

We evaluated PAG classification performance (biological relevant YES/NO classes) using three different PAG cohesion measures, i.e. CoI, CoT and CoCo. First, using the ROC curve (Fig. 3), we observed that all these measures can classify true PAGs from randomly generated PAGs effectively. The area-under-curves (AUCs) of classification performance for balanced positive class (integrated PAGs from the PAGER database) and negative class (randomly generated PAGs) are 0.96, 0.95 and 0.98 respectively for each of the cohesion measures CoI, CoT and CoCo. Second, we compared the positive class with the negative class using these cohesion measures’ score distributions. Measurement score distributions between samples from the two classes are statistically significant at P-values of 8.5e-138 (for CoI), 3.0e-10 (for CoT) and 2.4e-62 (for CoCo), respectively, when two-sample t-test analysis is used. Third, we observed variable effects between PAG size and cohesion measurements’ classification performance. For example, the AUC performance using CoT is slightly better than that for CoI among small PAGs (n < 100), based on Table 2. These observation justifies the use of the combined score CoCo whenever CoT may be calculated (size>2 and minimal PPI triangle = 1).

Fig. 3.

Table 2.

AUC performance comparison between CoI and CoT for small and large PAGs

	AUC (size <100)	AUC (size >300)
CoI	0.953	0.987
CoT	0.962	0.979

Cohesions (CoI, CoT and CoCo) performance: (a) ROC curves, (b) comparison boxplot. CoI+, CoI in the true PAGs; CoI−, CoI in the random PAGs; CoT+, CoT in the true PAGs; CoT−, CoT in the random PAGs AUC performance comparison between CoI and CoT for small and large PAGs To decide the threshold of high informative PAG, we choose the CoCo score as threshold to minimize the sum of false positive and false negative. We observe the threshold of 1 has the minimum sum of false positive and false negative counts equal to 2490 (Supplementary Table S2) with the false negative rate equal to 0.057 and the false positive rate equal to 0.252. We also show the PAG coverage decreases when the threshold increases, which means we need to balance the precision and PAGs recall (Supplementary Fig S2 and Supplementary Table S3).

3.3 PAG-PAG regulatory relationship network characterization

The PAG regulatory network in PAGER has the following characteristics. First, the network only connects a small portion of all PAGs. The network only covers 3304 regular PAGs, or 20.49% of the human regular PAGs, and contains 6783 directed PAG–PAG regulation; therefore, on average each PAG has the regulatory degree of 2.05. Second, the regulatory network is well-connected. This network has 39 connected components; however, the largest connected component covers 3236 PAGs, or 97.94% of the network size. Third, the PAG regulatory network node degree strictly follows the power law, achieves R2 = 0.884 for in-degree analysis, and R2 = 0.855 for out-degree analysis. Due to the sparsity of high-quality PAG-PAG regulatory relationships identified in PAGER, we therefore suggest using both gene regulatory relationship data from conventional methods and new PAG-PAG regulatory relationship data from PAGER for GNPA.

3.4 Web interface to access PAGER data

The PAGER web interface provides four basic features: basic search, advanced search, matrix and network view, and data download. The basic search allows the users to enter terms, such as a disease name or a gene symbol. The advanced search allows the users to enter a list of gene symbols. The users can decide specific options to filter out poor quality PAGs. When the users enter a list of terms, PAGER returns genes and PAGs associating with this list. When the users search a list of genes, PAGER returns the related PAGs, showing PAGs’ gene membership, cohesions, p-value and FDR. The users can construct, expand both m-PAG and r-PAG networks and compare PAGs using a similarity matrix view. The users can sort the matrix view by name, frequency. The site also allows a user to download the data used to construct the networks and images of the matrix. Details on what features the web application provides and how to use them are provided in Supplementary File S1.

3.5 PAGER analysis of MDSC gene expression data

MDSCs have been identified in most cancer patients and tumor mice models based on their ability to suppress T-cell activation (Ostrand-Rosenberg and Sinha, 2009). MDSC are induced by tumor-secreted factors, many of which are known pro-inflammation markers. In this study, we performed a functional genomics study using microarrays, by comparing MDSC at tumor site versus at control spleens, to identify detailed molecular mechanisms that trigger the MDSC’s immunity inhibition functions. To compare the effectiveness of performing integrative GNPA with the new PAGER database, we set up a control experiment to examine the gene regulatory network that can be constructed following microarray data analysis. Among the 1105 differentially expressed genes, there are only 256 genes (23% coverage) with direct gene regulation relationships between them in the MDSC gene regulatory network (Table 3 and Supplementary Fig S3). On the other hand, there are 972 genes (88% coverage) enriched in 91 connected mPAGs through newly defined r-type PAG–PAG relationship in the MDSC PAG regulatory network. Although one can try to characterize the MDSC-derived tumor versus control (TvN) individual gene biomarkers from the gene regulatory network formed within the 256 differentially expressed genes, the poor network data coverage due to limited availability of gene regulation data can lead to significant bias in the knowledge discovery process (refer to Table 1 for summary statistics). From the new PAGER database, we are able to identify 91 enriched mPAGs (13 up-regulated PAGs and 78 PAGs, all of which have CoCo score > 1), the majority of which come from diverse data sources. Although conventional GNPA relies on gene-set enrichment analysis, we demonstrate how to perform multi-scale integrative GNPA, using regulatory PAG–PAG network analysis (Fig. 4). In the figure, we show each PAG as a node of different sizes and shapes, with size proportional to the CoCo score (in log scales) for each PAG. The MDSC inflammatory regulatory PAG network shows a significantly suppressed PAG—FEX001153 (size n = 4568)—which includes many genes responsible for immunity functions. Upstream of the immunity-suppressed process are many activated cell growth and differentiation signal molecules, such as BCL2L1, a potent inhibitor of cell death; FGFR1, Tyrosine-protein kinase that acts as cell-surface receptor that can regulate embryonic development, cell proliferation, differentiation and migration; ID1, which regulates a variety of cellular processes, including cellular growth, senescence, differentiation, apoptosis, angiogenesis and neoplastic transformation; and AHR that are involved in cell-cycle regulation. Downstream of the immunity-suppressed process are many processes responsible for cell differentiation (e.g. FEX00813) and cancer progression (e.g. FEX002152 and EPHA2), and chemokine signaling (e.g. CCR5). GEX001173, the closest downstream PAG of FEX001153, is a nephrolithiasis-related gene set acquired from GAD, and is associated with the seeded overexpressed sub gene list. This anti-immunity and pro-cell proliferation signaling can be clearly explored in the PAG/gene multi-scale regulatory network consisting of both genes and PAGs implicated in the MDSC functional genomics data in tumor.

Table 3.

The statistic for gene regulatory network and PAG regulatory network for MDSCs gene expression data

	Gene regulatory network	PAG regulatory network
Gene (n = 1105)	256	972
Up-regulated (n = 576)	110	489
Down-regulated (n = 529)	146	483
PAGs (n = 1196)	0	133
m-type	0	91
Up-regulated	0	13
Down-regulated	0	78
Regulations	501	136
Gene regulations (gene-gene)	501	0
PAG regulations (mPAG-mPAG)	0	94
PAG regulations (sPAG-mPAG)	0	42

Fig. 4.

The PAG regulatory network constructed for myeloid-derived suppressor cells gene expression data

The PAG regulatory network constructed for myeloid-derived suppressor cells gene expression data The statistic for gene regulatory network and PAG regulatory network for MDSCs gene expression data

4 Conclusion

In this work, we presented a new conceptual framework that unifies pathways, annotated gene lists, and gene-sets into a standardized representation called PAGs. We constructed a comprehensive data repository called PAGER for PAG data from heterogeneous sources including ontology, pathway, omics, and public databases. The new PAGER database collects 80% more PAG data than that of the popular MSigDB database. We showed that a significant benefit for using PAGER data is the benchmarking of each PAG with cohesion measures—scores that can help evaluate the biological relevance of all genes/proteins in a given PAG in high confidence. To address the practical challenges of performing GNPA, we define new PAG–PAG relationships as m-type for those with shared members between PAGs and t-type for those with strong supporting gene regulatory relationships pointing from one PAG to the next. We demonstrated through an integrative cancer genomics study how integrative multi-scale GNPA could help gain significant biological insights of the Omics data. We expect future researchers in the field to focus on addressing several key questions related to PAG data management and integrative GNPA. First, there should be sufficient focus on curating PAGs with rich meta-data, e.g. information on individual genetic background, life-style, and electronic medical records. Second, there should be comprehensive linking of genomics and functional genomics data from public databases into PAG data structure. Therefore, researchers may further investigate the biological relevance of enriched PAGs and regulatory PAG relationships derived from GNPA. Last, we expect to develop novel literature mining and crowd sourcing within the biomedical research community to continue rapid extension of data for the PAGER database.

50 in total

1. Pharmacogenomics and bioinformatics: PharmGKB.

Authors: Caroline F Thorn; Teri E Klein; Russ B Altman
Journal: Pharmacogenomics Date: 2010-04 Impact factor: 2.533

2. A global pathway crosstalk network.

Authors: Yong Li; Pankaj Agarwal; Dilip Rajagopalan
Journal: Bioinformatics Date: 2008-04-23 Impact factor: 6.937

3. Gene Set/Pathway enrichment analysis.

Authors: Jui-Hung Hung
Journal: Methods Mol Biol Date: 2013

4. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

5. De-correlating expression in gene-set analysis.

Authors: Dougu Nam
Journal: Bioinformatics Date: 2010-09-15 Impact factor: 6.937

6. Reactome: a database of reactions, pathways and biological processes.

Authors: David Croft; Gavin O'Kelly; Guanming Wu; Robin Haw; Marc Gillespie; Lisa Matthews; Michael Caudy; Phani Garapati; Gopal Gopinath; Bijay Jassal; Steven Jupe; Irina Kalatskaya; Shahana Mahajan; Bruce May; Nelson Ndegwa; Esther Schmidt; Veronica Shamovsky; Christina Yung; Ewan Birney; Henning Hermjakob; Peter D'Eustachio; Lincoln Stein
Journal: Nucleic Acids Res Date: 2010-11-09 Impact factor: 16.971

Review 7. Pathway and network analysis in proteomics.

Authors: Xiaogang Wu; Mohammad Al Hasan; Jake Yue Chen
Journal: J Theor Biol Date: 2014-06-06 Impact factor: 2.691

8. TRED: a transcriptional regulatory element database, new entries and other development.

Authors: C Jiang; Z Xuan; F Zhao; M Q Zhang
Journal: Nucleic Acids Res Date: 2007-01 Impact factor: 16.971

9. Genome-wide meta-analysis of genetic susceptible genes for Type 2 Diabetes.

Authors: Paul J Hale; Alfredo M López-Yunez; Jake Y Chen
Journal: BMC Syst Biol Date: 2012-12-17

10. PID: the Pathway Interaction Database.

Authors: Carl F Schaefer; Kira Anthony; Shiva Krupa; Jeffrey Buchoff; Matthew Day; Timo Hannay; Kenneth H Buetow
Journal: Nucleic Acids Res Date: 2008-10-02 Impact factor: 16.971

11 in total

1. Cancer Explant Models.

Authors: Christian T Stackhouse; George Yancey Gillespie; Christopher D Willey
Journal: Curr Top Microbiol Immunol Date: 2021 Impact factor: 4.291

2. Biological Network Mining.

Authors: Zongliang Yue; Da Yan; Guimu Guo; Jake Y Chen
Journal: Methods Mol Biol Date: 2021

3. DEMA: a distance-bounded energy-field minimization algorithm to model and layout biomolecular networks with quantitative features.

Authors: Zhenyu Weng; Zongliang Yue; Yuesheng Zhu; Jake Yue Chen
Journal: Bioinformatics Date: 2022-06-24 Impact factor: 6.931

4. A nonredundant role for T cell-derived interleukin 22 in antibacterial defense of colonic crypts.

Authors: Carlene L Zindl; Steven J Witte; Vincent A Laufer; Min Gao; Zongliang Yue; Karen M Janowski; Baiyi Cai; Blake F Frey; Daniel J Silberger; Stacey N Harbour; Jeffrey R Singer; Henrietta Turner; Frances E Lund; Bruce A Vallance; Alexander F Rosenberg; Trenton R Schoeb; Jake Y Chen; Robin D Hatton; Casey T Weaver
Journal: Immunity Date: 2022-03-08 Impact factor: 43.474

5. "Super Gene Set" Causal Relationship Discovery from Functional Genomics Data.

Authors: Zongliang Yue; Michael T Neylon; Thanh Nguyen; Timothy Ratliff; Jake Y Chen
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2018-07-23 Impact factor: 3.710

6. Linking clinotypes to phenotypes and genotypes from laboratory test results in comprehensive physical exams.

Authors: Sisi Zeng; Ni Cao; Thanh Nguyen; Tongbin Zhang; Geoffrey Fox; Chuandi Pan; Jake Y Chen
Journal: BMC Med Inform Decis Mak Date: 2021-02-24 Impact factor: 3.298

7. A method for developing regulatory gene set networks to characterize complex biological systems.

Authors: Chayaporn Suphavilai; Liugen Zhu; Jake Y Chen
Journal: BMC Genomics Date: 2015-11-10 Impact factor: 3.969

8. PAGER 2.0: an update to the pathway, annotated-list and gene-signature electronic repository for Human Network Biology.

Authors: Zongliang Yue; Qi Zheng; Michael T Neylon; Minjae Yoo; Jimin Shin; Zhiying Zhao; Aik Choon Tan; Jake Y Chen
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

9. Repositioning drugs by targeting network modules: a Parkinson's disease case study.

Authors: Zongliang Yue; Itika Arora; Eric Y Zhang; Vincent Laufer; S Louis Bridges; Jake Y Chen
Journal: BMC Bioinformatics Date: 2017-12-28 Impact factor: 3.169

10. PAGER Web APP: An Interactive, Online Gene Set and Network Interpretation Tool for Functional Genomics.

Authors: Zongliang Yue; Radomir Slominski; Samuel Bharti; Jake Y Chen
Journal: Front Genet Date: 2022-04-12 Impact factor: 4.772