Literature DB >> 25535453

Propensity score method for partially matched omics studies.

Pei-Fen Kuan1.   

Abstract

This paper focuses on the problem of partially matched samples in the presence of confounders. We propose using propensity score matching to adjust for confounding factors for the subset of data with incomplete pairs, followed by integrating the P-values computed from the complete and incomplete paired samples, respectively. Several simulations and a case study on DNA methylation are considered to evaluate the operating characteristics of the proposed method.

Entities:  

Keywords:  confounders; full matching; microarray; observational studies; regression

Year:  2014        PMID: 25535453      PMCID: PMC4267441          DOI: 10.4137/CIN.S16352

Source DB:  PubMed          Journal:  Cancer Inform        ISSN: 1176-9351


Introduction

The advancement in biotechnologies has revolutionized numerous disciplines including biology and medicine. Several high-throughput platforms including whole genome arrays and the next-generation sequencing instruments are available for profiling large-scale omics data. These cutting edge biotechnologies have spurred rapid biomarker discovery and personalized medicine approach in multiple diseases, in particular, cancer research. In recent years, genomewide profiling utilizing these technologies has been carried out to identify biomarkers associated with cancer development and progression. In this paper, we consider the matched pairs samples for identifying differentially expressed biomarkers between two groups. Matched/paired study design is commonly used in omics/biomarkers profiling because it automatically accounts for confounding factors. Examples of matched/paired designs include profiling n (1) tumor and adjacent normal lesions, (2) pre- and post-drug treatment samples, or (3) one-to-one matching of patients by demographic covariates (eg, age, gender, race, etc.) from the two groups of interest. We will use the tumor versus normal samples hereafter for expository purpose. Ideally, one expects a total of 2n samples from such matched/paired design. However, in practice, circumstances such as RNA degradation, array failure, or insufficient resources could result in a subset of patients missing in either the tumor or matched normal biomarker profiles. For example, n1(patients have both the tumor and matched normal profiles, whereas n2 and n3 patients have only tumor and normal samples, respectively. Such incomplete or missing paired samples are also known as partially matched samples. Several methods have been developed to analyze partially matched data generated from the Gaussian distribution.1–4 Recently, Kuan and Huang5 and Yu et al.6 extended the approach to non-parametric setting, which does not require the Gaussian assumption. Specifically in Kuan and Huang,5 we introduced a simple and robust method for analyzing partially matched samples based on the weighted Z-test to combine the P-values computed using (1) paired sample tests (eg, paired t-test or Wilcoxon sign rank test) on the n1 matched pairs and (2) two-sample tests (eg, two-sample t-test or Mann–Whitney test) on the incomplete n2 and n3 pairs. The P-value pooling approach has been shown to achieve good operating characteristics compared to existing methods. As alluded earlier, matched/paired design is an appealing approach to avoid confounding. However, when a subset of samples has incomplete pairs in partially matched samples scenarios, this can result in unbalanced covariates between the tumor and normal groups. Nonetheless, the above-mentioned methods assume that the confounding factors among the n2 and n3 incomplete matched pairs are absent or negligible. If this assumption does not hold, the conclusions drawn from these methods are no longer valid. In this paper, we introduce an approach to adjust for potential confounders based on propensity score matching method in partially matched samples. Our paper is organized into five sections. In Section 2, we describe the proposed method, followed by Sections 3 and 4, which demonstrate the operating characteristics of the proposed approach in simulations and case study, respectively. We conclude with a discussion in Section 5.

Method

Let (Xi, Yi) be a matched pair for subject i, i = 1, …, n, where X and Y are the tumor and normal measurements, respectively. Without loss of generality, we assume that (X, Y) are complete matched pairs for i = 1, …, n1; Y’s are missing for i = n1 + 1, …, n1 + n2, and X’s are missing for i = n1 + n2 + 1, …, n1 + n2 + n3. That is, n1 patients have both the tumor and matched normal profiles, whereas n2 and n3 patients have only tumor or normal samples, respectively. Let = (Z1, …, Z) denote the p covariates for subject i, for instance, Z1 = age, Z2 = gender, etc. The first step is to create pseudo-pairs between the n2 and n3 incomplete pairs by matching the covariate information. We will use propensity score method to accomplish this step. To simplify the notation, we introduce subscript j to denote sample j, j = 1, …, n2 + n3 among the incomplete pairs, and let denote the corresponding covariate information. Let O denote the measurement for subject j. Note that O = X for j = 1, …, n2 and O = Y for j = n2 + 1, …, n2 + n3. We also let G denote the group indicator for sample j, ie, G = 1 and 2 for normal and tumor samples, respectively.

Propensity score method

The propensity score method, introduced by Rosenbaum and Rubin,7 is a popular approach in observational studies to create balance in multiple confounding covariates between the two groups. The propensity score is defined as There are several approaches for estimating e, including logistic regression and machine learning techniques such as boosted regression,8 classification trees (CART), and random forests. A comparison of these methods is provided in Lee et al.9 There are four main methods for removing confounding effects based on e, namely (1) propensity score matching, (2) stratification on propensity score, (3) covariate adjustment using propensity score, and (4) inverse probability weighting by propensity score. We refer the readers to Austin10 for a review on these different approaches. In this paper, we consider two approaches based on propensity scores to account for confounding effects. The first approach is covariate adjustment using propensity score via a linear model where ε ~ N(0, 1) if the biomarker measurements are approximately Gaussian distributed after appropriate normalization and transformation. Otherwise, one can use the generalized linear model11 with appropriate link function for non-Gaussian data. One can then evaluate if the expression of tumor is significantly different from normal by testing the hypothesis H0: β = 0. The second approach is based on Mahalanobis distance on covariate ranks with propensity score caliper12 for matching the covariates between the n2 tumor and n3 normal samples from incomplete pairs. Let be the vectors of covariate ranks for sample j. The Mahalanobis distance between sample j in the tumor group and sample k in the normal group is defined as where is the estimated pooled covariance matrix for the ranks. On the other hand, the propensity score caliper c is defined as the maximum propensity score distance between sample j and k allowed within a match. In other words, The choice of caliper width is related to bias–variance trade-off where small caliper width results in bias reduction but at the expense of increasing variance, and vice versa.13 A few studies have been conducted to investigate the optimal caliper width in propensity score matching, including the work of Austin13 and Wang et al.14 Based on these works and our own experience in propensity score matching, we recommend using caliper width equal to 0.2 of the standard deviation of the logit of the propensity score, which tends to have better performance, ie, where is the variance of logit of the propensity score in the Gth group. The samples are matched using the optimal full matching algorithm.15–17 Matching algorithm aims to group tumor and normal samples that have similar covariates, ie, small d. Optimal full matching subdivides the samples into collection of matched sets , where each set consists of a tumor with any number of normal samples or a normal sample with any number of tumors by minimizing the net discrepancy Σ, 15,17 The Olsen’s algorithm is used to create optimal matching (see Hansen15 and Hansen and Klopfer16 for details).

Test statistics for matched set

The tumor and normal samples within each matched set tend to be correlated since they have comparable baseline covariates.7,18 For one-to-one pairing, one usually uses paired sample t-test or Wilcoxon signed-rank test to test if the expression level of tumors is significantly different from the normal samples. However, in the full matching scenario, each tumor is paired with several normal samples and vice versa; thus, the paired sample t-test or Wilcoxon signed-rank test needs to be generalized to such one-to-many pairing. In this paper, we consider a generalization of the paired sample t-test under the scenario that the biomarker measurements are approximately Gaussian distributed. Following Rosner,19 a generalized paired sample t-test can be derived based on a one-way random effects ANOVA model given by where α is the overall within-pair mean difference between tumor and normal samples, is the random effect for the sth pairing, is the random error, and S is the total number of matched sets. In addition, σ = σ2 (1/m1 + 1/m2) where m1 and m2 are the number of tumors and normals in matched set s, respectively. The hypothesis for testing if the expression of tumor is different from normal samples translates into testing α = 0. The test statistic is given by where and σ2 is estimated using the usual unbiased estimator, whereas α and are estimated using numerical methods (see Rosner19 for details). For large samples, the P-value of can be obtained from the asymptotic distribution N(0, 1). For small samples, the P-value can be computed from the permutation test, by permuting the labels of tumor and normal samples within each matched set. Suppose there are m1 tumors and a total of N samples in matched set s, then the total number of possible permutations is . On the other hand, the generalized non-parametric test for one-to-many pairing can be carried out via the aligned rank test of Hodges and Lehmann20 if the data are non-Gaussian. We refer the readers to Hodges and Lehmann,20 and Heller et al.21 for additional details on implementing the aligned rank test.

P-values pooling

We follow the idea of our earlier work in Kuan and Huang5 to test if the biomarker is significantly up or down regulated in tumor compared to normal samples by pooling the P-values from the n1 complete and (n2, n3) incomplete pairs. The P-value for the n1 complete matched pairs is computed using either the paired sample t-test or Wilcoxon signed-rank test, denoted as p1. On the other hand, the P-value for the incomplete pairs p2 is computed based on the linear model using propensity score as covariate (equation (1) of Section 2.1), the generalized t-test, or aligned signed rank test (Section 2.2). The next step is to pool the two P-values by borrowing the idea of meta-analysis. Several methods are available for pooling P-values including the inverse normal and Fisher’s methods. In Kuan and Huang,5 we showed that pooling P-values based on weighted Z-test has good operating characteristics compared to other methods. The weighted Z-test for combining the P-values is based on transforming the P-values into Z-score Z = Φ−1(1 − p), k = 1, 2. The combined P-value by the weighted Z-test5,22 is given by where w’s are the corresponding weights. Although different choices of weights have been proposed in the literature, Kuan and Huang,5 Zaykin23 showed that setting the weights as the square root of the sample sizes works well in practice. Thus, we set and . In addition, pooling P-values is only meaningful if p1 and p2 are computed from one-sided hypothesis tests to avoid directional conflict. One can obtain a two-sided combined P-value as follows. Let p1 and p2 be the one-sided P-value for the same alternative (eg, “greater”) hypothesis, and pc be the combined one-sided P-value from equation (2). The two-sided P-value is given by

Simulation

We carry out simulation to evaluate the performance of propensity score method to adjust for potential confounders in partially matched samples. n paired sample measurements (X, Y) of a biomarker i for the tumor and matched normal group are generated from bivariate Gaussian distribution, where Z1 and Z2 are confounders, and μ and μ are the true mean expressions for tumor and normal groups, respectively. We consider (n1, n2, n3) = (70, 15, 15), (50, 25, 25), and (30, 35, 35) and set σ = σ = 1, whereas ρ ~ U(0, 1) to capture various degrees of correlation between tumor and normal matched pairs. In addition, we set μ = 0 and μ = 0, 0.1, 0.2, …, 0.5 for different effect sizes, and β = 0, 0.5, 1, and 2 for zero, moderate, strong, and very strong confounding effects. To simulate unbalanced confounders arising from incomplete matched pairs, we generate Z1, Z2 ~ N(0, 1) for i = 1, …, n1, Z1 ~ N(−0.2, 1), Z2 ~ N(0.2, 1) for i = n1 + 1, …, n1 + n2, and Z1 ~ (0.2, 1), Z2 ~ N(−0.2, 1) for i = n1 + n2 + 1, …, n1 + n2 + n3. We compare the performance of the following methods in our simulation studies: Gold standard (Gold-std): The P-value was computed from paired sample t-test on n1 + n2 + n3 original matched pairs assuming complete data set. This is the reference test. Paired only: The P-value was computed from paired sample t-test on the n1 complete matched pairs only, and discarding the n2 and n3 incomplete pairs. Two sample: Combining the P-value from paired sample t-test on the complete n1 matched pairs and the P-value from two-sample t-test on the incomplete n2 and n3 samples using the weighted Z-test approach.5 Propensity score with full matching (FM-PS): Combining the P-value from paired sample t-test on the complete n1 matched pairs and the P-value from generalized t-test on full matched data by Mahalanobis distance with propensity score caliper c = 0.2 on the incomplete n2 and n3 samples. Propensity score with regression adjustment (Reg-PS): Combining the P-value from paired sample t-test on the complete n1 matched pairs and the P-value from linear regression model using propensity score as covariate on the incomplete n2 and n3 samples.

Single biomarker

We first evaluate the performance of propensity score methods in adjusting for unbalanced covariates in single biomarker setting. Table 1 reports the average empirical Type I error at nominal α = 0.05 over 10,000 replications. When there is no confounding effect, ie, β = 0, all the methods control the Type I error. However, for β ≠ 0, the two-sample method exhibits the largest Type I error inflation. On the other hand, paired only, FM-PS, and Reg-PS methods control the Type I error under all the scenarios considered in the simulation. Figure 1 shows the average power for different combinations of n1, n2, n3, and β for methods that control the Type I error (empirical Type I error ≤ 0.055). As expected, missing samples in incomplete matched pairs reduce the power compared to complete data set (Gold-std). However, incorporating the incomplete matched pairs with proper adjustment for confounders via FM-PS or Reg-PS methods exhibit increased statistical power compared to using only the n1 paired samples when n2 and n3 are substantially large. When n2 and n3 are small relative to n1, using only the n1 paired samples is comparable to methods that incorporate n2 and n3. Both FM-PS and Reg-PS methods show comparable performance in this simulation study.
Table 1

Average empirical Type I error at nominal α = 0.05. Italicized values indicate that the empirical Type I error is greater than 0.055.

METHODβ= 0β= 0.5β= 1β= 2
n1 = 70, n2 = 15, n3 = 15
Gold-std0.04550.04730.05030.0501
Paired only0.04790.04940.04900.0518
Two-sample0.04750.06430.07940.0922
FM-PS0.04620.04850.04690.0476
Reg-PS0.04800.05020.04410.0442
n1 = 50, n2 = 25, n3 = 25
Gold-std0.05270.05410.04950.0541
Paired only0.04860.05360.05340.0531
Two-sample0.04760.10440.14600.1809
FM-PS0.04830.04670.04650.0485
Reg-PS0.04940.04760.04430.0416
n1 = 30, n2 = 35, n3 = 35
Gold-std0.05220.05430.04830.0489
Paired only0.05070.04940.04990.0465
Two-sample0.05220.17120.28050.3663
FM-PS0.04490.04540.04220.0460
Reg-PS0.05240.04790.04520.0446
Figure 1

Average power at nominal α = 0.05. Sample size (n1, n2, n3) and effect of confounding β are indicated in the header of each plot. Power curves for methods that did not control Type I error (ie, empirical Type I error > 0.055) are not shown.

Notes: ○: Gold-std, ∆: FM-PS, +: Reg-PS, ×: two-sample, and ⋄: paired only.

Multiple biomarkers

As omics data involve testing multiple biomarkers simultaneously (within a multiple hypothesis testing framework), we also simulate the observations from multiple biomarkers setting. We consider 1000 biomarkers and repeat each simulation setting over 100 replications. We use the false discovery rate (FDR) procedure of Benjamini and Hochberg24 to adjust for multiple hypothesis testing. Figure 2 reports the average empirical FDR for the different methods at nominal FDR = 0.05. Similar to the single biomarker case, two-sample method exhibits the largest inflated empirical FDR for β ≠ 0. On the other hand, FM-PS, Reg-PS, and paired only methods control the FDR across the different scenarios. Figure 3 shows the empirical false nondiscovery rate (FNR) for the methods under comparison. FNR is an analog of Type II error in multiple hypothesis testing settings, and is defined as the proportion of false negatives among the total number of non-rejection. Empirical FNR is large when the number of incomplete matched pairs is large. On the other hand, both FM-PS and Reg-PS methods result in lower FNR compared to paired only method.
Figure 2

Average empirical FDR at nominal FDR = 0.05. Sample size (n1, n2, n3) and effect of confounding β are indicated in the header of each plot.

Notes: ○: Gold-std, ∆: FM-PS, +: Reg-PS, ×: two-sample, and ⋄: paired only.

Figure 3

Average empirical FNR at nominal FDR = 0.05. Sample size (n1, n2, n3) and effect of confounding β are indicated in the header of each plot. FNR values for methods that did not control FDR (ie, empirical FDR > 0.055) are not shown.

Notes: ○: Gold-std, ∆: FM-PS, +: Reg-PS, ×: two-sample, and ⋄: paired only.

Case Study

We illustrate the proposed propensity score adjustment for partially matched samples on a publicly available DNA methylation data from Selamat et al.25 (downloaded from Gene Expression Omnibus (GEO) under accession number GSE32861). The data set consists of 58 matched pairs of lung adenocarcinoma and adjacent non-tumor lung tissue after removing paired sample 3023_T/N.25 Methylation for these samples was profiled using the Illumina HumanMethylation27 BeadChip, which covers 27,578 CpGs. We use a subset of baseline covariates measured for each sample (ie, age, smoking status, stage, recurrence, KRAS mutation, EGFR mutation, and LKB1 mutation) to illustrate the performance of the different methods. Age and stage are continuous and ordinal variables, respectively, whereas the other covariates are binary variables. We randomly choose n1 out of 58 matched pairs to be complete matched pairs. Next, we generate 58 − n1 indicator variables, ie, δ, k = 1, …, 58 − n1, where and after standardizing each covariate. This function generates approximately equal number of 0s and 1s on average. Among the remaining 58 – n1 pairs, we set n2 pairs to be missing in non-tumor lung tissue corresponding to those with δ = 1, and the remaining to be missing in lung adenocarcinoma. We consider n1 = 10, 20, 30, 40, and for each n1, the process is repeated 50 times. We apply FM-PS, Reg-PS, two-sample, and paired only methods on the logit-transformed methylation β-values, ie, log(β/(1 − β))26,27 of each CpG. Since CpGs that are truly differentially methylated between lung adenocarcinoma and non-tumor lung tissues are unknown in the case study, we use the results from paired sample t-test on the full 58 matched pairs as Gold-std. We define the true positive CpGs as the subset of CpGs that are significant at the Benjamini and Hochberg24 FDR of 0.05 for the Gold-std method. We compare the list of significant CpGs identified by FM-PS, Reg-PS, two-sample, and paired only methods at FDR = 0.05 to the true positive CpGs. In Table 2, we report the average empirical FDR, FNR, and average true positive (ATP) CpGs identified by each method. The ATP CpG is also the number of overlapping CpGs identified by each method and the Gold-std method. Two-sample method declares a larger number of false positives as indicated by the inflated empirical FDR. In this case study, the effect of confounding is moderate; thus, FM-PS, Reg-PS, and paired only methods are able to control the FDR. However, both FM-PS and Reg-PS methods have lower FNR and larger ATP compared to paired only method. This shows that the propensity score method for partially matched samples is able to adjust for confounders and improve the power of detecting differentially methylated CpGs.
Table 2

Average empirical FDR, FNR, and ATP at nominal FDR = 0.05.

METHODPAIRED ONLYTWO-SAMPLEFM-PSReg-PS
n1 = 10, n2 = 24, n3 = 24
FDR0.03040.08220.02850.0256
FNR0.54390.21210.46490.4478
ATP34571372269087527
n1 = 20, n2 = 19, n3 = 19
FDR0.03220.06250.02560.0247
FNR0.44430.18970.40530.3810
ATP76901401489419568
n1 = 30, n2 = 14, n3 = 14
FDR0.03520.05370.03930.0333
FNR0.35120.16320.32410.2948
ATP10422143951118311715
n1 = 40, n2 = 9, n3 = 9
FDR0.03540.06480.04040.0534
FNR0.24040.11930.22690.1803
ATP12955150281320113973
We carry out a gene ontology (GO) analysis to provide biological insights into the list of significant CpGs at FDR = 0.05 identified from the Gold-std method using the Bioconductor package topGO.28 We consider both the elim Fisher’s exact test (elim.Fisher) and elim Kolmogorov–Smirnov test (elim.KS) implemented in topGO based on Alexa et al.29 The elim method has been shown to improve interpretation of the GO analysis by integrating GO graph topology and iteratively removing genes that map to significant GO terms from a higher level GO terms.29 The P-values from each of the test are adjusted via the Benjamini–Hochberg method.24 Tables 3–5 report the GO terms corresponding to biological process (BP), molecular function (MF), and cellular component (CC) that exhibit adjusted P-values <0.05 by both the elim.Fisher and elim.KS test, respectively. For example, the BP GO analysis identifies a GO term related to positive regulation of ERK1 and ERK2 cascade, which has been shown to be implicated in lung adenocarcinomas.30,31
Table 3

Significant BP GO terms for the CpGs identified by the Gold-std method in the lung adenocarcinoma case study. The reported P-values are adjusted via the Benjamini–Hochberg FDR control.24

BIOLOGICAL PROCESS
GO IDTERMANNOTATEDSIGNIFICANTEXPECTEDelim.Fisherelim.KS
GO:0007268Synaptic transmission1153803685.542.18e-072.54e-17
GO:0048704Embryonic skeletal system morphogenesis15912094.540.01221.94e-ll
GO:0031424Keratinization585334.490.000192.97e-ll
GO:0007155Cell adhesion15771067937.640.01227.27e-08
GO:0048265Response to pain544632.110.01399.22e-08
GO:0009952Anterior/posterior pattern specification350246208.10.02831.21e-07
GO:0007156Homophilic cell adhesion177133105.240.01021.29e-07
GO:0007186G-protein coupled receptor signaling pat.1035721615.380.0003496.24e-06
GO:0007193Adenylate cyclase-inhibiting G-protein c.917354.110.01226.24e-06
GO:0007267Cell-cell signaling192313191143.360.01226.24e-06
GO:0070374Positive regulation of ERK1 and ERK2 cas.0.174128103.460.02013.26e-05
GO:0050911Detection of chemical stimulus involved.191911.30.01614.15e-05
GO:0001755Neural crest cell migration806547.570.01224.15e-05
GO:0023019Signal transduction involved in regulati.333019.620.02014.15e-05
GO:0007204Elevation of cytosolic calcium ion conce.326240193.830.03229.08e-05
GO:0048484Enteric nervous system development302817.840.01399.73e-05
GO:0030198Eextracellular matrix organization537375319.280.01029.73e-05
GO:0021527Spinal cord association neuron different.272516.050.03220.000155
GO:0042742Defense response to bacterium209150124.270.03130.000266
GO:0006954Inflammatory response845564502.410.02170.000556
GO:0030855Epithelial cell differentiation905611538.090.01390.00112
GO:0045666Positive regulation of neuron differenti.1128666.590.02170.00185
GO:0030335Positive regulation of cell migration428294254.480.01390.00265
GO:0019233Sensory perception of pain13710581.460.01220.00301
GO:0048485Sympathetic nervous system development454026.760.01220.00496
GO:0045165Cell fate commitment417299247.940.0340.00999
GO:0007215Glutamate receptor signaling pathway887552.320.01220.0181
GO:0021846Cell proliferation in forebrain433825.570.01390.0194
GO:0007631Feeding behavior15511792.160.01220.0273
Table 5

Significant CC GO terms for the CpGs identified by the Gold-std method in the lung adenocarcinoma case study. The reported P-values are adjusted via the Benjamini–Hochberg FDR control.24

CELLULAR COMPONENT
GO IDTERMANNOTATEDSIGNIFICANTEXPECTEDelim.Fisherelim.KS
GO:0005887Integral to plasma membrane207914431242.773.14e-143.64e-35
GO:0005576Extracellular region313421611873.411.35e-091.82e-27
GO:0005615Extracellular space1398964835.685.49e-ll4.19e-25
GO:0005886Plasma membrane628940793759.382.08e-052.84e-14
GO:0005578Proteinaceous extracellular matrix547402326.981.32e-071.39e-13
GO:0016021Integral to membrane689844214123.420.0003631.86e-ll
GO:0 04 5211Postsynaptic membrane294209175.740.002964.46e-10
GO:0008076Voltage-gated potassium channel complex1179269.940.00127.64e-10
GO:0030054Cell junction11 9 7776715.530.01052.57e-06
GO:0034774Secretory granule lumen1057962.770.0 4190.00149

Discussion

Partially matched samples could give rise to unbalanced covariate distribution among the incomplete matched pairs in large-scale matched pair omics studies. This paper extends the P-value pooling method of Kuan and Huang5 to a framework based on propensity score for adjusting unbalanced covariate distribution among the incomplete matched pairs. We consider two approaches using propensity score, namely, (1) full matching followed by generalized t-test (FM-PS) and (2) propensity score as covariate in regression model (Reg-PS). Both methods are able to reduce the number of false positives by accounting for the confounders. Currently, we use the full matching approach based on Mahalanobis distance with propensity score calipers15–17 and the one-way random effects ANOVA model19 for deriving the generalized paired t-test. One can also use other matching algorithms based on propensity score.10 In this paper, we assume that the biomarker measurements are properly transformed such that they are approximately Gaussian distributed. If Gaussian assumption is violated, one can replace the generalized paired t-test with the generalized non-parametric aligned rank test of Hodges and Lehmann20 and Heller et al.21 in FM-PS method, and replace regular linear regression with generalized linear models.11 For instance, in our case study on DNA methylation, the analysis is carried out on the logit transformed beta values (also known as M values). An alternative approach is to analyze the untransformed beta values using beta regression in the Reg-PS method. The choice of analyzing DNA methylation data on either beta values or M values is an ongoing active research.32,33 Both the FM-PS and Reg-PS methods exhibit comparable performance in both our simulations and case study. In this paper, we assume a linear propensity score–outcome relationship that enables us to apply direct adjustment with a linear propensity score term in Reg-PS. In such cases, Reg-PS method is computationally more efficient and easier to implement compared to FM-PS method. However, if the propensity score–outcome relationship is non-linear, one will need to consider more complicated models, for instance, the generalized additive model (GAM) as proposed in Myers and Louis.34 In such cases, the FM-PS method may be a better alternative as this approach does not require specification of the propensity score–outcome relationship. Thus, we recommend that the users compare the results from both Reg-PS and FM-PS methods in practice.
Table 4

Significant MF GO terms for the CpGs identified by the Gold-std method in the lung adenocarcinoma case study. The reported P-values are adjusted via the Benjamini–Hochberg FDR control.24

MOLECULAR FUNCTION
GO IDTERMANNOTATEDSIGNIFICANTEXPECTEDelim.Fisherelim.KS
GO:0004930G-protein coupled receptor activity741566440.862.51e-101.67e-22
GO:0004984Olfactory receptor activity676239.867.72e-072.26e-15
GO:0005509Calcium ion binding977654581.270.0002431.29e-13
GO:0043565Sequence-specific DNA binding1129735671.710.008885.34e-12
GO:0005201Extracellular matrix structural constitu.1239573.180.006341.81e-08
GO:0005234Extracellular-glutamate-gated ion channe.292617.250.02836.5e-07
GO:0005125Cytokine activity332234197.530.006041.33e-05
GO:0005230Extracellular ligand-gated ion channel a.1229972.580.01473.79e-05
GO:0004890GABA-A receptor activity332919.630.02836e-05
GO:0015269Calcium-activated potassium channel acti.272516.060.01920.000132
GO:0004888Transmembrane signaling receptor activit.1333979793.080.03630.000591
GO:0005540Hyaluronic acid binding322919.040.01790.00185
GO:0015293Symporter activity213152126.730.02160.00232
GO:0020037Heme binding201144119.590.02160.00558
GO:0005506Iron ion binding238169141.60.01920.00558
GO:0016918Retinal binding282516.660.03630.0104
GO:0005242Inward rectifier potassium channel activ.373222.010.02830.0126
GO:0015279Store-operated calcium channel activity16169.520.02290.0212
GO:0008227G-protein coupled amine receptor activit.675839.860.02160.0315
  19 in total

1.  A statistical framework for Illumina DNA methylation arrays.

Authors:  Pei Fen Kuan; Sijian Wang; Xin Zhou; Haitao Chu
Journal:  Bioinformatics       Date:  2010-09-29       Impact factor: 6.937

2.  Propensity score estimation with boosted regression for evaluating causal effects in observational studies.

Authors:  Daniel F McCaffrey; Greg Ridgeway; Andrew R Morral
Journal:  Psychol Methods       Date:  2004-12

3.  Matching methods for observational microarray studies.

Authors:  Ruth Heller; Elisabetta Manduchi; Dylan S Small
Journal:  Bioinformatics       Date:  2008-12-19       Impact factor: 6.937

Review 4.  Analysing and interpreting DNA methylation data.

Authors:  Christoph Bock
Journal:  Nat Rev Genet       Date:  2012-10       Impact factor: 53.242

5.  Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis.

Authors:  Pan Du; Xiao Zhang; Chiang-Ching Huang; Nadereh Jafari; Warren A Kibbe; Lifang Hou; Simon M Lin
Journal:  BMC Bioinformatics       Date:  2010-11-30       Impact factor: 3.169

6.  Statistical methods of translating microarray data into clinically relevant diagnostic information in colorectal cancer.

Authors:  Byung Soo Kim; Inyoung Kim; Sunho Lee; Sangcheol Kim; Sun Young Rha; Hyun Cheol Chung
Journal:  Bioinformatics       Date:  2004-09-16       Impact factor: 6.937

7.  ERK1/2 mediates lung adenocarcinoma cell proliferation and autophagy induced by apelin-13.

Authors:  Li Yang; Tao Su; Deguan Lv; Feng Xie; Wei Liu; Jiangang Cao; Irshad Ali Sheikh; Xuping Qin; Lanfang Li; Linxi Chen
Journal:  Acta Biochim Biophys Sin (Shanghai)       Date:  2013-12-29       Impact factor: 3.848

8.  Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies.

Authors:  Peter C Austin
Journal:  Pharm Stat       Date:  2011 Mar-Apr       Impact factor: 1.894

9.  Comparing paired vs non-paired statistical methods of analyses when making inferences about absolute risk reductions in propensity-score matched samples.

Authors:  Peter C Austin
Journal:  Stat Med       Date:  2011-02-21       Impact factor: 2.373

10.  Genome-scale analysis of DNA methylation in lung adenocarcinoma and integration with mRNA expression.

Authors:  Suhaida A Selamat; Brian S Chung; Luc Girard; Wei Zhang; Ying Zhang; Mihaela Campan; Kimberly D Siegmund; Michael N Koss; Jeffrey A Hagen; Wan L Lam; Stephen Lam; Adi F Gazdar; Ite A Laird-Offringa
Journal:  Genome Res       Date:  2012-05-21       Impact factor: 9.043

View more
  3 in total

1.  A systematic review on metabolomics-based diagnostic biomarker discovery and validation in pancreatic cancer.

Authors:  Nguyen Phuoc Long; Sang Jun Yoon; Nguyen Hoang Anh; Tran Diem Nghi; Dong Kyu Lim; Yu Jin Hong; Soon-Sun Hong; Sung Won Kwon
Journal:  Metabolomics       Date:  2018-08-10       Impact factor: 4.290

2.  Optimal weighted two-sample t-test with partially paired data in a unified framework.

Authors:  Xu Guo; Yan Wang; Niwen Zhou; Xuehu Zhu
Journal:  J Appl Stat       Date:  2020-04-20       Impact factor: 1.416

3.  Circulating Plasma Biomarkers of Survival in Antifibrotic-Treated Patients With Idiopathic Pulmonary Fibrosis.

Authors:  Ayodeji Adegunsoye; Shehabaldin Alqalyoobi; Angela Linderholm; Willis S Bowman; Cathryn T Lee; Janelle Vu Pugashetti; Nandini Sarma; Shwu-Fan Ma; Angela Haczku; Anne Sperling; Mary E Strek; Imre Noth; Justin M Oldham
Journal:  Chest       Date:  2020-05-22       Impact factor: 9.410

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.