| Literature DB >> 27380176 |
Hongying Dai1,2, Guodong Wu3, Michael Wu4, Degui Zhi5.
Abstract
Next-generation sequencing data pose a severe curse of dimensionality, complicating traditional "single marker-single trait" analysis. We propose a two-stage combined p-value method for pathway analysis. The first stage is at the gene level, where we integrate effects within a gene using the Sequence Kernel Association Test (SKAT). The second stage is at the pathway level, where we perform a correlated Lancaster procedure to detect joint effects from multiple genes within a pathway. We show that the Lancaster procedure is optimal in Bahadur efficiency among all combined p-value methods. The Bahadur efficiency,[Formula: see text], compares sample sizes among different statistical tests when signals become sparse in sequencing data, i.e. ε →0. The optimal Bahadur efficiency ensures that the Lancaster procedure asymptotically requires a minimal sample size to detect sparse signals ([Formula: see text]). The Lancaster procedure can also be applied to meta-analysis. Extensive empirical assessments of exome sequencing data show that the proposed method outperforms Gene Set Enrichment Analysis (GSEA). We applied the competitive Lancaster procedure to meta-analysis data generated by the Global Lipids Genetics Consortium to identify pathways significantly associated with high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, triglycerides, and total cholesterol.Entities:
Mesh:
Year: 2016 PMID: 27380176 PMCID: PMC4933358 DOI: 10.1371/journal.pone.0152667
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Simulation Scenarios and parameters*.
| Simulation Scenarios 1 |
Include 353 pathways, 3304 genes. |
Phenotype is normally distributed. |
Assume heritability is 20%. |
No pathways, genes or variations are associated with the trait. |
Significance level is 0.05/353. All significant results are considered as type 1 errors. |
| Simulation Scenario 2 |
Include 353 pathways and 3304 genes. |
Phenotype is normally distributed. |
Randomly assign one central causal pathway. Within the central causal pathway, randomly assign 50% causal genes. Randomly assign 70% causal variants in associated genes. |
Randomly assigned 80% (20%) of causal genes to be detrimental (protective). For variants within the causal genes, 80% are detrimental and 20% are protective. |
Associated variants' effect size ~ log10( |
Significance level is 0.05/353. |
| Simulation Scenarios 3 |
Include 353 pathways and 3304 genes. |
Phenotype is normally distributed. |
Assume heritability is 20%. |
Randomly assign one central causal pathway. Within the central causal pathway, randomly assign 50% causal genes. Randomly assign 70% causal variants in associated genes. |
Randomly assigned 80% (20%) of causal genes to be detrimental (protective). For variants within the causal genes, 80% are detrimental and 20% are protective. |
Associated variants' effect size ~ |
Significance level is 0.05/353. |
* Covariates: top 3 principal components for population stratification are included as covariates in all three simulation scenarios.
Comparison of type I error and power among competing methods.
| Simulation Scenario 1 | |||
| Type_1 error | Inflation factor | ||
| Test | (10E-4) | ||
| SKAT- Lancaster | 1.1615 | 0.9921 | |
| SKAT- Lancaster | 1.3598 | 0.9852 | |
| SKAT- Lancaster | 0.9632 | 0.9477 | |
| SKAT- Lancaster | 1.1331 | 0.9770 | |
| GSEA | 12.0000 | 1.2390 | |
| Simulation Scenario 2 | |||
| Test | Stringent Power | Lenient Power | |
| SKAT- Lancaster | 0.870 | 0.884 | |
| SKAT- Lancaster | 0.810 | 0.836 | |
| SKAT- Lancaster | 0.832 | 0.854 | |
| SKAT- Lancaster | 0.809 | 0.826 | |
| GSEA | 0.279 | 0.373 | |
| Simulation Scenario 3 | |||
| Test | Stringent Power | Lenient Power | |
| SKAT- Lancaster | 0.610 | 0.645 | |
| SKAT- Lancaster | 0.509 | 0.543 | |
| SKAT- Lancaster | 0.585 | 0.628 | |
| SKAT- Lancaster | 0.540 | 0.558 | |
| GSEA | 0.468 | 0.505 | |
Fig 1Q-Q plots investigating global null hypothesis type-I errors for the SKAT-Lancaster procedure under Simulation Scenario 1 (λ is the inflation factor for the Type I error rate).
The type I error inflation factor (λ) is the ratio between the area under the curve and the area under the diagonal reference line.
Fig 2Venn Diagrams for Significant Pathways (FDR < 0.05).
Number of Significant pathways.
| FDR < 0.05 | HDL | LDL | TC | TG |
|---|---|---|---|---|
| Lancaster (w1) | 117 | 79 | 150 | 93 |
| Lancaster (w2) | 91 | 55 | 129 | 72 |
| Lancaster (w3) | 0 | 0 | 0 | 0 |
| Lancaster (w4) | 77 | 44 | 115 | 60 |
| Fisher | 77 | 44 | 115 | 60 |
| Weighted Z-test (w1) | 2 | 1 | 3 | 2 |
| Weighted Z-test (w2) | 5 | 0 | 5 | 4 |
| Weighted Z-test (w3) | 0 | 0 | 0 | 0 |
| Weighted Z-test (w4) | 4 | 0 | 4 | 6 |
Comparison of pathway analysis p-values.
| Pathway Name | enzyme binding | lipid transport | lipoprotein metabolic process |
|---|---|---|---|
| GO Accession | GO:0019899 | GO:0006869 | GO:0042157 |
| Gene Ontology | molecular function | biological process | biological process |
| Description | Interacting selectively with any enzyme | The directed movement of lipids into, out of, within or between cells. Lipids are compounds soluble in an organic solvent but not, or sparingly, in an aqueous solvent. | The chemical reactions and pathways involving any conjugated, water-soluble protein in which the nonprotein moiety consists of a lipid or lipids. |
| number of genes | 178 | 28 | 33 |
| number of SNPs | 12089 | 1058 | 769 |
| (Willer 2013) | 0.038 | 0.0016 | 0.00017 |
| Lancaster (w1) | <10−5 | <10−5 | <10−5 |
| Lancaster (w2) | 0.80 | <10−5 | <10−5 |
| Lancaster (w3) | 0.98 | 0.44 | 0.44 |
| Lancaster (w4) | 0.82 | <10−5 | <10−5 |
| Fisher | 0.82 | <10−5 | <10−5 |
| Weighted z (w1) | 0.90 | 0.36 | 0.58 |
| Weighted z (w2) | 0.94 | 0.21 | 0.35 |
| Weighted z (w3) | 0.99 | 0.65 | 0.55 |
| Weighted z (w4) | 0.94 | 0.23 | 0.36 |
* FDR adjusted p-values