| Literature DB >> 21261946 |
Andre J Faure1, Cathal Seoighe, Nicola J Mulder.
Abstract
BACKGROUND: In order to interpret the results obtained from a microarray experiment, researchers often shift focus from analysis of individual differentially expressed genes to analyses of sets of genes. These gene-set analysis (GSA) methods use previously accumulated biological knowledge to group genes into sets and then aim to rank these gene sets in a way that reflects their relative importance in the experimental situation in question. We suspect that the presence of paralogs affects the ability of GSA methods to accurately identify the most important sets of genes for subsequent research.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21261946 PMCID: PMC3037853 DOI: 10.1186/1471-2105-12-29
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Paralog expression correlation. Mean expression correlation (Spearman's ρ) of gene paralogs in Arabidopsis at various protein sequence identity levels where %ID > 20. Error bars indicate the standard error of the estimated mean values. The values in parentheses indicate the number of unique pairwise gene comparisons in each case.
Figure 2Indygene 'Tool' page. Indygene 'Tool' page showing the form used to submit a gene list for processing.
Figure 3Comparison of stable set sizes obtained using three different algorithms for the MSSP. Graph order before and after the application of three greedy algorithms for the MSSP to random Arabidopsis gene graphs of differing sizes. We indicate the stable set size range over 10 replicates in each case. The ordinate shows the number of genes by which the stable set size exceeds the lower bound given by the Caro-Wei theorem.
Figure 4Comparison of computation times of three different algorithms for the MSSP. Mean computation times of three greedy algorithms for the MISP applied to random Arabidopsis gene graphs of differing orders. We indicate the mean calculation time over 10 replicates in each case.
Figure 5Removing paralogs leads to significantly different GSA results. Estimated null distribution for τ used to determine whether the paralog-reduced dataset (red vertical line at τ = 0.65) produces significantly different GSA results. The abscissa gives Kendall's correlation (τ) between the ranked GO term lists before and after randomly reducing the dataset by 6126 genes. The black line indicates the approximate probability density function of the null distribution, estimated using a Gaussian smoothing kernel.
GoMiner GSA results
| Original dataset: unique GSA results | Reduced dataset: unique GSA results |
|---|---|
| GO:0006007 - glucose catabolic process | GO:0009225 - nucleotide-sugar metabolic process |
| GO:0009056 - catabolic process | GO:0042632 - cholesterol homeostasis |
| GO:0002504 - antigen processing and presentation of peptide... | GO:0006890 - retrograde vesicle-mediated transport Golgi to ER |
| GO:0019320 - hexose catabolic process | GO:0000059 - protein import into nucleus docking |
| GO:0046365 - monosaccharide catabolic process | GO:0006692 - prostanoid metabolic process |
| GO:0006096 - glycolysis | GO:0006693 - prostaglandin metabolic process |
| GO:0012501 - programmed cell death | GO:0006183 - GTP biosynthetic process |
| GO:0019882 - antigen processing and presentation | GO:0007368 - determination of left right symmetry |
| GO:0006915 - apoptosis | GO:0008543 - fibroblast growth factor receptor signaling pathway |
| GO:0051258 - protein polymerization | GO:0009799 - determination of symmetry |
| GO:0008219 - cell death | GO:0009855 - determination of bilateral symmetry |
| GO:0016265 - death | GO:0030520 - estrogen receptor signaling pathway |
| GO:0031529 - ruffle organization and biogenesis | GO:0046039 - GTP metabolic process |
| GO:0048259 - regulation of receptor mediated endocytosis | |
| GO:0045045 - secretory pathway | |
| GO:0044275 - cellular carbohydrate catabolic process | |
| GO:0006006 - glucose metabolic process | |
| GO:0048193 - Golgi vesicle transport | |
| GO:0030832 - regulation of actin filament length | |
| GO:0007018 - microtubule-based movement | |
| GO:0016052 - carbohydrate catabolic process | |
| GO:0001508 - regulation of action potential | |
| GO:0043067 - regulation of programmed cell death | |
| GO:0006879 - iron ion homeostasis | |
| GO:0042981 - regulation of apoptosis | |
| GO:0007265 - Ras protein signal transduction | |
| GO:0030032 - lamellipodium biogenesis | |
| GO:0009894 - regulation of catabolic process | |
| GO:0032940 - secretion by cell | |
| GO:0007010 - cytoskeleton organization and biogenesis | |
| GO:0040017 - positive regulation of locomotion | |
| GO:0051272 - positive regulation of cell motility | |
| GO:0006402 - mRNA catabolic process | |
| GO:0006471 - protein amino acid ADP-ribosylation | |
| GO:0008064 - regulation of actin polymerization ... | |
| GO:0030036 - actin cytoskeleton organization ... | |
| GO:0048468 - cell development | |
| GO:0005996 - monosaccharide metabolic process | |
| GO:0006996 - organelle organization and biogenesis |
GoMiner GSA results indicating GO Biological Process terms significantly overrepresented amongst the genes expressed in airway epithelial cells from never-smokers. Only those terms exclusive to the results obtained from either the original or paralog-reduced list are shown. A P-value cut-off of α = 0.05 was used to determine significance.
GSEA GSA results
| Original dataset: unique GSA results | Reduced dataset: unique GSA results |
|---|---|
| [1] Lymphoblast cell lines: | [1] Lymphoblast cell lines: |
| | |
| TGFBETA_C2_UP | CROONQUIST_IL6_STROMA_UP |
| | |
| CHESLER_HIGHEST_FOLD_RANGE_GENES BHATTACHARYA_ESC_UP | |
| | |
| P53HYPOXIAPATHWAY | |
| HSP27PATHWAY | |
| MMS_HUMAN_LYMPH_HIGH_24HRS_UP | |
| P53PATHWAY | |
| KANNAN_P53_UP | |
| P53_BRCA_UP | |
| RADIATION_SENSITIVITY | |
| | |
| chr13q14 | |
| | |
| TGFBETA_C1_UP HDACLCOLON_TSA_DN | CANCER_UNDIFFERENTIATED_META_UP |
| MARSHALL_SPLEEN_BAL | |
| TRNASYNTHETASES | |
| EGF_HDMEC_UP | |
| AMINOACYL-TRNA_BIOSYNTHESIS | |
| ZELLER_MYC_UP | |
| HDACI_COLON_BUT16HRS_DN | |
| ZHAN_MULTIPLE_MYELOMA_SUBCLASSES_DIFF | |
| MYC_TARGETS | |
| MENSE_HYPOXIA_UP | |
| SMITH_HTERT_UP | |
| DOX_RESIST_GASTRIC_UP | |
| BASSO_REGULATORY_HUBS | |
| | |
| TGFBETA_C1_UP | |
| HSA00010_GLYCOLYSIS_AND_GLUCONEOGENESIS | |
| GLYCOLYSIS GLUCONEOGENESIS | |
| MENSE_HYPOXIA_UP | |
| VEGFPATHWAY | |
| ROME_INSULIN_2F_UP | |
| INSULIN-SIGNALING | |
| BHATTACHARYA_ESC_UP | |
| VANTVEER_BREAST_OUTCOME_GOOD_VS_POOR_DN | |
| GLYCOLYSIS_AND_GLUCONEOGENESIS | |
| ZUCCHI_EPITHELIAL_DN | |
| HYPOXIA_REVIEW | |
GSEA results of five diverse gene expression datasets showing gene sets significantly enriched in the phenotype indicated. Functional gene sets (MSigDB:C2) were used in all cases, except for the leukaemia dataset where cytogenetic gene sets (MSigDB:C1) were used. Only those sets exclusive to the results obtained from either the original or paralog-reduced list are shown. A threshold of FDR ≤ 0.25 was used to determine significance.
SAM-GS GSA results
| Original dataset: unique GSA results | Reduced dataset: unique GSA results |
|---|---|
| APOPTOSIS | ADIP_VS_FIBRO_DN |
| APOPTOSIS-GENMAPP | BCNU_GLIOMA_MGMT_48HRS_UP |
| APOPTOSIS_KEGG | BRCA1_SW480_DN |
| CELLCYCLEPATHWAY | BREAST_CANCER_ESTROGEN_SIGNALING |
| CHEMICALPATHWAY | DAC_PANC50_UP |
| FSH_HUMAN_GRANULOSA_UP | DNA_DAMAGE_SIGNALING |
| G1PATHWAY | DRUG_RESISTANCE_AND_METABOLISM |
| HSA05219_BLADDER_CANCER | G2PATHWAY |
| HSP27PATHWAY | HSA05040_HUNTINGTONS_DISEASE |
| IL4PATHWAY | OXSTRES_BREASTCA_UP |
| P53_BRCA1_UP | PARP_KO_UP |
| RACCYCDPATHWAY | PASSERINI_APOPTOSIS |
| SA_FAS_SIGNALING | SA_DIACYLGLYCEROL_SIGNALING SHEPARD_NEG_REG_OF_CELL_PROLIFERATION |
SAM-GS results indicating functional gene sets (MSigDB:C2) significantly enriched in the expression patterns of NCI-60 cancer cell lines with wild-type p53, compared to those of p53 mutants. Only those terms exclusive to the results obtained from either the original or paralog-reduced list are shown. A threshold of FDR ≤ 0.001 was used to determine significance.
Figure 6Comparison of three greedy algorithms for the MSSP using a toy example. A simulated graph G representing the paralogy relationships between 14 genes serves as input to each algorithm considered: GRAND, GMAX, GMIN. The final stable set of genes as well as the resulting graphs after two initial iterations are shown. Each iteration of GRAND consists of the removal of a random vertex (gene) whereas GMAX removes a vertex of maximum degree. This is repeated until no edges remain and the resulting set of genes is stable. GMIN selects a vertex of minimum degree to retain during each iteration and all adjacent vertices are removed. The process is repeated until G becomes empty and the retained vertices form a stable set.