| Literature DB >> 35942349 |
Nyasha Chambwe1, Rosalyn W Sayaman2, Donglei Hu3, Scott Huntsman3, Anab Kemal4, Samantha Caesar-Johnson4, Jean C Zenklusen4, Elad Ziv3, Rameen Beroukhim5, Andrew D Cherniack6.
Abstract
Differential mRNA expression between ancestry groups can be explained by both genetic and environmental factors. We outline a computational workflow to determine the extent to which germline genetic variation explains cancer-specific molecular differences across ancestry groups. Using multi-omics datasets from The Cancer Genome Atlas (TCGA), we enumerate ancestry-informative markers colocalized with cancer-type-specific expression quantitative trait loci (e-QTLs) at ancestry-associated genes. This approach is generalizable to other settings with paired germline genotyping and mRNA expression data for a multi-ethnic cohort. For complete details on the use and execution of this protocol, please refer to Carrot-Zhang et al. (2020), Robertson et al. (2021), and Sayaman et al. (2021).Entities:
Keywords: Bioinformatics; Cancer; Computer sciences; Gene Expression; Genomics; RNAseq; Sequence analysis
Mesh:
Substances:
Year: 2022 PMID: 35942349 PMCID: PMC9356164 DOI: 10.1016/j.xpro.2022.101586
Source DB: PubMed Journal: STAR Protoc ISSN: 2666-1667
Figure 1Schematic overview of genotype quality control workflow
Stepwise description of pre-processing steps taken to generate clean quality-controlled germline genotyping data for stranding and imputation. The protocol requires specific calculations to be performed (yellow), and steps to filter SNPs (green) or individuals (blue).
(A) Histogram of X chromosome homozygosity estimate (XHE) inbreeding F coefficient. F coeff thresholds at 0.2 and 0.8 are shown.
(B) Heterozygosity rate vs. log10 of the proportion of missing genotypes per ancestry group. Thresholds for the proportion of missing genotypes at log10(0.05) and mean heterozygosity +/- 3∗standard deviations per ancestry group are shown.
(C) Empirical cumulative distribution function of HWE log10 p-value for the European ancestry group. HWE threshold at p=10-6 is shown.
(D) Histogram of log10 MAF. MAF threshold at log10(0.005) is shown.
Figure 2Expected distributions of imputation R2 and MAF values
(A) Schematic of the number of SNPs (i) originally downloaded, (ii) after QC, (iii) after imputation, and (iv) after imputation QC.
(B) Hexagonal heatmap of 2d bin counts of the number SNPs post-imputation, showing the distribution of SNP HRC Imputation R2 (x-axis) against the log10 Minor Allele Frequency (MAF) values across all autosomal chromosomes (y-axis). (c) Table showing the number and percent of SNPs below and above the suggested threshold levels of R2 ≥ 0.5 and MAF ≥ 0.005.
Figure 31000 Genomes allele frequency distributions for ancestry associated SNPs in European and African populations
(A) Delta Minor Allele Frequency (dMAF) distributions for imputed SNPs by ancestry association status as class determined by logistic regression in the 1000 Genomes reference populations. Kolmogorov-Smirnov (KS) test p < 0.05.
(B) Scatterplot depicting density of allele frequencies for all imputed SNPs that passed QC and were tested for ancestry association in the African (x-axis) and European (y-axis) populations.
Figure 4Curation of TCGA sex assignments
(A–I) Histograms of XHE inbreeding F coefficient faceted by imputed genotyped sex and self-reported sex. Number of individuals within each category are annotated. (Note, y-axes are scaled within each category for readability.).
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| TCGA Pan-cancer Atlas normalized mRNA expression | NCI Genomic Data Commons (GDC), Pancan Atlas Portal | |
| PancanQTL - Pan-cancer eQTL Database | ( | |
| TCGA Pan-cancer Atlas Ancestry Calls | Table S1 from ( | |
| TCGA mRNA associations with ancestry | Table S4 from ( | |
| TCGA Germline Whitelisted Samples | Table S1 from ( | |
| TCGA Germline Data - Affymetrix Genome-wide SNP 6.0 array | Genomic Data Commons Legacy Archive | ( |
| TCGA QC’ed and HRC-Imputed Data | ( | |
| Haplotype Reference Consortium Reference Dataset | Haplotype Reference Consortium ( | |
| Hail v0.2 | N/A | |
| PLINK v1.9 | ( | |
| bcftools 1.9 | ( | |
| McCarthy Group Tools | N/A | |
| Michigan Imputation Server | ( | |
| Eagle v2.3 | ( | |
| Minimac3 | ( | |
| NCI Genomic Data Commons (GDC) Data Transfer Tool | NCI Genomic Data Commons (GDC) | ( |
| Custom scripts | ( | |