Joseph D Barry1, Maud Fagny2, Joseph N Paulson2, Hugo J W L Aerts3, John Platig2, John Quackenbush2. 1. Center for Cancer Computational Biology and Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA 02215, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, 667 Huntington Avenue, Boston, MA 02115, USA. Electronic address: josephbarry.mail@gmail.com. 2. Center for Cancer Computational Biology and Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA 02215, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, 667 Huntington Avenue, Boston, MA 02115, USA. 3. Department of Radiology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA 02215, USA.
Abstract
Genotype-to-phenotype association studies typically use macroscopic physiological measurements or molecular readouts as quantitative traits. There are comparatively few suitable quantitative traits available between cell and tissue length scales, a limitation that hinders our ability to identify variants affecting phenotype at many clinically informative levels. Here we show that quantitative image features, automatically extracted from histopathological imaging data, can be used for image quantitative trait loci (iQTLs) mapping and variant discovery. Using thyroid pathology images, clinical metadata, and genomics data from the Genotype-Tissue Expression (GTEx) project, we establish and validate a quantitative imaging biomarker for immune cell infiltration. A total of 100,215 variants were selected for iQTL profiling and tested for genotype-phenotype associations with our quantitative imaging biomarker. Significant associations were found in HDAC9 and TXNDC5. We validated the TXNDC5 association using GTEx cis-expression QTL data and an independent hypothyroidism dataset from the Electronic Medical Records and Genomics network.
Genotype-to-phenotype association studies typically use macroscopic physiological measurements or molecular readouts as quantitative traits. There are comparatively few suitable quantitative traits available between cell and tissue length scales, a limitation that hinders our ability to identify variants affecting phenotype at many clinically informative levels. Here we show that quantitative image features, automatically extracted from histopathological imaging data, can be used for image quantitative trait loci (iQTLs) mapping and variant discovery. Using thyroid pathology images, clinical metadata, and genomics data from the Genotype-Tissue Expression (GTEx) project, we establish and validate a quantitative imaging biomarker for immune cell infiltration. A total of 100,215 variants were selected for iQTL profiling and tested for genotype-phenotype associations with our quantitative imaging biomarker. Significant associations were found in HDAC9 and TXNDC5. We validated the TXNDC5 association using GTEx cis-expression QTL data and an independent hypothyroidism dataset from the Electronic Medical Records and Genomics network.
Genomics has had tremendous success in relating genetic variants to a variety of molecular phenotypes. Variants associated with gene expression, so-called expression quantitative trait loci (eQTLs), are enriched for regions of active chromatin, influence gene regulation, and are involved in processes that contribute to disease (Pickrell et al., 2010, Westra et al., 2013). At the macroscopic level, genome-wide association studies (GWAS) that use physiological measurements of organ health have also provided insight into the genetic component of human disease. However, given the immense gap between SNP-level variation and tissue and body function, GWAS results concerning human traits can be challenging to interpret, in part due to the high correlations observed between studies of different traits (Bulik-Sullivan et al., 2015). This is especially true for autoimmune diseases, which have many shared genetic risk factors (Gutierrez-Arcelus et al., 2016). One way to increase GWAS variant interpretability and to discover new disease variants will be to place more focus on quantitative traits that lie at intermediate cellular and sub-tissue scales.Histopathology has for decades remained the standard for the diagnosis and grading of many complex diseases. Advances in digital slide imaging have allowed for unprecedented resolution at tissue, cellular, and sub-cellular scales. As such, histopathology could provide the genomics community with a vast resource of quantitative traits for evaluating disease phenotype across a range of mesoscopic scales. Previous histopathological GWASs have used discrete pathology grading schemes as quantitative traits. Although such schemes are highly effective for guiding clinical decisions on the level of individual patients and can in principle identify disease variants in GWASs (Chalasani et al., 2010), they do not scale well to the hundreds or thousands of samples needed owing to inter- and intra-observation bias (Colquhoun et al., 2003, Layfield et al., 2017, Qureshi et al., 2016). Unbiased, automatically extracted, continuous features have been shown to improve the predictive power of survival analyses with pathology data in comparison with discrete grades (Khan et al., 2017, Yu et al., 2016). GWASs using quantitative image features extracted from radiological imaging data have successfully identified COPD-relevant variants using computed tomography (Cho et al., 2015) and variants related to Alzheimer's disease and mild cognitive impairment using magnetic resonance imaging (Shen et al., 2010). Thus, genome association studies that leverage automated imaging analysis methodologies can be highly effective. With quantitative image analysis now routinely applied to digital pathology datasets (Beck et al., 2011, Khan et al., 2017, Popovici et al., 2016, Yu et al., 2016, Yuan et al., 2012), it is clear that unbiased, continuous image features can readily be extracted for histopathological GWAS variant discovery.Nevertheless, little attention has been paid to the use of cellular imaging to obtain quantitative traits for understanding the connection between genotype and cellular phenotypes. Part of the reason for this may be that there are few available datasets for which both genomic profiling and standardized histopathology data are available for any appreciable number of samples. The Genotype-Tissue Expression (GTEx) resource (GTEx Consortium, 2015, Lonsdale et al., 2013) has collected germline genotype data from over 450 autopsy research subjects, as well as standardized histological imaging and RNA sequencing (RNA-Seq) gene expression from approximately 50 different body sites.One of the tissues for which there is substantial histological imaging data is the thyroid. The thyroid plays a central role in the endocrine system, producing thyroid hormones that influence protein synthesis and metabolism. Hashimoto thyroiditis (HT) is an autoimmune disease affecting approximately 5% of the population, in which the thyroid is slowly destroyed, frequently leading to hypothyroidism, particularly in women (Pyzik et al., 2015). Among the GTEx population there are 341 individuals for whom paired thyroid imaging and gene expression data were available; of these, 31 had morphological evidence of HT according to GTEx pathology notes. Leukocyte infiltration is a morphological hallmark of HT and necessary for disease progression. We saw this dataset as an ideal place to test the hypothesis that cellular imaging features could be used to discover immune cell infiltration variants in an image quantitative trait locus (iQTL) analysis.
Results and Discussion
Quantitative Image Analysis
Image data were downloaded from the GTEx histological image archive and convolved with a Gaussian filter to smooth out pixel-level variation on a length scale smaller than that of observed regions of leukocyte invasion in HT samples. Color deconvolution to estimate the H&E channels directly was not performed to avoid the risk of introducing any biases from the deconvolution process that might negatively affect downstream QTL fits. Individual tissue pieces were segmented using adaptive thresholding (Figure 1A), and 117 Haralick image features were extracted for each tissue piece using the Bioconductor image processing package EBImage (Pau et al., 2010) (Transparent Methods). To capture patterns across multiple length scales, we incorporated three Haralick scales that sampled every 1, 10, or 100 pixels. Haralick features are widely used for the quantification of image texture, and correlations between neighboring pixels in an image are taken into account (Haralick, 1979). Given that lymphocyte invasion has a profound effect on histological image texture, we reasoned that Haralick features could be ideal candidates for capturing HT cellular phenotype. After removing overly small tissue pieces, averaged image feature values were calculated across pieces to give a single value for each sample and feature. In preparation for downstream model fitting, features were log2 transformed and centered and rescaled using a Z score to ensure feature comparability across samples (Figure 1B). The mean and SD values used for the Z score were saved for use in later analyses. A fully worked through image analysis example with the applied segmentation parameters is supplied as part of a supplemental R data package (https://github.com/josephbarry/iQTL).
Figure 1
Thyroid Image Processing and Establishment of a Quantitative Imaging Biomarker for Immune Cell Infiltration
(A) Digital pathology slide for thyroid sample GTEX-11NV4. Raw: before image processing. Blurred: post Gaussian convolution. Segmented: tissue piece masks after adaptive thresholding.
(B) Heatmap of 117 log2 transformed and standardized Haralick image features against 341 thyroid samples.
(C) PC1 versus PC2 from a PCA of the image feature matrix. Blue points indicate patients with Hashimoto thyroiditis, as identified from pathology notes. Circles indicate females and triangles males.
(D) ROC showing biomarker performance of PC2 for predicting HT.
(E) Top row, five thyroid samples with highest values of image PC2. Bottom row, local image PC2 signal. Bright regions correspond to high local image PC2. Image brightness has been rescaled to aid visualization.
Thyroid Image Processing and Establishment of a Quantitative Imaging Biomarker for Immune Cell Infiltration(A) Digital pathology slide for thyroid sample GTEX-11NV4. Raw: before image processing. Blurred: post Gaussian convolution. Segmented: tissue piece masks after adaptive thresholding.(B) Heatmap of 117 log2 transformed and standardized Haralick image features against 341 thyroid samples.(C) PC1 versus PC2 from a PCA of the image feature matrix. Blue points indicate patients with Hashimoto thyroiditis, as identified from pathology notes. Circles indicate females and triangles males.(D) ROC showing biomarker performance of PC2 for predicting HT.(E) Top row, five thyroid samples with highest values of image PC2. Bottom row, local image PC2 signal. Bright regions correspond to high local image PC2. Image brightness has been rescaled to aid visualization.To reduce the number of candidate features and to identify the primary directions of variability in our image feature matrix, we performed principal-component analysis (PCA) and found that the first two imaging principal components (PCs) were sufficient to explain 73% of the variance. By inspection of the images, low values of PC1 were found to be associated with interior holes in the thyroid pieces (Figure S1A), an effect we adjudged to be primarily technical in origin. High values of image PC2 were visually strongly associated with the presence of invading leukocytes (Figure S1B).For validation of image PC2 using clinical metadata, we identified all samples for which an HT phenotype was indicated in the comments of the original GTEx pathologists (Figure 1C blue points). We note that the morphological phenotype of HT is not definitive evidence of clinical HT, which requires assessment of thyroid hormone levels, ultrasonographic scanning, and additional clinical findings (Anila et al., 2016). A receiver operating characteristic (ROC) analysis showed that image PC2 was a highly performant biomarker for HT phenotype, with an area under the curve (AUC) of 0.84 (Figure 1D). Using logistic regression, we found that HT phenotype was significantly associated with image PC2 (odds ratio [OR] = 1.3, P = 1.4 × 10−9) after correcting for sex and age (Transparent Methods). A higher association with HT phenotype for women was also observed after correcting for age (OR = 4.2, P = 4 × 10−4), a known clinical characteristic of the disease (Pyzik et al., 2015). That image PC2 is a performant biomarker for HT phenotype likely does not have direct clinical application since a pathologist can easily identify HT morphology by inspection of pathology images; we considered its utility to be for downstream integration with genomic data.Since the fibrous variant of clinical HT accounts for up to 10% of cases (Iannaci et al., 2013) we investigated co-occurrence of HT and fibrotic phenotypes in GTEx pathology notes. We found that 9/31 samples with HT phenotypes also had fibrotic phenotypes, which was significantly enriched compared with the background set of non-HT samples (Fisher’s test; P = 0.002, OR = 4.2). However, samples with both HT and fibrotic phenotypes did not show any clear classification pattern in the image PC1 vs PC2 plot (Figure S2A) that would suggest segregation by disease subtype.Next, we inspected PC2 feature loadings from the image PCA (Table S1). These represented the weighted contribution of each Haralick feature to image PC2. We found that no single feature had a notably higher magnitude of loading than any other, showing that signals from many different Haralick features were required (Figure S2B). To assess the relative contribution of each RGB color channel we summed the absolute value of PC2 principal component loadings for all features extracted from each channel. Proportionately, we found that the red channel features contributed 47%, the green channel 28%, and the blue channel 26% to the image PC2 signal. This was suggestive of eosin contributing the most to the observed signal.To verify that the image PC2 signal was originating from regions of immune infiltrate, we mapped image PC2 signal to tiles of size 250 × 250 pixels using the following procedure. Haralick features were extracted within each tile, log2 transformed, and centered and rescaled using a Z score. Crucially, the mean and SD values used for the Z score were those calculated previously when estimating feature variability across samples. The resulting feature values were then multiplied by their corresponding image PC2 component loading (Figure S2B) and summed together to give the final estimate for local image PC2. Figure 1E shows the local image PC2 signal for the five samples with the highest image PC2 signal in Figure 1C. Bright regions indicate where local image PC2 signal is the highest (Figure 1E bottom row) and can be seen to correspond directly to dark regions of dense immune infiltrate in the original images (Figure 1E top row).
Integrated RNA-Seq and Image Analysis: Imaging Biomarker Validation
Next, we analyzed GTEx gene expression data associated with the 341 thyroid images. After the removal of lowly expressed genes, counts were log2 transformed and quantile normalized to ensure comparability of samples (Transparent Methods). To test for associations between gene expression and image PCs, linear models were used to fit each image PC against normalized gene readouts while correcting for age, sex, sample collection site, tissue autolysis score, and RNA extraction type. To remove bias caused by unknown confounders (Figure S3A) we fitted 20 PEER (Probabilistic Estimation of Expression Residuals) factors (Stegle et al., 2012) to the gene expression data while including the known confounders in the PEER fit; a similar approach was used by the GTEx consortium (Aguet et al., 2017, GTEx Consortium, 2015). The first PEER factor was not included as a covariate in the linear model fits owing to the strength of its correlation with our image PC2 biomarker (Pearson r = 0.53, P < 2.2 × 10−16), which indicated that it likely contained valuable biological information. Although other PEER factors may capture both biological signal and systematic noise, their inclusion as covariates successfully removed the technical bias observed in Figure S2A. The results from fits correcting for the known confounders as well as the 19 remaining PEER factors revealed that image PC2 was systematically and highly associated with gene expression (Figure 2A), suggesting that leukocyte infiltration is at least a major, if not the primary, source of expression variability in GTEx thyroid tissue. In contrast, image PC1 had comparatively few significant associations with thyroid gene expression (Figure 2A).
Figure 2
Integration of Imaging Data with Gene Expression Analyses Verifies that Image PC2 Is Highly Associated with Immune Cell Infiltration
(A) A QQ plot of p values (PG) from the regression analysis of image PC1 and PC2 against thyroid gene expression for 23,993 genes.
(B) Correlation of image PC2 with −log10(PC) from the CIBERSORT analysis for samples with PC<0.5. Blue points are samples for which Hashimoto thyroiditis was indicated in GTEx pathology notes.
(C) Frequencies of immune cell types reported from CIBERSORT for samples with PC<0.1. Cell types with an average frequency of 5% or more are shown.
Integration of Imaging Data with Gene Expression Analyses Verifies that Image PC2 Is Highly Associated with Immune Cell Infiltration(A) A QQ plot of p values (PG) from the regression analysis of image PC1 and PC2 against thyroid gene expression for 23,993 genes.(B) Correlation of image PC2 with −log10(PC) from the CIBERSORT analysis for samples with PC<0.5. Blue points are samples for which Hashimoto thyroiditis was indicated in GTEx pathology notes.(C) Frequencies of immune cell types reported from CIBERSORT for samples with PC<0.1. Cell types with an average frequency of 5% or more are shown.To identify which biological pathways may drive the significant image PC2 associations with gene expression we performed gene set enrichment analysis, choosing as a test set 2,913 genes with −log10(PG) > 10 in the gene expression analysis and all other 21,080 genes as a background set (Transparent Methods). Significant enrichment was observed for a number of gene ontology terms related to immune function and the activation and signaling activity of invading T and B lymphocytes (Table 1).
Table 1
Gene Set Enrichment Analysis
GO Id
Type
Description
Odds
P
Padj
N
GO:0050776
BP
Regulation of immune response
6.10
1.3 × 10−14
5.8 × 10−11
43
GO:0031295
BP
T cell costimulation
5.83
6.0 × 10−11
2.1 × 10−7
32
GO:0042110
BP
T cell activation
5.57
2.6 × 10−8
3.6 × 10−5
24
GO:0042113
BP
B cell activation
8.67
7.0 × 10−8
8.6 × 10−5
17
GO:0045060
BP
Negative thymic T cell selection
25.51
6.7 × 10−7
5.4 × 10−4
10
GO:0050690
BP
Regulation of defense response to virus by virus
6.96
2.1 × 10−6
1.5 × 10−3
15
GO:0042101
CC
T cell receptor complex
17.00
2.5 × 10−6
1.6 × 10−3
10
Significant gene ontology terms for genes highly associated with image PC2 (−log10(PG) > 10, see Figure 2A) in the gene expression regression analysis. All terms with an odds ratio of greater than 5 and an adjusted p value of less than 0.01 are shown. BP, biological process; CC, cellular component.
Gene Set Enrichment AnalysisSignificant gene ontology terms for genes highly associated with image PC2 (−log10(PG) > 10, see Figure 2A) in the gene expression regression analysis. All terms with an odds ratio of greater than 5 and an adjusted p value of less than 0.01 are shown. BP, biological process; CC, cellular component.To confirm the association between image PC2 and the presence of invading immune cells on the level of individual samples, we ran our gene expression data through CIBERSORT, an algorithm designed to deconvolve complex cell mixtures (Newman et al., 2015). The LM22 leukocyte gene signature matrix supplied by the CIBERSORT team was used to perform the deconvolution as it could in principle detect and distinguish between a number of T and B cell types. As expected, CIBERSORT failed to detect immune cells in the majority of healthy thyroid samples. However, in samples in which immune cells were detected (CIBERSORT PC<0.5), the significance of detection correlated significantly with image PC2 (Figure 2B; Spearman ρ = 0.67, P = 3 × 10−7). Figure 2C shows the cell type signature for samples with PC<0.1 and cell types with an average CIBERSORT frequency of more than 5% across samples. This analysis confirmed the presence of T cell CD4 memory resting, T cell CD8, naive and memory B cells, as well as other immune cell types, on the level of individual samples.The CIBERSORT algorithm also detected infiltrating immune cells in several samples for which image PC2 was high but HT phenotype was not indicated by GTEx pathologists (Figure 2B gray points), suggesting that our imaging biomarker may be able to detect a broader spectrum of thyroid autoimmune processes. By inspection, we confirmed that many of these images did show infiltrative phenotypes but did not exhibit the classical HT visual phenotype (for example, GTEX-OXRO, GTEX-144FL). One possible clinical application of our imaging approach would therefore be to automatically flag such samples for rechecking by pathologists.To demonstrate the advantage of using PCA for the identification of a candidate imaging biomarker, as opposed to directly using the individual Haralick image features, we correlated each of the 117 image features with CIBERSORT significance of detection for immune cells, in the same manner as described earlier. Image PC2 had a higher Spearman correlation coefficient than all image features, with the closest having a coefficient of ρ = 0.60 (Figure S2B). Thus, image PC2 combined information from multiple image features to attain a higher overall performance.
iQTL Identification of Variants Associated with Immune Infiltration
Based on our analysis of image features and gene expression, we decided to use image PC2 as a quantitative trait capturing the cellular features of immune cell infiltration in an iQTL analysis. Since we did not have sufficient sample size to achieve high power in a genome-wide iQTL analysis that would include millions of variants, and other suitably large datasets were not available, we decided to reduce the search space by asking a more targeted question of the data.Since leukocyte invasion was associated with a higher prevalence of differentially upregulated genes (Figure 3A) we hypothesized that the genotype of upregulated genes in GTEx samples with an HT phenotype could in part determine the extent of leukocyte infiltration. Therefore, the analysis was further restricted to 100,215 candidate SNPs residing in 1,380 genes that were significantly (−log10(Padj) > 7, Figure S4A) and positively (log2(Fold-Change) >0.5) differentially expressed in 31 HT cases versus 310 controls (Figure 3A blue points) using DESeq2 (Love et al., 2014). We note that HT phenotype, as indicated by GTEx pathologists, was not included in the feature set used to build our imaging biomarker. Therefore, the SNP selection procedure was independent of the biomarker subsequently used as the quantitative trait in the iQTL analysis.
Figure 3
An Image QTL Analysis Finds Associations between Genomic Variants and Our Image PC2 Biomarker for Thyroid Autoimmune Disease
(A) Selection of 1,380 candidate genes (blue points) based on their positive fold-change (log2(FC) > 0.5) and significant differential expression (−log10(Padj) > 7) in GTEx samples with Hashimoto thyroiditis (HT) phenotypes.
(B) A QQ plot showing expected vs observed p values from image QTL fits of 100,215 candidate SNPs residing in the selected genes highlighted blue in panel A.
(C and D) −log10(P) vs genomic coordinates for GTEx iQTLs (top panels) and eMERGE variant association with HT (bottom panels) for all tested SNPs in HDAC9 (C) and TXNDC5 (D). Blue vertical lines indicate the locations of the most significant SNP for each gene after multiple testing correction using the IHW method described in the main text. Plot ranges are mapped to the start and end positions of the genes, as defined by GTEx-consortium transcript data. Horizontal green bars are of length 50 kb.
An Image QTL Analysis Finds Associations between Genomic Variants and Our Image PC2 Biomarker for Thyroid Autoimmune Disease(A) Selection of 1,380 candidate genes (blue points) based on their positive fold-change (log2(FC) > 0.5) and significant differential expression (−log10(Padj) > 7) in GTEx samples with Hashimoto thyroiditis (HT) phenotypes.(B) A QQ plot showing expected vs observed p values from image QTL fits of 100,215 candidate SNPs residing in the selected genes highlighted blue in panel A.(C and D) −log10(P) vs genomic coordinates for GTEx iQTLs (top panels) and eMERGE variant association with HT (bottom panels) for all tested SNPs in HDAC9 (C) and TXNDC5 (D). Blue vertical lines indicate the locations of the most significant SNP for each gene after multiple testing correction using the IHW method described in the main text. Plot ranges are mapped to the start and end positions of the genes, as defined by GTEx-consortium transcript data. Horizontal green bars are of length 50 kb.We then limited our analysis to the 292 samples that had both thyroid imaging and genotype information. The genotype data had already undergone rigorous quality control and filtering by the GTEx consortium, including but not limited to Hardy-Weinberg equilibrium and imputation quality criteria. Owing to sample size considerations, minor allele frequencies (MAFs) of less than 10% were removed, as well as SNPs with a missingness frequency of more than 10%.iQTL linear model fits were implemented using the R package MatrixEQTL (Shabalin, 2012) by treating image PC2 as a pseudo trans gene. Fits were corrected for age, sex, race, ancestry, processing center, and tissue autolysis score (Transparent Methods). A QQ plot comparing the distribution of observed p values to the expected uniform distribution showed that the data were well behaved under the null hypothesis (Figure 3B), with a departure from the dashed line for a group of SNPs with low p values. Independent hypothesis weighting (IHW) (Ignatiadis et al., 2016) using MAF as the independent covariate was used to correct for multiple testing. In a data-driven way, this method assigned different weights to variants based on their MAF (Figure S4B) to maximize the number of null hypothesis rejections while controlling for type I error. A total of 21 SNPs across 3 haplotype blocks were identified as being significant (PIHW<0.05; Table S2). As expected, IHW attained higher power as compared with standard, unweighted Benjamini and Hochberg false discovery rate (FDR) (Table S2).Strikingly, 20 of the 21 significant SNPs resided in Histone Deacetylase 9 (HDAC9), an enzyme linked to epigenetic control of gene transcription and previously proposed to be an epigenetic switch for T-cell-mediated autoimmunity (Yan et al., 2011). Histone deacetylase inhibitors have been effective in the treatment of hypothyroidism in mice (Kim et al., 2014) and autoimmune diseases in general (Tang et al., 2013). A plot of SNP significance along HDAC9 revealed a sharp peak at 18,301 kb on chromosome 7 (Figure 3C). To our knowledge this is the first reported association between HDAC9 variants and thyroid immune cell infiltration. A search of GWAS Central (www.gwascentral.org [Beck et al., 2014]) identified modest associations between ulcerative colitis (UC) and two of our significant iQTL variants (−log10(P) = 2.3 for rs215122, −log10(P) = 2.7 for rs2529749) (Anderson et al., 2011). Although further work will be required to investigate the biological effects of HDAC9 genotype in both UC and HT, a common genetic basis would be consistent with evidence that these conditions present concurrently at higher rates than would be expected in the general population (Shizuma, 2016).One variant in the Thioredoxin Domain Containing 5 gene (TXNDC5), rs11962800, was found to be significant at the level of PIHW<0.05 (Table S2). TXNDC5 expression is induced by hypoxia (Sullivan et al., 2003), and variants in this gene have been associated with a number of autoimmune diseases, including rheumatoid arthritis (Chang et al., 2011) and vitiligo (Jeong et al., 2010). As for HDAC9, variants in TXNDC5 have not previously been associated with immune cell infiltration, again suggesting that iQTLs can uncover clinically relevant associations. It should be noted that rs11962800 was reported as a cis-expression QTL in thyroid tissue by the GTEx consortium (P = 7.7 × 10−21; www.gtexportal.org [Aguet et al., 2017]). Since our analysis shows that TXNDC5 expression is associated with leukocyte invasion in thyroid tissue, it is possible that any expression-based result could be biased by cell-type mixing, as has been highlighted in GTEx lung tissue (McCall et al., 2016). However, rs11962800 was also a reported eQTL in 16 other GTEx tissues (Aguet et al., 2017), with thyroid ranked 3/17 in terms of highest significance. Next, we relaxed the IHW-adjusted p value cutoff for iQTLs to 0.4 and found that the top five iQTLs in TXNDC5 were also eQTLs (Table S3). In addition, 12/31 of the GTEx TXNDC5 eQTLs overlapping with our set of tested variants were iQTLs or in linkage disequilibrium with iQTLs (Figure S5 red points). Together, these results offer the mechanistic hypothesis that genetic variation in TXNDC5 affects HT imaging phenotype directly through cis-regulation of RNA expression levels.Next, we asked how the performance of our imaging biomarker might compare with expression-based biomarkers for immune infiltration variant discovery. We chose the first expression PEER factor (PEER1) as a comparative candidate biomarker based on the earlier observation that it was significantly correlated with image PC2. After verifying that PEER1 was performant for predicting HT phenotype (AUC 0.88, Figure S6A), we ran another QTL analysis using the same 100,215 variants as for the iQTL analysis but with PEER1 as the response variable. A QQ plot revealed that PEER1 performed poorly (Figure S6B), with no significant QTL hits. Therefore, image PC2 was in this case a superior choice of biomarker for variant discovery.
Validation of iQTL Results Using an Independent Dataset
To attempt validation of the observed associations between HDAC9 and TXNDC5 genotype, and thyroid immune cell infiltration in an independent dataset, we obtained genotype and phenotype data and clinical metadata from the Electronic Medical Records and Genomics (eMERGE) network (Gottesman et al., 2013) (Transparent Methods). This dataset has been used previously for hypothyroidism variant discovery (Denny et al., 2011). Criteria for defining HT cases included the assessment of thyroid-stimulating hormone (TSH) and free T4 (FT4) levels (Transparent Methods). Although levels of biochemical markers are not necessarily correlated with morphological grading of HT, immune cell infiltration is pathognomonic of the disease (Anila et al., 2016). An increased prevalence of immune cell infiltration is therefore likely to be present in a large set of HT cases vs controls.We selected 1,261 individuals with chronic autoimmune hypothyroidism (presumptive HT) and 4,457 non-hypothyroidism control subjects (Transparent Methods). As over 96% of the hypothyroidism cases were from the Caucasian cohort, and 250/292 of the GTEx samples used for the iQTL analysis were of European American descent, to simplify the analysis all non-Caucasian cohorts were removed, leaving 1,213 individuals with chronic autoimmune hypothyroidism and 3,789 control subjects. For consistency with the iQTL analysis, SNPs with a minor allele frequency of less than 10% and a missingness frequency of more than 10% were removed. One SNP in HDAC9 was removed owing to deviation from the Hardy-Weinberg equilibrium (Transparent Methods).We used logistic regression to test for associations between hypothyroidism status and HDAC9 and TXNDC5 genotype while correcting for patient age (decade of birth), sex, collection site, and ancestry (Transparent Methods). For both genes, the location of the most significant GTEx iQTL peaks, as determined by their IHW-adjusted p values, coincided closely with the highest eMERGE peaks (Figures 3C and 3D vertical lines). Although a colocalization analysis did not identify a single causative SNP common to both datasets, we noted a high degree of similarity between the association profiles. To quantitatively assess the concordance of the association signatures between the GTEx and eMERGE data, we first selected the SNPs common to both datasets and grouped neighboring SNPs into bins of size 20 SNPs based on their genomic coordinates. The lowest p value in each bin was then selected for both datasets. A correlation test revealed that the association signatures between eMERGE and GTEx data were significantly concordant across both HDAC9 (Figure S7A1; Spearman ρ = 0.59, P < 2.9 × 10−6) and TXNDC5 (Figure S7A2; Spearman ρ = 0.73, P = 9 × 10−4).Next, we checked if linkage disequilibrium (LD) could be artificially driving the observed concordance. SNPs were pruned for LD using PLINK (Transparent Methods), and the binning and correlation analysis was repeated. To account for the reduction in the number of SNPs (HDAC9, 1,071 to 330; TXNDC5, 347 to 88), smaller bin sizes of six and five SNPs were used for HDAC9 and TXNDC5, respectively. A correlation test revealed a more modest but still significant correlation for HDAC9 (Spearman ρ = 0.30, P = 0.026) and a significant correlation for TXNDC5 (Spearman ρ = 0.72, P = 0.001). Thus, HDAC9 concordance was partially biased by LD, whereas TXNDC5 concordance was unaffected.To verify that it was hypothyroidism status driving the concordance, we performed 1,000 permutations of the hypothyroidism case/control labels in the eMERGE dataset and repeated the logistic regression, binning, and correlation analyses between the LD-pruned GTEx and eMERGE data. The resulting permutation p value did not reach significance for HDAC9 (P = 0.127), likely due to the lower nominal p values remaining after the LD pruning. However, the TXNDC5 permutation value was significant (P = 0.006) and demonstrated that it was unlikely to obtain the observed TXNDC5 concordance between eMERGE and GTEx profiles by chance (Figures S7B1 and S7B2).Although we validated our TXNDC5 association using eMERGE hypothyroidism status, there are limitations to the validation approach. First, the analysis identified common regions of association rather than individual causative SNPs. The colocalization analysis was likely hindered by the modest GTEx sample size, and future iQTL studies may require higher numbers of samples to identify causative SNPs. Second, the severity of the HT morphological phenotype and clinical hypothyroidism status are not necessarily correlated (Anila et al., 2016), although immune cell infiltration is a prerequisite for clinical HT. This may in part explain why higher nominal p values were observed in the GTEx data. Another explanation might be that quantitative histological image analysis captures subclinical, as well as clinical, disease phenotype as a smooth function, whereas binary categorizations of disease status do not take into account informative subclinical disease cases or the morphological severity of the disease phenotype.
Outlook
Our intuition, along with over a decade of progress in GWAS, tells us that genotype and phenotype are associated in ways that can be quantitatively assessed. The growth of integrated data resources such as GTEx, and the advancement of digital slide imaging technologies, provides the opportunity to explore that association at the cellular level. This study demonstrates the potential of histopathological image-based QTL profiling for de novo discovery of disease variants, especially when unbiased, continuous, automatically extracted image features are used as quantitative traits. Our results indicate that iQTLs can identify different variants to cases vs controls GWAS, offering a complementary approach that leverages cell-level phenotypes for target discovery. Although we applied iQTLs to a thyroid dataset, the method is completely generalizable to other tissues and disease contexts. With larger sample sizes and study designs tailored for specific diseases, iQTL scans can be extended to millions of variants. Ultimately, digital pathology image analysis approaches could revolutionize genome-wide association studies by providing a wealth of unbiased quantitative traits at the cellular level that have high levels of biological interpretability.
Methods
All methods can be found in the accompanying Transparent Methods supplemental file.
Authors: Andrew H Beck; Ankur R Sangoi; Samuel Leung; Robert J Marinelli; Torsten O Nielsen; Marc J van de Vijver; Robert B West; Matt van de Rijn; Daphne Koller Journal: Sci Transl Med Date: 2011-11-09 Impact factor: 17.956
Authors: Michael H Cho; Peter J Castaldi; Craig P Hersh; Brian D Hobbs; R Graham Barr; Ruth Tal-Singer; Per Bakke; Amund Gulsvik; Raúl San José Estépar; Edwin J R Van Beek; Harvey O Coxson; David A Lynch; George R Washko; Nan M Laird; James D Crapo; Terri H Beaty; Edwin K Silverman Journal: Am J Respir Crit Care Med Date: 2015-09-01 Impact factor: 21.405
Authors: Lester J Layfield; Magda Esebua; Shellaine R Frazier; Richard D Hammer; William W Bivin; Van Nguyen; Ilker Ersoy; Robert L Schmidt Journal: Diagn Cytopathol Date: 2017-02 Impact factor: 1.582
Authors: Joseph K Pickrell; John C Marioni; Athma A Pai; Jacob F Degner; Barbara E Engelhardt; Everlyne Nkadori; Jean-Baptiste Veyrieras; Matthew Stephens; Yoav Gilad; Jonathan K Pritchard Journal: Nature Date: 2010-03-10 Impact factor: 49.962
Authors: Harm-Jan Westra; Marjolein J Peters; Tõnu Esko; Hanieh Yaghootkar; Claudia Schurmann; Johannes Kettunen; Mark W Christiansen; Bruce M Psaty; Samuli Ripatti; Alexander Teumer; Timothy M Frayling; Andres Metspalu; Joyce B J van Meurs; Lude Franke; Benjamin P Fairfax; Katharina Schramm; Joseph E Powell; Alexandra Zhernakova; Daria V Zhernakova; Jan H Veldink; Leonard H Van den Berg; Juha Karjalainen; Sebo Withoff; André G Uitterlinden; Albert Hofman; Fernando Rivadeneira; Peter A C 't Hoen; Eva Reinmaa; Krista Fischer; Mari Nelis; Lili Milani; David Melzer; Luigi Ferrucci; Andrew B Singleton; Dena G Hernandez; Michael A Nalls; Georg Homuth; Matthias Nauck; Dörte Radke; Uwe Völker; Markus Perola; Veikko Salomaa; Jennifer Brody; Astrid Suchy-Dicey; Sina A Gharib; Daniel A Enquobahrie; Thomas Lumley; Grant W Montgomery; Seiko Makino; Holger Prokisch; Christian Herder; Michael Roden; Harald Grallert; Thomas Meitinger; Konstantin Strauch; Yang Li; Ritsert C Jansen; Peter M Visscher; Julian C Knight Journal: Nat Genet Date: 2013-09-08 Impact factor: 38.330
Authors: Tim Beck; Robert K Hastings; Sirisha Gollapudi; Robert C Free; Anthony J Brookes Journal: Eur J Hum Genet Date: 2013-12-04 Impact factor: 4.246