Literature DB >> 29022581

The impact of rare variation on gene expression across tissues.

Xin Li¹, Yungil Kim², Emily K Tsang^1,3, Joe R Davis^1,4, Farhan N Damani², Colby Chiang⁵, Gaelen T Hess⁴, Zachary Zappala^1,4, Benjamin J Strober⁶, Alexandra J Scott⁵, Amy Li⁴, Andrea Ganna^7,8,9, Michael C Bassik⁴, Jason D Merker¹, Ira M Hall^5,10,11, Alexis Battle², Stephen B Montgomery^1,4.

Abstract

Rare genetic variants are abundant in humans and are expected to contribute to individual disease risk. While genetic association studies have successfully identified common genetic variants associated with susceptibility, these studies are not practical for identifying rare variants. Efforts to distinguish pathogenic variants from benign rare variants have leveraged the genetic code to identify deleterious protein-coding alleles, but no analogous code exists for non-coding variants. Therefore, ascertaining which rare variants have phenotypic effects remains a major challenge. Rare non-coding variants have been associated with extreme gene expression in studies using single tissues, but their effects across tissues are unknown. Here we identify gene expression outliers, or individuals showing extreme expression levels for a particular gene, across 44 human tissues by using combined analyses of whole genomes and multi-tissue RNA-sequencing data from the Genotype-Tissue Expression (GTEx) project v6p release. We find that 58% of underexpression and 28% of overexpression outliers have nearby conserved rare variants compared to 8% of non-outliers. Additionally, we developed RIVER (RNA-informed variant effect on regulation), a Bayesian statistical model that incorporates expression data to predict a regulatory effect for rare variants with higher accuracy than models using genomic annotations alone. Overall, we demonstrate that rare variants contribute to large gene expression changes across tissues and provide an integrative method for interpretation of rare variants in individual genomes.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2017 PMID： 29022581 PMCID： PMC5877409 DOI： 10.1038/nature24267

Source DB: PubMed Journal: Nature ISSN： 0028-0836 Impact factor: 49.962

Rare genetic variants are abundant in humans and are expected to contribute to individual disease risk[1-4]. While genetic association studies have successfully identified common genetic variants associated with susceptibility, they are not practical for rare variants[1,5]. Efforts to distinguish pathogenic from benign rare variants have leveraged the genetic code to identify deleterious protein coding alleles[1,6,7], but no analogous code exists for non-coding variants. Thus, ascertaining which rare variants have phenotypic effects remains a major challenge. Rare non-coding variants have been associated with extreme gene expression in single tissue studies[8-11], but their effects across tissues are unknown. Here, through combined analyses of whole genomes and multi-tissue RNA-sequencing data from the Genotype-Tissue Expression (GTEx) Project V6 release[12], we identify gene expression outliers, or individuals showing extreme expression levels for a particular gene, across 44 human tissues. We find that 58% of underexpression and 28% of overexpression outliers have nearby conserved rare variants compared with 8% of non-outliers. Additionally, we developed RIVER, a statistical method including a Bayesian model that incorporates expression data to predict a regulatory role for rare variants with higher accuracy than models using genomic annotations alone. Overall, we demonstrate that rare variants contribute to large gene expression changes across tissues and provide an integrative method for variant interpretation for rare variants in individual genomes. Our analysis focused on individuals with extremely high or low expression of a particular gene compared with the population, using the GTEx v6 release data, which includes RNA-sequencing data for 449 individuals and 44 tissues. We refer to these individuals as gene expression outliers. The GTEx data afford the ability to identify both single-tissue and multi-tissue expression outliers (Fig. 1a), with the latter defined by consistent extreme expression across many tissues (see Methods). To account for broad environmental and technical confounders, we removed hidden factors estimated by PEER[13] from each tissue prior to outlier discovery (Extended Data Fig. 1 and 2, Supplementary Tables 1 and 2).

Figure 1

Gene expression outliers and sharing between tissues

(a) A multi-tissue outlier. The individual has extreme expression values for the gene AKR1C4 in multiple tissues (red arrows) and the most extreme median expression value across tissues. (b) Outlier expression sharing between tissues, as measured by the proportion of single-tissue outliers that have |Z-score| ≥ 2 with the same effect direction for the corresponding genes in each replication tissue. Tissues are hierarchically clustered by gene expression. (c) Estimated replication rate of multi-tissue outliers in a constant held-out set of tissues for different sets of discovery tissues.

Extended Data Figure 1

PEER correction

(a) Adjusted R2 between top 15 PEER factors and top 20 sample (left) and subject (right) covariates in an example tissue, skeletal muscle. Covariates were ranked by the average adjusted R2 across all PEER factors and hierarchically clustered. The corresponding data for all tissues are provided in Supplementary Tables 1 and 2. (b) Adjusted R2 between the total expression component removed by PEER in each tissue and top 20 sample (left) and subject (right) covariates. The covariates were ranked by the average adjusted R2 across all tissues, and both axes were hierarchically clustered. White denotes missing values, and tissues are colored as in Fig. 1. PEER factors captured slightly different covariates across tissues, with a noticeable difference between the brain and other tissues. (c) Rare variant enrichments as in Fig. 2a for different levels of PEER correction. The fully corrected data show substantially stronger rare variant enrichments than the two partially corrected datasets.

Extended Data Figure 2

Distribution of the number of genes with a multi-tissue outlier

(a) Distribution of the number of genes for which each individual was a multi-tissue outlier. Each individual was an outlier for a median of 10 genes. Individuals with 50 or more outliers are colored in grey and were excluded from downstream analyses. (b–f) Distribution of the number of genes for which individuals, stratified by common covariates, were multi-tissue outliers. For race and sex, we compared the distributions using an unsigned Wilcoxon rank sum test, while we used Spearman’s ρ to test for association with the remaining covariates. Only age (Spearman’s ρ = 0.10, P = 0.033) and ischemic time (Spearman’s ρ = 0.18, P = 0.00022) were nominally associated with the number of outlier genes per individual. The association with age fails to achieve significance after correcting for multiple testing using the Bonferroni method. Note that in (b) we only tested for a significant difference in the distribution of the number of outlier genes between White and Black individuals because there were too few individuals in the other groups. (g) Enrichments as shown in Fig. 2a either including all individuals, or excluding individuals that are outliers for 50 (matches Fig. 2a) or 30 genes.

We identified a single-tissue expression outlier for ≥ 99% of expressed genes in each tissue and a multi-tissue outlier for 4,919 of 18,380 tested genes (27%). Each individual was a single-tissue outlier for a median of 83 genes per tissue compared with a median of 10 genes as a multi-tissue outlier. Single-tissue outliers discovered in one tissue replicated in other tissues at rates up to 33%, with higher rates among related tissues (Fig. 1b, Extended Data Fig. 3). The replication rate for multi-tissue outliers was much higher and increased with the number of tissues used for discovery (Fig. 1c).

Extended Data Figure 3

Single-tissue outlier replication

(a) Correlation between the replication proportions (see Methods) obtained from all samples and from a subset of 70 overlapping individuals per tissue pair (Pearson’s correlation, P < 2.2 × 10−16). When restricting to 70 individuals, the replication rates decreased more for discovery tissues with larger sample sizes in the full data set, indicating that replication rates were underestimated for tissues with small sample sizes. (b) Correlation between replication in the 70 individuals used for discovery and replication assessed in a set of 70 individuals that included the outlier individual and 69 individuals excluded from the discovery set (Pearson’s correlation, P < 2.2 × 10−16). Replication was higher when computed in the discovery individuals rather than in a distinct set of individuals. (c) Single-tissue outlier replication using all individuals, as in Fig. 1b, but data are only shown for pairs with at least 70 overlapping individuals. Tissue pairs with insufficient overlap are in grey. (d) For each pair of tissues with sufficient samples, outlier discovery and replication using 70 individuals sampled in both tissues. The replication values decreased compared with replication performed in all individuals (c), particularly for tissues with large sample sizes in the complete dataset. However, the pattern of replication, with more similar tissues having higher replication rates, is maintained. (e) For each tissue, the proportion of (individual, gene) outlier pairs where the individual was also a multi-tissue outlier for the gene. This proportion was positively correlated with the tissue sample size (P = 1.4 × 10−10). Points are colored by tissue following the convention in Fig. 1.

We investigated the influence of rare genetic variation on extreme expression levels, focusing on the individuals of European ancestry with whole genome sequencing data (1,144 multi-tissue outliers). Multi-tissue outliers were strongly enriched for nearby rare variants. The enrichment was most pronounced for structural variants (SVs) as previously described[14], and greater for short insertions and deletions (indels) than for single nucleotide variants (SNVs) (Fig 2a, Extended Data Fig. 4). As most rare variants are heterozygotes, expression outliers driven by rare variants in cis should exhibit allele-specific expression (ASE). Both single-tissue and multi-tissue outliers were significantly enriched for ASE compared with non-outliers (see Methods; two-sided Wilcoxon rank sum tests, each nominal P < 2.2 × 10−16; Fig. 2c). For underexpression outliers with exonic rare variants, the rare allele was generally underexpressed with respect to the common allele and conversely so for overexpression outliers, consistent with the rare variant causing the effect (two-sided Wilcoxon rank sum tests, each nominal P < 4.0 × 10−8; Extended Data Fig. 5a). The enrichment for rare variants and ASE was stronger for multi-tissue outliers than for single-tissue outliers (Fig. 2b,c, Extended Data Fig. 6a), especially at higher Z-score thresholds.

Figure 2

Enrichment of rare variants and ASE in outliers

(a) Enrichment of SNVs, indels, and SVs within 10 kb of the TSS among outliers. For each frequency stratum, we calculated enrichment as the relative risk of having a nearby rare variant given the outlier status (see Methods). Bars indicate 95% Wald confidence intervals. (b) Rare SNV enrichments at increasing Z-score thresholds. Text labels indicate the number of outliers at each threshold. (c) ASE, measured as the magnitude of the difference between the reference-allele ratio and the null expectation of 0.5. The non-outlier category is defined in the Methods.

Extended Data Figure 4

Number of rare variants per individual and population structure

(a) The distribution of the number of rare variants of each type for individuals of European descent (reported as white). Certain individuals harbored many more rare variants than the population median (vertical black line). (b) Principal component analysis of all individuals. Individuals are plotted according to their first two genotype principal components (PCs) and colored by their reported ancestry. White individuals with whole genome sequencing data, included in (a), are colored in a lighter shade of blue and those with 60,000 or more rare variants are circled in black. The individuals with an excess of rare variants likely had African or Asian admixture. (c) Enrichments as in Fig. 2a and excluding individuals with >60,000 rare variants (circled in (b)), which did not substantially affect the enrichment patterns. (d) European population allele frequency distributions in the 1000 Genomes project of rare SNVs and indels analyzed. The rare variants included in our analysis were constrained to have MAF ≤ 0.01 in the 1000 Genomes European super population, but they were also relatively rare in each of the individual European populations.

Extended Data Figure 5

Comparison of overexpression and underexpression outliers

(a) Allele-specific expression (ASE) at rare exonic variants. ASE is shown as the ratio of the number of reads supporting the minor allele to the total number of reads at the site. If the rare variant is driving the extreme expression, we expect this ratio to be below 0.5 for underexpression outliers and above 0.5 for overexpression outliers. Rare coding variants were enriched for ASE in the direction of the extreme expression effect (two-sided Wilcoxon rank sum tests, each nominal P < 4.0 × 10−8). (b) Expression level distribution of all genes and genes with overexpression or underexpression outliers. Expression is shown as the log2 of the median (RPKM + 2), where the median was first taken across individuals in each tissue then across expressed tissues for each gene. For genes with low expression, even an RPKM of 0 may not yield a Z-score ≤ −2. Indeed, underexpression outliers were depleted among lowly expressed genes whereas the opposite was true of overexpression outliers (two-sided Wilcoxon rank sum test comparing to all genes, P < 2.2 × 10−16 for both overexpression and underexpression). (c) Feature enrichments (as in Fig. 3b) shown separately for over and underexpression outliers.

Extended Data Figure 6

Extended rare variant enrichments

(a) For each tissue, rare SNV enrichment in single-tissue outliers compared with non-outliers at the same genes for increasing Z-score thresholds. Enrichments calculated as in Fig. 2. The rare variant enrichments varied between tissues though the overall pattern mirrored that of multi-tissue outliers when combining all the tissues (Fig. 2b). The high variance in the enrichments underscores the noise in single-tissue outlier discovery. (b) As in Fig. 2a, enrichment for SNVs, indels, and SVs in outliers compared with the same genes in non-outliers either including all rare variants or only those outside protein-coding or lincRNA exons in Gencode v19 annotation. The enrichment of rare variants was weaker, but still significant, for all variant types when excluding exonic regions.

To characterize the properties of rare variants correlated with large changes in gene expression, we assessed the enrichment of different classes of variant types in outliers compared with non-outliers (Supplementary Table 3a). Outliers were enriched, in order of significance, for SVs, variants near splice sites, introducing frameshifts, at start or stop codons, near the transcription start site (TSS), and in conserved regions (Fig. 3a). Variants in coding regions contributed disproportionately to outlier expression; enrichments weakened for all variants types (SNVs, indels, and SVs) when excluding exonic regions (Extended Data Fig. 6b). Additionally, 90% of stop-gain and frameshift variants were predicted to trigger nonsense-mediated decay in outliers, suggesting a biological mechanism for these cases.

Figure 3

Stratification of multi-tissue outliers by rare variant classes

We considered rare variants in the gene body and within 10 kb of the gene (200 kb for SVs and enhancers). (a) Enrichment of disjoint variant classes among outliers calculated as log odds ratio with 95% Wald confidence intervals. (b) Enrichment of functional annotations for rare SNVs. (c) Proportion of genes with an outlier potentially explained by each rare variant class. (d) Distribution of median Z-scores for each variant class. (e) For each variant class, distribution of ASE (see Methods) averaged across tissues. Grey lines mark the median values among non-outliers.

We also tested the relationship between outlier gene expression and functional annotations. Multi-tissue outliers were strongly enriched for variants in promoter or CpG-rich regions and had variants with higher conservation[15-18] and CADD[19] scores than non-outliers. We observed weaker enrichment in enhancers and transcription factor binding sites (Fig. 3b, Extended Data Fig. 7). Combining all classes of variation, other than non-conserved non-coding rare variants (excluded as less likely candidates for causal effects), we observed that 58% of underexpression and 28% of overexpression outliers had rare variants near the relevant gene, compared with 8% for non-outliers (Fig. 3c). Overexpression outliers were more common overall, potentially because detection of underexpression outliers for very low expression genes is inherently limited (Extended Data Fig. 5b). Overexpression outliers were also less enriched for functionally annotated rare variants (Extended Data Fig. 5c). Some variant classes had strong directionality concordant with their expected impact: duplications caused overexpression, while deletions, start and stop codon variants, and frameshifts coincided with underexpression (Fig. 3d). We also observed strong ASE for outliers carrying all classes except non-conserved variants (Fig. 3e).

Extended Data Figure 7

Enrichment of an extended list of functional genomic annotations

Log odds ratios and 95% Wald confidence intervals from logistic regression models of outlier status as a function of each genomic feature. Features were calculated among rare SNVs within 10 kb of the gene. When more than one feature corresponded to the same genomic annotation (e.g., the number or the presence of rare variants in a splice region; Supplementary Table 3b), the feature with the highest enrichment is shown. Lighter shading indicates a non-significant log odds ratio (nominal P > 0.05).

We hypothesized that functional, large-effect rare variants have been under recent selective pressure. As expected, we found that rare promoter variants in outliers were significantly less frequent in the UK10K cohort of 3,781 individuals[3] than those from non-outliers for the same genes (two-sided Wilcoxon rank sum test, P = 0.0060; Fig. 4a). Additionally, genes intolerant to loss-of-function and missense mutations were depleted of both multi-tissue outliers and multi-tissue eQTLs (Fisher’s exact test, all P < 2 × 10−15; Fig. 4b, Extended Data Fig. 8a). We observed a similar depletion in two curated disease gene lists—genes involved in heritable cardiovascular disease (Cardio) and genes in the ACMG guidelines for incidental findings[20]—but not in broader gene lists (Fig. 4c; Extended Data Fig. 8b,c). Genes with a multi-tissue outlier were more likely to have a multi-tissue eQTL (two-sided Wilcoxon rank sum test, P < 2.2 × 10−16; Extended Data Fig. 8d,e), suggesting influence of both rare and common regulatory variation for some genes. However, we found evidence that genes with outliers were more constrained than genes with multi-tissue eQTLs as they harbored less missense and loss-of-function variation (Tukey’s range test, missense Z-score P = 0.0070, probability of loss-of-function intolerance score P = 0.032; Fig. 4b, Extended Data Fig. 8a). This suggests that outlier expression analysis can yield unique insight into constraint on gene regulation.

Figure 4

Evolutionary constraint of genes with multi-tissue outliers

(a) Distributions of UK10K minor allele frequencies for promoter SNVs in outlier and non-outlier individuals at genes with multi-tissue outliers. (b) Odds ratio of being intolerant to loss-of-function variants for genes with multi-tissue outliers, genes with shared eQTLs (eGenes), genes reported in the GWAS catalog, and OMIM genes. (c) Odds ratio of a gene having a multi-tissue outlier for each of eight sets of genes involved in complex traits or diseases. In (b) and (c) bars represent 95% confidence intervals (Fisher’s exact test).

Extended Data Figure 8

Evolutionary constraint and regulatory control of multi-tissue outlier genes

(a) Odds ratio of being intolerant to synonymous and missense variants for genes with multi-tissue eQTLs (eGenes), genes with multi-tissue outliers, OMIM, and GWAS genes (see Methods). As expected, GWAS and OMIM genes showed no enrichment or depletion for synonymous variation intolerant genes. Genes with multi-tissue outliers and eGenes showed slight depletion for these genes. Genes with multi-tissue outliers and eGenes were strongly depleted for missense variation intolerant genes compared with OMIM and GWAS genes. (b) Comparison of the depletion of disease genes among genes with a multi-tissue outlier and eGenes. Similar to Fig. 4c, bars represent 95% confidence intervals from Fisher’s exact test. (c) For each of ten gene lists, the difference in the mean number of variants near genes in the list compared with the mean for all other annotated genes. Results are stratified by minor allele frequency, and bars indicate the 95% confidence interval for the difference from a two-sided t-test. Disease genes harbored more variants than control genes in general, and the difference was particularly striking for rare variants. This suggests that the depletion of outliers and eQTLs for certain groups of disease genes is due to less rare variation near these genes. Instead, we hypothesize that the variation around these genes in our healthy cohort is less likely to have large regulatory effects. (d) Distribution of the number of tissues with an eQTL for genes with and without outliers. Genes with multi-tissue outliers had eQTLs in more tissues than genes without, which suggests that they are more susceptible to shared regulatory control. This result held for both multi-tissue eQTL definitions (see Methods; Meta-Tissue: 23 vs 3 tissues, Wilcoxon rank sum test P < 2.2 × 10−16; tissue-by-tissue: 7 vs 3 tissues, P < 2.2 × 10−16). (e) This eGene enrichment was robust across different mean expression levels across tissues (two-sided Wilcoxon rank sum tests, Bonferroni-adjusted P < 1 × 10−11).

Next, we sought to prioritize rare variants in each individual genome by their predicted impact on regulation of gene expression. We developed RIVER (RNA-Informed Variant Effect on Regulation), a statistical method including a Bayesian model that jointly analyzes genome and transcriptome data from the same individual to estimate the probability that a variant has regulatory impact (https://bioconductor.org/packages/release/bioc/html/RIVER.html, see Methods). RIVER uses a generative model that assumes that genomic annotations (Supplementary Table 3b) determine the prior probability that a variant is a functional regulatory variant, in terms of influence on gene expression, which in turn influences whether nearby genes are likely to display outlier levels of expression (Fig. 5a). RIVER does not require a labeled set of functional/non-functional variants; rather it derives its power from identifying expression patterns that coincide with predictive genomic annotations.

Figure 5

Performance of RIVER for prioritizing functional regulatory variants

(a) RIVER probabilistic graphical model (see Methods). (b) Predictive power of RIVER compared with an L2-regularized logistic regression model using only genomic annotations. Accuracy was assessed using held-out individuals sharing the same rare SNVs as observed individuals (AUCs compared with DeLong’s approach[29]). (c) Distribution of RIVER scores (shades of blue) as a function of expression and genomic annotation scores. The distributions of variant categories across expression and genomic annotation scores are shown as histograms aligned opposite the corresponding axes.

We trained RIVER on the GTEx V6 cohort, and evaluated the model on held out pairs of individuals who shared the same rare variants. We then computed the RIVER score (the posterior probability of having a functional regulatory variant) for one individual, using both expression and genomic data, and assessed the accuracy with respect to the second individual’s held-out expression levels (see Methods). Incorporating expression data significantly improved prediction compared with a model that uses genomic annotations alone (AUCs 0.64 and 0.54, respectively, P = 3.5 × 10−4; Fig. 5b; Extended Data Fig. 9a,b), and RIVER learned, unsupervised, to prioritize variants supported by both genomic annotations and extreme expression levels across tissues (Fig. 5c, Extended Data Fig. 9c). ASE was also enriched among the top RIVER instances compared with the genomic annotation model (Extended Data Fig. 9d). Finally, even after accounting for the most informative genomic annotations or summary scores, personal expression data was highly informative of rare variant effects (average log odds ratio 2.76; Extended Data Fig. 9e,f).

Extended Data Figure 9

River performance

(a) Comparison between the predictive power of RIVER and that of the genomic annotation model, as in Fig. 5a, across different Z-score thresholds for outlier calling. Increasing the Z-score threshold improved AUC values, but reduced the number of outlier examples, which led to noisy ROCs. (b) Stability analysis of estimated parameters with different parameter initializations (see Methods). (c) Correlations, using Kendall’s tau, between the fraction of tissues with |Z-score| ≥ 2 and the test probabilities from the genomic annotation model (left) and RIVER (right). We calculated test posterior probabilities using 10-fold cross validation and only considered individual and gene pairs with a fraction of tissues with |Z-score| ≥ 2 that was significantly different from 0.05 (one-sided binomial exact test, Benjamini-Hochberg adjusted P < 0.05). (d) P-values from a one-sided Fisher’s exact test measuring the association between allelic imbalance (see Methods) and the posterior probability of a functional rare variant according to the genomic annotation model and RIVER. The posterior probabilities from RIVER were more strongly associated with allelic imbalance across all four thresholds tested. (e) Assessment of the advantage of incorporating gene expression with genomic annotations for predicting outlier status using simplified supervised models (see Methods). All models showed consistent improvement of the log odds ratio of outlier status when incorporating expression. (f) Performance of models with 12 individual genomic features compared with the genomic annotation model and RIVER. Some models with single genomic features provided slightly better AUCs compared with the genomic annotation model, but they were not statistically different. On the other hand, RIVER predicted the effects of rare variants significantly better than each of the models with a single feature.

RIVER can be used to predict regulatory effects on gene expression and aid in prioritization amongst disease associated variants. To investigate this potential, we evaluated 27 pathogenic variants from ClinVar[21] present in 21 GTEx donors (Fig. 5c, Extended Data Fig. 10a). Overall, pathogenic variants had RIVER scores higher than background variants (two-sided Wilcoxon rank sum test, P = 3.3 × 10−9; Extended Data Fig. 10b–d), and the six that were likely regulatory (those not annotated as missense or coding indel) scored in the 99.9th percentile. Evaluated in detail, several cases illustrated that rare disease-causing variants can have a regulatory impact evident from RNA-seq data, even from healthy individuals harboring those variants (where they are often heterozygous; Extended Data Fig. 10e,f). Note that RIVER trained on healthy cohorts such as GTEx can then be directly applied to new cohorts including disease samples.

Extended Data Figure 10

Evaluation of known pathogenic variants using RIVER

(a) 27 GTEx rare SNVs reported as disease variants in ClinVar. Relative frequency of (b) the |median Z-score|, (c) posterior probabilities from the genomic annotation model, and (d) posterior probabilities from RIVER for all individual and gene pairs (grey) and 27 pairs with pathogenic variants from ClinVar (orange). P-values were computed using a two-sided Wilcoxon rank sum test. We note that rare indels and SVs were not found nearby the genes in the individuals carrying these pathogenic variants. (e and f) Z-score and RPKM distributions for (e) SBDS and (f) GAMT were compared with the values for four individuals carrying regulatory pathogenic variation (red asterisks and triangles). The median Z-score and RPKM values across tissues are shown at the top of each plot (black circle). Tissues are colored as in Fig. 1 and sorted in decreasing order of the difference between the average Z-score of individuals with a regulatory pathogenic variant and the median Z-score for the tissue. Three individuals carrying a total of two unique rare variants are shown for SBDS. Both variants are associated with the recessive Shwachman-Diamond syndrome, which causes systemic symptoms including pancreatic, neurological, and hematologic abnormalities[46] and can disrupt fibroblast function[47]. The individuals, being heterozygous for these variants, lacked the disease phenotype. Nonetheless, we saw extreme underexpression of SBDS across almost all tissues in these individuals, including brain tissues, fibroblasts, and pancreas. One individual had a rare variant for GAMT associated with cerebral creatine deficiency syndrome 2, shown to cause neurological deficiencies and also lead to low body fat[48]. The individual had the most extreme underexpression in (subcutaneous) adipose.

To experimentally validate a subset of the variants identified through outlier analysis, we used CRISPR/Cas9 mediated genome editing[22,23]. In K562 cells, we tested six SNVs and matched controls in transcribed regions of genes with an outlier (see Methods; Extended Data Fig. 11a,b), and compared the allelic ratios between mRNA and genomic DNA (gDNA), an internal control. All variants tested were in underexpression outliers and were therefore expected to decrease expression. Two variants were excluded due to low cDNA and gDNA total reads counts. The four remaining SNVs in outliers all showed lower proportions of the alternate (installed) allele in the cDNA compared with gDNA, confirming that these variants decreased expression (Extended Data Fig. 11c).

Extended Data Figure 11

Validation of large-effect rare variants via CRISPR/Cas9 genome editing

(a) SNVs in outliers and controls assayed for expression effects using CRISPR/Cas9 genome editing. For common SNVs in controls (MAF >1% in the GTEx cohort), the range of median Z-scores and RIVER scores are given for all individuals harboring the minor allele. Missing values indicate that the variant was absent from our cohort. (b) Single-guide RNAs (sgRNAs) for four SNVs found in outliers and four control SNVs in the same genes. (c) Alternate (installed) gDNA and cDNA allele proportions for four rare, coding SNVs in outliers (left) and four matched control SNVs (right). Each gDNA and cDNA sample was sequenced in triplicate (technical replicates). Asterisks denote the Bonferroni-adjusted significance level from a two-sided t-test of the difference between the gDNA and cDNA alternate allele proportions: P < 0.05 (.), P < 0.01 (*), and P < 0.001 (**). Though one control SNV showed a significant difference in the alternate allele proportion between cDNA and gDNA, it displayed an increase rather than a decrease in expression.

In summary, by combining data across multiple tissues, we curated a set of gene expression outliers that replicated at higher rates and showed stronger rare variant enrichments than those from any single tissue. We found that rare structural variants, frameshift indels, coding variants, and variants near the transcription start site were most likely to have large effects on expression. However, our ability to characterize the genetic basis of multi-tissue outliers remains incomplete. Outliers without an underlying rare variant in our analysis may be due to variants in more distal regions or in annotations we did not consider, or may be attributable to residual technical or environmental effects. Although genetic variant interpretation remains challenging, RIVER demonstrates the value of incorporating personal gene expression data to examine the influence of rare variants on expression that may be uncertain from sequence alone. Our results suggest a general approach that can be applied to studies that supplement genome sequencing with other molecular phenotypes, such as methylation[24-26] and histone modification[27,28]. We anticipate that such integrative approaches will be essential for effective interpretation of genome-wide genetic variation on a personalized level.

Methods

Study population

All human subjects were deceased donors. Informed consent was obtained for all donors via next-of-kin consent to permit the collection and banking of de-identified tissue samples for scientific research. The research protocol was reviewed by Chesapeake Research Review Inc., Roswell Park Cancer Institute’s Office of Research Subject Protection, and the institutional review board of the University of Pennsylvania. We used the RNA-seq, allele-specific expression, and whole genome sequencing (WGS) data from the v6 release of the GTEx project. The generation of these data is described in Supplementary Information sections 3 and 5 of the main GTEx Consortium publication[12].

Correction for technical confounders

We restricted our expression analyses to the 449 individuals and 44 tissues for which sex and the top three genotype principal components (PCs), which capture major population stratification, were available. For each tissue, we log2-transformed all expression values (log2(RPKM + 2)). We then standardized the expression of each gene to prevent shrinkage of outlier expression values caused by quantile normalization. To remove unmeasured batch effects and other confounders, for each tissue separately, we estimated hidden factors using PEER[13] on the transformed expression values. In each tissue, we defined expressed genes and corrected for the same number of PEER factors as in the GTEx eQTL analyses (see Supplementary Information sections 5.5 and 5.6 of the main GTEx Consortium publication[12]). We regressed out the PEER factors, the top three genotype principal components, and sex (where appropriate) from the transformed expression data for each tissue using the following linear model: where is the transformed expression in the given gene g, μ is the mean expression level for the gene, is the nth PEER factor, are the top three genotype PCs, and is the sex covariate. We assumed the residual vector ε follows the multivariate normal distribution ε ~ N(0, σ2). Finally, we standardized the expression residuals ε for each gene, which yielded Z-scores. To better understand the effect of PEER correction on the removal of technical and biological confounders, we compared the PEER factors in each tissue separately to pre-collected sample and subject covariates. We considered the subset of covariates with >50 observations in at least 31 tissues, where we first selected covariates with more than one unique entry in each tissue. For categorical covariates, we only considered categories with more than 20 observations. For each PEER factor and each covariate, we fit a linear model with the PEER factor as the response and the covariate as the predictor. From this model, we computed the proportion of that PEER factor’s variance explained by the covariate as the adjusted R2: where p and n are the number of parameters and samples, respectively, and . SS and SS refer to the total and residual sums of squares, respectively. To quantify the degree to which each covariate was captured by the combination of all PEER factors, genotype PCs, and sex (where appropriate) for each tissue, we considered the expression component regressed out from the uncorrected data: For each covariate, we then fit a linear model with as the response and the covariate as the predictor. We assessed the proportion of the variance of explained by each covariate by computing the adjusted R2 for the covariate across all genes. We used the formula above, but summed across all genes to compute SS and SS. To assess the impact of PEER correction on rare variant enrichment, we also tried removing either the top five PEER factors for each tissue or no PEER factors. We then performed multi-tissue outlier calling and tested the enrichment of rare and common variants in the two partially corrected datasets (see Methods section “Enrichment of rare and common variants near outlier genes”).

Single-tissue and multi-tissue outlier discovery

Single-tissue and multi-tissue outlier calling was restricted to autosomal lincRNA and protein coding genes. For each tissue, an individual was called a single-tissue outlier for a particular gene if that individual had the largest absolute Z-score and the absolute value was at least two. For each gene, the individual with the most extreme median Z-score taken across tissues was identified as a multi-tissue outlier for that gene provided the absolute median Z-score was at least two. Therefore, each gene had at most one single-tissue outlier per tissue and one multi-tissue outlier. Under this definition an individual could be an outlier for multiple genes. In addition, we only tested for multi-tissue outliers among individuals with expression measurements for the gene in at least five tissues. To reduce cases where non-genetic factors may cause widespread extreme expression, we removed eight individuals that were multi-tissue outliers for 50 or more genes from all downstream analyses, including before single-tissue outlier discovery. Removing these individuals with extreme expression across many genes improved our rare variant enrichments, but the precise threshold mattered less (Extended Data Fig. 2g). We chose the threshold of 50 to strike a balance between removing extreme individuals while not excluding a large proportion of our cohort.

Replication of expression outliers

We calculated the proportion of single-tissue outliers discovered in one tissue that had |Z-score| ≥ 2 with the same direction of effect for the same gene in the replication tissue. Since certain groups of tissues were sampled in a specific subset of individuals, we evaluated the extent to which replication was influenced by the size and the overlap of the discovery and replication sets. We repeated the replication analysis with the discovery and replication in exactly 70 overlapping individuals for each pair of tissues with enough samples and compared the replication patterns to those obtained by using all individuals. To estimate the extent to which individual overlap biased replication estimates, for each pair of tissues with sufficient samples, we defined three disjoint groups of individuals: 70 individuals with data for both tissues, 69 distinct individuals with data in the first tissue, and 69 distinct individuals with data in the second tissue. We discovered outliers in the first tissue using the shared set of individuals then tested for replication using the same individuals in the second tissue. Then, for each gene, we added the identified outlier to the distinct set of individuals and tested the replication again in the second tissue. We repeated the process running the discovery in the second tissue and the replication in the first one. We compared the replication rates when using the same or different individuals for the discovery and replication. We assessed the confidence of our multi-tissue outliers using cross-validation. We separated the tissue expression data randomly into two groups: a discovery set of 34 tissues and a replication set of 10 tissues. For t = 10, 15, 20, 25, and 30, we randomly sampled t tissues from the discovery set and performed outlier calling as described above. Due to incomplete tissue sampling, the number of tissues supporting each outlier is at least five but less than t. We computed the replication rate as the proportion of outliers in the discovery set with |median Z-score| ≥ 1 or 2 in the replication set. We set no restriction on the number of tissues required for testing in the replication set. To calculate the expected replication rate, we randomly selected individuals in the discovery set with at least five tissues that expressed the gene and computed the replication rate. We repeated this process 10 times for each discovery set size.

Quality control of genotypes and rare variant definition

We restricted our rare variant analyses to individuals of European descent, as they constituted the largest homogenous population within our dataset. We considered only autosomal variants that passed all filters in the VCF (those marked as PASS in the Filter column). Minor allele frequencies (MAF) within the GTEx data were calculated from the 123 individuals of European ancestry with whole genome sequencing data (average coverage 30×). The MAF was the minimum of the reference and the alternate allele frequency where the allele frequencies of all alternate alleles were summed together. Rare variants were defined as having MAF ≤ 0.01 in GTEx, and for SNVs and indels we also required MAF ≤ 0.01 in the European population of the 1000 Genomes Project Phase 3 data[30]. To ensure that population structure among the individuals of European descent was unlikely to confound our results, we verified that the allele frequency distribution of rare variants included in our analysis (within 10 kb of a protein coding or lincRNA gene, see below) was similar for the five European populations in the 1000 Genomes project (Extended Data Fig. 4d).

Enrichment of rare and common variants near outlier genes

We assessed the enrichment of rare SNVs, indels, and SVs near outlier genes. Proximity was defined as within 10 kb of the TSS for most analyses. For Fig. 3 and Extended Data Figs. 5, 7 and 8, we included all variants within 10 kb of the gene, including the gene body, to also capture coding variants. In Fig. 3 and Extended Data Fig. 5 and 8, we extended the window to 200 kb for enhancers and SVs. For each gene with an outlier, we chose the remaining set of individuals tested for outliers at the same gene as non-outlier controls. We only considered genes that had both an outlier and at least one control. We stratified variants of each class into four minor allele frequency bins (0–1%, 1–5%, 5–10%, 10–25%) to compare the relative enrichments of rare and common variants. We also assessed the enrichment of SNVs at different Z-score cutoffs. Enrichment was defined as the ratio of the proportion of outliers with a variant whose frequency lies within the range to the corresponding proportion for non-outliers. This enrichment measure is equivalent to the relative risk of having a nearby rare variant given outlier status. We used the asymptotic distribution of the log relative risk to obtain 95% Wald confidence intervals. Within our set of European individuals, we observed some individuals with minor admixture that had relatively more rare variants than the rest (Extended Data Fig. 1b). We confirmed that inclusion of these admixed individuals did not substantially affect our results (Extended Data Fig. 1c). We also calculated rare variant enrichments when restricting to variants outside protein-coding and lincRNA exons in Gencode v19 annotation (extending internal exons by 5 bp to capture canonical splice regions). To measure the informativeness of variant annotations, we used logistic regression to model outlier status as a function of the feature of interest, which yielded log odds ratios with 95% Wald confidence intervals. Note that for the feature enrichment analysis in Fig. 3b and Extended Data Fig. 7, we required that outliers and their gene-matched non-outlier controls have at least one rare variant near the gene. We standardized all features, including binary features, to facilitate comparison between features of different scale. We also calculated the proportion of overexpression outliers, underexpression outliers and non-outliers with a rare variant near the gene (within 10 kb for SNVs and indels and 200 kb for SVs). To each outlier instance, we assigned at most one of the 12 rare variant classes we considered (Supplementary Table 3a). If an outlier had rare variants from multiple classes near the relevant genes, we selected the class that was most significantly enriched among outliers.

Annotation of variants

We obtained SV anntoations from Chiang et al.[14] and computed features for rare SNVs and indels using three primary data sources: Epigenomics Roadmap[31], CADD v1.2[19], and VEP v80[32]. Promoter and enhancer annotation tracks were obtained from the Epigenomics Roadmap Project (http://www.broadinstitute.org/~meuleman/reg2map/HoneyBadger2_release/). We mapped 28 unique tissues in the GTEx Project to 19 tissue groups in the Roadmap Project. Using these annotations, for each individual, we assessed whether each SNV or indel overlapped a promoter or enhancer region in at least one of the 19 Roadmap tissue groups. Features including conservation[15-18], transcription factor binding, and deleteriousness were extracted from the full annotation tracks of the CADD v1.2 release (downloaded 15/05/2015; http://cadd.gs.washington.edu/download). Finally, we obtained protein-coding and transcription-related annotations from VEP. This information was provided in the GTEx v6 VCF file. Stop-gain and frameshift variants annotated as high-confidence loss-of-function variants by LOFTEE were assumed to trigger nonsense-mediated decay. We generated gene-level features described in Supplementary Table 3.

Allele-specific expression (ASE)

We only considered sites with at least 30 total reads and at least five reads supporting each of the reference and alternate alleles. To minimize the effect of mapping bias, we filtered out sites that showed mapping bias in simulations[33], that were in low mappability regions (ftp://hgdownload.cse.ucsc.edu/gbdb/hg19/bbi/wgEncodeCrgMapabilityAlign50mer.bw), or that were rare variants or within 1 kb of a rare variant in the given individual (the variants were extracted from the GTEx exome sequencing data described in section 4 of the main GTEx consortium publication[12]). The first two filters were provided in the GTEx ASE data release. The third filter was applied to eliminate potential mapping artifacts that mimic genetic effects from rare variants. We measured ASE at each testable site as the absolute deviation of the reference allele ratio from 0.5. For each gene, all testable sites in all tissues were included. We compared ASE in single-tissue and multi-tissue outliers at different Z-score thresholds to non-outliers using two-sided Wilcoxon rank sum tests. To obtain a matched background, we only included a gene in the comparison when ASE data existed for both the outlier individual and at least one non-outlier. In the case of single-tissue outliers, we also required the tissue to match between the outlier and the non-outlier. All individuals that were neither multi-tissue outliers for the given gene nor single-tissue outliers for the gene in the corresponding tissue were included as non-outliers. In cases where outliers had rare coding variants in the gene, if the rare variants were causing the extreme expression in cis, we expected to see ASE at the rare variant matching the direction of the effect. For underexpression outliers, we expected the (rare) minor allele to be underexpressed compared with the major allele. For overexpression outliers, we expected the minor allele to be overexpressed. To test this, we used the same filters as above, but looked exclusively at rare variants (instead of excluding them). We measured ASE as the minor allele ratio: the number of reads supporting the minor allele over the total number of reads. We also used ASE to evaluated performance of both the genomic annotation model and RIVER (see below) by testing the association between allelic imbalance and model predictions using Fisher's Exact Test. Here, we defined allelic imbalance as the top 10% of the median absolute deviation, across tissues, of the reference allele ratio from 0.5.

Allele frequency measurements in UK10K

UK10K[3] VCF files of whole genome cohorts were downloaded from https://www.ebi.ac.uk. We merged the Avon Longitudinal Study of Parents and Children (ALSPAC) EGAS00001000090 and the Department of Twin Research and Genetic Epidemiology (TWINSUK) EGAS00001000108 datasets for a total of 3,781 individuals. We counted the occurrence of all rare GTEx SNVs in Epigenomics Roadmap-annotated promoter regions among the UK10K samples. GTEx variants absent from the UK10K cohorts were assigned a count of 0.

Definition of multi-tissue eGenes

We defined multi-tissue eGenes using two approaches. For the tissue-by-tissue approach, we obtained lists of significant eGenes (q-value ≤ 0.05) for each of the 44 tissues from the GTEx v6p release. The second approach used cis-eQTLs with shared effects across tissues estimated by the RE2 model of the Meta-Tissue software[34], as described in the main consortium manuscript[12]. We chose for each gene the variant with the lowest nominal P-value from the RE2 model. We then determined the number of tissues in which this variant-gene pair showed a cis-eQTL effect (m-value ≥ 0.9[34]). For each of the 18,380 genes tested for multi-tissue outliers, we calculated the number of tissues in which the gene appeared as a significant eGene (tissue-by-tissue approach) or had a shared eQTL effect (Meta-Tissue approach). To show that the enrichment of outlier genes as multi-tissue eGenes was not confounded by gene expression level, using the Meta-Tissue results, we stratified genes tested for multi-tissue outliers into RPKM deciles and repeated the comparison between genes with and without a multi-tissue outlier. When comparing the enrichment for eGenes among constrained and disease gene lists, we classified the top n Meta-Tissue eGenes (ranked by nominal P-value from the RE2 model) as multi-tissue eGenes and considered the remaining genes as background. We selected n to match the umber of multi-tissue outliers in the comparison.

Evolutionary constraint of genes with multi-tissue outliers

We obtained gene-level estimates of evolutionary constraint from the Exome Aggregation Consortium[35] (http://exac.broadinstitute.org/, ExAC release 0.3). We intersected the 17,351 autosomal lincRNA and protein coding genes with constraint data from ExAC with the 18,380 genes tested for multi-tissue outliers from GTEx, yielding 14,379 genes for further analysis (3,897 and 10,482 genes with and without a multi-tissue outlier, respectively). We examined three functional constraint scores from the ExAC database: synonymous Z-score, missense Z-score, and probability of loss-of-function intolerance (pLI). Synonymous- and missense-intolerant genes were defined as those with corresponding Z-scores above the 90th percentile. We defined loss-of-function intolerant genes as those with a pLI score above 0.9, following the guidelines provided by the ExAC consortium. We calculated odds ratios and 95% confidence intervals for the enrichment of genes with multi-tissue outliers in these lists using Fisher’s exact test. We repeated this analysis for three other gene sets: 19,182 multi-tissue eGenes from GTEx v6p defined using Meta-Tissue, 9,480 reported GWAS genes from the NHGRI-EBI catalog[36] (http://www.ebi.ac.uk/gwas accessed 30/11/2015), and 3,576 OMIM genes (http://omim.org/ accessed 26/5/2016). We tested for a difference in the mean constraint for genes with multi-tissue outliers and genes with multi-tissue eQTLs using ANOVA. For each constraint score in ExAC, we treated the score for each gene as the response and the status of the gene as having a multi-tissue outlier and/or a multi-tissue eQTL as a categorical predictor with four classes. After fitting the model, we performed Tukey’s range test to determine whether there was a significant difference in the mean constraint between genes with a multi-tissue outlier but no multi-tissue eQTL and genes with a multi-tissue eQTL but no multi-tissue outlier.

Overlap of genes with multi-tissue outliers and disease genes

We examined the enrichment of genes with multi-tissue outliers in eight disease gene lists: the GWAS catalog and OMIM (described above) as well as ClinVar (6,279 genes; http://www.ncbi.nlm.nih.gov/clinvar/), OrphaNet (3,451 genes; http://www.orpha.net/), ACMG[20] (58 genes; http://www.ncbi.nlm.nih.gov/clinvar/docs/acmg/), Developmental Disorders Genotype-Phenotype[37] (DDG2P; 1693 genes; http://www.ebi.ac.uk/gene2phenotype/), and two curated gene lists of 86 cardiovascular disease genes and 55 cancer genes (described below). We computed odds ratios and 95% confidence intervals using Fisher’s exact test to compare each disease gene list to the genes with multi-tissue outliers and repeated the comparison for genes with multi-tissue eQTLs. Heritable cancer predisposition and heritable cardiovascular disease gene lists were curated by local experts in clinical and laboratory-based genetics in the two respective areas (Stanford Medicine Clinical Genomics Service, Stanford Cancer Center's Cancer Genetics Clinic, and Stanford Center for Inherited Cardiovascular Disease). Genes were included if both the clinical and laboratory-based teams agreed there was sufficient published evidence to support using variants in these genes in clinical decision making. For each of the eight disease gene lists above and for genes with multi-tissue outliers or multi-tissue eQTLs, we computed the number of variants (SNVs and indels within 10 kb and SVs within 200 kb of the gene, including the gene body) at each gene in the 123 individuals of European ancestry with WGS data. For each gene list and for each MAF bin (0–1%, 1–5%, 5–10%, 10–25%), we compared the mean number of variants near genes in the list to the mean number near all other annotated autosomal protein coding and lincRNA genes using a two-sided t-test.

RIVER integrative model for predicting regulatory effects of rare variants

RIVER (RNA-Informed Variant Effect on Regulation) is a hierarchical Bayesian model that predicts the regulatory effects of rare variants by integrating gene expression with genomic annotations. The RIVER model consists of three layers: a set of nodes = G … G in the topmost layer representing P observed genomic annotations over all rare variants near a particular gene, a latent binary variable FR in the middle layer representing the unobserved functional regulatory status of the rare variants, and one binary node E in the final layer representing expression outlier status of the nearby gene. We model each conditional probability distribution as follows: with parameters and and hyper-parameters λ and C. Because FR is unobserved, the RIVER log-likelihood objective over instances n = 1, …, N is non-convex. We therefore optimize model parameters via Expectation-Maximization[38] (EM) as follows: In the E-step, we compute the posterior probabilities (ω() of the latent variables FR given current parameters and observed data. For example, at the ith iteration, the posterior probability of FR = 1 for the nth instance is In the M-step, at the ith iteration, given the current estimates ω(i), the parameters () are estimated as where λ is an L2 penalty hyper-parameter derived from the Gaussian prior on . The parameters get updated as: where I is an indicator operator, t is the binary value of expression E, s is the possible binary values of FR, and C is a pseudo count derived from the Beta prior on . The E and M steps are applied iteratively until convergence.

RIVER application to the GTEx cohort

As input, RIVER requires a set of genomic features and a set of corresponding expression outlier observations , each over instances of individual and gene pairs. Using the variant annotations described above, we generated site-level genomic features for the 116 European individuals with GTEx WGS data that had fewer than 50 multi-tissue outliers. We then collapsed these features for all rare SNVs within 10 kb of each TSS to generate gene-level features described in Supplementary Table 3b. This produced a matrix of genomic features of size (116 individuals × 1,736 genes) × (112 genomic features), where we standardized features before use. For the values , we defined any individual with |median Z-score| ≥ 1.5 as an outlier if expression was observed in at least five tissues; the remaining individuals were labeled as non-outliers for the gene. We used this more lenient threshold in order to obtain a sufficiently large set of outliers for robust training and testing. In total, we extracted 48,575 instances where an individual had at least one rare variant within 10 kb of the TSS of a gene. To train and evaluate RIVER on the GTEx cohort, we used the 3,766 instances of individual and gene pairs where two individuals had the same rare SNVs near a particular gene. We held out those instances and trained RIVER parameters with the remaining instances. RIVER requires two hyper-parameters λ and C. To select λ, we first applied an L2-regularized multivariate logistic regression with features and response variable , selecting lambda with the minimum squared error via 10-fold cross-validation (we selected λ = 0.01). We selected C = 50, informed simply by the total number of training instances available, as validation data was not available for extensive cross-validation. Initial parameters for EM were set to = (P(E = 0 | FR = 0), P(E = 1 | FR = 0), P(E = 0 | FR = 1), P(E = 1 | FR = 1)) = (0.99, 0.01, 0.3, 0.7) and from the multivariate logistic regression above, although different initializations did not significantly change the final parameters (Extended Data Fig. 9b). The 3,766 held out pairs of instances were used to create a labeled evaluation set. For one of the two individuals from each pair, we estimated the posterior probability of a functional rare variant P(FR | ). The outlier status of the second individual, whose data was not observed either during training or prediction, was then treated as a “label” of the true status of functional effect FR. Using this labeled set, we compared the RIVER score to the posterior P(FR | ) estimated from the plain L2-regualrized multivariate logistic regression model with genomic annotations alone. We produced ROCs and computed AUCs for both models, testing for significant differences using DeLong’s method[29]. This measure relied on outlier status reflecting the consequences of rare variants. Indeed, pairs of individuals who shared rare variants tended to have highly similar outlier status even after regressing out effects of common variants (Kendall’s tau rank correlation, P < 2.2 × 10−16). We repeated this evaluation, varying the median Z-score threshold used to define outliers, and we also compared RIVER to individual features that were strongly enriched among outliers as well as PolyPhen[39] and SIFT[40].

Supervised model integrating expression and genomic annotation

To assess the information gained by incorporating gene expression data in the prediction of functional rare variants, we applied a simplified supervised approach to a limited dataset. We used the instances where two individuals had the same rare SNVs to create a labeled training set where the outlier status of the second individual was used as the response variable. We then trained a logistic regression model with just two features: 1) the outlier status of the first individual and 2) a single genomic feature value such as CADD or DANN. We estimated parameters from the entire set of rare-variant-matched pairs using logistic regression to determine the log odds ratio and corresponding P-value of expression status as a predictor. While this approach was not amenable to training a full predictive model over all genomic annotations jointly given the limited number of instances, it provided a consistent estimate of the log odds ratio of outlier status. We tested five genomic predictors: CADD[19], DANN[41], transcription factor binding site annotations, PhyloP scores[15], and one aggregated feature: the posterior probability from a multivariate logistic regression model learned with all genomic annotations.

RIVER assessment of pathogenic ClinVar variants

We downloaded variants from the ClinVar database[21] (accessed 04/05/2015) and searched any of these disease variants within the set of rare variants segregating in the GTEx cohort. Any disease variant reported as pathogenic, likely pathogenic, or a risk factor for disease was considered pathogenic. We further categorized the pathogenic variants as likely regulatory if they were annotated as splice-site variants, synonymous, or nonsense, whereas missense variants were considered unlikely to have a regulatory effect. To explore RIVER scores for those pathogenic variants, all instances were used for training RIVER. We then computed a posterior probability P(FR | ) for each instance coinciding with a pathogenic ClinVar variant.

Stability of estimated parameters with different parameter initializations

We tried several different initialization parameters for and to explore how this affected the estimated parameters. We initialized a noisy by adding K% Gaussian noise compared to the mean of with fixed (for K = 10, 20, 50 100, 200, 400, 800). For we fixed P(E = 1 | FR = 0) and P(E = 0 | FR = 0) as 0.01 and 0.99, respectively, and initialized (P(E = 1 | FR = 1), P(E = 0 | FR = 1)) as (0.1, 0.9), (0.4, 0.6), and (0.45, 0.55) instead of (0.3, 0.7) with fixed. For each parameter initialization, we computed Spearman rank correlations between parameters from RIVER using the original initialization and the alternative initializations. We also investigated how many instances within top 10% of posterior probabilities from RIVER under the original settings were replicated in the top 10% of posterior probabilities under the alternative initializations (Replication accuracy in Extended Data Fig. 9b).

Validation of large-effect rare variants via CRISPR/Cas9 genome editing

To select rare, coding SNVs for validation by CRISPR/Cas9 editing, we first restricted to the (gene, individual, variant) tuples identified in multi-tissue outliers without a rare SV or a rare indel within 200 kb or 10 kb of the gene, respectively. We considered the 116 rare SNVs with a coding consequence for the corresponding gene as annotated by VEP[32]; coding annotations included stop gained, stop lost, splice acceptor variant, splice donor variant, start lost, missense variant, splice region variant, stop retained variant, synonymous variant, coding sequence variant, and 5’/3’ UTR variant. Using RNA-seq data from ENCODE, we further restricted our variant list to the 59 SNVs occurring in genes with an average FPKM of at least 10 in K562 cells (ENCODE experiment IDs ENCSR000AEL and ENCSR000AEN)[42]. Finally, we filtered for rare, coding SNVs in (gene, individual) pairs with |median Z-score| > 4 and a RIVER score above the 99.5th percentile. These filters yielded a final set of 13 rare SNVs from which we chose the six exonic SNVs for testing. As controls, we selected SNVs present within the same cDNA amplicon region as the corresponding outlier SNV (see details on targeted sequencing below). We first searched for coding SNVs present within these regions in the GTEx cohort that did not occur in the outlier individual. If no SNV could be found satisfying these criteria, we expanded our search for SNVs using the ExAC database (ExAC release 0.3)[35]. If multiple possible control variants existed for an outlier SNV, we ranked the controls by CADD score[19] and prioritized synonymous variants. Sequences of single-guide RNAs (sgRNAs) used in the study are listed in Extended Data Fig. 11b. For each variant, a sgRNA and two donor oligonucleotides (with the reference and alternative alleles) were designed such that the PAM was located as close to the variant as possible. The donors were 99 bp long centered on the variant being installed. The variants were installed into K562 cells as previously described[22,23]. The K562 cells were those generated previously[23] and were regularly tested for mycoplasma infection. sgRNAs were expressed in the pGH020 (Addgene plasmid #85405) expression vector. For each donor oligonucleotide, K562 cells constitutively expressing a Cas9-BFP protein fusion were electroporated with 3 µg of sgRNA plasmid DNA and 1 µL of 100 µM donor oligonucleotide using the T-016 program on a Lonza Nucleofector 2b. After electroporation, cells were allowed to recover for 5 days. Cells electroporated with the reference and alternative allele donor oligonucleotides were mixed in a 1:1 ratio and grown together for 3 more days to control for differences in culturing conditions. We included cells electroporated with the reference allele to ensure that any changes in expression we observed were not due to the editing process itself. Since the editing efficiency is not 100% and varies between loci, we expect fewer than half the cells to carry the alternative allele and for this proportion to vary by locus. One to two million cells were collected for RNA and genomic DNA extraction. Genomic DNA was extracted using the QiaAmp DNA mini kit (Qiagen). Total RNA was extracted using QiaShredder and RNeasy Mini kit (Qiagen). Six µg of RNA was converted into cDNA using AMV reverse transcriptase (Promega). cDNA was purified and concentrated with the PCR Purification Kit (Qiagen). PCR primers were designed to generate 300–400 bp amplicons including the variant in either the genomic DNA or cDNA locus. For both genomic DNA and cDNA samples, 400 ng of DNA was amplified in triplicate (technical replicates) using Phusion High-Fidelty polymerase (Fisher) and the amplicon was purified on a 1% TAE agarose gel. The amplicons were then prepared for sequencing using the Nextera XT kit (Illumina) and sequenced together on a NextSeq 500. Reads were trimmed with cutadapt[43] (version 1.13) and aligned using bwa[44] (version 0.7.12-r1039) allowing no mismatches (bwa aln –n 0), which excluded any reads with indels created during editing. We used custom reference sequences, one each for the reference and alternate alleles of the targeted cDNA and gDNA amplicon regions. Allele counts at the target locus were computed for each sample using samtools pileup as implemented in the R package Rsamtools[45] (version 1.22.0). Only reads with a minimum mapping quality of 20 were considered. Two of the tested loci amplified poorly in preparation for sequencing, and they had extremely low mapping rates and total read counts over the target locus (median read count across replicates < 400 compared to 281,000 and 397,000 for gDNA and cDNA, respectively, for the remaining loci). As such, we removed these two loci from further analysis. Finally, to assess the effect of each variant on expression, we tested for a significant difference between the cDNA and gDNA alternate allele proportions with a two-sided t-test. We corrected for multiple testing using the Bonferroni procedure.

Code availability

RIVER is available at https://bioconductor.org/packages/release/bioc/html/RIVER.html. Additionally, the code for running analyses and producing the figures throughout this manuscript is available separately (https://github.com/joed3/GTExV6PRareVariation).

Data availability

The GTEx v6 release genotype and allele-specific expression data are available from dbGaP (study accession phs000424.v6.p1; http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v6.p1). Expression data from the v6p release and eQTL results are available from the GTEx portal (http://gtexportal.org).

PEER correction

Distribution of the number of genes with a multi-tissue outlier

Single-tissue outlier replication

Number of rare variants per individual and population structure

Comparison of overexpression and underexpression outliers

Extended rare variant enrichments

Enrichment of an extended list of functional genomic annotations

Evolutionary constraint and regulatory control of multi-tissue outlier genes

River performance

Evaluation of known pathogenic variants using RIVER

Validation of large-effect rare variants via CRISPR/Cas9 genome editing

44 in total

1. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses.

Authors: Oliver Stegle; Leopold Parts; Matias Piipari; John Winn; Richard Durbin
Journal: Nat Protoc Date: 2012-02-16 Impact factor: 13.491

2. Chemically modified guide RNAs enhance CRISPR-Cas genome editing in human primary cells.

Authors: Ayal Hendel; Rasmus O Bak; Joseph T Clark; Andrew B Kennedy; Daniel E Ryan; Subhadeep Roy; Israel Steinfeld; Benjamin D Lunstad; Robert J Kaiser; Alec B Wilkens; Rosa Bacchetta; Anya Tsalenko; Douglas Dellinger; Laurakay Bruhn; Matthew H Porteus
Journal: Nat Biotechnol Date: 2015-06-29 Impact factor: 54.908

3. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.

Authors: Adam Siepel; Gill Bejerano; Jakob S Pedersen; Angie S Hinrichs; Minmei Hou; Kate Rosenbloom; Hiram Clawson; John Spieth; Ladeana W Hillier; Stephen Richards; George M Weinstock; Richard K Wilson; Richard A Gibbs; W James Kent; Webb Miller; David Haussler
Journal: Genome Res Date: 2005-07-15 Impact factor: 9.043

4. A method and server for predicting damaging missense mutations.

Authors: Ivan A Adzhubei; Steffen Schmidt; Leonid Peshkin; Vasily E Ramensky; Anna Gerasimova; Peer Bork; Alexey S Kondrashov; Shamil R Sunyaev
Journal: Nat Methods Date: 2010-04 Impact factor: 28.547

5. Global analysis of DNA methylation variation in adipose tissue from twins reveals links to disease-associated variants in distal regulatory elements.

Authors: Elin Grundberg; Eshwar Meduri; Johanna K Sandling; Asa K Hedman; Sarah Keildson; Alfonso Buil; Stephan Busche; Wei Yuan; James Nisbet; Magdalena Sekowska; Alicja Wilk; Amy Barrett; Kerrin S Small; Bing Ge; Maxime Caron; So-Youn Shin; Mark Lathrop; Emmanouil T Dermitzakis; Mark I McCarthy; Timothy D Spector; Jordana T Bell; Panos Deloukas
Journal: Am J Hum Genet Date: 2013-10-31 Impact factor: 11.025

6. Enrichment of cis-regulatory gene expression SNPs and methylation quantitative trait loci among bipolar disorder susceptibility variants.

Authors: E R Gamazon; J A Badner; L Cheng; C Zhang; D Zhang; N J Cox; E S Gershon; J R Kelsoe; T A Greenwood; C M Nievergelt; C Chen; R McKinney; P D Shilling; N J Schork; E N Smith; C S Bloss; J I Nurnberger; H J Edenberg; T Foroud; D L Koller; W A Scheftner; W Coryell; J Rice; W B Lawson; E A Nwulia; M Hipolito; W Byerley; F J McMahon; T G Schulze; W H Berrettini; J B Potash; P P Zandi; P B Mahon; M G McInnis; S Zöllner; P Zhang; D W Craig; S Szelinger; T B Barrett; C Liu
Journal: Mol Psychiatry Date: 2012-01-03 Impact factor: 15.992

7. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines.

Authors: Jordana T Bell; Athma A Pai; Joseph K Pickrell; Daniel J Gaffney; Roger Pique-Regi; Jacob F Degner; Yoav Gilad; Jonathan K Pritchard
Journal: Genome Biol Date: 2011-01-20 Impact factor: 13.583

8. Selection and explosive growth alter genetic architecture and hamper the detection of causal rare variants.

Authors: Lawrence H Uricchio; Noah A Zaitlen; Chun Jimmie Ye; John S Witte; Ryan D Hernandez
Journal: Genome Res Date: 2016-05-18 Impact factor: 9.043

9. An integrated encyclopedia of DNA elements in the human genome.

Authors:
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

10. The UK10K project identifies rare variants in health and disease.

Authors: Klaudia Walter; Josine L Min; Jie Huang; Lucy Crooks; Yasin Memari; Shane McCarthy; John R B Perry; ChangJiang Xu; Marta Futema; Daniel Lawson; Valentina Iotchkova; Stephan Schiffels; Audrey E Hendricks; Petr Danecek; Rui Li; James Floyd; Louise V Wain; Inês Barroso; Steve E Humphries; Matthew E Hurles; Eleftheria Zeggini; Jeffrey C Barrett; Vincent Plagnol; J Brent Richards; Celia M T Greenwood; Nicholas J Timpson; Richard Durbin; Nicole Soranzo
Journal: Nature Date: 2015-09-14 Impact factor: 49.962

97 in total

1. A Multiplexed Assay for Exon Recognition Reveals that an Unappreciated Fraction of Rare Genetic Variants Cause Large-Effect Splicing Disruptions.

Authors: Rocky Cheung; Kimberly D Insigne; David Yao; Christina P Burghard; Jeffrey Wang; Yun-Hua E Hsiao; Eric M Jones; Daniel B Goodman; Xinshu Xiao; Sriram Kosuri
Journal: Mol Cell Date: 2018-11-29 Impact factor: 17.970

2. Comprehensive In Vivo Interrogation Reveals Phenotypic Impact of Human Enhancer Variants.

Authors: Evgeny Z Kvon; Yiwen Zhu; Guy Kelman; Catherine S Novak; Ingrid Plajzer-Frick; Momoe Kato; Tyler H Garvin; Quan Pham; Anne N Harrington; Riana D Hunter; Janeth Godoy; Eman M Meky; Jennifer A Akiyama; Veena Afzal; Stella Tran; Fabienne Escande; Brigitte Gilbert-Dussardier; Nolwenn Jean-Marçais; Sanjarbek Hudaiberdiev; Ivan Ovcharenko; Matthew B Dobbs; Christina A Gurnett; Sylvie Manouvrier-Hanu; Florence Petit; Axel Visel; Diane E Dickel; Len A Pennacchio
Journal: Cell Date: 2020-03-12 Impact factor: 41.582

3. ORE identifies extreme expression effects enriched for rare variants.

Authors: F Richter; G E Hoffman; K B Manheimer; N Patel; A J Sharp; D McKean; S U Morton; S DePalma; J Gorham; A Kitaygorodksy; G A Porter; A Giardini; Y Shen; W K Chung; J G Seidman; C E Seidman; E E Schadt; B D Gelb
Journal: Bioinformatics Date: 2019-10-15 Impact factor: 6.937

4. OUTRIDER: A Statistical Method for Detecting Aberrantly Expressed Genes in RNA Sequencing Data.

Authors: Felix Brechtmann; Christian Mertes; Agnė Matusevičiūtė; Vicente A Yépez; Žiga Avsec; Maximilian Herzog; Daniel M Bader; Holger Prokisch; Julien Gagneur
Journal: Am J Hum Genet Date: 2018-11-29 Impact factor: 11.025

5. De novo pattern discovery enables robust assessment of functional consequences of non-coding variants.

Authors: Hai Yang; Rui Chen; Quan Wang; Qiang Wei; Ying Ji; Guangze Zheng; Xue Zhong; Nancy J Cox; Bingshan Li
Journal: Bioinformatics Date: 2019-05-01 Impact factor: 6.937

Review 6. [Genetic diagnostics of retinal dystrophies : Breakthrough with new methods of DNA sequencing].

Authors: H J Bolz
Journal: Ophthalmologe Date: 2018-12 Impact factor: 1.059

7. Clinical and biological significance of a - 73A > C variation in the CDH1 promoter of patients with sporadic gastric carcinoma.

Authors: Baozhen Zhang; Jing Zhou; Zhaojun Liu; Liankun Gu; Jiafu Ji; Woo Ho Kim; Dajun Deng
Journal: Gastric Cancer Date: 2017-11-22 Impact factor: 7.370