| Literature DB >> 33824302 |
Mehrdad Bakhtiari1, Jonghun Park1, Yuan-Chun Ding2, Sharona Shleizer-Burko3, Susan L Neuhausen2, Bjarni V Halldórsson4, Kári Stefánsson4, Melissa Gymrek1,3, Vineet Bafna5.
Abstract
Variable number tandem repeats (VNTRs) account for significant genetic variation in many organisms. In humans, VNTRs have been implicated in both Mendelian and complex disorders, but are largely ignored by genomic pipelines due to the complexity of genotyping and the computational expense. We describe adVNTR-NN, a method that uses shallow neural networks to genotype a VNTR in 18 seconds on 55X whole genome data, while maintaining high accuracy. We use adVNTR-NN to genotype 10,264 VNTRs in 652 GTEx individuals. Associating VNTR length with gene expression in 46 tissues, we identify 163 "eVNTRs". Of the 22 eVNTRs in blood where independent data is available, 21 (95%) are replicated in terms of significance and direction of association. 49% of the eVNTR loci show a strong and likely causal impact on the expression of genes and 80% have maximum effect size at least 0.3. The impacted genes are involved in diseases including Alzheimer's, obesity and familial cancers, highlighting the importance of VNTRs for understanding the genetic basis of complex diseases.Entities:
Mesh:
Year: 2021 PMID: 33824302 PMCID: PMC8024321 DOI: 10.1038/s41467-021-22206-z
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1VNTR performance.
a Length distribution of all known VNTRs (red) and selected targeted VNTRs (blue) across the GRCh38 human genome in base pairs. b The genotyping pipeline. c Neural network architecture for each VNTR which uses a mapping of reads to a k-mer composition vector. d Improvement in running time after using neural network and k-mer matching. e Accuracy and efficiency of read recruitment in simulated data. The scatter plot shows 1-efficiency ((TP + FP)/R) and recall (TP/(TP + FN)) of classification with different methods. High efficiency is related directly with running time. Each of 10,264 points represents a VNTR locus (Method) and are shown once for each method. The side and top panels show cumulative distributions of recall and 1-efficiency. f Base pairs (log-scale) affected by VNTRs per individual in the GTEx cohort. Source data are provided as a Source Data file.
Fig. 2Effect of VNTR genotypes on mediating gene expression.
a Location of target VNTRs and eVNTRs relative to the proximal genes. b Pipeline to identify eVNTRs and assign causality scores. Ancestry, Sex, and PEER factors are included in C as covariates. We associate VNTR genotype with expression residuals after correcting for the effect of C. c Quantile-quantile plot showing p values of association signals separated by tissue. Green line represents the p values using 100 permutations. d Number of unique and shared eVNTRs in each tissue. e Trend of RU count correlation with gene expression level. f Spearman correlation of eVNTRs effect sizes for each pair of tissues. g Scatter plot correlating effect size versus minor allele frequency (MAF). Source data are provided as a Source Data file.
Replication of whole blood VNTRs in independent cohorts.
| Replication | ||||||||
|---|---|---|---|---|---|---|---|---|
| Locus | Length | RU Length | Effect Size | Gene | Annotation | Icelandic | Geuvadis | |
| 1 | chr1:21440112–21440147 | 35 | 6 | 0.43 | UTR | Y | Y | |
| 2 | chr2:24084339–24084414 | 75 | 25 | −0.12 | UTR | Y | Y | |
| 3 | chr2:25161573–25161616 | 43 | 9 | 0.22 | Coding | Y | Y | |
| 4 | chr2:112542424–112542500 | 76 | 25 | −0.18 | Coding | Y | Y | |
| 5 | chr3:56557249–56557289 | 40 | 20 | −0.12 | Coding | Y | Y | |
| 6 | chr6:13328502–13328532 | 30 | 6 | 0.12 | UTR | Y | Y | |
| 7 | chr7:64337190–64337240 | 50 | 13 | 0.09 | UTR | Y | Y | |
| 8 | chr8:86508719–86508765 | 46 | 23 | 0.13 | UTR | Y | Y | |
| 9 | chr10:102869497–102869605 | 108 | 36 | 0.22 | Coding | Y | Y | |
| 10 | chr21:46228815–46228863 | 48 | 9 | −0.03 | UTR | Y | Y | |
| 11 | chr17:75589192–75589228 | 36 | 6 | −0.06 | Coding | Y | - | |
| 12 | chr1:46609102–46609134 | 32 | 16 | 0.09 | UTR | Y | N | |
| 13 | chr5:80654880–80654954 | 74 | 9 | 0.04 | Coding | Y | N | |
| 14 | chr9:137063433–137063550 | 117 | 39 | −0.15 | UTR | Y | N | |
| 15 | chr14:61762420–61762454 | 34 | 17 | 0.03 | UTR | Y | N | |
| 16 | chr19:12577507–12577551 | 44 | 22 | −0.09 | UTR | Y | N | |
| 17 | chr21:41316673–41316756 | 83 | 13 | −0.19 | UTR | Y | N | |
| 18 | chr22:37805258–37805313 | 55 | 6 | 0.11 | UTR | Y | N | |
| 19 | chr1:202187007–202187042 | 35 | 7 | 0.06 | UTR | N | Y | |
| 20 | chr17:18208488–18208544 | 56 | 7 | −0.13 | UTR | N | Y | |
| 21 | chr17:76564106–76564152 | 46 | 9 | 0.11 | UTR | - | N | |
| 22 | chr17:56978047–56978107 | 60 | 20 | 0.15 | UTR | N | N | |
| 23 | chr6:30163542–30163579 | 37 | 12 | 0.14 | UTR | - | - | |
Each row describes an eVNTR in whole blood from GTEx project(n = 652 individuals) identified with false discovery rate (FDR) <0.05 based on 100 permutations. Replication of the signal in whole blood tissue of the Icelandic cohort of 903 samples and in lymphoblastoid cell-lines from the Geuvadis cohort (462 samples) with the same direction of effect and FDR <0.05. For the Icelandic cohort, only the VNTRs that showed significant associations in GTEx were tested using unmapped reads plus reads mapped to those specific loci. Hence, we used the conservative p value cutoff from the smaller GTEx cohort. Length (respectively, RU length) refers to the total (respectively, repeat-unit length) of the VNTR.
Fig. 3Effect of VNTR genotypes on mediating gene expression.
a Association of AS3MT VNTR genotype with gene expression in brain cortex (n = 148 samples, Fisher’s two-sided P: 2.78 × 10−12). Box plots display the median, 25th and 75th percentiles. b Association with gene expression (upper panel) and CAVIAR causality probability of proximal SNPs—all SNPs in 100 kbp window on either side of the AS3MT VNTR (red-star). c Location of AS3MT VNTR relative to known regulatory elements. d, e Association with gene expression of the POMC VNTR (n = 378 samples, Fisher’s two-sided P: 1.53 × 10−9) and its causality probability relative to proximal SNPs. Box plots display the median, 25th and 75th percentiles. f Location of POMC VNTR relative to other regulatory regions and its spatial proximity with the promoter region revealed via Hi-C. g, h Association with gene expression of the ZNF232 VNTR (n = 114 samples, Fisher’s two-sided P: 5.47 × 10−9) and its causality score relative to proximal SNPs. Box plots display the median, 25th and 75th percentiles. Source data are provided as a Source Data file.