Literature DB >> 23613972

Large genomic region free of GWAS-based common variants contains fertility-related genes.

Rong Qiu1, Chao Chen, Hong Jiang, Libing Shen, Min Wu, Chunyu Liu.   

Abstract

DNA variants, such as single nucleotide polymorphisms (SNPs) and copy number variants (CNVs), are unevenly distributed across the human genome. Currently, dbSNP contains more than 6 million human SNPs, and whole-genome genotyping arrays can assay more than 4 million of them simultaneously. In our study, we first questioned whether published genome-wide association studies (GWASs) assays cover all regions well in the genome. Using dbSNP build 135 data, we identified 50 genomic regions longer than 100 Kb that do not contain any common SNPs, i.e., those with minor allele frequency (MAF)≥ 1%. Secondly, because conserved regions are generally of functional importance, we tested genes in those large genomic regions without common SNPs. We found 97 genes and were enriched for reproduction function. In addition, we further filtered out regions with CNVs listed in the Database of Genomic Variants (DGV), segmental duplications from Human Genome Project and common variants identified by personal genome sequencing (UCSC). No region survived after those filtering. Our analysis suggests that, while there may not be many large genomic regions free of common variants, there are still some "holes" in the current human genomic map for common SNPs. Because GWAS only focused on common SNPs, interpretation of GWAS results should take this limitation into account. Particularly, two recent GWAS of fertility may be incomplete due to the map deficit. Additional SNP discovery efforts should pay close attention to these regions.

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 23613972      PMCID: PMC3629113          DOI: 10.1371/journal.pone.0061917

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

The human genome contains millions of common SNPs, which are being deposited into public databases. These data have been used to design genome-wide association studies (GWASs) [1], [2], [3]. Common SNPs are better powered in association tests [4]. However, genomic regions not covered by common variants are neglected. Those neglected regions may contain variants with low frequencies, and should be paid more attention to because rare variants are even more likely to be functional than common ones [5]. In our study, we were interested in two questions: 1) whether the human genome is sufficiently covered by common SNPs and is sufficiently captured by common SNPs of standard GWAS platforms, and 2) whether any genes were included in those regions and their enriched biological functions. To answer these two questions, we started with searching regions without common SNPs, called common SNP-free regions (CSFRs), regions free of both common SNPs and CNVs, called common variant-free regions (CVFRs). Next, we explored the functional enrichment of genes identified in CSFRs and CVFRs. With available personal genome sequencing data, whether these CSFRs and CVFRs contain common and rare variants were also examined.

Methods

Identification of CSFRs and CVFRs

Common SNPs (MAF≥1%) in dbSNP build 135, Genome Assembly Gaps and Genome Database refGene data were downloaded from the UCSC Genome Browser (http://genome.ucsc.edu/) (Table 1). The CNV data were downloaded from the DGV (Table 1). Using the common SNP table, we calculated distances between adjacent common SNPs and subtracted regions containing the genome assembly gaps. If the remaining SNP intervals were longer than 100 kb, those intervals were defined as CSFRs. The CSFRs were further searched for CNVs. If after subtracting regions containing CNVs, the intervals were still longer than 100 kb, those intervals were defined as CVFRs. The reason we used 100 kb as bin to detect SNP free region is the SNP Linkage disequilibrium distance: several groups reported blocks of up to 100 kb in length exhibiting very strong linkage disequilibrium [6], [7].
Table 1

Data Sources Used in This Study.

DataURLVersionModified dateData description and summary statistics
Common SNP Data in HapMap http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database Human Genome assembly hg19.18-Dec-2011snp135Common.txt.gz Total SNPs: 11,488,259 in chr1-chrY.
Genome Assembly Gaps data http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database Human Genome assembly hg19.27-Apr-2009gap.txt.gz Total gaps, 357 in chr1-chrY.
Genomes Unzipped data http://www.genomesunzipped.org/download/ Based on human genome hg18, upgraded to hg1910-Oct-2010Total of 1923 SNPs in the chrY.9 sample, 546 common SNPs with maf>1%.With data for 9 personal genome sequences.
personal genome variation data http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ Based on Human Genome assembly hg19.21-Feb-2010Total of 9 personal genomes: pgNA12878.txt.gzpgNA12891.txt.gzpgNA12892.txt.gzpgNA19240.txt.gzpgSjk.txt.gzpgVenter.txt.gzpgWatson.txt.gzpgYh1.txt.gzpgYoruban3.txt.gz
DGV data http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ Human Genome assembly hg19.07-Mar-2011dgv.txt.gz Total 101605 in chr1-chrY.
segmental duplication data http://eichlerlab.gs.washington.edu/database.html Human Genome assembly hg19.27-Jun-2011inter pairs is 22980; intra pairs is 8763
Genes http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz Human Genome assembly hg19.21-May-2012Total number of genes is 42,742; after eliminating other chromosome, 30,332 genes in chr1-chrY remain.
To verify our result for its impacts on GWAS, we first determined whether the CSFRs are truly missed by Affymetrix Genome-Wide Human SNP Arrays. Next, we asked whether these regions included rare variations or were devoid of genetic variation. We analyzed common SNP data obtained from Genomes Unzipped (genomesunzipped.org) and Personal Genome Variation tracks from the UCSC Genome Browser. These two datasets are collections of variants that have been identified in the sequencing of personal genomes (Table 1).

Identification of genes in CSFRs and CVFRs

Gene annotation data from the Human Genome assembly hg19 UCSC refGene was used to map coding genes in the CSFRs and CVFRs (Table 2). Genes were included if their transcription regions overlapped with the CSFRs/CVFRs by at least one base pair. When a gene had multiple splicing forms, we chose the longest splicing form to define the gene region.
Table 2

List of 50 common SNP-free regions containing 97 genes.

ChrCSFR_startCSFR_endCSFR_sizeGene_nameIsochore_type
chr1145883118145989503106385GPR89C, PDZK1P1Isochore_border
chr2110524226110704031179805RGPD5, RGPD6, LIMS3, LIMS3-LOC440895, LIMS3 LIsochore
chr2111191098111347035155937LIMS3-LOC440895, LIMS3, LIMS3L, RGPD6, RGPD5Isochore_border
chr77476572474866460100736GATSL2Isochore
chr93937925039551456172206LOC653501, ZNF658BUnknown
chr93982960639961804132198FAM75A2, FAM75A1, FAM74A1Unknown
chr94149771841635419137701FAM75A5, FAM75A7, LOC653501, ZNF658BUnknown
chr94274390542847394103489LOC286297Isochore_border
chr104679921446907775108561FAM35BIsochore
chr104818533648300420115084LOC642826, AGAP9, FAM25B, FAM25G, FAM25C, ANXA8, ANXA8 L1Isochore_border
chr163314289033293778150888TP53TG3, TP53TG3C, TP53TG3BIsochore_border
chrX5209873852395914297176XAGE2, XAGE2B, XAGE1B, XAGE1A, XAGE1D, XAGE1C, XAGE1EUnknown
chrX5244591452568230122316XAGE1A, XAGE1C, XAGE1E, XAGE1D, XAGE1BIsochore_border
chrY48342814935713101432PCDH11YIsochore_border
chrY50128925205540192648PCDH11YUnknown
chrY52744345421065146631PCDH11YIsochore_border
chrY60746906422524347834TTTY23, TTTY23B, TSPY2, TTTY1B, TTTY1, TTTY2B, TTTY2, TTTY21, TTTY21B, TTTY7B, TTTY7, TTTY8B,TTTY8Isochore
chrY93818469492957111111RBMY3APIsochore
chrY95245039768115243612TTTY8, TTTY8B, TTTY7B, TTTY7, TTTY21, TTTY21B, TTTY2B, TTTY2, TTTY1, TTTY1B, TTTY22, TTTY23,TTTY23BIsochore
chrY1469112714804076112949TTTY15Isochore_border
chrY1956389420143885579991FAM41AY1, FAM41AY2, LINC00230B, LINC00230A, XKRY, XKRY2, CDY2B, CDY2AUnknown
chrY2019388520834702640817XKRY, XKRY2, LINC00230A, LINC00230B, FAM41AY1, FAM41AY2, HSFY2, HSFY1,TTTY9B, TTTY9AUnknown
chrY2083755321080706243153TTTY9B, TTTY9A, HSFY2, HSFY1, NCRNA00185Unknown
chrY2256477822665261100483TTTY10Unknown
chrY2347320123580342107141RBMY2EPIsochore_border
chrY2363436223838234203872RBMY1B, RBMY1A1, RBMY1E, RBMY1D, TTTY13Isochore_border
chrY2399315624359930366774RBMY1A1, RBMY1D, RBMY1B, RBMY1E, PRY, PRY2, TTTY6, TTTY6B, RBMY1F, RBMY1JIsochore_border
chrY2450060224620459119857RBMY1F, RBMY1J, TTTY6B, TTTY6Unknown
chrY24620459281608903540431PRY, PRY2, TTTY17B, TTTY17C,TTTY17A, TTTY4C, TTTY4B, TTTY4, BPY2B, BPY2, BPY2C, DAZ1, DAZ4, DAZ3, DAZ2, TTTY3B, TTTY3, CDY1, CDY1B, CSPG4P1Y, GOLGA2P2Y, GOLGA2P3YIsochore_border
chr94202773242145811118079Isochore_border
chr94446620544651655185450Isochore_border
chr94512850045250203121703Isochore_border
chr96563258365745692113109Isochore_border
chrY30161233134221118098Isochore
chrY31791173359419180302Isochore_border
chrY38337773966707132930Unknown
chrY39667084346934380226Unknown
chrY44660774593373127296Unknown
chrY45934114807708214297Unknown
chrY64821406677618195478Isochore_border
chrY74018367548914147078Unknown
chrY82148278334874120047Isochore_border
chrY1503995515234829194874Unknown
chrY1824869818381734133036Unknown
chrY1839054318560004169461Isochore_border
chrY1937529419500106124812Unknown
chrY2221422122369679155458Isochore_border
chrY2241967922564743145064Isochore_border
chrY2324156823361665120097Isochore_border
chrY2816089128509481348590Isochore_border

Pathway and functional analyses

The genes identified in the CSFRs/CVFRs were used to analyze their enrichment of biological functions through the Database for Annotation, Visualization and Integrated Discovery (DAVID, http://david.abcc.ncifcrf.gov/tools.jsp).

Isochore characterization

Isochore is a large region of DNA sequence which has a relatively uniform degree in its GC content [8]. We use 100 kb as the length of flank region and 2% GC difference as indicator to identify isochore, isochore border and unknown region among SNP free regions. All SNP free regions in this study are longer than 100 kb. CSFRs are identified as isochore if its GC content is 2% greater or lower than both right and left regions. CSFRs are identified as isochore border if the difference of GC content between two flank regions is greater than 2%, and GC-content difference between left flank and right flank region is greater than GC-content difference between CSFR and its flank regions. Unknown region means CSFR is neither isochore nor isochore border.

Results

CSFRs and CVFRs identification

We identified 50 CSFRs distributed across eight chromosomes: chr1, chr2, chr7, chr9, chr10, chr16, chrX, and chrY. The Y chromosome carried the majority of these regions–33 in total (Table 2). After excluding the CNV regions, we identified 20 CVFRs distributed across two chromosomes: chrX and chrY. The Y chromosome still carried the majority, with 18 regions (Table 3).
Table 3

List of 20 common variant-free regions containing 20 genes.

chrCVFR_startCVFR_endCVFR_sizegene_name
chrX5209873852231295132557XAGE2, XAGE2B
chrX5226736152395914128553XAGE2, XAGE2B
chrY48342814935713101432PCDH11Y
chrY49357145205540269826PCDH11Y
chrY52744345421065146631PCDH11Y
chrY95245039640365115862TTTY8, TTTY8B, TTTY7B, TTTY7, TTTY21,
TTTY21B, TTTY2B, TTTY2, TTTY1, TTTY1B
TTTY22
chrY2022833320599266370933XKRY, XKRY2, LINC00230A, LINC00230B
FAM41AY1, FAM41AY2
chrY30161233134221118098
chrY31791173359419180302
chrY41143664346934232568
chrY44660774593373127296
chrY45934114807708214297
chrY65772156677618100403
chrY82148278334874120047
chrY1503995515234829194874
chrY1755965217661377101725
chrY1824869818381734133036
chrY1839054318560004169461
chrY1937529419500106124812
chrY2324700423361665114661
We checked our results in the Affymetrix SNP Array 6.0 by its annotation data. Among the CSFRs, we found 25 SNPs' information in the annotation file, and only four of them had non-zero minor allele frequency: rs11681529, rs2571764, rs2874557, and rs35516764. The other 20 are monomorphic for HapMap four populations (Caucasian, African, Chinese and Japanese). Therefore, we concluded that most of these 50 large genomic regions has not been covered properly by the Affymetrix 6.0 Array at least in those major populations investigated.

Genes in CSFRs and CVFRs and their functional enrichment

Ninety-seven genes overlapped with 28 of the 50 CSFRs (56%) (Table 2). DAVID was used to test whether the annotations of this set of genes were over presented with particular GO terms [9]. They were highly enriched with biological pathways involved with sexual reproduction, spermatogenesis, male gamete generation, gamete generation, multicellular organism reproduction, and reproductive processes in a multicellular organism (p<0.05 and FDR q<0.05, Table 4). The gene set included a number of gene previously reported to be related to reproduction, including DAZ1 [10], [11], BPY2 [12], TSPY2 [11], CDY1 [13], CDY2A [13] and RBMY1 [11]. A gr/gr deletion polymorphism on Y chromosome of those CSFRs has also been suggested to be a risk factor of spermatogenic impairment in some populations [14], [15].
Table 4

Top 6 GO terms from the functional annotation analysis of 97 CSFR genes by DAVID.

CategoryTermCount%P-ValueFDR
GOTERM_BP_FATsexual reproduction1 914.80.000000030.000033
GOTERM_BP_FATSpermatogenesis2 813.10.0000000470.000052
GOTERM_BP_FATmale gamete generation2 813.10.0000000470.000052
GOTERM_BP_FATgamete generation2 813.10.000000260.00028
GOTERM_BP_FATmulticellular organism reproduction2 813.10.00000110.0012
GOTERM_BP_FATreproductive process in a multicellular organism2 813.10.00000110.0012

gene included RBMY1A1, RBMY1B, RBMY1J, RBMY1F, XKRY, XKRY2, BPY2C, BPY2B, BPY2, CDY1, CDY1B, CDY2B, CDY2A, DAZ2, DAZ3, DAZ4, DAZ1, and TSPY2.

gene included RBMY1A1, RBMY1B, RBMY1J, RBMY1F, BPY2C, BPY2B, BPY2, CDY1, CDY1B, CDY2B, CDY2A, DAZ2, DAZ3, DAZ4, DAZ1, and TSPY2.

gene included RBMY1A1, RBMY1B, RBMY1J, RBMY1F, XKRY, XKRY2, BPY2C, BPY2B, BPY2, CDY1, CDY1B, CDY2B, CDY2A, DAZ2, DAZ3, DAZ4, DAZ1, and TSPY2. gene included RBMY1A1, RBMY1B, RBMY1J, RBMY1F, BPY2C, BPY2B, BPY2, CDY1, CDY1B, CDY2B, CDY2A, DAZ2, DAZ3, DAZ4, DAZ1, and TSPY2. Twenty genes were overlapped with seven of the 20 CVFRs (35%) (Table 3). DAVID was also performed on these 20 genes. However, these genes were not enriched in any biological functions.

SNP-free regions from personal genome sequencing and segmental duplications

We further explored those SNP-free regions in personal genome variant data. Rare variants were detected in most of the CSFRs or CVFRs. Only one region on X chromosome (chrX: 52,267,361-52,395,914) left. We also examined this region in updated dbSNP database (dbSNP137, http://www.ncbi.nlm.nih.gov/). Two more common SNPs were deteceted (rs201652812 and rs199865557). After subtract them, the left region was 105 kb (chrX: 52,290,698-52,395,914), which was the finally region not containing any known variant in all of the genome-wide sequencing data that we were able to collect. XAGE2 and its splicing isoforms were harbored in this region. We next tested this final region in segmental duplication database from Eichler's lab (http://eichlerlab.gs.washington.edu/database.html) [7], and found it was overlapped with one of the segmental duplication regions. We found that 49 CSFRs did carry SNPs in the Genomes Unzipped and Personal Genome Variation tracks. And the left X chromosome region did not contain any SNPs but overlapped with segmental duplication region.

Twenty-four CSFRs are isochore borders

To dig out the sequence properties of 50 CSFRs, we characterized those regions by GC content. Different GC contents can separated DNA sequences into compositionally fairly homogeneous regions [8]. By comparing GC contents between CSFRs and their flanking regions, we found that twenty-four CSFRs belong to isochore border regions, seven belong to isochore regions, and eighteen are unknown regions (Table 2, Table S1).

Discussion

We performed a thorough search for large genomic regions that are free of common variants in dbSNP and we found 50 CSFRs and 20 CVFRs. Most of these variations free regions located on Y chromosome. Genes in the CSFRs were highly enriched for activities related to reproduction. Further investigation in the sequencing of personal genomes found most of the CSFRs (49 out of 50) did contain rare SNPs, suggesting those regions have not been covered well in the existing common variants sequencing projects, like the 1000 Genomes Project. GWAS is one the most infusive common variants sequencing projects, but important finding might be missed because of its poor coverage of rare variants. Recently, two fertility GWAS studies were conducted but failed to find SNPs on sex chromosomes [16], [17]. Both studies used Affymetrix GWAS platforms that we evaluated in this study. However, both sex chromosomes have long been implicated in infertility, specifically in spermatogenic damage in mouse models and in human candidate gene/region studies [18]. Our study found that those genomic regions free of common variants regions carrying many genes important to reproduction. With those important candidate genes missing, we must be cautious of analyzing fertility-related GWASs, which may produce false negatives. The most reliable CVFR call contains the XAGE2 and its isoforms, which belong to XAGE subfamily. XAGE2 is strongly expressed in normal testes, and in some tumor [19]. Because genotyping platforms cannot fully cover structural variations such as segmental duplication, we further applied structural variations filtering analysis, and observed XAGE region was overlapped with segmental duplication. Based on these observations, we concluded that the observation of variant free regions is more a coverage problem with the current versions of dbSNP and existing GWAS assay platforms than a lack of assayable variation. When more genomes are sequenced, we may end up with proper coverage of complete human genome by common SNPs. We mapped our SNPs on dbSNP build 135 and regions on GRCh37.p10 (hg19) assembly reference, which is the most accurate alignment version and with all current genome knowledge available. Comparing to old versions, hg19 changed many genomic coordinates and included alternate haplotype assemblies for chr6 (7 haplotypes), chr4 (1 haplotype), and chr17 (1 haplotype). Different versions can be converted by liftOver software (http://genome.ucsc.edu/cgi-bin/hgLiftOver). More details of differences in each version are provided in NCBI (http://www.ncbi.nlm.nih.gov/genome/guide/human/release_notes.html). Further study can focus on the sequence properties of those regions, and their conservative across species. Isochores are spatially heterogeneous in mammalian genome and varies in replication timing, gene richness, recombination rate, etc [20], [21], [22]. Natural selection is the most plausible explanation for formation and maintenance of isochores [20]. We observed nearly half of CSFRs are isochores and isochore border regions, which is a hint that these CSFRs may be under different selection pressure from its neighboring regions. To further test selection pressure, we mapped those regions to chimpanzee and mouse by Synteny analysis from Ensembl (http://useast.ensembl.org/Homo_sapiens/Location/Synteny?r=6:133017695-133161157), and found only 6 genes (RGPD5, RGPD6, GATSL2, FAM25G, HSFY1, HSFY2) can map to unique regions in the other two species. Next we applied dN/dS ratio test, the ratio of substitution rates at non-synonymous and synonymous sites, and found that human genes under more purify selection than chimpanzee genes (paired T test, p = 0.01, Table S2). Those results suggest that natural selection seems to be the major evolutionary force behind these variant-free regions. In summary, by searching large genomic regions free of common variants for the first time, we identified tens of common variations free regions, and most of them were located on the X and Y chromosomes. The genes located in CSFRs are enriched for fertility. Incorporating personal genome data, only one region was still free of variants and harbored gene XAGE2, indicating most of the detections due to low coverage of rare variations. Future deep sequencing from more individuals and redesigning GWAS arrays should improve our understanding of the variability of these regions and their functional importance. Isochore characterization of 50 CSFRs. (DOC) Click here for additional data file. Evolution pressure of conserved genes by dN/dS ratio test. (DOC) Click here for additional data file.
  22 in total

1.  Segmental duplications: organization and impact within the current human genome project assembly.

Authors:  J A Bailey; A M Yavor; H F Massa; B J Trask; E E Eichler
Journal:  Genome Res       Date:  2001-06       Impact factor: 9.043

2.  Polymorphism for a 1.6-Mb deletion of the human Y chromosome persists through balance between recurrent mutation and haploid selection.

Authors:  Sjoerd Repping; Helen Skaletsky; Laura Brown; Saskia K M van Daalen; Cindy M Korver; Tatyana Pyntikova; Tomoko Kuroda-Kawaguchi; Jan W A de Vries; Robert D Oates; Sherman Silber; Fulco van der Veen; David C Page; Steve Rozen
Journal:  Nat Genet       Date:  2003-10-05       Impact factor: 38.330

3.  Genome-wide evaluation of the public SNP databases.

Authors:  Ruhong Jiang; Jicheng Duan; Andreas Windemuth; J Claiborne Stephens; Richard Judson; Chuanbo Xu
Journal:  Pharmacogenomics       Date:  2003-11       Impact factor: 2.533

4.  IsoFinder: computational prediction of isochores in genome sequences.

Authors:  José L Oliver; Pedro Carpena; Michael Hackenberg; Pedro Bernaola-Galván
Journal:  Nucleic Acids Res       Date:  2004-07-01       Impact factor: 16.971

Review 5.  Genome-wide association studies: theoretical and practical concerns.

Authors:  William Y S Wang; Bryan J Barratt; David G Clayton; John A Todd
Journal:  Nat Rev Genet       Date:  2005-02       Impact factor: 53.242

6.  A quantitative trait locus for body fat on chromosome 1q43 in French Canadians: linkage and association studies.

Authors:  Brahim Aissani; Louis Perusse; Gilles Lapointe; Yvon C Chagnon; Luigi Bouchard; Brandon Walts; Claude Bouchard
Journal:  Obesity (Silver Spring)       Date:  2006-09       Impact factor: 5.002

7.  Genome-wide association study identifies candidate genes for male fertility traits in humans.

Authors:  Gülüm Kosova; Nicole M Scott; Craig Niederberger; Gail S Prins; Carole Ober
Journal:  Am J Hum Genet       Date:  2012-05-24       Impact factor: 11.025

8.  An isochore map of human chromosomes.

Authors:  Maria Costantini; Oliver Clay; Fabio Auletta; Giorgio Bernardi
Journal:  Genome Res       Date:  2006-04       Impact factor: 9.043

9.  The fine-scale structure of recombination rate variation in the human genome.

Authors:  Gilean A T McVean; Simon R Myers; Sarah Hunt; Panos Deloukas; David R Bentley; Peter Donnelly
Journal:  Science       Date:  2004-04-23       Impact factor: 47.728

10.  Fertility in mice requires X-Y pairing and a Y-chromosomal "spermiogenesis" gene mapping to the long arm.

Authors:  P S Burgoyne; S K Mahadevaiah; M J Sutcliffe; S J Palmer
Journal:  Cell       Date:  1992-10-30       Impact factor: 41.582

View more
  1 in total

Review 1.  Clinically relevant known and candidate genes for obesity and their overlap with human infertility and reproduction.

Authors:  Merlin G Butler; Austen McGuire; Ann M Manzardo
Journal:  J Assist Reprod Genet       Date:  2015-01-29       Impact factor: 3.412

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.