Literature DB >> 26484070

A definitive haplotype map of structural variations determined by microarray analysis of duplicated haploid genomes.

Tomoko Tahira¹, Koji Yahara², Yoji Kukita³, Koichiro Higasa⁴, Kiyoko Kato⁵, Norio Wake⁵, Kenshi Hayashi¹.

Abstract

Complete hydatidiform moles (CHMs) are tissues carrying duplicated haploid genomes derived from single sperms, and detecting copy number variations (CNVs) in CHMs is assumed to be sensitive and straightforward methods. We genotyped 108 CHM genomes using Affymetrix SNP 6.0 (GEO#: GSE18642) and Illumina 1 M-duo (GEO#: GSE54948). After quality control, we obtained 84 definitive haplotype consisting of 1.7 million SNPs and 2339 CNV regions. The results are presented in the database of our web site (http://orca.gen.kyushu-u.ac.jp/cgi-bin/gbrowse/humanBuild37D4_1/).

Entities: Disease Gene Species

Keywords: Complete hydatidiform moles; Copy Number Variation; Definitive haplotypes; LD-bin; Single nucleotide polymorphism

Year: 2014 PMID： 26484070 PMCID： PMC4535890 DOI： 10.1016/j.gdata.2014.04.006

Source DB: PubMed Journal: Genom Data ISSN： 2213-5960

Direct link to deposited data

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE18642 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54948

Experimental design, materials and methods

Samples

Complete hydatidiform mole tissues dissected from patients and the blood sample of one patient served as sources of DNAs for array hybridization experiments as described previously [1]. The informed consent was obtained from each donor. This study was approved by the Institutional Review Board (Ethical Committee of Kyushu University).

SNP genotyping

The raw data files of Affymetrix SNP 6.0 arrays (CEL files) and sample attribute files of 94 CHM samples and one blood sample that has passed quality control in the previous study [1] were reanalyzed by Birdseed v2 of Geotyping Console 4. 1. 1. 834 (GTC 4.1), together with CEL files and sample attribute files of 45 HapMap-JPT samples (obtained from Affymetrix). The locations of markers in genome coordinate of GRCh37 were according to GenomeWideSNP_6.na32 that was obtained from Affymetrix. A total of 905,025 SNP genotypes (excluding chromosome Y and mitochondria) were obtained, at an initial average call rate for the 94 CHMs of 99.2%. Array hybridization experiments using Illumina 1 M-duo was performed for 98 CHM samples that included the 94 samples and one blood samples mentioned above by previously described procedures [1]. The genotypes were called using GenTrain 2.0 cluster algorithm of Genome Studio 2011.1, Illumina. Human1M-Duov3_H.egt (based on GRCh37) was used as the manifest file and Human1M-Duov3_H.bpm as the cluster file. The initial average call rate was 99.5%.

Copy number analysis

The CEL files of Affymetrix arrays were subjected to Copy Number/LOH analysis module of GTC 4.1 without regional GC correction. The 94 CHM samples, one blood sample mentioned above and four male samples from HapMap JPT (NA18940, NA18943, NA18944 and NA18945) served as references to obtain “Log2Ratio” (abbreviated as log2R in this paper) data. Then, the data of markers on chromosome Y and mitochondria were excluded and the remaining data were exported as CNCHP.txt. The “log R Ratio” (abbreviated as logRR in this paper) data of Illumina arrays were calculated by Genome Studio 2011.1 using the cluster file (Human1M-Duov3_H.bpm) as a reference.

Results and discussion

SNP genotyping of haploid samples

CHM genomes are supposed to be genome-widely homozygous. However, the genotypes obtained by the two systems revealed small fractions (0.27% of Affymetrix call and 0.01% of Illumina call) of heterozygous calls. The dramatic increase of heterozygous calls for the markers at lower relative signal intensities (log2R of Affymetrix arrays and logRR of Illumina arrays) indicated that the calls were falsely made for the markers at (homozygously) deleted regions where no genotypes should be called, although some of them might be ascribed to the markers in divergent paralogous regions (Fig. 1). These findings provided us an additional quality control measure of SNP genotype calling, that was, forcing all calls at log2R < − 0.6 (0.88% of Affymetrix calls), or logRR < − 1 (0.17% of Illumina calls) to no-calls. We also removed 164 SNPs in Illumina calls, because they were duplicated (i.e., two SNP at the same position). Subsequently, SNPs with call rate less than 90% were removed. After these quality control steps, 84 CHMs, whose SNP genotypes were called at greater than 96% by both platforms, remained.

Fig. 1

Increased heterozygosity of calls at a low signal intensity.

The genotype calls at the relative signal intensity where heterozygosity was approximately 1% (horizontal red dotted lines) or greater were regarded to contain significant fraction of unreliable calls. Blue horizontal lines indicate the fraction of cumulative calls at the reliability thresholds.

The genotypes of both platforms were compared using merge function of PLINK program version 1.07 [2], that revealed considerable strand inconsistencies between the two platforms. We flipped the strands of Illumina data for these SNPs to resolve inconsistency with Affymetrix annotation. After these corrections, the fraction of discordant calls was 1.05 × 10− 5, which were forced to no calls at merge (Fig. 2).

Fig. 2

Overview of SNP genotyping and its quality control.

*HQC: haploid quality control, that is, heterozygous calls and weak signal calls were forced to no calls. See text for detail.

Linkage disequilibrium, LD bins and tagSNPs

The pair-wise r2 values between merged SNP markers whose minor allele frequencies were at least 5% (common SNPs) and maximum inter-marker distance of 300 kb were calculated. LD bins were determined at threshold of r2 ≥ 0.80 by TagZilla version 1.0 (http://tagzilla.nci.nih.gov/). The program estimates LD bins using a greedy maximal approach similar to that of ldSelect [3]. As a result, 1,115,537 common SNPs were grouped in 366,214 LD bins, of which 189,417 were single-SNP bins. That left 17% of common SNPs without proxies. TagSNPs (representative SNPs for each bin) was selected by the TagZilla criteria “avesnp”, that is, having maximum average r2 with all other SNPs in the bin.

CNV segments and CNV regions

B allele frequency (BAF) of heterozygous sites has been commonly used as an indicator of CNV of Illumina array data obtained from diploid materials. However, it is not an appropriate indicator in this study, because all SNPs in our duplicated haploid samples are expected to be genome-widely homozygous. And so, relative signal intensity of markers is the only variable for the detection of copy number changes, which we detected using circular binary segmentation algorithm implemented in the R statistical package module DNAcopy 1.26 with default parameters [4]. Since the distributions of log2R and logRR were widely different, combined interpretation of the two data sets were inappropriate. Therefore, the segmentation analysis of the two data sets was carried out separately. Fig. 3 shows the distribution of mean relative signal intensities of segments defined by the two data sets (Affymetrix and Illumina). As shown in the figure, distinct peaks were observed in the regions below zero, apparently distinguishing deletion segments from normal copy segments. We defined the boundary of the two copy number states at the inflection points of cumulative segment coverage in each data set. Thus, the copy number states of segments having mean log2R < − 1 for Affymetrix and mean logRR < − 2 for Illumina were defined to be a loss, that accounted 0.02–0.03% of the genome. The thresholds for the definition of gain segments were not distinguishable from the plots, and we arbitrarily placed the boundary at 0.5 for both data sets. Then, a CNV segment that extended beyond centromere was split at the latter. The segments were filtered so that all of them had the sizes greater than 50 bp. The numbers of CNV segments defined by the two platforms are summarized in Table 1.

Fig. 3

Distribution of copy number segments in bins of mean relative signal intensities.

See text for detail.

Table 1

CNV segments defined by the two platforms.

Platform	Loss (per genome)	Gain (per genome)
Affymetrix SNP 6.0	6517 (78)	1444 (17)
Illumina 1 M-duo	4597 (55)	39 (0.5)

The definition of gain CNV segments is arbitrary. See text for detail.

The concordance of CNV segment calls between the two platforms was examined using an “intersect” function of BEDTools version 2.11.2 [5], setting a minimal overlap of one bp. The results revealed that in some genomic regions, mutually exclusive subsets of samples were judged to be in the CNV segments of opposite directions (gain by Affymetrix versus loss by Illumina). We also found that less than half of segments detected by the two arrays were overlapped (Fig. 4a). The reason for these apparent discrepancies should at least partly be attributable to the differences in the definition of reference intensities in the calculation of relative signal intensity and in the distribution of markers between the two systems, as discussed previously [1]. However, a good size correlation between overlapped segments was observed for segments longer than 10 kb, although some discrepancies by splitting/fusion of overlapped regions between the two platforms were observed even in long segments (Fig. 4b).

Fig. 4

Overlap and size correlation of CNV segments detected by two platforms.

a. The concordant calls of CNV segments between Affymetrix system and Illumina system were examined without distinguishing gains or losses, as detailed in the text. b. The lengths of overlapped CNV segments detected in the Affymetrix (abscissa) and Illumina (ordinate) systems are plotted.

Next we defined CNV regions as merges of CNV segments across CHM samples without discriminating gains or losses. The results revealed a total of 2339 CNV regions that occupied 1.4% of the genome.

Definitive Haplotype Database (D-HaploDB)

The results of SNP genotypings and CNV analyses described above are comprehensively presented in tracks (listed below) of D-HaploDB version 4.1 (http://orca.gen.kyushu-u.ac.jp) that uses Generic Genome Browser version 1.64 [6]. The genome coordinates are according to GRCh37. A screen shot of an example page of the database is shown in Fig. 5.

Fig. 5

Screen capture of D-Haplo D4.1 glutathione S-transferase theta 1 region.

CNV segments of gain or loss was detected by Affymetrix or Illumina systems, respectively, for mutually exclusive subsets of CHM samples. CNV segments of only a portion of samples are shown for the ease of viewing.

CHMSNPs_D4.1: Merged SNPs genotyped using Affymetrix and Illumina platforms, and validated. Individual genotypes and allele counts are viewable by clicking the glyphs. Affymetrix SNP 6.0: Positions of Affymetrix markers are shown, with distinction of SNP probes (red) and CN probes (black). Illumina 1 M-duo: Positions of Illumina markers are shown, with distinction of SNP probes (red) and intensity only probes (black). LD_bin_D4.1 (MAF ≥ 5%): The pair-wise r2 tagging at r2 ≥ 0.8 using Tagzilla 1.0 program was done for SNPs whose minor allele frequencies were at least 5%. The best-tags (i.e., the tagSNP that showed the highest average r2 against the remaining members within the bin) are highlighted in red. Details containing SNP and haplotype information are viewable by clicking the glyphs. r-square (MAF ≥ 5%): The r2 values from high to low between all combinations of markers within the selected regions are graphically shown by deep to shallow red. CHM_CNVR: CNV regions (CNVRs) in CHMs were defined as merges of CNV segments across all CHM samples. Thus, these are the regions where CNV segments were detected by either Affymetrix or Illumina platforms at least in one CHM. CHM#: CNV segments in each CHM sample (indicated by #) are shown with distinctions of losses (red) or gains (blue), and Affymetrix (dark) or Illumina (light). In addition, some external data are incorporated and presented in tracks, to facilitate further interpretation of our data. Those are cytobands, genes, transcripts, segmental duplications and CNV data of Conrad et al. [7], HapMap3 [8], and Park et al. [9].

Specifications
Organism/cell line/tissue	Homo sapiens/complete hydatidiform moles (CHMs)
Sex	Duplicated haploids whose genomes are from single sperms harboring X
Sequencer or array type	Affymetrix SNP 6.0 and Illumina 1 M-duo
Data format	AffymetrixRaw data: CEL files, normalized data: SOFT, MINIML and TXTIlluminaRaw data: GSE54948_ signal_intensities.txt.gz, normalized data: SOFT, MINIML, TXT and GSE54948_matrix_processed.txt.gz
Experimental factors	Single nucleotide polymorphism (SNP), copy number variation (CNV), LD-bin, CNV segments, CNV regions, definitive haplotypes
Experimental features	Whole genome SNP/CNV haplotyping of 84 duplicated haploid samples
Consent	All patients (donors) gave their written informed consent before study entry.
Sample source location	Japan

9 in total

1. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium.

Authors: Christopher S Carlson; Michael A Eberle; Mark J Rieder; Qian Yi; Leonid Kruglyak; Deborah A Nickerson
Journal: Am J Hum Genet Date: 2003-12-15 Impact factor: 11.025

2. The generic genome browser: a building block for a model organism system database.

Authors: Lincoln D Stein; Christopher Mungall; ShengQiang Shu; Michael Caudy; Marco Mangone; Allen Day; Elizabeth Nickerson; Jason E Stajich; Todd W Harris; Adrian Arva; Suzanna Lewis
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

3. A definitive haplotype map as determined by genotyping duplicated haploid genomes finds a predominant haplotype preference at copy-number variation events.

Authors: Yoji Kukita; Koji Yahara; Tomoko Tahira; Koichiro Higasa; Miki Sonoda; Ken Yamamoto; Kiyoko Kato; Norio Wake; Kenshi Hayashi
Journal: Am J Hum Genet Date: 2010-05-27 Impact factor: 11.025

4. A faster circular binary segmentation algorithm for the analysis of array CGH data.

Authors: E S Venkatraman; Adam B Olshen
Journal: Bioinformatics Date: 2007-01-18 Impact factor: 6.937

5. PLINK: a tool set for whole-genome association and population-based linkage analyses.

Authors: Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham
Journal: Am J Hum Genet Date: 2007-07-25 Impact factor: 11.025

6. Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing.

Authors: Hansoo Park; Jong-Il Kim; Young Seok Ju; Omer Gokcumen; Ryan E Mills; Sheehyun Kim; Seungbok Lee; Dongwhan Suh; Dongwan Hong; Hyunseok Peter Kang; Yun Joo Yoo; Jong-Yeon Shin; Hyun-Jin Kim; Maryam Yavartanoo; Young Wha Chang; Jung-Sook Ha; Wilson Chong; Ga-Ram Hwang; Katayoon Darvishi; Hyeran Kim; Song Ju Yang; Kap-Seok Yang; Hyungtae Kim; Matthew E Hurles; Stephen W Scherer; Nigel P Carter; Chris Tyler-Smith; Charles Lee; Jeong-Sun Seo
Journal: Nat Genet Date: 2010-04-04 Impact factor: 38.330

7. Integrating common and rare genetic variation in diverse human populations.

Authors: David M Altshuler; Richard A Gibbs; Leena Peltonen; David M Altshuler; Richard A Gibbs; Leena Peltonen; Emmanouil Dermitzakis; Stephen F Schaffner; Fuli Yu; Leena Peltonen; Emmanouil Dermitzakis; Penelope E Bonnen; David M Altshuler; Richard A Gibbs; Paul I W de Bakker; Panos Deloukas; Stacey B Gabriel; Rhian Gwilliam; Sarah Hunt; Michael Inouye; Xiaoming Jia; Aarno Palotie; Melissa Parkin; Pamela Whittaker; Fuli Yu; Kyle Chang; Alicia Hawes; Lora R Lewis; Yanru Ren; David Wheeler; Richard A Gibbs; Donna Marie Muzny; Chris Barnes; Katayoon Darvishi; Matthew Hurles; Joshua M Korn; Kati Kristiansson; Charles Lee; Steven A McCarrol; James Nemesh; Emmanouil Dermitzakis; Alon Keinan; Stephen B Montgomery; Samuela Pollack; Alkes L Price; Nicole Soranzo; Penelope E Bonnen; Richard A Gibbs; Claudia Gonzaga-Jauregui; Alon Keinan; Alkes L Price; Fuli Yu; Verneri Anttila; Wendy Brodeur; Mark J Daly; Stephen Leslie; Gil McVean; Loukas Moutsianas; Huy Nguyen; Stephen F Schaffner; Qingrun Zhang; Mohammed J R Ghori; Ralph McGinnis; William McLaren; Samuela Pollack; Alkes L Price; Stephen F Schaffner; Fumihiko Takeuchi; Sharon R Grossman; Ilya Shlyakhter; Elizabeth B Hostetter; Pardis C Sabeti; Clement A Adebamowo; Morris W Foster; Deborah R Gordon; Julio Licinio; Maria Cristina Manca; Patricia A Marshall; Ichiro Matsuda; Duncan Ngare; Vivian Ota Wang; Deepa Reddy; Charles N Rotimi; Charmaine D Royal; Richard R Sharp; Changqing Zeng; Lisa D Brooks; Jean E McEwen
Journal: Nature Date: 2010-09-02 Impact factor: 49.962

8. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

9. Origins and functional impact of copy number variation in the human genome.

Authors: Donald F Conrad; Dalila Pinto; Richard Redon; Lars Feuk; Omer Gokcumen; Yujun Zhang; Jan Aerts; T Daniel Andrews; Chris Barnes; Peter Campbell; Tomas Fitzgerald; Min Hu; Chun Hwa Ihm; Kati Kristiansson; Daniel G Macarthur; Jeffrey R Macdonald; Ifejinelo Onyiah; Andy Wing Chun Pang; Sam Robson; Kathy Stirrups; Armand Valsesia; Klaudia Walter; John Wei; Chris Tyler-Smith; Nigel P Carter; Charles Lee; Stephen W Scherer; Matthew E Hurles
Journal: Nature Date: 2009-10-07 Impact factor: 49.962

9 in total

1 in total

1. Parental contribution to trisomy in heterozygous androgenetic complete moles.

Authors: Hirokazu Usui; Asuka Sato; Makio Shozu
Journal: Sci Rep Date: 2020-10-13 Impact factor: 4.379

1 in total