| Literature DB >> 26292667 |
Masao Nagasaki1,2,3, Jun Yasuda1,2, Fumiki Katsuoka1,2, Naoki Nariai1, Kaname Kojima1,2, Yosuke Kawai1,2, Yumi Yamaguchi-Kabata1,2, Junji Yokozawa1,2, Inaho Danjoh1,2, Sakae Saito1,2, Yukuto Sato1,2, Takahiro Mimori1, Kaoru Tsuda1, Rumiko Saito1, Xiaoqing Pan1, Satoshi Nishikawa1, Shin Ito1, Yoko Kuroki1, Osamu Tanabe1,2, Nobuo Fuse1,2, Shinichi Kuriyama1,2,4, Hideyasu Kiyomoto1,2, Atsushi Hozawa1,2, Naoko Minegishi1,2, James Douglas Engel5, Kengo Kinoshita1,3,6, Shigeo Kure1,2, Nobuo Yaegashi1,2, Masayuki Yamamoto1,2.
Abstract
The Tohoku Medical Megabank Organization reports the whole-genome sequences of 1,070 healthy Japanese individuals and construction of a Japanese population reference panel (1KJPN). Here we identify through this high-coverage sequencing (32.4 × on average), 21.2 million, including 12 million novel, single-nucleotide variants (SNVs) at an estimated false discovery rate of <1.0%. This detailed analysis detected signatures for purifying selection on regulatory elements as well as coding regions. We also catalogue structural variants, including 3.4 million insertions and deletions, and 25,923 genic copy-number variants. The 1KJPN was effective for imputing genotypes of the Japanese population genome wide. These data demonstrate the value of high-coverage sequencing for constructing population-specific variant panels, which covers 99.0% SNVs of minor allele frequency ≥0.1%, and its value for identifying causal rare variants of complex human disease phenotypes in genetic association studies.Entities:
Mesh:
Year: 2015 PMID: 26292667 PMCID: PMC4560751 DOI: 10.1038/ncomms9018
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Figure 1SNVs in 1KJPN.
(a) Statistics on read depth in 1KJPN. The vertical bars indicate the minimum and maximum depth of the number of sequence reads on each individual after filtering. They were sorted according to the average sequenced read depth (the black line). (b) The plot shows the power to detect SNVs (blue) of the confidence SNVs and the mean r2 values before (yellow) and after (orange) filtering with SNP array data for the same sample on non-reference allele counts ranging from 1 to 50. The r2 between genotypes from the SNVs in 1KJPN and the SNP array data is given by the squared Pearson correlation. (c) The numbers of novel and known SNVs in each MAF bin. The novel SNV frequency begins to dominate for lower MAFs. (d) The rate of variant discovery by minimum MAF in the 1KJPN population. The rates of variant discovery in our sequencing strategy were plotted against minimum MAF in the 1KJPN population by different sampling size. The distribution of population MAF was estimated on the basis of the demographic model shown in Supplementary Fig. 3.
Summary of WGS of Japanese individuals and variant detection in autosomes.
| Total samples | 1,070 | |
| Total raw bases | 100.4 trillion bases | |
| Mean sequenced depth | 32.4 × | |
| High-sensitive SNVs | High-confidence SNVs | |
| Total | 29,588,649 | 21,221,195 |
| Number of known variants | 12,308,520 | 9,219,783 |
| Number of novel variants | 17,280,129 | 12,001,412 |
| Novelty rate | 58.40% | 56.55% |
| Average number per sample | 3,886,081 | 2,716,853 |
| Average individual heterozygosity | 2,252,841 | 1,532,773 |
| 1 bp≤length<100 bp | 100 bp≤length | |
| Number of sites overall | 1,969,302 | 47,343 |
| Number of novel variants | 1,429,636 | — |
| Novelty rate | 72.60% | — |
| Number of inframe/frameshift | 3,112/4,454 | — |
| Average number per sample | 190,857 | 2,654 |
| 1 bp≤length<100 bp | 100 bp≤length | |
| Number of sites overall | 1,384,230 | 9,354 |
| Number of novel variants | 1,037,839 | 9,354 |
| Novelty rate | 74.98% | — |
| Number of inframe/frameshift | 1,577/2,506 | — |
| Average number per sample | 159,359 | 45 |
SNV, single-nucleotide variant; WGS, whole-genome sequencing.
All data listed here are limited to the autosomal genome.
*Comparison based on dbSNP build 138.
†The decision of novel sites is described in Methods.
Figure 2The impact of very-rare variants on genomic regions and functional categories.
(a) The SFSs of intergenic region for SNVs of 1KJPN (blue) and 1KGP (red). (b) The numbers of SNVs observed in 1KJPN and 1KGP are depicted as four functional categories. The fraction of very-rare variants observed in 1KJPN are depicted with 95% binomial confidence interval according to (c) genomic region, (d) probable consequences for coding regions, (e) in noncoding regions and (f) for scaled C scores. Because the number of genotyped individuals in the confidence SNVs is different among sites because of the individual depth filter, we applied a hypergeometric projection65, which subsamples each variant down to a sample size of 963 (90% of 1,070 samples) to obtain the SFSs of the confidence SNVs for a,c–f.
Figure 3Properties of genomic variation discovered in 1KJPN.
(a) The size-frequency spectrum of SNVs, deletions and insertions discovered by high-coverage sequencing in 1KJPN. Novelty rates are shown by the red line. Peaks corresponding to long interspersed elements (LINE), Alu and microsatellite repeat (MSR) are shown. (b) Size-frequency spectrum of CNVs estimated from high-coverage sequencing data in the genic regions in 1KJPN. (c) Histograms and scatterplot of diploid copy numbers of AMY1 genes (blue) and region X (red) in 1KJPN. A diagram depicting the positions of AMY1A, Region X, AMY1B and AMY1C on chromosome 1 of GRCh37 is shown in the right top. (d) Allele frequencies for HLA-A in 1,070 individuals in 1KJPN estimated by high-coverage sequencing (blue), and 1,018 Japanese individuals typed by PCR-SSOP (red)42.
Figure 4Imputation with the Japanese reference panel.
(a) Comparison of imputation performance (r2) for four reference panels: 1,070 individuals in 1KJPN (1KJPN), 1,092 cosmopolitan samples in 1KGP (1KGP ALL), 1KJPN plus 1KGP ALL (1KJPN+1KGP ALL) and 89 Japanese individuals in 1KGP (1KGP JPT). The x axis represents the MAF of each panel. The y axis represents the averaged r2 at SNV sites that exist in both the cosmopolitan samples of 1KGP and 1KJPN. (b) A Manhattan plot of P values from GWAS of MMD. The SNV sites from the original data set and imputed markers are plotted as dots in magenta and grey, respectively. Blue and red lines display the significance threshold of the original and imputed results, respectively. Only one significant signal was identified on chromosome 17. (c) A plot of P values from GWAS of MMD with the original (non-imputed; upper panel) and imputed (lower panel) data set around the SNP exhibiting the significant signal in b. In the imputed result, the SNP with the highest association is a nonsynonymous variant of RNF213, and was reported as one of the MMD-causing variants in the original study. In contrast, from the non-imputed result the SNP with the highest association is located in the coding region of ENDOV.
Individual variant load in coding regions.
| HGMD-DM | 1KJPN (1,070) high confidence | 0.640 | 0.814 | 1.039 | 1.016 | 4.757 | 2.246 | 3.179 | 1.604 | 9.619 | 3.032 |
| 1KJPN (1,070) high-sensitive | 0.675 | 0.848 | 1.107 | 1.051 | 4.905 | 2.261 | 4.388 | 1.787 | 11.074 | 3.136 | |
| 1KGP JPT (89) | NA | NA | NA | NA | 6.270 | 2.503 | 4.169 | 1.829 | 10.438 | 2.969 | |
| 1KGP CHB (97) | NA | NA | NA | NA | 5.536 | 2.381 | 4.464 | 1.921 | 10.000 | 3.218 | |
| 1KGP CHS (100) | NA | NA | 1.470 | 1.359 | 4.320 | 2.049 | 3.680 | 1.803 | 9.470 | 2.798 | |
| Stop-gained | 1KJPN (1,070) high-confidence | 2.385 | 1.550 | 1.563 | 1.294 | 8.679 | 2.486 | 29.017 | 4.327 | 41.644 | 5.358 |
| 1KJPN (1,070) high-sensitive | 2.624 | 1.616 | 1.777 | 1.376 | 6.008 | 2.402 | 42.125 | 4.878 | 52.535 | 5.795 | |
| 1KGP JPT (89) | NA | NA | NA | NA | 8.685 | 2.987 | 39.337 | 5.261 | 48.022 | 6.166 | |
| 1KGP CHB (97) | NA | NA | NA | NA | 9.742 | 3.215 | 37.845 | 5.593 | 47.588 | 6.777 | |
| 1KGP CHS (100) | NA | NA | 3.860 | 2.433 | 6.580 | 3.085 | 36.070 | 4.860 | 46.510 | 5.947 | |
| HGMD-DM | 1KJPN (1070) high-confidence | 0.001 | 0.031 | 0.003 | 0.053 | 0.048 | 0.230 | 1.570 | 1.126 | 1.621 | 1.145 |
| 1KJPN (1,070) high-sensitive | 0.000 | 0.000 | 0.003 | 0.053 | 0.050 | 0.234 | 1.862 | 1.235 | 1.914 | 1.251 | |
| 1KGP JPT (89) | NA | NA | NA | NA | 0.022 | 0.149 | 1.899 | 1.244 | 1.921 | 1.227 | |
| 1KGP CHB (97) | NA | NA | NA | NA | 0.052 | 0.222 | 2.021 | 0.989 | 2.072 | 1.003 | |
| 1KGP CHS (100) | NA | NA | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 2.110 | 1.118 | |
| Stop-gained | 1KJPN (1,070) high-confidence | 0.005 | 0.081 | 0.004 | 0.061 | 0.753 | 0.747 | 11.303 | 2.713 | 12.064 | 2.813 |
| 1KJPN (1,070) high-sensitive | 0.008 | 0.101 | 0.008 | 0.101 | 0.099 | 0.302 | 12.101 | 2.853 | 12.217 | 2.851 | |
| 1KGP JPT (89) | NA | NA | NA | NA | 0.067 | 0.252 | 11.315 | 2.898 | 11.382 | 2.914 | |
| 1KGP CHB (97) | NA | NA | NA | NA | 0.052 | 0.222 | 12.093 | 2.758 | 12.144 | 2.769 | |
| 1KGP CHS (100) | NA | NA | 0.000 | 0.000 | 0.070 | 0.326 | 12.900 | 3.047 | 12.970 | 3.096 | |
CHB, Han Chinese in Beijing, China; CHS , Han Chinese South, China; HGMD, Human Gene Mutation Database; JPT, Japanese in Tokyo, Japan; 1KGP, 1000 Genomes Project; 1KJPN, reference panel of 1,070 Japanese individual; NA, not available; ORF, open reading frame; SNV, single-nucleotide variant.
SNV sites with reliable ancestral states were used.
*HGMD-DM (disease-causing) alleles were analysed if they are derived alleles and alternative (non-reference) alleles.
†We selected stop-gained alleles if they are derived alleles and alternative (non-reference) alleles. We discarded stop-gained SNVs if the proportion of truncated ORF is less than 5%.