| Literature DB >> 31494266 |
Zhenglin Du1, Liang Ma2, Hongzhu Qu3, Wei Chen2, Bing Zhang4, Xi Lu4, Weibo Zhai4, Xin Sheng1, Yongqiao Sun4, Wenjie Li4, Meng Lei4, Qiuhui Qi4, Na Yuan1, Shuo Shi1, Jingyao Zeng1, Jinyue Wang1, Yadong Yang3, Qi Liu2, Yaqiang Hong2, Lili Dong1, Zhewen Zhang1, Dong Zou1, Yanqing Wang1, Shuhui Song1, Fan Liu5, Xiangdong Fang6, Hua Chen5, Xin Liu5, Jingfa Xiao7, Changqing Zeng8.
Abstract
To unravel the genetic mechanisms of disease and physiological traits, it requires comprehensive sequencing analysis of large sample size in Chinese populations. Here, we report the primary results of the Chinese Academy of Sciences Precision Medicine Initiative (CASPMI) project launched by the Chinese Academy of Sciences, including the de novo assembly of a northern Han reference genome (NH1.0) and whole genome analyses of 597 healthy people coming from most areas in China. Given the two existing reference genomes for Han Chinese (YH and HX1) were both from the south, we constructed NH1.0, a new reference genome from a northern individual, by combining the sequencing strategies of PacBio, 10× Genomics, and Bionano mapping. Using this integrated approach, we obtained an N50 scaffold size of 46.63 Mb for the NH1.0 genome and performed a comparative genome analysis of NH1.0 with YH and HX1. In order to generate a genomic variation map of Chinese populations, we performed the whole-genome sequencing of 597 participants and identified 24.85 million (M) single nucleotide variants (SNVs), 3.85 M small indels, and 106,382 structural variations. In the association analysis with collected phenotypes, we found that the T allele of rs1549293 in KAT8 significantly correlated with the waist circumference in northern Han males. Moreover, significant genetic diversity in MTHFR, TCN2, FADS1, and FADS2, which associate with circulating folate, vitamin B12, or lipid metabolism, was observed between northerners and southerners. Especially, for the homocysteine-increasing allele of rs1801133 (MTHFR 677T), we hypothesize that there exists a "comfort" zone for a high frequency of 677T between latitudes of 35-45 degree North. Taken together, our results provide a high-quality northern Han reference genome and novel population-specific data sets of genetic variants for use in the personalized and precision medicine.Entities:
Keywords: De novo assembly; Large population; Phenotype association; Reference genome; Variation map
Mesh:
Year: 2019 PMID: 31494266 PMCID: PMC6818495 DOI: 10.1016/j.gpb.2019.07.002
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Statistics for the sequencing and assembly of four reference genomes
| Population | Southern Chinese | Southern Chinese | Northern Han Chinese | European |
| Sequencing methods | HiSeq (fosmid) | PacBio + BioNano | PacBio + 10× Genomics + Bionano | Sanger (BAC + fosmid) |
| Assembly software | SOAPdenovo | FALCON | CANU + Supernova | NA |
| Scaffold N50 (Mb) | 20.52 | 21.98 | 46.63 | 67.79 |
| Contig N50 (Mb) | 0.02 | 8.33 | 3.6 | 56.41 |
| No. of scaffolds | 125,643 | 5367 | 5574 | 735 |
| No. of gaps | 235,514 | 10,901 | 8484 | 999 |
| PhaseBlock N50 (Mb) | 0.48 | NA | 2.16 | NA |
| Assembly size (bp) | 2,911,235,363 | 2,934,084,193 | 2,892,287,479 | 3,209,286,105 |
Note: NA, not available.
Figure 1A comparison of three Chinese reference genomes
A. A Venn diagram showing the SNVs present in each of the three Chinese reference genomes and the shared SNVs. B. A Venn diagram showing the structural variations shared among the three reference genomes (large deletions on the left and large insertions on the right with SV length >50 bp). C. Top, a map of chromosome 4, showing the position of the ZNF718 gene near the telomere region. Beneath this shows ZNF718 exons at 5′ end, followed by the mean inner distance (brown) and the coverage (green) of paired-end reads. Both indicate the presence of a homozygous deletion of 6138 bp in the ZNF718 gene in NH1.0. Below the read coverage is the structural variations shown in the DGV [53], the short blue thick line and the connected dark red thick line indicate the gains and losses, respectively, and the black bars underneath indicate the distribution of repeat elements. Bottom, the domain structure diagram of the protein encoded by ZNF718 showing that the genomic deletion (red line) results in generation of a truncated ZNF718 protein lacking the KRAB domain (dark green). SNV, single nucleotide variant; DGV, Database of Genomic Variants.
Allele frequency and genomic features of the SNVs and indels identified in the CASPMI cohort
| SNV | ≥50% | 11,445 | 51 | 5749 | 105,509 | 32 | 2969 | 13,805 | 664,216 | 11,693 | 12,061 | 1,088,588 | 1,916,118 |
| 5%–50% | 23,613 | 65 | 11,660 | 199,759 | 67 | 6021 | 27,056 | 1,343,806 | 23,236 | 24,117 | 2,119,311 | 3,778,711 | |
| 0.5%–5% | 25,422 | 115 | 8763 | 144,622 | 58 | 5977 | 24,531 | 1,025,511 | 18,148 | 18,337 | 1,492,106 | 2,763,590 | |
| <0.5% | 201,609 | 1595 | 51,408 | 872,202 | 280 | 41,586 | 157,899 | 6,221,982 | 113,756 | 109,551 | 8,619,679 | 16,391,547 | |
| All | 262,089 | 1826 | 77,580 | 1,322,092 | 437 | 56,553 | 223,291 | 9,255,515 | 166,833 | 164,066 | 13,319,684 | 24,849,966 | |
| Indel | ≥50% | 347 | 65 | 713 | 17,654 | 3 | 467 | 2778 | 114,668 | 2217 | 2400 | 173,161 | 314,473 |
| 5%–50% | 651 | 40 | 1813 | 52,435 | 9 | 935 | 7771 | 366,652 | 6798 | 7523 | 512,111 | 956,738 | |
| 0.5%–5% | 1080 | 43 | 1594 | 41,482 | 13 | 1058 | 6936 | 289,865 | 5520 | 5684 | 393,381 | 746,656 | |
| <0.5% | 7940 | 458 | 5213 | 102,642 | 45 | 4264 | 21,859 | 702,803 | 14,524 | 14,846 | 958,744 | 1,833,338 | |
| All | 10,018 | 606 | 9333 | 214,213 | 70 | 6724 | 39,344 | 1,473,988 | 29,059 | 30,453 | 2,037,397 | 3,851,205 | |
Note: SNV, single-nucleotide variant; AF, allele frequency.
Statistics of SNVs and indels in the coding regions identified in the CASPMI cohort
| SNV | ≥50% | NA | NA | NA | NA | 5146 | 6029 | 19 | 4 | 247 | 11,445 |
| 5%–50% | NA | NA | NA | NA | 10,806 | 12,320 | 98 | 10 | 379 | 23,613 | |
| 0.5%–5% | NA | NA | NA | NA | 13,593 | 11,345 | 168 | 11 | 305 | 25,422 | |
| <0.5% | NA | NA | NA | NA | 124,339 | 72,302 | 2597 | 103 | 2268 | 201,609 | |
| All | NA | NA | NA | NA | 153,884 | 101,996 | 2882 | 128 | 3199 | 262,089 | |
| Indel | ≥50% | 43 | 47 | 84 | 87 | NA | NA | 2 | 0 | 84 | 347 |
| 5%–50% | 87 | 149 | 150 | 228 | NA | NA | 7 | 2 | 28 | 651 | |
| 0.5%–5% | 173 | 262 | 164 | 448 | NA | NA | 11 | 0 | 22 | 1080 | |
| <0.5% | 1285 | 2945 | 869 | 2534 | NA | NA | 161 | 6 | 140 | 7940 | |
| All | 1588 | 3403 | 1267 | 3297 | NA | NA | 181 | 8 | 274 | 10,018 | |
Note: NA, not applicable.
Figure 2SNV identification among projects and metabolism-related rs1549293 in
A. A comparison of SNVs found in the CASPMI project (pink) with those present in the dbSNP (olive green), 1KGP (gray), 1KGP EAS (green), and the 90 Han Chinese genome study (light blue) [24]. B. The enrichment of KEGG pathways for genes with a high frequency of SNPs in the hfCAS-EAS dataset (a group of SNPs with relatively high frequencies in both CASPMI cohort and 1KGP EAS). X-axis represents the ratio of the number of queried genes to the number of total genes involved in each pathway (gene ratio), and y-axis shows the enriched KEGG pathways. The color scale represents Q values (log10-transformed) for each enriched pathway (hypergeometric test) and the dot size indicates the number of genes involved in a particular process or pathway. C. Genes (shown in x-axis) that are associated with the metabolism-related traits (colored bars underneath) and contain overlapping SNPs present in both hfCAS-EAS dataset and GWAS Catalog. Blue squares in different intensities illustrate frequencies of each SNP in the six populations shown on y-axis. CAS indicates participants of the CASPMI cohort in this study, while EAS, SAS, AFR, EUR, and AMR refer to the respective populations in 1KGP. Genes examined in the current study are indicated using asterisks. D. Frequency distribution of the rs1549293-T allele in the aforementioned populations. E. Association of waist circumference with different rs1549293 genotypes present in males of the CASPMI cohort (P = 0.002, t-test). F. The interaction of rs1549293 with HSD3B7 and FUS (red arcs) as revealed in various cell types by correlation assays of DHS (black peaks) and ChIA-PET (brick red lines stopping at squares), forming each of 145 kb and 54 kb chromatin interactions, respectively, via recruiting transcription factors PU.1 [36]. The locus where rs1549293 resides is enriched with both H3K4me1 (purple) and H3K27ac (blue) modifications, suggesting an enhancer function of this region. G. rs1549293 is localized in a PU.1 binding motif. The affinity for PU.1 binding appears to be weaker with the presence of the T allele [38]. CASPMI, Chinese Academy of Sciences Precision Medicine Initiative; 1KGP, 1000 Genomes Project; EAS, east Asian; hfCAS-EAS, relatively high-frequency SNPs of the CASPMI cohort shared with 1KGP EAS; SAS, South Asian; AFR, African; EUR, European; AMR, Admixed American; DHS, DNase I hypersensitive site.
Phenotype correlation of the 17 SNPs associated with the metabolic-related traits in the CASPMI cohort
| Waist circumference | rs13210323 | A | C | 0.64 | 0.69 | 0.31 | 0.34 | 0.31 | 0.28 | Intronic | DHS | 0.1073 | 0.3253 | 0.361 | |
| rs1549293* | T | T | 0.92 | 0.88 | 0.14 | 0.09 | 0.38 | 0.38 | Intronic | DHS + ChIA-PET | 0.791 | 0.0093 | 0.5406 | ||
| rs3791679 | A | G | 0.78 | 0.77 | 0.25 | 0.03 | 0.24 | 0.23 | Intronic | DHS | 0.5242 | 0.2793 | 0.446 | ||
| rs806794 | A | G | 0.77 | 0.76 | 0.45 | 0.38 | 0.45 | 0.30 | 3_prime_UTR | DHS | 0.5319 | 0.8388 | 0.3953 | ||
| Type 2 diabetes | rs11257655 | T | T | 0.58 | 0.54 | 0.24 | 0.24 | 0.26 | 0.23 | Intergenic | DHS | 0.5517 | 0.3214 | 0.55 | |
| rs231356 | T | T | 0.79 | 0.79 | 0.48 | 0.25 | 0.43 | 0.30 | ncRNA exonic | DHS | 0.5996 | 0.6583 | 0.8272 | ||
| rs62481355 | T | T | 0.65 | 0.69 | 0.32 | 0.01 | 0.25 | 0.31 | Intergenic | DHS | 0.7108 | 0.4669 | 0.9738 | ||
| rs806215 | C | C | 0.63 | 0.66 | 0.26 | 0.27 | 0.24 | 0.22 | Intronic | DHS | 0.8 | 0.5613 | 0.9468 | ||
| Fasting plasma glucose | rs733331 | A | A | 0.52 | 0.52 | 0.18 | 0.00 | 0.15 | 0.04 | Intergenic | DHS | 0.6071 | 0.6828 | 0.8547 | |
| Hypertension | rs2398162* | A | G | 0.65 | 0.67 | 0.28 | 0.04 | 0.34 | 0.21 | ncRNA intronic | DHS | 0.0256 | 0.6531 | 0.0071 | |
| Systolic blood pressure | rs820430 | A | G | 0.70 | 0.67 | 0.38 | 0.02 | 0.37 | 0.39 | Intergenic | DHS | 0.6159 | 0.6138 | 0.1146 | |
| rs13359291 | A | A | 0.62 | 0.59 | 0.29 | 0.14 | 0.27 | 0.17 | Intronic | DHS | 0.8001 | 0.8163 | 0.763 | ||
| Systolic blood pressure (cigarette smoking interaction) | rs1792738 | G | G | 0.79 | 0.78 | 0.44 | 0.05 | 0.43 | 0.34 | Intergenic | DHS | 0.1913 | 0.1317 | 0.4146 | |
| Diastolic blood pressure | rs820430 | A | G | 0.70 | 0.67 | 0.38 | 0.02 | 0.37 | 0.39 | Intergenic | DHS | 0.8312 | 0.4117 | 0.4473 | |
| Triglycerides | rs11649653 | G | G | 0.93 | 0.90 | 0.17 | 0.02 | 0.41 | 0.39 | Intergenic | DHS | 0.6986 | 0.6136 | 0.2415 | |
| HDL cholesterol | rs759819 | C | C | 0.81 | 0.72 | 0.29 | 0.07 | 0.36 | 0.31 | Intergenic | DHS | 0.3521 | 0.1906 | 0.9844 | |
| rs2967605 | T | T | 0.64 | 0.60 | 0.30 | 0.22 | 0.24 | 0.19 | Downstream gene | DHS | 0.1471 | 0.3878 | 0.2987 | ||
| rs386000 | C | C | 0.84 | 0.65 | 0.15 | 0.17 | 0.43 | 0.19 | Intergenic | DHS | 0.76 | 0.9099 | 0.6911 | ||
Note: Association analysis was performed using PLINK toolset. SNPs significantly associated with the phenotypes are put in red (P < 0.05). CAS, CASPMI cohort participants in the current study; EAS, East Asian; SAS, South Asian; AFR, African; AMR, Admixed American; EUR, European; HDL, high-density lipoprotein; DHS, DNase hypersensitive site; ChIA-PET, chromatin interaction analysis with paired-end tag sequencing.
Figure 3Genetic differentiation between northern and southern Han populations in the CASPMI cohort
A.Fst values between NH and SH populations in the CASPMI cohort. The red dashed horizontal line indicates the Fst cutoff of ≥0.054. Some top significant regions, genes, and missense SNPs are marked. B. Allele frequencies and genotype ratios of MTHFR rs1801133 in the NH and SH groups. C. Allele frequencies and genotype ratios of TCN2 rs75680863 in the NH and SH groups. D. A relatively high MTHFR 667T (rs1801133) belt (colored in red) between latitude 35–45° North. As demonstrated in the map produced by National Geographic Map Maker Interactive (https://mapmaker.nationalgeographic.org/), populations with higher frequencies of 667T are present in the relative central regions of the temperate zone (0.3–0.4 and above, pink belt). The frequency of 667T decreases toward north in Europe and toward south in Africa and Asia (see more details in Table S15), suggesting a selection pressure for higher MTHFR activity in more frigid as well as more tropic area. Fst, the fixation index; NH, northern Han; SH, southern Han.
Figure 4The population distribution of mutational signatures
A. Five COSMIC mutation signatures with patterns matching analysis of the novel singletons identified in the CASPMI cohort. The 96 types of trinucleotide mutational contexts are presented on the x axis, and y-axis shows the probability of a specific mutation occurring in such a context. B. Distribution of the five aforementioned mutational signatures in the NH and SH groups. Signature 1 showed the most significant difference between these 2 groups (P = 0.001, Wilcox rank test). Boxplots show the proportion of each mutational signature in NH (green) and SH (orange) individuals. Whiskers denote the lowest and highest values within 1.5 times the range of the first and third quartiles, respectively; dots represent outliers beyond the whiskers. C. SNPs significantly associated with the individual load of COSMIC signature 5. 17 significant SNPs were identified as being associated with the individual load of this signature (P < 10−5). Dashed horizontal line represents the significance threshold (P = 10−5). Red dots represent the significant SNPs, and black circles indicate the genes where the significant SNPs reside. COSMIC, the Catalogue of Somatic Mutations in Cancer.
Top 10 enriched traits for the SVs mapped to the GWAS Catalog
| Body mass index | 41 | 340 | 0.0008 | 0.2190 |
| Schizophrenia | 49 | 441 | 0.0013 | 0.2742 |
| Mean platelet volume | 11 | 55 | 0.0029 | 0.3312 |
| QT interval | 12 | 68 | 0.0047 | 0.3787 |
| Parkinson's disease | 15 | 100 | 0.0065 | 0.3787 |
| Nickel levels | 8 | 37 | 0.0071 | 0.3787 |
| Adverse response to chemotherapy (neutropenia/leucopenia) (carboplatin) | 4 | 9 | 0.0074 | 0.3787 |
| Obesity-related traits | 65 | 689 | 0.0088 | 0.3787 |
| Bone mineral density | 13 | 85 | 0.0095 | 0.3787 |
| Platelet count | 12 | 78 | 0.0119 | 0.3908 |