| Literature DB >> 24708222 |
Yunchao Ling, Zhong Jin, Mingming Su, Jun Zhong, Yongbing Zhao, Jun Yu1, Jiayan Wu, Jingfa Xiao.
Abstract
BACKGROUND: The data released by the 1000 Genomes Project contain an increasing number of genome sequences from different nations and populations with a large number of genetic variations. As a result, the focus of human genome studies is changing from single and static to complex and dynamic. The currently available human reference genome (GRCh37) is based on sequencing data from 13 anonymous Caucasian volunteers, which might limit the scope of genomics, transcriptomics, epigenetics, and genome wide association studies. DESCRIPTION: We used the massive amount of sequencing data published by the 1000 Genomes Project Consortium to construct the Virtual Chinese Genome Database (VCGDB), a dynamic genome database of the Chinese population based on the whole genome sequencing data of 194 individuals. VCGDB provides dynamic genomic information, which contains 35 million single nucleotide variations (SNVs), 0.5 million insertions/deletions (indels), and 29 million rare variations, together with genomic annotation information. VCGDB also provides a highly interactive user-friendly virtual Chinese genome browser (VCGBrowser) with functions like seamless zooming and real-time searching. In addition, we have established three population-specific consensus Chinese reference genomes that are compatible with mainstream alignment software.Entities:
Mesh:
Year: 2014 PMID: 24708222 PMCID: PMC4028056 DOI: 10.1186/1471-2164-15-265
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Data processing workflow used to construct the virtual Chinese genome database (VCGDB).
Dynamic position counts for the different populations in the virtual Chinese genome database (VCGDB)
| | |||
|---|---|---|---|
| 33,780,152 | 19,591,609 | 24,109,529 | |
| 1,747,140 | 937,088 | 1,222,844 | |
| 1,006 | 673 | 654 | |
| 35,528,298 | 20,529,370 | 25,333,027 | |
| | |||
| | |||
| 392,074 | 454,215 | 345,647 | |
| 14,360 | 17,323 | 12,346 | |
| 25 | 20 | 59 | |
| 406,459 | 471,558 | 358,052 | |
| | |||
| | |||
| 27,258,690 | 12,505,097 | 17,688,051 | |
| 1,477,821 | 632,253 | 939,627 | |
| 907 | 569 | 536 | |
| 28,737,418 | 13,137,919 | 18,628,214 | |
Figure 2Statistical analysis of the dynamic genomics information in the virtual Chinese genome database (VCGDB). A. Dynamic positions and indel distribution in the CHN, CHB, and CHS populations. The X-axis shows the major base probability of the dynamic position/probability of indels in the genome sequences. The Y-axis shows the proportion of dynamic positions/indels with the specific probability region. B. Indel length distribution in the CHN, CHB, and CHS populations. The X-axis shows the length of insertions (blue) and deletions (red), and the Y-axis shows the number of indels. Only high-probability insertions and deletions (>50%) in VCGDB were counted. C. Distribution of dynamic positions, indels, and MAIR (major allele and indel positions against the GRCh37 reference genome) based on the annotation information.
Statistical analysis of genetic variations in the exonic regions of the Chinese and GRCh37 genomes
| 6481(52.80%) | 6657(52.68%) | 6446(52.98%) | |
| 5430(44.24%) | 5609(44.39%) | 5362(44.07%) | |
| 22(0.18%) | 21(0.17%) | 22(0.18%) | |
| 5(0.04%) | 6(0.05%) | 5(0.04%) | |
| 336(2.74%) | 343(2.71%) | 332(2.73%) | |
| 12,274 | 12,636 | 12,167 |
Genetic variations in the major allele and indel positions against the GRCh37 reference genome (MAIR) were analyzed.
Enrichment analysis of MAIR in GWAS trait locations in the Chinese and GRCh37 genomes
| Height (110/324) | Height (108/324) | Height (110/324) | |
| Multiple sclerosis (41/187) | Multiple sclerosis (41/187) | Multiple sclerosis (41/187) | |
| Crohn's disease (36/181) | Crohn's disease (40/181) | Crohn's disease (38/181) | |
| Body mass index (34/109) | Body mass index (34/109) | Coronary heart disease (37/151) | |
| Coronary heart disease (34/151) | Coronary heart disease (33/151) | Body mass index (36/109) | |
| Type 2 diabetes (33/164) | Type 2 diabetes (32/164) | Bipolar disorder (32/109) | |
| Rheumatoid arthritis (31/170) | LDL cholesterol (32/114) | Type 2 diabetes (31/164) | |
| LDL cholesterol (31/114) | HDL cholesterol (31/118) | Rheumatoid arthritis (30/170) | |
| Bipolar disorder (30/109) | Type 1 diabetes (29/107) | Bone mineral density (30/87) | |
| Bone mineral density (30/87) | Bone mineral density (29/87) | LDL cholesterol (30/114) |
MAIR, major allele and indel positions, against the reference genome.
Figure 3Differences between the two Han Chinese CHS and CHB populations. A. Venn diagram of a comparison of the dynamic positions. B. Venn diagram of a comparison of major alleles against the GRCh37 reference genome. C. Venn diagram of a comparison of high-probability indels against the GRCh37 reference genome. D. Venn diagram of a comparison of rare variations. In B and C, some shared dynamic positions were substituted by the same nucleotides/indels, others were substituted by different nucleotides/indels; these are marked "same" and "diff", respectively.
Figure 4VCGBrowser interface (A) and VCGDB online search page (B).
Mapping of 15 Asian genomes onto the VCG, YH and GRCh37 reference genomes
| GRCh37 | 46,809,092 | 54,343,732 | 58,459,936 | 41,735,844 | 46,809,092 | |
| | (96.14%) | (95.58%) | (97.78%) | (97.80%) | (96.14%) | |
| | YH | 46,963,414 | 54,578,732 | 58,711,122 | 41,929,952 | 46,963,414 |
| | (96.46%) | (95.99%) | (98.20%) | (98.25%) | (96.46%) | |
| | VCG | 47,025,644 | 54,677,084 | 58,817,318 | 42,017,702 | 47,025,644 |
| | (96.59%) | (96.16%) | (98.37%) | (98.46%) | (96.59%) | |
| | ||||||
| GRCh37 | 9,570,212 | 9,900,966 | 12,832,078 | 11,818,302 | 15,160,036 | |
| | (74.52%) | (83.42%) | (70.94%) | (64.55%) | (61.13%) | |
| | YH | 9,554,028 | 9,899,632 | 12,825,240 | 11,826,910 | 15,179,378 |
| | (74.39%) | (83.41%) | (70.90%) | (64.60%) | (61.20%) | |
| | VCG | 9,590,720 | 9,931,332 | 12,870,814 | 11,880,666 | 15,228,994 |
| | (74.68%) | (83.68%) | (71.16%) | (64.89%) | (61.40%) | |
| | ||||||
| GRCh37 | 31,213,781 | 4,235,400 | 34,476,427 | 9,353,246 | 71,376,941 | |
| | (89.77%) | (97.06%) | (98.70%) | (96.88%) | (89.93%) | |
| | YH | 31,338,365 | 4,236,640 | 34,514,636 | 9,359,975 | 71,711,466 |
| | (90.13%) | (97.09%) | (98.81%) | (96.95%) | (90.35%) | |
| | VCG | 31,410,518 | 4,236,680 | 34,500,530 | 9,361,169 | 71,898,286 |
| (90.34%) | (97.09%) | (98.76%) | (96.96%) | (90.59%) | ||