| Literature DB >> 28938720 |
Tianming Lan1, Haoxiang Lin2, Wenjuan Zhu2, Tellier Christian Asker Melchior Laurent1,3, Mengcheng Yang1, Xin Liu1, Jun Wang1,3, Jian Wang1,4, Huanming Yang1,4, Xun Xu1, Xiaosen Guo1,3,5.
Abstract
Next-generation sequencing provides a high-resolution insight into human genetic information. However, the focus of previous studies has primarily been on low-coverage data due to the high cost of sequencing. Although the 1000 Genomes Project and the Haplotype Reference Consortium have both provided powerful reference panels for imputation, low-frequency and novel variants remain difficult to discover and call with accuracy on the basis of low-coverage data. Deep sequencing provides an optimal solution for the problem of these low-frequency and novel variants. Although whole-exome sequencing is also a viable choice for exome regions, it cannot account for noncoding regions, sometimes resulting in the absence of important, causal variants. For Han Chinese populations, the majority of variants have been discovered based upon low-coverage data from the 1000 Genomes Project. However, high-coverage, whole-genome sequencing data are limited for any population, and a large amount of low-frequency, population-specific variants remain uncharacterized. We have performed whole-genome sequencing at a high depth (∼×80) of 90 unrelated individuals of Chinese ancestry, collected from the 1000 Genomes Project samples, including 45 Northern Han Chinese and 45 Southern Han Chinese samples. Eighty-three of these 90 have been sequenced by the 1000 Genomes Project. We have identified 12 568 804 single nucleotide polymorphisms, 2 074 210 short InDels, and 26 142 structural variations from these 90 samples. Compared to the Han Chinese data from the 1000 Genomes Project, we have found 7 000 629 novel variants with low frequency (defined as minor allele frequency < 5%), including 5 813 503 single nucleotide polymorphisms, 1 169 199 InDels, and 17 927 structural variants. Using deep sequencing data, we have built a greatly expanded spectrum of genetic variation for the Han Chinese genome. Compared to the 1000 Genomes Project, these Han Chinese deep sequencing data enhance the characterization of a large number of low-frequency, novel variants. This will be a valuable resource for promoting Chinese genetics research and medical development. Additionally, it will provide a valuable supplement to the 1000 Genomes Project, as well as to other human genome projects.Entities:
Keywords: Han Chinese genomes; de novo assembly; genetic variations; high-coverage whole-genome sequencing
Mesh:
Year: 2017 PMID: 28938720 PMCID: PMC5603764 DOI: 10.1093/gigascience/gix067
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
The sequencing depth of different library insert sizes
| Library insert size | Sequencing depth (fold) | Standard deviation |
|---|---|---|
| 180 bp | 51.78 | 8.11 |
| 500 bp | 12.74 | 2.54 |
| 2000 bp | 5.01 | 1.08 |
| 5000 bp | 5.02 | 2.08 |
| 10 000 bp | 5.62 | 2.22 |
| 20 000 bp | 6.68 | 2.52 |
| <1000 bp | 64.52 | 8.11 |
| >1000 bp | 22.33 | 3.90 |
| Total | 86.85 | 8.53 |
Sequencing depth is calculated as total sequencing base/3e10.
Deep whole-genome sequencing data of 90 Chinese samples
| CHS | CHB | Total | |
|---|---|---|---|
| Number of individuals | 45 | 45 | 90 |
| Raw bases (Gb) | 231.61 ± 72.61 | 264.24 ± 44.92 | 247.69 ± 56.54 |
| Mapped bases (Gb) | 212.35 ± 68.96 | 243.28 ± 41.81 | 227.57 ± 53.74 |
| Average sequencing depth (fold) | 71.87 ± 23.52 | 82.36 ± 14.13 | 77.02 ± 18.37 |
Gene-based annotation of SNPs and InDels
| Regions | SNPs | InDels |
|---|---|---|
| Intron | 5 072 778 | 889 195 |
| CDS | 127 027 | 5916 |
| 5΄UTRs | 15 823 | 1754 |
| 3΄UTRs | 90 167 | 18 062 |
| Upstream | 174 016 | 33 074 |
| Downstream | 171 951 | 34 479 |
| Intergenic | 6 917 042 | 1 091 730 |
| Total variant | 12 568 804 | 2 074 210 |
Validations results of SNPs and InDels
| Types | Referenced variation set | Sample size | Total sites | Concordance sites | Concordance rate | FDR |
|---|---|---|---|---|---|---|
| SNP | Illumina Infinium OmniZhongHua-8 | 22 | 407 040 ± 2635 | 406 790 ± 2674 | 99.94% ± 0.02% | 0.06% ± 0.02% |
| Affymetrix Affy 6.0 | 86 | 406 354 ± 2064 | 406 011 ± 2109 | 99.92% ± 0.02% | 0.08% ± 0.02% | |
| Illumina Omni 2.5 | 86 | 678 718 ± 2783 | 678 253 ± 2805 | 99.93% ± 0.01% | 0.07% ± 0.01% | |
| 1KG phase III | 83 | 10 678 197 ± 892 | 10 648 954 ± 7813 | 99.68% ± 0.07% | 0.32% ± 0.07% | |
| INDEL | 1KG phase III | 83 | 774 476 ± 489 | 755 367 ± 1196 | 97.28% ± 0.15% | 2.72% ± 0.15% |
Figure 1:The results of novel variants. (A) The novel SNPs when comparing the SNP set of our 90 HAN Chinese with those of 1000GP, SNP147, or HAN Chinese from 1000GP (CHB+CHS). (B) The novel InDels when comparing InDels of our 90 HAN Chinese with those of 1000GP, SNP147, or HAN Chinese from 1000GP (CHB+CHS).
Figure 2:The proportion distribution of novel SNPs and InDels against minor allele frequency. HAN-SNP: the comparison of the SNP set generated from our 90 Han Chinese and the SNP set from the Han Chinese of the 1000GP. SNP147-SNP: the comparison of SNPs between our 90 Han Chinese and the dbSNP build147. 1KG-SNP: the comparison of the SNP sets between 90 Han Chinese and 1KG phase III release. HAN-INDEL: the comparison of INDELs between 90 Han Chinese and Han Chinese from 1000GP. SNP147-INDEL: the comparison of INDELs between 90 Han Chinese and dbSNP build147. 1KG-INDEL: the comparison of INDELs between 90 Han Chinese and 1000GP phase III release.
Figure 3:Annotation results of deletion breakpoints. Combine-repeat: combining repeat types with low frequency; L1: L1 repeat elements; L2: L2 repeat elements; MIR: mammalian interspersed repetitive (MIR) element; hAT-Charlie: one kind of DNA transposons.
Figure 4:The concordance rates of SVs between the 1000 Genomes Project and 90 Han Chinese.