| Literature DB >> 27882129 |
Xiaobin Wang1, Weiguo Sui2, Weiqing Wu3, Xianliang Hou4, Minglin Ou2, Yueying Xiang5, Yong Dai6.
Abstract
With the advent of next-generation sequencing technology, the cost of sequencing has significantly decreased. However, sequencing costs remain high for large-scale studies. In the present study, DNA pooling was applied as a cost-effective strategy for sequencing. The sequencing results for 100 healthy individuals obtained via whole-genome resequencing and using DNA pooling are presented in the present study. In order to minimise the likelihood of systematic bias in sampling, paired-end libraries with an insert size of 500 bp were prepared for all samples and then subjected to whole-genome sequencing using four lanes for each library and resulting in at least a 30-fold haploid coverage for each sample. The NCBI human genome build37 (hg19) was used as a reference genome for the present study and the short reads were aligned to the reference genome achieving 99.84% coverage. In addition, the average sequencing depth was 32.76. In total, ~3 million single-nucleotide polymorphisms were identified, of which 99.88% were in the NCBI dbSNP database. Furthermore, ~600,000 small insertion/deletions, 500,000 structure variants, 5,000 copy number variations and 13,000 single nucleotide variants were identified. According to the present study, the whole genome has been sequenced for a small sample subjects from southern China for the first time. Furthermore, new variation sites were identified by comparing with the reference sequence, and new knowledge of the human genome variation was added to the human genomic databases. Furthermore, the particular distribution regions of variation were illustrated by analyzing various sites of variation, such as single-nucleotide polymorphisms.Entities:
Keywords: DNA pooling; genetic variation; single nucleotide polymorphism; variation site; whole-genome resequencing
Year: 2016 PMID: 27882129 PMCID: PMC5103757 DOI: 10.3892/etm.2016.3797
Source DB: PubMed Journal: Exp Ther Med ISSN: 1792-0981 Impact factor: 2.447
Quality statistics of clean data.
| Type | Raw data | Clean data |
|---|---|---|
| Number of reads | 1,273,028,056 | 1,210,244,348 |
| Data size | 114,572,525,040 | 108,921,991,320 |
| N of fq1 | 23,591,083 | 1,889,209 |
| N of fq2 | 61,780,180 | 1,483,604 |
| GC (%) of fq1 | 39.61–40.1 | 39.43–40.01 |
| GC (%) of fq2 | 39.8–40.17 | 39.55–40.05 |
| Q20 (%) of fq1 | 94.58–97.09 | 95.79–97.85 |
| Q20 (%) of fq2 | 88.51–93.66 | 92.07–96.12 |
| Q30 (%) of fq1 | 86.69–92.40 | 88.20–93.46 |
| Q30 (%) of fq2 | 78.99–88.05 | 82.37–90.50 |
| Discard reads related to N | 2,264,798 | |
| Discard reads related to low qual | 59,735,130 | |
| Discard reads related to adapter | 783,780 | |
| Clean data/raw data | 95.07% |
Before doing any further analysis, quality control is required in order to detect whether the data is qualified. In addition, filtering of raw data is needed to decrease data noise.
Figure 1.Analysis of base composition and quality. (A) Unbalanced base composition of raw reads. (B) Balanced base composition of raw reads. (C) Low quality distribution of bases along reads. Each dot in the image represents the quality value of the corresponding position along reads. If the percentage of the bases with low quality (<20) was considered very high, then the sequencing quality of this lane was considered bad. (D) High quality distribution of bases along reads. Each dot in the image represents the quality value of the corresponding position along reads. If the percentage of the bases with low quality (<20) was considered low, then the sequencing quality of this lane was considered good.
Alignment results.
| Item | Value | Item | Value |
|---|---|---|---|
| Clean reads | 1,210,244,348 | Duplicate rate | 8.51% |
| Clean bases (bp) | 108,921,991,320 | Mismatch bases | 425,479,678 |
| Mapped reads | 1,173,317,876 | Mismatch rate | 0.41% |
| Mapped bases (bp) | 103,953,154,126 | Average sequencing depth | 32.76 |
| Mapping rate | 96.95% | Coverage | 99.84% |
| Uniq reads | 1,125,241,695 | Coverage at least 4X | 99.21% |
| Uniq bases (bp) | 99,700,359,408 | Coverage at least 10X | 97.48% |
| Unique rate | 95.90% | Coverage at least 20X | 91.30% |
| Duplicate reads | 99,867,211 |
Bp, base pairs.
Figure 2.Depth distribution. (A) X-axis denotes the sequencing depth, while the y-axis indicates the percentage of the non-N region of the whole genome under a given sequencing depth. (B) Plot of cumulative depth distribution in the non-N region of the whole genome, the x-axis denotes sequencing depth while the y-axis indicates the fraction of bases that achieves at or above a given sequencing depth.
SNPs summary of annotation.
| Categories | Value | Categories | Value |
|---|---|---|---|
| Total | 3,830,314 | Splicing | 143 |
| 1000 genome and dbsnp135 | 3,768,967 | NcRNA | 93,679 |
| 1000 genome specific | 1572 | UTR5 | 3,747 |
| dbSNP135 specific | 56,946 | UTR5 and UTR3 | 12 |
| dbSNP rate | 99.89% | UTR3 | 24,880 |
| Novel | 2,829 | Intronic | 1,330,526 |
| Hom | 479,258 | Upstream | 18,144 |
| Het | 3,351,056 | Upstream and downstream | 580 |
| Synonymous | 11,267 | Downstream | 21,376 |
| Missense | 9,534 | Intergenic | 2,316,322 |
| Stopgain | 71 | SIFT | 1,138 |
| Stoploss | 33 | Ti/Tv | 2.1055 |
| Exonic | 20,616 | dbSNP Ti/Tv | 2.1068 |
| Exonic and splicing | 289 | Novel Ti/Tv | 1.1191 |
SNP, single-nucleotide polymorphism; UTR, untranslated region; SIFT, sorting intolerant from tolerant; Ti, transition; Tv, transvertion.
Insertion/deletion summary of annotation.
| Categories | Value | Categories | Value |
|---|---|---|---|
| Total | 601,124 | Stopgain | 1 |
| 1000 genome and dbsnp135 | 301,621 | Stoploss | 1 |
| 1000 genome specific | 73,292 | Exonic | 403 |
| dbSNP135 specific | 119,018 | Exonic and splicing | 6 |
| dbSNP rate | 69.98% | Splicing | 77 |
| Novel | 107,193 | NcRNA | 15,081 |
| Hom | 101,236 | UTR5 | 438 |
| Het | 499,888 | UTR5 and UTR3 | 3 |
| Frameshift insertion | 123 | UTR3 | 4,954 |
| Non-frameshift insertion | 85 | Intronic | 211,208 |
| Frameshift deletion | 100 | Upstream | 3,074 |
| Non-frameshift deletion | 99 | Upstream and downstream | 99 |
| Frameshift block substitution | 0 | Downstream | 4,051 |
| Non-frameshift block substitution | 0 | Intergenic | 361,730 |
SNP, single-nucleotide polymorphism; UTR, untranslated region.
Figure 3.InDel length distribution. Length distribution of the InDels in (A) whole genome and (B) CDS were also plotted below. The length distribution of InDels in coding region shows that peaks are present in length (bp). The InDels with this periodicity are non-frameshift InDels, they have relatively small effect on the genome comparing with the frameshift InDels. InDel, insertion/deletion; CDS, coding sequence.
Structure variants summary of annotation.
| Categories | Value | Categories | Value |
|---|---|---|---|
| Total | 5,412 | NcRNA | 114 |
| Insertion | 352 | UTR5 | 3 |
| Deletion | 4,834 | UTR5 and UTR3 | 0 |
| Inversion | 14 | UTR3 | 8 |
| ITX | 120 | Intronic | 1,823 |
| CTX | 92 | Upstream | 11 |
| Exonic | 6 | Upstream and downstream | 2 |
| Exonic and splicing | 1 | Downstream | 29 |
| Splicing | 6 | Intergenic | 3,409 |
ITX, intra-chromosomal translocation; CTX, inter-chromosomal translocation; SNP, single-nucleotide polymorphism; UTR, untranslated region.
Copy number variations summary of annotation.
| Categories | Value | Categories | Value |
|---|---|---|---|
| Total | 5,201 | UTR3 | 7 |
| Exonic | 954 | Intronic | 1,174 |
| Exonic and splicing | 0 | Upstream | 59 |
| Splicing | 274 | Upstream and downstream | 3 |
| NcRNA | 196 | Downstream | 35 |
| UTR5 | 0 | Intergenic | 2,499 |
| UTR5 and UTR3 | 0 | Amplification size | 12,106,400 |
| Deletion size | 85,672,600 |
UTR, untranslated region.
Figure 4.SNP depth distribution. X-axis denotes different sequencing depth, while y-axis indicates the percentage of SNP number. The trends of novel SNP depth analysis should be same like known. SNP, single-nucleotide polymorphism.