| Literature DB >> 32555294 |
Chuanfeng Huang1, Libin Shao2, Shoufang Qu1, Junhua Rao3, Tao Cheng4, Zhisheng Cao5, Sanyang Liu6, Jie Hu2, Xinming Liang3, Ling Shang4, Yangyi Chen7, Zhikun Liang8, Jiezhong Zhang6, Peipei Chen5, Donghong Luo7, Anna Zhu8, Ting Yu1, Wenxin Zhang1, Guangyi Fan2,9,10, Fang Chen11, Jie Huang12.
Abstract
Sequencing technologies have been rapidly developed recently, leading to the breakthrough of sequencing-based clinical diagnosis, but accurate and complete genome variation benchmark would be required for further assessment of precision medicine applications. Despite the human cell line of NA12878 has been successfully developed to be a variation benchmark, population-specific variation benchmark is still lacking. Here, we established an Asian human variation benchmark by constructing and sequencing a stabilized cell line of a Chinese Han volunteer. By using seven different sequencing strategies, we obtained ~3.88 Tb clean data from different laboratories, hoping to reach the point of high sequencing depth and accurate variation detection. Through the combination of variations identified from different sequencing strategies and different analysis pipelines, we identified 3.35 million SNVs and 348.65 thousand indels, which were well supported by our sequencing data and passed our strict quality control, thus should be high confidence variation benchmark. Besides, we also detected 5,913 high-quality SNVs which had 969 sites were novel and located in the high homologous regions supported by long-range information in both the co-barcoding single tube Long Fragment Read (stLFR) data and PacBio HiFi CCS data. Furthermore, by using the long reads data (stLFR and HiFi CCS), we were able to phase more than 99% heterozygous SNVs, which helps to improve the benchmark to be haplotype level. Our study provided comprehensive sequencing data as well as the integrated variation benchmark of an Asian derived cell line, which would be valuable for future sequencing-based clinical development.Entities:
Mesh:
Year: 2020 PMID: 32555294 PMCID: PMC7300012 DOI: 10.1038/s41598-020-66605-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Overview of variation calling pipeline. The major steps included data filtering, alignment, variation calling, and integrated analysis.
Figure 2Saturation analysis. The relationship between SNVs(A)/indels(B) and depth, with the X axis for sequencing depth and the Y axis for the number of SNVs/indels detected.
Figure 3Blind zones by MPS in each sequencing platform.
Figure 4Depth and coverage of NBPF4 gene in blind zones.
Figure 5Consistency analysis: BGI regular MPS platforms, Illumina regular MPS platforms, linked-reads library, and PacBio CCS mode SNV(A) and indel(B) consistency analysis.
Figure 6Density maps of SNV and indel variations normalized with Chinese population in 1000 Genome project. From inside to outside circles are DNBSEQ-MPS, Illumina-MPS, stLFR and Pacbio CCS respectively, and the last but one contains several lines, which means Chinese population failed in those regions detection while our data set contains variations here. Window =1 Mb, Inside and outside are indel and SNV.
Annotation of HJ, YH and NA12878 SNVs.
| Sample | HJ | YH | NA12878 |
|---|---|---|---|
| Total | 3,345,294 | 3,072,912 | 3,259,653 |
| dbSNP (%) | 99.29 | 87.13 | 99.89 |
| 1000genomes (%) | 98.28 | 95.93 | 98.68 |
| Novel (%) | 0.01 | 12.87 | 0.10 |
| Homozygous | 1,492,029 | 1,352,822 | 1,289,007 |
| Heterozygous | 1,853,265 | 1,720,090 | 1,970,646 |
| Intronic | 1,366,626 | 1,256,586 | 1,344,882 |
| 5′ UTRs | 4,306 | 3,871 | 4,207 |
| 3′ UTRs | 22,396 | 22,182 | 21,248 |
| Upstream | 47,789 | 43,612 | 44,056 |
| Downstream | 47,217 | 43,627 | 43,574 |
| Intergenic | 1,827,269 | 1,674,057 | 1,775,196 |
| Ti/Tv | 2.1 | 2.01 | 2.1 |
Haplotype phasing small variants.
| Chr | CCS | stLFR | ||||
|---|---|---|---|---|---|---|
| Heterozygous | Phased SNV | Phased rate(%) | Heterozygous | Phased SNV | Phased rate(%) | |
| 1 | 169,906 | 169,174 | 99.57 | 172,790 | 172,682 | 99.94 |
| 2 | 167,518 | 166,806 | 99.57 | 103,486 | 103,435 | 99.95 |
| 3 | 143,618 | 142,968 | 99.55 | 101,126 | 101,070 | 99.94 |
| 4 | 151,585 | 151,033 | 99.64 | 102,317 | 102,274 | 99.96 |
| 5 | 128,296 | 127,772 | 99.59 | 75,908 | 75,874 | 99.96 |
| 6 | 131,798 | 131,325 | 99.64 | 70,091 | 70,064 | 99.96 |
| 7 | 123,689 | 123,253 | 99.65 | 68,411 | 68,379 | 99.95 |
| 8 | 120,782 | 120,391 | 99.68 | 72,564 | 72,536 | 99.96 |
| 9 | 94,946 | 94,653 | 99.69 | 53,789 | 53,762 | 99.95 |
| 10 | 99,256 | 98,894 | 99.64 | 60,279 | 60,254 | 99.96 |
| 11 | 100,822 | 100,465 | 99.65 | 49,744 | 49,724 | 99.96 |
| 12 | 101,519 | 101,168 | 99.65 | 172,818 | 172,720 | 99.94 |
| 13 | 75,515 | 75,282 | 99.69 | 43,553 | 43,531 | 99.95 |
| 14 | 68,223 | 67,954 | 99.61 | 37,388 | 37,370 | 99.95 |
| 15 | 67,759 | 67,549 | 99.69 | 34,105 | 34,089 | 99.95 |
| 16 | 69,062 | 68,823 | 99.65 | 145,073 | 145,017 | 99.96 |
| 17 | 54,620 | 54,358 | 99.52 | 151,677 | 151,603 | 99.95 |
| 18 | 59,025 | 58,847 | 99.7 | 130,925 | 130,865 | 99.95 |
| 19 | 48,314 | 48,195 | 99.75 | 135,438 | 135,376 | 99.95 |
| 20 | 42,939 | 42,750 | 99.56 | 126,619 | 126,552 | 99.95 |
| 21 | 39,076 | 39,010 | 99.83 | 122,931 | 122,888 | 99.97 |
| 22 | 31,029 | 30,979 | 99.84 | 111,751 | 111,697 | 99.95 |
| X | — | — | — | 4,418 | 3,636 | 82.3 |
| Y | — | — | — | 4,444 | 4,354 | 97.97 |
| Genome | 2,089,297 | 2,081,649 | 99.63 | 2,151,645 | 2,149,752 | 99.91 |