| Literature DB >> 30959223 |
Qian Liu1, Qiang Hu2, Song Yao3, Marilyn L Kwan4, Janise M Roh4, Hua Zhao5, Christine B Ambrosone3, Lawrence H Kushi4, Song Liu2, Qianqian Zhu6.
Abstract
As next-generation sequencing (NGS) technology has become widely used to identify genetic causal variants for various diseases and traits, a number of packages for checking NGS data quality have sprung up in public domains. In addition to the quality of sequencing data, sample quality issues, such as gender mismatch, abnormal inbreeding coefficient, cryptic relatedness, and population outliers, can also have fundamental impact on downstream analysis. However, there is a lack of tools specialized in identifying problematic samples from NGS data, often due to the limitation of sample size and variant counts. We developed SeqSQC, a Bioconductor package, to automate and accelerate sample cleaning in NGS data of any scale. SeqSQC is designed for efficient data storage and access, and equipped with interactive plots for intuitive data visualization to expedite the identification of problematic samples. SeqSQC is available at http://bioconductor.org/packages/SeqSQC.Entities:
Keywords: 1000 Genomes Project; Bioconductor package; Next-generation sequencing; Quality assessment; Whole-exome sequencing
Mesh:
Year: 2019 PMID: 30959223 PMCID: PMC6620264 DOI: 10.1016/j.gpb.2018.07.006
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 6.409
Dataset from the 1000 Genomes Project
| Benchmark | AFR | 22 | 3 (2 PO + 1 FS) |
| EAS | 22 | 2 (1 FS + 1 HF) | |
| EUR | 21 | 1 (1 HF) | |
| SAS | 22 | 2 (2 PO) | |
| Test cohorts | AFR | 647 AFR + 2 EAS + 2 EUR + 2 SAS + 1 DU + 1 CTM | 6 (1 PO + 4 FS + 1 HF) |
| EAS | 493 EAS + 2 AFR + 2 EUR + 2 SAS + 1 DU + 1 CTM | 9 (3 PO + 3 FS + 3 HF) | |
| EUR | 484 EUR + 2 AFR + 2 EAS + 2 SAS + 1 DU + 1 CTM | 1 (1 FS) | |
| SAS | 472 SAS + 2 AFR + 2 EAS + 2 EUR + 1 DU + 1 CTM | 3 (2 PO + 1 HF) | |
Note: PO, parent-offspring; FS, full sibling; HF, half sibling/avuncular pair; AFR, African; EAS, East Asian; EUR, European; SAS, South Asian; DU, duplicate; CTM, contamination.
Figure 1Flowchart of the
In the data preparation module, SeqSQC merges the study cohort with the benchmark data. Merged data of SeqSQC class are used for the subsequent sample QC and result summary. The input files allowed in SeqSQC include a VCF file, a BED file for capture region, and an annotation file with sample population and gender information. User could use the wrap up function for an automated sample QC, to generate all QC results, a problematic sample list with indication of the reason for removal, and a sample QC report with interactive plots for each QC step. User can also call the specific QC function, or customize the settings of each QC step, including the criteria for defining problematic samples and the choice of statistical methods.
Figure 2The sample quality check for the AFR test cohort from the 1000 Genomes Project
A. Sex check. 655 study samples and 22 benchmark samples of AFR ancestry were shown. Gray lines were drawn when sex inbreeding coefficient equals 0.2 or 0.8 as threshold for sample genders (See Method). Two self-reported female samples were detected to be male by SeqSQC (indicated as two red triangles among the group of cyan triangles). B. The plot of inbreeding coefficients. 655 study samples and 22 benchmark samples of AFR ancestry were shown. Gray lines were drawn when autosomal inbreeding coefficient equals to five standard deviations beyond mean. Any point beyond the gray lines was defined to be problematic. Eight inbreeding outliers were detected (including one simulated sample with contamination, six intended population outliers, and one unintended inbreeding outlier; see Tables S1 and S2). C. IBD check. After removing problematic samples detected from previous QC steps, a total of 732 samples (including 645 study samples and 87 benchmark samples) were shown in pairwise fashion. Samples with known relationships are highlighted, including DU (red), FS (green), HF (organge), and PO (pink), whereas samples with unknown relationship were marked in black. “+” highlights the expected position for each corresponding relationship. Newly-detected relationships from this test cohort are highlighted with red circles. D. The plot of the first two PC axes from the PCA analysis. After removing problematic samples detected from previous QC steps except for the six intended population outliers, as well as the related samples in benchmark data, a total of 718 independent samples (including 638 study samples and 80 benchmark samples) were shown. Six intended population outliers (two from each population of EAS, EUR, and SAS) are highlighted with red circles. The AFR samples were separated into different groups in PC2 since they came from different sub-populations including ACB, ASW, ESN, GWD, LWK, MSL, and YRI. AFR, African; EAS, East Asian; EUR, European; SAS, South Asian; DU, duplicate; FS, full-sibling; HF, half-sibling/avuncular pair; UN, unknown; PO, parent–offspring pair; PCA, principal component analysis; ACB, African Caribbeans in Barbados; ASW, Americans of African ancestry in Southwestern USA; ESN, Esan in Nigeria; GWD, Gambian in Western Divisions in the Gambia; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; YRI, Yoruba in Ibadan, Nigeria.
The problematic samples in WES of 143 breast cancer patients
| AFR | 69 | 1 | Inbreeding outlier |
| 2 | Population outlier | ||
| EUR | 48 | 1 | Inbreeding outlier |
| ASN | 26 | 2 | Population outlier |