| Literature DB >> 35309142 |
Sihan Liu1, Yuanyuan Zeng2, Chao Wang1, Qian Zhang1, Meilin Chen1, Xiaolu Wang1, Lanchen Wang1, Yu Lu1, Hui Guo3, Fengxiao Bu1.
Abstract
In clinical genetic testing, checking the concordance between self-reported gender and genotype-inferred gender from genomic data is a significant quality control measure because mismatched gender due to sex chromosomal abnormalities or misregistration of clinical information can significantly affect molecular diagnosis and treatment decisions. Targeted gene sequencing (TGS) is widely recommended as a first-tier diagnostic step in clinical genetic testing. However, the existing gender-inference tools are optimized for whole genome and whole exome data and are not adequate and accurate for analyzing TGS data. In this study, we validated a new gender-inference tool, seGMM, which uses unsupervised clustering (Gaussian mixture model) to determine the gender of a sample. The seGMM tool can also identify sex chromosomal abnormalities in samples by aligning the sequencing reads from the genotype data. The seGMM tool consistently demonstrated >99% gender-inference accuracy in a publicly available 1,000-gene panel dataset from the 1,000 Genomes project, an in-house 785 hearing loss gene panel dataset of 16,387 samples, and a 187 autism risk gene panel dataset from the Autism Clinical and Genetic Resources in China (ACGC) database. The performance and accuracy of seGMM was significantly higher for the targeted gene sequencing (TGS), whole exome sequencing (WES), and whole genome sequencing (WGS) datasets compared to the other existing gender-inference tools such as PLINK, seXY, and XYalign. The results of seGMM were confirmed by the short tandem repeat analysis of the sex chromosome marker gene, amelogenin. Furthermore, our data showed that seGMM accurately identified sex chromosomal abnormalities in the samples. In conclusion, the seGMM tool shows great potential in clinical genetics by determining the sex chromosomal karyotypes of samples from massively parallel sequencing data with high accuracy.Entities:
Keywords: Gaussian mixture model; aneuploidy; gender; massively parallel sequencing data; sex chromosomal abnormality
Year: 2022 PMID: 35309142 PMCID: PMC8930203 DOI: 10.3389/fgene.2022.850804
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Gender prediction accuracy of different methods for samples in dataset 1.
| Tools | Accuracy for all samples (%) | Accuracy for male samples (%) | Accuracy for female samples (%) |
|---|---|---|---|
| PLINK | 81.44 | 48.28 | 100 |
| seXY | 62.5 | 45.45 | 81.63 |
| XYalign | 98.08 | 100 | 95.92 |
| seGMM | 99.52 | 100 | 98.98 |
FIGURE 1Schematic diagram of seGMM. The seGMM tool automatically collects features from the input VCF and BAM files and builds the GMM model. The output of seGMM includes gender prediction results and identification of samples with abnormal sex chromosomes.
FIGURE 2The performance of seGMM in the TGS datasets. (A–D) Distribution of features collected from dataset 1. (E–G) Sample classification results of datasets 1, 2, and 3 based on seGMM. The colors represent different sample clusters. Dir1 and Dir2 represent the eigenvectors that specify the discriminant subspace generated from the features included in the GMM model.
Gender prediction accuracy of different methods for samples in datasets 2 and 3.
| Methods | Dataset 2 | Dataset 3 |
|---|---|---|
| PLINK | 87.10 | 38.87 |
| seGMM | 99.92 | 92.31 |
FIGURE 3Experimentally verified gender of HL-001200 (A), CTRL-002692 (B) and CTRL-002753 (C). The green box shows the location of the amelogenin loci.
Gender prediction accuracy of different methods for the WES and WGS datasets.
| Datasets | PLINK | XYalign | seXY | seGMM |
|---|---|---|---|---|
| 1000G phase3 WES data | 100 | 99.65 | 100 | 100 |
| 1000G phase3 high quality WGS data | 100 | 100 | 100 | 100 |
| In-house WES data | 99.79 | 99.91 | 49.23 | 100 |
FIGURE 4The prediction accuracy of seGMM in inferring the gender of samples from the in-house WES dataset. (A) Sample clustering results of seGMM. The colors represent different sample clusters. Dir1 and Dir2 represent eigenvectors that specify the discriminant subspace generated from the features included in the GMM model. (B) Scatter plot shows the reads mapped to the X and Y chromosomes. As shown, we identified three samples (HL-029620, HL-009382 and HL-019110) with XYY sex chromosome karyotypes.
Experimental verification of gender prediction results for samples in the in-house WES data.
| Sample ID | Size of PCR products in the amelogenin loci (bp) | Self-reported gender | seGMM inferred gender | Experimentally validated gender |
|---|---|---|---|---|
| HL-005584 | 209.15 | Male | Female | Female |
| - | ||||
| HL-006009 | 209.06 | Male | Female | Female |
| - | ||||
| HL-006904 | 209.04 | Female | Male | Male |
| 214.8 | ||||
| HL-007335 | 209.06 | Female | Male | Male |
| 214.85 | ||||
| HL-007935 | 209.11 | Male | Female | Female |
| - | ||||
| HL-012246 | 209.18 | Female | Male | Male |
| 214.92 | ||||
| HL-033182 | 209.02 | Female | Female | Female |
| - | ||||
| HL-020292 | 209.19 | Female | Female | Female |
| - | ||||
| HL-011500 | 209.25 | Male | Male | Male |
| 215.07 | ||||
| HL-019211 | 209.19 | Male | Male | Male |
| 215.04 | ||||
| HL-009389 | 209.27 | Female | Female | Female |
| - | ||||
| HL-012554 | 209.26 | Male | Male | Male |
| 215.03 |
FIGURE 5Quantitative determination of Y chromosome copy number.