| Literature DB >> 28369524 |
Sejoon Lee1,2, Soohyun Lee3, Scott Ouellette3, Woong-Yang Park1, Eunjung A Lee3,4, Peter J Park3,5.
Abstract
In many next-generation sequencing (NGS) studies, multiple samples or data types are profiled for each individual. An important quality control (QC) step in these studies is to ensure that datasets from the same subject are properly paired. Given the heterogeneity of data types, file types and sequencing depths in a multi-dimensional study, a robust program that provides a standardized metric for genotype comparisons would be useful. Here, we describe NGSCheckMate, a user-friendly software package for verifying sample identities from FASTQ, BAM or VCF files. This tool uses a model-based method to compare allele read fractions at known single-nucleotide polymorphisms, considering depth-dependent behavior of similarity metrics for identical and unrelated samples. Our evaluation shows that NGSCheckMate is effective for a variety of data types, including exome sequencing, whole-genome sequencing, RNA-seq, ChIP-seq, targeted sequencing and single-cell whole-genome sequencing, with a minimal requirement for sequencing depth (>0.5X). An alignment-free module can be run directly on FASTQ files for a quick initial check. We recommend using this software as a QC step in NGS studies. AVAILABILITY: https://github.com/parklab/NGSCheckMate.Entities:
Mesh:
Year: 2017 PMID: 28369524 PMCID: PMC5499645 DOI: 10.1093/nar/gkx193
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.A schematic overview. NGSCheckMate can handle various data types in any of the three formats (FASTQ, VCF or BAM). The tool calculates pairwise correlations of VAFs (variant allele fractions) from the input files and classifies each pair of files as either matched (from the same individual) or unmatched (not from the same individual). The output files are a text file listing the VAF correlation for each pair, a dendrogram image or an XGMML file with a graph structure that can be fed into graph visualization tools such as Cytoscape.
Figure 2.Illustration of key steps. (A) A VAF is computed as the fraction of reads supporting a variant (non-reference) allele for a given SNP site. A VAF ranges between 0 and 1 at each genotype. The VAFs are computed across a panel of SNP sites for each file and the Pearson correlation between two VAF vectors is computed. (B) For the alignment-free module, a pre-built hash table stores 21-mer sequence tags that represent SNP sites and alleles. Each tag overlaps with the SNP site either at the center or at one of the two ends. For each SNP site, a total of 24 tag sequences (4 alleles × 3 overlapping SNP locations to the SNP site × 2 orientations (forward and reverse complementary)) are prepared and 6 and 18 of them represent a reference allele and alternative alleles, respectively. A hash is constructed with the 21-mer tags as keys, each pointing to an element of a 2-dimensional read count array, where the two dimensions are SNP loci and alleles. Given an input FASTQ file (single-end) or a pair of input FASTQ files (paired-end), randomly subsampled reads are examined by a 21-nt sliding window. If a 21-nt substring exists in the hash, we increase the corresponding read count value by one and move to the next read. In the end, VAFs are computed using the count values in the array. (C) A depth-dependent VAF correlation background model is constructed by down-sampling from high-coverage WGS data to 0.01–60X. Given input data files i and j, NGSCheckMate computes a VAF correlation coefficient C between the two files and compares it to the precomputed model at the observed depth D defined as the smaller of the mean depths for the two files. The VAF correlation cutoff for classification is the midpoint between the average correlation for matched pairs (C) minus one standard deviation (sd) and the average correlation for unmatched pairs (C) plus one standard deviation (sd) at a given depth.
Figure 3.Classification based on VAF correlations. (A) Depth-dependent VAF correlations derived from WGS and WES datasets for different types of sample pairs: identical, related (parent-child or siblings) and unrelated. The LOESS regression lines are shown. Family pairs were tested only for depths >0.5X. (B) VAF correlations based on simulation, with different shape parameters (r) of a negative binomial model for the read depth distribution, numbers of SNPs used (12K versus 20K), and percentages of alternative homozygous SNPs (see ‘Materials and Methods’ section). A binomial distribution was assumed for the distribution of alternative allele reads (see ‘Materials and Methods’ section). The top and bottom 5% of simulated VAF correlations are plotted as shaded areas around each line. VAF correlations of parent-child pairs (green) are hidden in the figure because they overlap those of sibling pairs (purple). VAF correlations using 12K (solid lines) and 20K SNPs (dotted lines) are also indistinguishable (nearly superimposed in the figure). (C) The distributions of simulated VAF correlations in (B) at the depth of 0.5X are shown in vertical lines representing the top and bottom 5% VAF correlation values. (D) Accuracy at various sequencing depths for WGS data. The original and down-sampled datasets of 66 TCGA colorectal pairs and 36 single-neurons from three post-mortem brains were tested.
NGSCheckMate performance
| Data type | Dataset (pair type) | Sequencing depth1 | Individual | Sample | Test pairs #matched, #unmatched | Accuracy (%)2 |
|---|---|---|---|---|---|---|
| WGS (BAM) | TCGA colorectal (cancer versus normal) | >30X, down-sampling (0.5–30X) | 66 | 132 | 66, 66 | 100 |
| down-sampling (0.01–0.2X) | 55.3–99.2 | |||||
| WGS (BAM, FASTQ) | non-TCGA lymphoma (cancer versus normal) | 30–60X, down-sampling (0.5–30X) | 14 | 28 | 14, 28 | 100, 100 |
| WES (BAM) | TCGA 9 cancer types (cancer versus normal) | ∼100X | 421 | 842 | 421, 421 | 100 |
| TCGA kidney (cancer versus normal) | ∼100X, down-sampling (0.5–30X) | 50 | 100 | 50, 50 | 100 | |
| WES (FASTQ) | non-TCGA breast (cancer versus normal) | ∼60X, down-sampling (0.5–10X) | 68 | 136 | 68, 68 | 100 |
| RNA-seq (BAM) | TCGA colorectal (cancer versus normal) | ∼65X, down-sampling (0.5–10X) | 19 | 38 | 19, 19 | 100 |
| Single-cell WGS (BAM) | single-neuron | ∼42X, down-sampling (0.5–10X) | 3 | 36 | 210, 210 | 100 |
| glioblastoma (cancer–cancer) | 0.01–0.3X | 2 | 89 | 45, 45 | 87.8 | |
| Chip-seq (BAM, FASTQ) | within marks | 5.4 (2.2–19.0) | 8 | 119 | 72, 72 | 97.6, 97.7 |
| input versus mark | input DNA 2.3 (2.1–2.9) | 8 | 127 | 133, 133 | 98.5, 99.8 | |
| Panel-seq (BAM, FASTQ) | cancer versus normal, multiple regions, primary versus metastasis | 40 (20–119) | 5, 18, 11 | 12, 48, 25 | 92, 87 | 98.3, 99.4 |
| RNA-seq versus WES (BAM) | TCGA stomach (cancer or normal DNA versus cancer RNA) | RNA-seq (∼70X) WES (∼100X) | 65 | 201 | 132,132 | 100 |
| RNA-seq versus WES (FASTQ) | non-TCGA breast cancer (cancer or normal DNA versus cancer RNA) | RNA-seq (∼25X) WES (∼60X), down-sampling for both datasets (0.5–10X) | 53 | 159 | 106, 106 | 100 |
| RNA- versus ChIP-seq (BAM, FASTQ) | RNA-seq versus ChIP-seq (all marks) | RNA-seq (∼5X) ChIP-seq (described above) | 7 | 119 | 231, 231 | 99, 98.9 |
1For WGS, WES and RNA-seq, the average mean depth is shown. For Panel-seq and ChIP-seq, the average of the mean non-zero depths across the SNPs with at least one mapped read is shown with its range in parentheses.
2Accuracy estimates for the alignment-based and the alignment-free method are separated by a comma.
Figure 4.Examples of sample mislabelings. (A) A dendrogram output for panel-seq data of primary colon cancer and liver metastasis samples, based on 1 - VAF correlation as the distance measure. The numbers in the sample names are the patient IDs. The red line indicates the average of VAF correlation cutoffs used for predicting matching status of each pair; the VAF correlation cutoff used for every pair depends on the smaller of the depths for the two input profiles. Since this is panel-sequencing data, the mean of non-zero depths was used as the reference depth to retrieve a corresponding VAF correlation cutoff from the pre-built model (see Methods for details). Samples predicted to be mislabeled with high confidence were marked by red boxes. (B) Part of a graph output for the TCGA lung cancer WGS (low-coverage) and WES datasets. A node represents an input file, colored by individual. A solid edge represents a matched pair predicted by NGSCheckMate. The corresponding VAF correlation is written next to each edge. The graph indicates a mislabeling of 44–3396 T (tumor) WGS file. VAF Correlations between the 44–3396 T file and each of the other two 44–3396 files (tumor WES and blood WGS) did not pass a VAF correlation cutoff for a matched pair (red dotted lines and texts).