| Literature DB >> 31888490 |
Maximillian Westphal1, David Frankhouser2,3, Carmine Sonzone4, Peter G Shields4,5,6, Pearlly Yan5,6, Ralf Bundschuh7,8,9,10,11.
Abstract
BACKGROUND: Inadvertent sample swaps are a real threat to data quality in any medium to large scale omics studies. While matches between samples from the same individual can in principle be identified from a few well characterized single nucleotide polymorphisms (SNPs), omics data types often only provide low to moderate coverage, thus requiring integration of evidence from a large number of SNPs to determine if two samples derive from the same individual or not.Entities:
Keywords: Identity matching; Next generation sequencing data; Sample swap
Mesh:
Year: 2019 PMID: 31888490 PMCID: PMC6936078 DOI: 10.1186/s12864-019-6332-7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Workflow of SMaSH. The number of reads supporting the wild type and alternate allele at 6059 SNPs in the human genome are counted and a Bayesian approach is used to calculate a p-value for the null hypothesis that the two samples are derived from the same individual. W = Wild Type; A = Alternate Allele
Fig. 2Receiver Operating Characteristic curves for the performance of SMaSH and VerifyBamID on a subset of data set 2 consisting of RNA-Seq and MethylCap-Seq libraries. Each curve shows the fraction of true positives as a function of the fraction of false positives. The black solid curve (which follows the axes as SMaSH is a perfect classifier on this data set) represents SMaSH and the red curve represents VerifyBamID. The circles indicate the performance at a p-value/IBD cutoff of 0.95
Fig. 3Receiver Operating Characteristic curves for the performance of SMaSH on a fairly low quality RNA-Seq data set. Each curve shows the fraction of true positives as a function of the fraction of false positives. The black solid curves correspond to the full data sets while the colored dashed curves correspond to different degrees of subsampling in order to illustrate how performance depends on read coverage. (a) shows data for all samples while (b) shows data after removal of all comparisons involving one sample that was later excluded from the study due to very low RNA quality. The circles indicate the performance at a p-value cutoff of 0.95
Fig. 4p-value distributions for all four data sets. Each symbol corresponds to the comparison of one pair of samples in the respective data sets and its height represents the calculated probability that the two samples are derived from the same individual. Red diamonds indicate sample pairs from the same individual while blue circles indicate sample pairs from different individuals. For data set 4 data after exclusion of the failed quality control sample RNA09 is shown. The dashed line corresponds to the chosen threshold of 0.95 that discriminates pairings involving the same individual from pairings not involving the same individual in all four data sets
Calculated probabilities that samples from members of two families come from the same individual
| Mother 1 | Child 1 | Father 2 | Mother 2 | Child 2 | Sibling 2 | |
|---|---|---|---|---|---|---|
| Father 1 | 4 · 10− 46 | 0.80 | 2 · 10− 151 | 7 · 10− 174 | 1 · 10− 161 | 3 · 10− 177 |
| Mother 1 | 0.16 | 2 · 10− 171 | 3 · 10− 240 | 8 · 10− 237 | 1 · 10− 269 | |
| Child 1 | 3 · 10− 171 | 6 · 10− 175 | 3 · 10− 182 | 4 · 10− 198 | ||
| Father 2 | 1 · 10− 78 | 0.96 | 0.32 | |||
| Mother 2 | 0.75 | 0.16 | ||||
| Child 2 | 0.9999995 |
HumanAll Exon V5 data sheet, Agilent Technologies, Santa Clara, CA) targeted regions. These regions are more likely to have coverage for SNP calling across all discussed data types: whole genome sequencing, RNA-seq, Exome-seq, and MethylCap-seq. To account for linkage equilibrium, we required each SNP to be at least 100kb away from any other SNPs in the list. In cases where SNP were closer than this minimum distance, we chose the SNP with the allele frequency closest to 1/2 in order to maximize the information content contributed by the SNP. This resulted in 6059 SNPs to be tested, which are listed in Additional file 4.