| Literature DB >> 24070238 |
Reuben J Pengelly1, Jane Gibson1, Gaia Andreoletti1, Andrew Collins1, Christopher J Mattocks2, Sarah Ennis1.
Abstract
Whole-exome sequencing provides a cost-effective means to sequence protein coding regions within the genome, which are significantly enriched for etiological variants. We describe a panel of single nucleotide polymorphisms (SNPs) to facilitate the validation of data provenance in whole-exome sequencing studies. This is particularly significant where multiple processing steps necessitate transfer of sample custody between clinical, laboratory and bioinformatics facilities. SNPs captured by all commonly used exome enrichment kits were identified, and filtered for possible confounding properties. The optimised panel provides a simple, yet powerful, method for the assignment of intrinsic, highly discriminatory identifiers to genetic samples.Entities:
Year: 2013 PMID: 24070238 PMCID: PMC3978886 DOI: 10.1186/gm492
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Figure 1Venn diagrams showing commonality of targeting between capture kits (A,B) and properties of encompassed SNPs (C). Overlap between exome capture kits is presented in Mbp (A) and number of SNPs with an AF ≥0.3 (B). Agilent - SureSelect Human All Exon V4; Illumina - TruSeq Exome Enrichment; Nimblegen - SeqCap EZ Human Exome Library V3.0. For a subset of SNPs present in both the intersection of the three kits shown, and the Illumina TruSight Exome kit, a breakdown of fulfilment of the four classes of candidate filtering criteria is shown (C) (see the main text for details of filtering criteria); 117 SNPs exhibited all desired characteristic; 74 SNPs exhibited none of the desired characteristics.
Optimised panel of identifying SNPs
| 1 | 179520506 | rs1410592 | A/C | 0.59 | 0.62 | 0.54 | 0.53 | |
| 1 | 67861520 | rs2229546 | A/G | 0.64 | 0.36 | 0.44 | 0.58 | |
| 2 | 169789016 | rs497692 | A/Gb | 0.55 | 0.65 | 0.51 | 0.22 | |
| 2 | 227896976 | rs10203363 | C/T | 0.46 | 0.44 | 0.36 | 0.57 | |
| 3 | 4403767 | rs2819561 | A/Gb | 0.56 | 0.73 | 0.73 | 0.72 | |
| 4 | 5749904 | rs4688963 | A/Gb | 0.33 | 0.65 | 0.67 | 0.52 | |
| 5 | 82834630 | rs309557 | A/Gb | 0.49 | 0.34 | 0.52 | 0.50 | |
| 6 | 146755140 | rs2942 | C/T | 0.54 | 0.49 | 0.55 | 0.47 | |
| 7 | 48450157 | rs17548783 | C/T | 0.46 | 0.72 | 0.53 | 0.48 | |
| 8 | 94935937 | rs4735258 | C/T | 0.40 | 0.64 | 0.66 | 0.46 | |
| 9 | 100190780 | rs1381532 | A/Gb | 0.48 | 0.59 | 0.50 | 0.58 | |
| 10 | 100219314 | rs10883099 | A/G | 0.52 | 0.52 | 0.53 | 0.62 | |
| 11 | 16133413 | rs4617548 | C/T | 0.52 | 0.65 | 0.61 | 0.51 | |
| 12 | 993930 | rs7300444 | A/G | 0.46 | 0.55 | 0.48 | 0.28 | |
| 13 | 39433606 | rs9532292 | A/G | 0.29 | 0.41 | 0.44 | 0.54 | |
| 14 | 50769717 | rs2297995 | A/G | 0.55 | 0.65 | 0.67 | 0.59 | |
| 15 | 34528948 | rs4577050 | C/T | 0.68 | 0.75 | 0.63 | 0.32 | |
| 16 | 70303580 | rs2070203 | A/Gb | 0.53 | 0.28 | 0.51 | 0.49 | |
| 17 | 71197748 | rs1037256 | C/T | 0.50 | 0.67 | 0.65 | 0.56 | |
| 18 | 21413869 | rs9962023 | A/G | 0.67 | 0.81c | 0.75 | 0.51 | |
| 19 | 10267077 | rs2228611 | C/Tb | 0.47 | 0.73 | 0.56 | 0.48 | |
| 20 | 6100088 | rs10373 | G/Tb | 0.54 | 0.31 | 0.35 | 0.58 | |
| 21 | 44323590 | rs4148973 | C/T | 0.65 | 0.33 | 0.38 | 0.73 | |
| 22 | 21141300 | rs4675 | A/C | 0.46 | 0.62 | 0.51 | 0.57 | |
aPosition as defined in genome reference assembly GRCh37 (hg19).
bSNP is defined on the negative strand.
cAF marginally outside target range for candidate selection. Selected due to paucity of candidates on chromosome 18.
Profile collisions per simulated dataset of 10,000 individuals with various population AFs
| 0.0039 (0.062) | |
| | |
| CEU | 0.0064 (0.079) |
| CHB | 0.0239 (0.154) |
| JPT | 0.0082 (0.090) |
| YRI | 0.0076 (0.086) |
| 0.0031 (0.056) |
aAll 24 SNPs assigned an AF of 0.5, which will give the most even trifurcation per SNP, and thus discriminatory power. SD, standard deviation.
Figure 2Relationship between size of simulated datasets and the occurrence of non-unique profiles. Thirteen 1000 Genomes Project populations were simulated [20]. Datasets were simulated as described in Methods. With increasing dataset size, the probability of repeat profiles increases. Only populations with a sample size of >50 individuals in the dataset were simulated. Additional populations are Americans of African ancestry in Southwest USA (ASW), Columbians from Medellin, Colombia (CLM), Finnish in Finland (FIN), British in England and Scotland (GBR), Luhya in Webuye, Kenya (LWK), Mexican ancestry from Los Angeles, USA (MXL), Puerto Ricans from Puerto Rico (PUR) and Toscany in Italia (TSI).
Figure 3Exome derived and orthogonal genotypes (Geno) for four samples, showing a sample-switch between samples 2 and 3. Informative markers for the resolution of this switch are highlighted in yellow.