MOTIVATION: Sequencing of matched tumor and normal samples is the standard study design for reliable detection of somatic alterations. However, even very low levels of cross-sample contamination significantly impact calling of somatic mutations, because contaminant germline variants can be incorrectly interpreted as somatic. There are currently no sequence-only based methods that reliably estimate contamination levels in tumor samples, which frequently display copy number changes. As a solution, we developed Conpair, a tool for detection of sample swaps and cross-individual contamination in whole-genome and whole-exome tumor-normal sequencing experiments. RESULTS: On a ladder of in silico contaminated samples, we demonstrated that Conpair reliably measures contamination levels as low as 0.1%, even in presence of copy number changes. We also estimated contamination levels in glioblastoma WGS and WXS tumor-normal datasets from TCGA and showed that they strongly correlate with tumor-normal concordance, as well as with the number of germline variants called as somatic by several widely-used somatic callers. AVAILABILITY AND IMPLEMENTATION: The method is available at: https://github.com/nygenome/conpair CONTACT: egrabowska@gmail.com or mczody@nygenome.orgSupplementary information: Supplementary data are available at Bioinformatics online.
MOTIVATION: Sequencing of matched tumor and normal samples is the standard study design for reliable detection of somatic alterations. However, even very low levels of cross-sample contamination significantly impact calling of somatic mutations, because contaminant germline variants can be incorrectly interpreted as somatic. There are currently no sequence-only based methods that reliably estimate contamination levels in tumor samples, which frequently display copy number changes. As a solution, we developed Conpair, a tool for detection of sample swaps and cross-individual contamination in whole-genome and whole-exome tumor-normal sequencing experiments. RESULTS: On a ladder of in silico contaminated samples, we demonstrated that Conpair reliably measures contamination levels as low as 0.1%, even in presence of copy number changes. We also estimated contamination levels in glioblastoma WGS and WXS tumor-normal datasets from TCGA and showed that they strongly correlate with tumor-normal concordance, as well as with the number of germline variants called as somatic by several widely-used somatic callers. AVAILABILITY AND IMPLEMENTATION: The method is available at: https://github.com/nygenome/conpair CONTACT: egrabowska@gmail.com or mczody@nygenome.orgSupplementary information: Supplementary data are available at Bioinformatics online.
The decreasing cost of high-throughput sequencing allows analysis of larger number of samples than before, which as an unfortunate side effect increases the chances of sample mix-ups and contamination. Cancer studies often jointly analyze matched tumor–normal (T–N) samples in order to detect somatic mutations that are present in the tumor. Even a very low level of cross-individual contamination in the tumor sample may introduce many low allele frequency germline variants that will be interpreted as somatic by somatic variant calling algorithms, resulting in greatly reduced specificity (Supplementary Figure S1). Detecting sample swaps and low level contamination in tumor samples are critical quality control steps that should precede every somatic analysis. However, estimating contamination in tumor samples is confounded by frequent copy number alterations that affect allelic ratio distributions.VerifyBamID (Jun ) and ContEst (Cibulskis ) have emerged as standard methods to estimate sample contamination. VerifyBamID maximizes the likelihood of a contamination level in a two-sample mixture model, given the alleles and base qualities, using a grid search over a range of contamination fractions and refining the result using a numerical root-finding method. VerifyBamID provides an accurate measure for contamination in mostly diploid (copy-neutral) samples, however it may interpret copy number-driven allelic imbalance frequently seen in cancer as contamination. ContEst calculates the maximum a posteriori estimate of contamination based on the base identities and quality scores from sequencing data, at sites identified on a SNP array to be homozygous. The method can be applied to tumor–normal studies, however ContEst requires additional data from a genotyping array. Alternatively, genotypes of a normal sample called from high coverage (>50×) sequencing data can be used.We developed Conpair (Concordance/Contamination of paired samples) to robustly detect contamination in cancer studies based on sequence data alone. We show that our method accurately detects contamination levels as low as 0.1% (Supplementary Table S3), even in presence of copy number changes. In contrast to ContEst, our tool also allows verifying concordance between tumor and normal samples and estimating contamination in normal samples. Conpair is ∼50× faster than VerifyBamID and ∼18× faster than ContEst on a 60×/60× WGS pair (Supplementary Figure S11A).
2 Methods
Copy number changes, which are frequent in tumor samples, may cause difficulties in estimating contamination levels due to shifting of the expected 50% allelic fraction for heterozygous markers. By using matched normal samples we can robustly detect homozygous markers, which are invariant to copy number changes and are not affected by contamination in the normal sample, and subsequently use them to reliably estimate contamination level in the tumor sample (see Supplementary Methods).Conpair takes as input a pair of BAM files, the reference genome and a short list of pre-selected highly informative genomic markers that are provided with the tool (see Supplementary Methods), in order to run both concordance verification and contamination estimation. For concordant T–N pairs, Conpair measures contamination first in the normal and then in the tumor sample, using the genotype information from the normal. Conpair employs the statistical model developed by Jun and colleagues (VerifyBamID), but in contrast to VerifyBamID allows for only two alleles and uses a limited set of markers (Supplementary Methods).
3 Results
We constructed two independent sets of in silico contaminated cancer samples by mixing reads from BAM files from copy number aberrant (Magi ) TCGA glioblastoma exomes (Brennan ) at a ladder from 0.1% to 95%, yielding a total of 245 samples at 49 different contamination levels (α) in each set. For each sample we estimated α using Conpair, VerifyBamID and ContEst (sequence-only mode). Our results indicate a better agreement between Conpair and the ground truth in both sets (RMSD = 0.0064; 0.009), compared to ContEst (RMSD = 0.0075; 0.0128) or VerifyBamID (RMSD = 0.062; 0.045) (Supplementary Figures S4 and S5).TCGA glioblastoma dataset. After verifying T–N pairing (Supplementary Figure S6), we applied Conpair to 51 WGS and 396 WXS sample pairs from the TCGA glioblastoma study. Since the WGS dataset appeared clean according to Conpair (α: 0.0–0.612%/0–0.905% in the tumor and normal samples respectively), we focused on the less clean WXS dataset (α: 0.008–4.75%/0.014–6.52% in the tumor and normal samples respectively). The WXS dataset consists of 144 T–N pairs that underwent a whole-genome amplification (WGA) library preparation protocol and 252 T–N pairs prepared by exome capture. Conpair, ContEst and VerifyBamID returned similar contamination values for all the normal samples, independently of the library preparation method (Supplementary Figure S7A and C).For tumor samples, the differences in the values returned by the three programs were substantial. VerifyBamID estimated high α for the majority of the tumor samples. Contamination estimates generated by ContEst were higher, but comparable to Conpair for all samples prepared following exome capture. Conpair and ContEst did not agree on a subset of tumor samples that underwent WGA, for which ContEst detected much higher fractions of contamination (5–10%) (Supplementary Figure S7B and D).To assess which method was more accurate, we correlated the contamination estimates with the T–N concordance values (calculated based only on markers that were homozygous in the normal sample). Tumor samples with T–N concordance values close to 100% cannot be significantly contaminated (Supplementary Figure S2). Based on this fact, we were able to show that VerifyBamID highly overestimated α on the majority of the tumor samples, and ContEst overestimated α on the subset of the WGA samples. The results returned by Conpair show a monotonic dependency between the T–N concordance and contamination values (Fig. 1).
Fig. 1.
Relationship between tumor–normal discordance values (1 – concordance) and contamination levels detected by Conpair, ContEst and VerifyBamID in a set of TCGA glioblastoma WXS tumor samples. Data shows whole genome amplified samples (red) and exome capture (blue)
Relationship between tumor–normal discordance values (1 – concordance) and contamination levels detected by Conpair, ContEst and VerifyBamID in a set of TCGA glioblastoma WXS tumor samples. Data shows whole genome amplified samples (red) and exome capture (blue)As an independent metric, we also looked at the number of known germline variants called as ‘somatic’ by three somatic callers: MuTect (Cibulskis ), LoFreq (Wilm ) and Strelka (Saunders ). These numbers were strongly correlated with the contamination in the tumor samples returned by Conpair (Spearman r: 0.76 [P-value = 7.5e–20], 0.75 [5.5e–19], 0.67 [3.7e–14], for variants where α > 0.5%), but not correlated with the estimates returned by ContEst and VerifyBamID (correlations not significant) (Supplementary Figure S8). The obtained results suggest that Conpair is more robust in estimating contamination levels in the light of different library preparation methods.Click here for additional data file.
Authors: Cameron W Brennan; Roel G W Verhaak; Aaron McKenna; Benito Campos; Houtan Noushmehr; Sofie R Salama; Siyuan Zheng; Debyani Chakravarty; J Zachary Sanborn; Samuel H Berman; Rameen Beroukhim; Brady Bernard; Chang-Jiun Wu; Giannicola Genovese; Ilya Shmulevich; Jill Barnholtz-Sloan; Lihua Zou; Rahulsimham Vegesna; Sachet A Shukla; Giovanni Ciriello; W K Yung; Wei Zhang; Carrie Sougnez; Tom Mikkelsen; Kenneth Aldape; Darell D Bigner; Erwin G Van Meir; Michael Prados; Andrew Sloan; Keith L Black; Jennifer Eschbacher; Gaetano Finocchiaro; William Friedman; David W Andrews; Abhijit Guha; Mary Iacocca; Brian P O'Neill; Greg Foltz; Jerome Myers; Daniel J Weisenberger; Robert Penny; Raju Kucherlapati; Charles M Perou; D Neil Hayes; Richard Gibbs; Marco Marra; Gordon B Mills; Eric Lander; Paul Spellman; Richard Wilson; Chris Sander; John Weinstein; Matthew Meyerson; Stacey Gabriel; Peter W Laird; David Haussler; Gad Getz; Lynda Chin Journal: Cell Date: 2013-10-10 Impact factor: 41.582
Authors: Goo Jun; Matthew Flickinger; Kurt N Hetrick; Jane M Romm; Kimberly F Doheny; Gonçalo R Abecasis; Michael Boehnke; Hyun Min Kang Journal: Am J Hum Genet Date: 2012-10-25 Impact factor: 11.025
Authors: Kristian Cibulskis; Aaron McKenna; Tim Fennell; Eric Banks; Mark DePristo; Gad Getz Journal: Bioinformatics Date: 2011-07-29 Impact factor: 6.937
Authors: Christopher T Saunders; Wendy S W Wong; Sajani Swamy; Jennifer Becq; Lisa J Murray; R Keira Cheetham Journal: Bioinformatics Date: 2012-05-10 Impact factor: 6.937
Authors: Alberto Magi; Lorenzo Tattini; Ingrid Cifola; Romina D'Aurizio; Matteo Benelli; Eleonora Mangano; Cristina Battaglia; Elena Bonora; Ants Kurg; Marco Seri; Pamela Magini; Betti Giusti; Giovanni Romeo; Tommaso Pippucci; Gianluca De Bellis; Rosanna Abbate; Gian Franco Gensini Journal: Genome Biol Date: 2013 Impact factor: 13.583
Authors: Kristian Cibulskis; Michael S Lawrence; Scott L Carter; Andrey Sivachenko; David Jaffe; Carrie Sougnez; Stacey Gabriel; Matthew Meyerson; Eric S Lander; Gad Getz Journal: Nat Biotechnol Date: 2013-02-10 Impact factor: 54.908
Authors: Isidro Cortés-Ciriano; Doga C Gulhan; Jake June-Koo Lee; Giorgio E M Melloni; Peter J Park Journal: Nat Rev Genet Date: 2021-12-08 Impact factor: 53.242
Authors: Sarah Moody; Sergey Senkin; S M Ashiqul Islam; Jingwei Wang; Dariush Nasrollahzadeh; Ricardo Cortez Cardoso Penha; Stephen Fitzgerald; Erik N Bergstrom; Joshua Atkins; Yudou He; Azhar Khandekar; Karl Smith-Byrne; Christine Carreira; Valerie Gaborieau; Calli Latimer; Emily Thomas; Irina Abnizova; Pauline E Bucciarelli; David Jones; Jon W Teague; Behnoush Abedi-Ardekani; Stefano Serra; Jean-Yves Scoazec; Hiva Saffar; Farid Azmoudeh-Ardalan; Masoud Sotoudeh; Arash Nikmanesh; Hossein Poustchi; Ahmadreza Niavarani; Samad Gharavi; Michael Eden; Paul Richman; Lia S Campos; Rebecca C Fitzgerald; Luis Felipe Ribeiro; Sheila Coelho Soares-Lima; Charles Dzamalala; Blandina Theophil Mmbaga; Tatsuhiro Shibata; Diana Menya; Alisa M Goldstein; Nan Hu; Reza Malekzadeh; Abdolreza Fazel; Valerie McCormack; James McKay; Sandra Perdomo; Ghislaine Scelo; Estelle Chanudet; Laura Humphreys; Ludmil B Alexandrov; Paul Brennan; Michael R Stratton Journal: Nat Genet Date: 2021-10-18 Impact factor: 38.330
Authors: Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski Journal: Funct Integr Genomics Date: 2021-10-18 Impact factor: 3.410
Authors: Wei Wei; Katherine R Schon; Greg Elgar; Andrea Orioli; Melanie Tanguy; Adam Giess; Marc Tischkowitz; Mark J Caulfield; Patrick F Chinnery Journal: Nature Date: 2022-10-05 Impact factor: 69.504
Authors: R Alex Harbison; Mark Kubik; Eric Q Konnick; Qing Zhang; Seok-Geun Lee; Heuijoon Park; Jianan Zhang; Christopher S Carlson; Chu Chen; Stephen M Schwartz; Cristina P Rodriguez; Umamaheswar Duvvuri; Eduardo Méndez Journal: JCI Insight Date: 2018-07-26
Authors: Nikita Kotlov; Alexander Bagaev; Maria V Revuelta; Jude M Phillip; Maria Teresa Cacciapuoti; Zoya Antysheva; Viktor Svekolkin; Ekaterina Tikhonova; Natalia Miheecheva; Natalia Kuzkina; Grigorii Nos; Fabrizio Tabbo; Felix Frenkel; Paola Ghione; Maria Tsiper; Nava Almog; Nathan Fowler; Ari M Melnick; John P Leonard; Giorgio Inghirami; Leandro Cerchietti Journal: Cancer Discov Date: 2021-02-04 Impact factor: 39.397
Authors: Asaf Zviran; Rafael C Schulman; Minita Shah; Steven T K Hill; Sunil Deochand; Cole C Khamnei; Dillon Maloney; Kristofer Patel; Will Liao; Adam J Widman; Phillip Wong; Margaret K Callahan; Gavin Ha; Sarah Reed; Denisse Rotem; Dennie Frederick; Tatyana Sharova; Benchun Miao; Tommy Kim; Greg Gydush; Justin Rhoades; Kevin Y Huang; Nathaniel D Omans; Patrick O Bolan; Andrew H Lipsky; Chelston Ang; Murtaza Malbari; Catherine F Spinelli; Selena Kazancioglu; Alexi M Runnels; Samantha Fennessey; Christian Stolte; Federico Gaiti; Giorgio G Inghirami; Viktor Adalsteinsson; Brian Houck-Loomis; Jennifer Ishii; Jedd D Wolchok; Genevieve Boland; Nicolas Robine; Nasser K Altorki; Dan A Landau Journal: Nat Med Date: 2020-06-01 Impact factor: 53.440