Literature DB >> 27354699

Conpair: concordance and contamination estimator for matched tumor-normal pairs.

Ewa A Bergmann¹, Bo-Juen Chen¹, Kanika Arora¹, Vladimir Vacic¹, Michael C Zody¹.

Abstract

MOTIVATION: Sequencing of matched tumor and normal samples is the standard study design for reliable detection of somatic alterations. However, even very low levels of cross-sample contamination significantly impact calling of somatic mutations, because contaminant germline variants can be incorrectly interpreted as somatic. There are currently no sequence-only based methods that reliably estimate contamination levels in tumor samples, which frequently display copy number changes. As a solution, we developed Conpair, a tool for detection of sample swaps and cross-individual contamination in whole-genome and whole-exome tumor-normal sequencing experiments.
RESULTS: On a ladder of in silico contaminated samples, we demonstrated that Conpair reliably measures contamination levels as low as 0.1%, even in presence of copy number changes. We also estimated contamination levels in glioblastoma WGS and WXS tumor-normal datasets from TCGA and showed that they strongly correlate with tumor-normal concordance, as well as with the number of germline variants called as somatic by several widely-used somatic callers.
AVAILABILITY AND IMPLEMENTATION: The method is available at: https://github.com/nygenome/conpair CONTACT: egrabowska@gmail.com or mczody@nygenome.orgSupplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Substances：
DNA, Neoplasm

Year: 2016 PMID： 27354699 PMCID： PMC5048070 DOI： 10.1093/bioinformatics/btw389

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The decreasing cost of high-throughput sequencing allows analysis of larger number of samples than before, which as an unfortunate side effect increases the chances of sample mix-ups and contamination. Cancer studies often jointly analyze matched tumor–normal (T–N) samples in order to detect somatic mutations that are present in the tumor. Even a very low level of cross-individual contamination in the tumor sample may introduce many low allele frequency germline variants that will be interpreted as somatic by somatic variant calling algorithms, resulting in greatly reduced specificity (Supplementary Figure S1). Detecting sample swaps and low level contamination in tumor samples are critical quality control steps that should precede every somatic analysis. However, estimating contamination in tumor samples is confounded by frequent copy number alterations that affect allelic ratio distributions. VerifyBamID (Jun ) and ContEst (Cibulskis ) have emerged as standard methods to estimate sample contamination. VerifyBamID maximizes the likelihood of a contamination level in a two-sample mixture model, given the alleles and base qualities, using a grid search over a range of contamination fractions and refining the result using a numerical root-finding method. VerifyBamID provides an accurate measure for contamination in mostly diploid (copy-neutral) samples, however it may interpret copy number-driven allelic imbalance frequently seen in cancer as contamination. ContEst calculates the maximum a posteriori estimate of contamination based on the base identities and quality scores from sequencing data, at sites identified on a SNP array to be homozygous. The method can be applied to tumor–normal studies, however ContEst requires additional data from a genotyping array. Alternatively, genotypes of a normal sample called from high coverage (>50×) sequencing data can be used. We developed Conpair (Concordance/Contamination of paired samples) to robustly detect contamination in cancer studies based on sequence data alone. We show that our method accurately detects contamination levels as low as 0.1% (Supplementary Table S3), even in presence of copy number changes. In contrast to ContEst, our tool also allows verifying concordance between tumor and normal samples and estimating contamination in normal samples. Conpair is ∼50× faster than VerifyBamID and ∼18× faster than ContEst on a 60×/60× WGS pair (Supplementary Figure S11A).

2 Methods

Copy number changes, which are frequent in tumor samples, may cause difficulties in estimating contamination levels due to shifting of the expected 50% allelic fraction for heterozygous markers. By using matched normal samples we can robustly detect homozygous markers, which are invariant to copy number changes and are not affected by contamination in the normal sample, and subsequently use them to reliably estimate contamination level in the tumor sample (see Supplementary Methods). Conpair takes as input a pair of BAM files, the reference genome and a short list of pre-selected highly informative genomic markers that are provided with the tool (see Supplementary Methods), in order to run both concordance verification and contamination estimation. For concordant T–N pairs, Conpair measures contamination first in the normal and then in the tumor sample, using the genotype information from the normal. Conpair employs the statistical model developed by Jun and colleagues (VerifyBamID), but in contrast to VerifyBamID allows for only two alleles and uses a limited set of markers (Supplementary Methods).

3 Results

We constructed two independent sets of in silico contaminated cancer samples by mixing reads from BAM files from copy number aberrant (Magi ) TCGA glioblastoma exomes (Brennan ) at a ladder from 0.1% to 95%, yielding a total of 245 samples at 49 different contamination levels (α) in each set. For each sample we estimated α using Conpair, VerifyBamID and ContEst (sequence-only mode). Our results indicate a better agreement between Conpair and the ground truth in both sets (RMSD = 0.0064; 0.009), compared to ContEst (RMSD = 0.0075; 0.0128) or VerifyBamID (RMSD = 0.062; 0.045) (Supplementary Figures S4 and S5). TCGA glioblastoma dataset. After verifying T–N pairing (Supplementary Figure S6), we applied Conpair to 51 WGS and 396 WXS sample pairs from the TCGA glioblastoma study. Since the WGS dataset appeared clean according to Conpair (α: 0.0–0.612%/0–0.905% in the tumor and normal samples respectively), we focused on the less clean WXS dataset (α: 0.008–4.75%/0.014–6.52% in the tumor and normal samples respectively). The WXS dataset consists of 144 T–N pairs that underwent a whole-genome amplification (WGA) library preparation protocol and 252 T–N pairs prepared by exome capture. Conpair, ContEst and VerifyBamID returned similar contamination values for all the normal samples, independently of the library preparation method (Supplementary Figure S7A and C). For tumor samples, the differences in the values returned by the three programs were substantial. VerifyBamID estimated high α for the majority of the tumor samples. Contamination estimates generated by ContEst were higher, but comparable to Conpair for all samples prepared following exome capture. Conpair and ContEst did not agree on a subset of tumor samples that underwent WGA, for which ContEst detected much higher fractions of contamination (5–10%) (Supplementary Figure S7B and D). To assess which method was more accurate, we correlated the contamination estimates with the T–N concordance values (calculated based only on markers that were homozygous in the normal sample). Tumor samples with T–N concordance values close to 100% cannot be significantly contaminated (Supplementary Figure S2). Based on this fact, we were able to show that VerifyBamID highly overestimated α on the majority of the tumor samples, and ContEst overestimated α on the subset of the WGA samples. The results returned by Conpair show a monotonic dependency between the T–N concordance and contamination values (Fig. 1).

Fig. 1.

Relationship between tumor–normal discordance values (1 – concordance) and contamination levels detected by Conpair, ContEst and VerifyBamID in a set of TCGA glioblastoma WXS tumor samples. Data shows whole genome amplified samples (red) and exome capture (blue) As an independent metric, we also looked at the number of known germline variants called as ‘somatic’ by three somatic callers: MuTect (Cibulskis ), LoFreq (Wilm ) and Strelka (Saunders ). These numbers were strongly correlated with the contamination in the tumor samples returned by Conpair (Spearman r: 0.76 [P-value = 7.5e–20], 0.75 [5.5e–19], 0.67 [3.7e–14], for variants where α > 0.5%), but not correlated with the estimates returned by ContEst and VerifyBamID (correlations not significant) (Supplementary Figure S8). The obtained results suggest that Conpair is more robust in estimating contamination levels in the light of different library preparation methods. Click here for additional data file.

7 in total

1. The somatic genomic landscape of glioblastoma.

Authors: Cameron W Brennan; Roel G W Verhaak; Aaron McKenna; Benito Campos; Houtan Noushmehr; Sofie R Salama; Siyuan Zheng; Debyani Chakravarty; J Zachary Sanborn; Samuel H Berman; Rameen Beroukhim; Brady Bernard; Chang-Jiun Wu; Giannicola Genovese; Ilya Shmulevich; Jill Barnholtz-Sloan; Lihua Zou; Rahulsimham Vegesna; Sachet A Shukla; Giovanni Ciriello; W K Yung; Wei Zhang; Carrie Sougnez; Tom Mikkelsen; Kenneth Aldape; Darell D Bigner; Erwin G Van Meir; Michael Prados; Andrew Sloan; Keith L Black; Jennifer Eschbacher; Gaetano Finocchiaro; William Friedman; David W Andrews; Abhijit Guha; Mary Iacocca; Brian P O'Neill; Greg Foltz; Jerome Myers; Daniel J Weisenberger; Robert Penny; Raju Kucherlapati; Charles M Perou; D Neil Hayes; Richard Gibbs; Marco Marra; Gordon B Mills; Eric Lander; Paul Spellman; Richard Wilson; Chris Sander; John Weinstein; Matthew Meyerson; Stacey Gabriel; Peter W Laird; David Haussler; Gad Getz; Lynda Chin
Journal: Cell Date: 2013-10-10 Impact factor: 41.582

2. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data.

Authors: Goo Jun; Matthew Flickinger; Kurt N Hetrick; Jane M Romm; Kimberly F Doheny; Gonçalo R Abecasis; Michael Boehnke; Hyun Min Kang
Journal: Am J Hum Genet Date: 2012-10-25 Impact factor: 11.025

3. ContEst: estimating cross-contamination of human samples in next-generation sequencing data.

Authors: Kristian Cibulskis; Aaron McKenna; Tim Fennell; Eric Banks; Mark DePristo; Gad Getz
Journal: Bioinformatics Date: 2011-07-29 Impact factor: 6.937

4. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs.

Authors: Christopher T Saunders; Wendy S W Wong; Sajani Swamy; Jennifer Becq; Lisa J Murray; R Keira Cheetham
Journal: Bioinformatics Date: 2012-05-10 Impact factor: 6.937

5. EXCAVATOR: detecting copy number variants from whole-exome sequencing data.

Authors: Alberto Magi; Lorenzo Tattini; Ingrid Cifola; Romina D'Aurizio; Matteo Benelli; Eleonora Mangano; Cristina Battaglia; Elena Bonora; Ants Kurg; Marco Seri; Pamela Magini; Betti Giusti; Giovanni Romeo; Tommaso Pippucci; Gianluca De Bellis; Rosanna Abbate; Gian Franco Gensini
Journal: Genome Biol Date: 2013 Impact factor: 13.583

6. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets.

Authors: Andreas Wilm; Pauline Poh Kim Aw; Denis Bertrand; Grace Hui Ting Yeo; Swee Hoe Ong; Chang Hua Wong; Chiea Chuen Khor; Rosemary Petric; Martin Lloyd Hibberd; Niranjan Nagarajan
Journal: Nucleic Acids Res Date: 2012-10-12 Impact factor: 16.971

7. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples.

Authors: Kristian Cibulskis; Michael S Lawrence; Scott L Carter; Andrey Sivachenko; David Jaffe; Carrie Sougnez; Stacey Gabriel; Matthew Meyerson; Eric S Lander; Gad Getz
Journal: Nat Biotechnol Date: 2013-02-10 Impact factor: 54.908

7 in total

24 in total

Review 1. Detecting Somatic Mutations in Normal Cells.

Authors: Yanmei Dou; Heather D Gold; Lovelace J Luquette; Peter J Park
Journal: Trends Genet Date: 2018-05-03 Impact factor: 11.639

2. A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis.

Authors: Eunjee Lee; Seungyeul Yoo; Wenhui Wang; Zhidong Tu; Jun Zhu
Journal: Gigascience Date: 2019-07-01 Impact factor: 6.524

3. Integrative analysis of the genomic and transcriptomic landscape of double-refractory multiple myeloma.

Authors: Bachisio Ziccheddu; Giulia Biancon; Filippo Bagnoli; Chiara De Philippis; Francesco Maura; Even H Rustad; Matteo Dugo; Andrea Devecchi; Loris De Cecco; Marialuisa Sensi; Carolina Terragna; Marina Martello; Tina Bagratuni; Efstathios Kastritis; Meletios A Dimopoulos; Michele Cavo; Cristiana Carniti; Vittorio Montefusco; Paolo Corradini; Niccolo Bolli
Journal: Blood Adv Date: 2020-03-10

Review 4. Computational analysis of cancer genome sequencing data.

Authors: Isidro Cortés-Ciriano; Doga C Gulhan; Jake June-Koo Lee; Giorgio E M Melloni; Peter J Park
Journal: Nat Rev Genet Date: 2021-12-08 Impact factor: 53.242

5. Mutational signatures in esophageal squamous cell carcinoma from eight countries with varying incidence.

Authors: Sarah Moody; Sergey Senkin; S M Ashiqul Islam; Jingwei Wang; Dariush Nasrollahzadeh; Ricardo Cortez Cardoso Penha; Stephen Fitzgerald; Erik N Bergstrom; Joshua Atkins; Yudou He; Azhar Khandekar; Karl Smith-Byrne; Christine Carreira; Valerie Gaborieau; Calli Latimer; Emily Thomas; Irina Abnizova; Pauline E Bucciarelli; David Jones; Jon W Teague; Behnoush Abedi-Ardekani; Stefano Serra; Jean-Yves Scoazec; Hiva Saffar; Farid Azmoudeh-Ardalan; Masoud Sotoudeh; Arash Nikmanesh; Hossein Poustchi; Ahmadreza Niavarani; Samad Gharavi; Michael Eden; Paul Richman; Lia S Campos; Rebecca C Fitzgerald; Luis Felipe Ribeiro; Sheila Coelho Soares-Lima; Charles Dzamalala; Blandina Theophil Mmbaga; Tatsuhiro Shibata; Diana Menya; Alisa M Goldstein; Nan Hu; Reza Malekzadeh; Abdolreza Fazel; Valerie McCormack; James McKay; Sandra Perdomo; Ghislaine Scelo; Estelle Chanudet; Laura Humphreys; Ludmil B Alexandrov; Paul Brennan; Michael R Stratton
Journal: Nat Genet Date: 2021-10-18 Impact factor: 38.330

Review 6. Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.

Authors: Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski
Journal: Funct Integr Genomics Date: 2021-10-18 Impact factor: 3.410

7. Nuclear-embedded mitochondrial DNA sequences in 66,083 human genomes.

Authors: Wei Wei; Katherine R Schon; Greg Elgar; Andrea Orioli; Melanie Tanguy; Adam Giess; Marc Tischkowitz; Mark J Caulfield; Patrick F Chinnery
Journal: Nature Date: 2022-10-05 Impact factor: 69.504

8. The mutational landscape of recurrent versus nonrecurrent human papillomavirus-related oropharyngeal cancer.

Authors: R Alex Harbison; Mark Kubik; Eric Q Konnick; Qing Zhang; Seok-Geun Lee; Heuijoon Park; Jianan Zhang; Christopher S Carlson; Chu Chen; Stephen M Schwartz; Cristina P Rodriguez; Umamaheswar Duvvuri; Eduardo Méndez
Journal: JCI Insight Date: 2018-07-26

9. Clinical and Biological Subtypes of B-cell Lymphoma Revealed by Microenvironmental Signatures.

Authors: Nikita Kotlov; Alexander Bagaev; Maria V Revuelta; Jude M Phillip; Maria Teresa Cacciapuoti; Zoya Antysheva; Viktor Svekolkin; Ekaterina Tikhonova; Natalia Miheecheva; Natalia Kuzkina; Grigorii Nos; Fabrizio Tabbo; Felix Frenkel; Paola Ghione; Maria Tsiper; Nava Almog; Nathan Fowler; Ari M Melnick; John P Leonard; Giorgio Inghirami; Leandro Cerchietti
Journal: Cancer Discov Date: 2021-02-04 Impact factor: 39.397

10. Genome-wide cell-free DNA mutational integration enables ultra-sensitive cancer monitoring.

Authors: Asaf Zviran; Rafael C Schulman; Minita Shah; Steven T K Hill; Sunil Deochand; Cole C Khamnei; Dillon Maloney; Kristofer Patel; Will Liao; Adam J Widman; Phillip Wong; Margaret K Callahan; Gavin Ha; Sarah Reed; Denisse Rotem; Dennie Frederick; Tatyana Sharova; Benchun Miao; Tommy Kim; Greg Gydush; Justin Rhoades; Kevin Y Huang; Nathaniel D Omans; Patrick O Bolan; Andrew H Lipsky; Chelston Ang; Murtaza Malbari; Catherine F Spinelli; Selena Kazancioglu; Alexi M Runnels; Samantha Fennessey; Christian Stolte; Federico Gaiti; Giorgio G Inghirami; Viktor Adalsteinsson; Brian Houck-Loomis; Jennifer Ishii; Jedd D Wolchok; Genevieve Boland; Nicolas Robine; Nasser K Altorki; Dan A Landau
Journal: Nat Med Date: 2020-06-01 Impact factor: 53.440