Zhenhua Yu1, Yuanning Liu1, Yi Shen1, Minghui Wang2, Ao Li2. 1. School of Information Science and Technology and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China. 2. School of Information Science and Technology and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China School of Information Science and Technology and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China.
Abstract
MOTIVATION: Whole-genome sequencing of tumor samples has been demonstrated as an efficient approach for comprehensive analysis of genomic aberrations in cancer genome. Critical issues such as tumor impurity and aneuploidy, GC-content and mappability bias have been reported to complicate identification of copy number alteration and loss of heterozygosity in complex tumor samples. Therefore, efficient computational methods are required to address these issues. RESULTS: We introduce CLImAT (CNA and LOH Assessment in Impure and Aneuploid Tumors), a bioinformatics tool for identification of genomic aberrations from tumor samples using whole-genome sequencing data. Without requiring a matched normal sample, CLImAT takes integrated analysis of read depth and allelic frequency and provides extensive data processing procedures including GC-content and mappability correction of read depth and quantile normalization of B-allele frequency. CLImAT accurately identifies copy number alteration and loss of heterozygosity even for highly impure tumor samples with aneuploidy. We evaluate CLImAT on both simulated and real DNA sequencing data to demonstrate its ability to infer tumor impurity and ploidy and identify genomic aberrations in complex tumor samples. AVAILABILITY AND IMPLEMENTATION: The CLImAT software package can be freely downloaded at http://bioinformatics.ustc.edu.cn/CLImAT/.
MOTIVATION: Whole-genome sequencing of tumor samples has been demonstrated as an efficient approach for comprehensive analysis of genomic aberrations in cancer genome. Critical issues such as tumor impurity and aneuploidy, GC-content and mappability bias have been reported to complicate identification of copy number alteration and loss of heterozygosity in complex tumor samples. Therefore, efficient computational methods are required to address these issues. RESULTS: We introduce CLImAT (CNA and LOH Assessment in Impure and Aneuploid Tumors), a bioinformatics tool for identification of genomic aberrations from tumor samples using whole-genome sequencing data. Without requiring a matched normal sample, CLImAT takes integrated analysis of read depth and allelic frequency and provides extensive data processing procedures including GC-content and mappability correction of read depth and quantile normalization of B-allele frequency. CLImAT accurately identifies copy number alteration and loss of heterozygosity even for highly impure tumor samples with aneuploidy. We evaluate CLImAT on both simulated and real DNA sequencing data to demonstrate its ability to infer tumor impurity and ploidy and identify genomic aberrations in complex tumor samples. AVAILABILITY AND IMPLEMENTATION: The CLImAT software package can be freely downloaded at http://bioinformatics.ustc.edu.cn/CLImAT/.
Various aberrations such as amplification, deletion and translocation of segmental regions are common features of cancer genomes and play an important role in tumorigenesis and progression (Albertson ; Stratton ). It is reported that dysfunction of oncogene and tumor suppressor gene is related to frequent genomic aberrations (Bignell ; Stephens ; Stratton ). Genomic aberrations in specific regions have been used as an indicator of aggressiveness of cancer and clinical outcome (Carén ; Suzuki ). Genome-wide copy number alteration (CNA) and loss of heterozygosity (LOH) are two essential features of cancer genomes, and accurate detection of these abnormalities is a crucial step to assess genomic aberrations and cancer-related genes. Experimental technologies are now available for high-throughput profiling of genome-wide aberrations in tumor samples, such as array comparative genomic hybridization (Park, 2008), single nucleotide polymorphism (SNP) genotyping array (Li ; Peiffer ) and more recently, whole-genome sequencing (WGS) technology for massively parallel sequencing of DNA (Mardis, 2008; Metzker, 2009; Morozova and Marra, 2008; Schuster, 2007). By allowing for comprehensive analysis of genomic aberrations in cancer genomes, WGS has been demonstrated as an efficient platform for studies of humancancers (Metzker, 2009).Although several computational approaches have been proposed for assessing genomic aberrations from tumor sequencing data (Boeva , 2012; Carter ; Gusnanto ; Ha ; Mayrhofer ; Sathirapongsasuti ; Xi ), most of these methods do not effectively address the critical issues encountered in interpreting complex tumor samples. For example, tumor samples are often infiltrated with normal stroma, resulting in inevitable contamination of normal DNA and dilution of somatic aberration signals (Boeva , 2012; Gusnanto ; Ha ; Mayrhofer ). Impurity of tumor sample can significantly alter WGS data; and therefore, complicates genomic aberration detection, especially when normal cells dominate in tumor samples. Recent studies, such as FREEC (Boeva , 2012) and APOLLOH (Ha ), have been proposed to address this issue. FREEC constructs copy number and B-allele frequency (BAF) profiles to detect CNA and allelic content in cancer genomes, with optional correction for tumor impurity. APOLLOH is designed for LOH detection using tumor-normal paired samples, and the issue of tumor impurity is addressed by a two-component mixture model for allelic read counts.In addition to tumor impurity, tumor aneuploidy is another critical issue in genomic aberration detection, which is caused by various numerical and structural chromosomal abnormalities frequently observed in cancer genome (Carter ). Although APOLLOH introduces a delicate statistical model to eliminate the effect of tumor impurity, it does not take account of tumor aneuploidy in modeling and analyzing tumor WGS data. To handle aneuploid tumor samples, FREEC provides an option for users to input tumor ploidy. Currently, automatic correction for tumor aneuploidy using WGS data still remains a challenging task. Theoretically, it is often difficult to determine the actual ploidy of cancer cells by sequencing technology (Gusnanto ). In some particular cases, somatic aberration signals could present similar characteristics among genomes of different ploidy (Gusnanto ; Oesper ), which makes it hard to accurately estimate the tumor ploidy. It should be pointed out that, complicated interpretation of WGS data are even more challenging in tumor samples confounded by both tumor impurity and aneuploidy, as they usually cannot be solved separately (Oesper ).So far, only a few algorithms have been proposed for analyzing WGS data of impure tumor samples with aneuploidy (Carter ; Gusnanto ; Mayrhofer ; Oesper ). For example, CNAnorm (Gusnanto ) uses a mixture normal distribution for ratios of tumor-normal read counts to correct tumor impurity and aneuploidy. However, CNAnorm assumes that the most common component in the normal mixture is diploid, which may not hold for aneuploid tumor samples. Moreover, it cannot detect LOH in cancer genomes. Another approach, ABSOLUTE (Carter ), is originally introduced to detect CNA from SNP array data by inferring tumor impurity and ploidy. Although it can be adapted to analyze DNA sequencing data, a previous study shows that the underlying statistical models used by ABSOLUTE do not comprehensively describe the characteristics of DNA sequencing data and therefore may sometimes gravely misestimate the tumor impurity and ploidy (Oesper ). Recently, Markus et al. introduced a novel method called Patchwork (Mayrhofer ) for allele-specific copy number analysis of sequenced tumor tissue in consideration of tumor impurity and tumor aneuploidy, which requires intermediate arguments determined by users. In addition, it is noteworthy that another method called THetA was proposed recently to analyze tumor sequencing data (Oesper ). THetA mainly focuses on the inference of cancer subclones in heterogeneous tumor samples and cannot detect LOH in cancer genomes, as it only utilizes read count data. Therefore, it is essential to develop an efficient approach for analysis of tumor sequencing data by comprehensively addressing the challenge of tumor impurity and aneuploidy.In this study, we present a novel method called CLImAT (CNA and LOH Assessment in Impure and Aneuploid Tumors) to detect genomic aberrations with automatic correction for both tumor impurity and aneuploidy. Without requiring a matched normal sample, CLImAT fully explores both read depth (RD) and allele frequency derived from tumor WGS data, and provides extensive data processing procedures including elimination of sequencing/mapping bias and quantile normalization (QN) of allele frequency data. By adopting an integrated Hidden Markov Model (HMM) that quantitatively delineates tumor impurity and ploidy, CLImAT provides accurate identification of various kinds of genomic aberrations even for highly impure tumor samples with aneuploidy. We apply CLImAT to both simulated and real tumor data, and the results demonstrate the superior performance of CLImAT in analysis of genomic aberrations using tumor WGS data.
2 METHODS
2.1 Simulated data by sampling reads from tumor-normal mixture
To assess the performance of CLImAT for complex tumor samples, we generate simulated tumor samples with different impurity and ploidy. Similar to the procedure proposed previously (Duan ), virtual tumor-normal mixture experiment is performed on chromosome 20 of human reference genome (NCBI build 36, hg18) by sampling reads from a control genome and a test genome with tumor impurity ranging from 0 to 0.9 with 0.1 increments (Supplementary Methods). The test genome is constructed by dividing the reference genome into 20 non-overlapping and equally sized segments, which are randomly assigned with particular kinds of genomic aberrations (Supplementary Figure S1). Sampled reads from both control and test genome are mapped to the reference using Bowtie (Langmead ) with default parameters. BAM files and pileups are generated by SAMtools (Li ). For each combination of predetermined tumor impurity and ploidy (diploidy, triploidy and tetraploidy), three BAM files are generated at 10×, 30× and 60× sampled coverage, respectively. The average copy number (ACN) is 2.48, 3.19 and 4.00 for diploid, triploid and tetraploid tumor samples, respectively. By this way, we generate totally 90 simulated tumor samples for comprehensive evaluation of prediction performance. Detailed information about construction of test genomes and read sampling process is provided in Supplementary Methods.
2.2 Real sequencing data of tumor samples
WGS data from three unpaired primary triple negative breast cancer (TNBC) samples described in a previous study (Shah ) are adopted in this study. Each sample was sequenced at ∼30× coverage on the Life/ABI SOLID sequencing platform. Reads were mapped to the reference genome hg18 using BioScope. The data was downloaded from European Genome-Phenome Archive (EGA) with accession number EGAS00001000132.
2.3 Pipeline of CLImAT
The pipeline of CLImAT is depicted in Supplementary Figure S2. RD used in this study is retrieved from the BAM file using SAMtools (Li ) and is further processed to correct GC and mappability bias. BAF signals of all known SNPs in dbSNP database (Sherry ) are normalized to eliminate allelic bias. Both RD and BAF signals are modeled by an integrated HMM for identifying genomic aberrations, including CNA and LOH, and estimating tumor impurity and ploidy.
2.4 Deriving RD and BAF from tumor WGS data
In this study, RD is obtained by counting the reads with starting position within a 1000-bp window centered at each SNP. For BAF, we count the reads that override the SNP and the reads with non-reference base at the corresponding SNP as B allelic read count. Thus, BAF of the SNP is calculated as the proportion of B allelic reads. Consisted with the procedure adopted in previous study (Ha ), data filtering is taken to further eliminate positions that have either low depth (<10 reads for 30/60× coverage and <5 reads for 10× coverage) or high depth (>250 reads).
2.5 Signal correction and normalization
GC-content and mappability may heavily affect RD signals and bring bias to CNA detection. Therefore, as the first step we perform a correction procedure to remove the bias in RD signals. For each window used in RD calculation, GC-content is measured by calculating the G + C percentage, and the mappability score is defined as the average of mappability values. The mappability file used in this study was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeMapability/. Following the procedure used in (Yoon ), we scale GC-content and mappability score to integer values between 0 and 100, and perform correction of RD signals using the following equation:
where rdc is the corrected RD of the ith window, rd is the original RD of the ith window, m is the overall median RD of all the windows and m is the median RD of the windows that have the same GC-content and mappability values as the ith window.It has been reported that loss of reads (LOR) issue happens in the alignment step of sequencing data processing (Kim ). Indeed, most aligners, such as BioScope and BWA (Li and Durbin, 2009), have the preference for aligning reads to reference allele over alternative allele. Reads sequenced from alternative chromosome are inclined to be discarded because of mismatches between reference sequence and read sequence, leading to asymmetrical distribution of allelic frequencies. Therefore, it is necessary to normalize the BAF data for better estimation of LOH and other related parameters, including tumor impurity and ploidy. We adopt an efficient QN (Bolstad ) procedure to address this issue (Supplementary Methods).
2.6 Integrated HMM
We propose an integrated HMM that takes RD and BAF data as input. Supplementary Table S1 shows the hidden states defined in the HMM with detailed description of each HMM state regarding copy number, tumor genotype mutated from normal cell genotype and zygosity status. Tumor and normal genotype pairs are used to give a detailed view of the intrinsic relationship between genotypes of tumor and normal cells admixed in tumor samples. For example, (AAAB, AB) is the case that tumor genotype ‘AAAB’ is derived from normal cell genotype ‘AB’.
2.6.1 Emission probabilities
Aligning a read to a genomic position can be treated as a Bernoulli trial (Ha ). Thus, given the number of reads that override a SNP position, the number of reads that have non-reference base at corresponding SNP position is modeled by a binomial distribution. Suppose B allelic read count and total read count of the ith SNP are bi and Ni, respectively, the observation probability for hidden state c can be formulated as:
where g is the number of tumor genotypes included in state c. The ACN y and average B allele copy number z for state c are defined as:
where n is the normal copy number and is fixed to 2 in this study, n is the tumor copy number in state c and w is the level of tumor impurity. u denotes expected BAF value of normal cells and is fixed to 0.5, and u represents the expected BAF value of the kth tumor genotype in state c.Taking into account the over-dispersed distribution of RD values (Anders and Huber, 2010), we use a negative binomial distribution to model RD signals. Suppose that RD of the ith SNP is d, the observation probability for hidden state c can be formulated as:
where Γ is the gamma function and p is a parameter of negative binomial distribution defined as the probability of success. The average read count for state c is defined as:
where λ is the mean value of copy neutral read count and varies with respect to tumor ploidy change. o accounts for background RD noise resulted from sequencing error and wrongly mapped reads.
2.6.2 EM algorithm for parameter estimation
We employ expectation maximization (EM) algorithm for HMM training and parameter estimation. In the expectation step, the expectation of the partial log-likelihood of BAF is formulated as:
where represents the posterior probability that the ith SNP is in state c and is calculated by the forward-backward algorithm (Rabiner, 1989). Similarly, the expectation of the partial log-likelihood function of RD can be formulated as:
In the maximization step of the EM algorithm, we use Newton algorithm to update the parameters in emission probabilities. For example, during iteration n we update the parameter by using the following formula:All the parameters are iteratively updated until the EM algorithm converges. Copy number and tumor genotype for each SNP are determined by the hidden state with the largest conditional probability. In addition, post-processing is performed for copy number annotation of highly amplified regions (copy number >7) according to the mean RD values of all SNPs within these regions (Supplementary Methods). To evaluate the reliability of CLImAT results, we also calculate a reliability score for each region to measure how well the data fit to the model (Supplementary Methods).
3 RESULTS
3.1 Correction and normalization of RD and BAF signals
We assess the performance of GC-content and mappability correction and plot the distribution of RD with respect to GC-content and mappability score for 1–3 copies (Supplementary Figure S3). Before correction, RD signals demonstrate a unimodal distribution with respect to GC-content and are positively correlated with mappability scores. After correction both GC-content and mappability bias is significantly eliminated. Further investigation suggests the order of GC-content and mappability correction performed to tumor WGS data affects the final results and simultaneous correction for both GC-content and mappability bias shows better performance (Supplementary Figure S4).It is observed that owing to LOR issue BAF plots of tumor samples display asymmetrical bands positioned around 0.5 (Supplementary Figure S5A). The altered distribution of BAF signals seriously hampers accurate identification of genomic aberrations in tumor samples. After applying the QN procedure, BAF signals are largely corrected with symmetrical bands positioned around 0.5 (Supplementary Figure S5B).
3.2 Appling CLImAT to simulated data
We apply CLImAT to simulated tumor data, and the results are shown in Supplementary Figure S6. The RD and BAF signals vary dramatically with increased tumor impurity for both diploid and triploid genomes. Especially, with 90% normal cells admixed in the tumor sample, both RD and BAF signals for aberrant regions are dramatically attenuated. CLImAT correctly detects all aberrant regions and provides CNA and LOH prediction with reasonable performance.
3.2.1 Estimation of tumor impurity and ploidy
We examine tumor impurity estimated by CLImAT and ABSOLUTE (Carter ) on simulated data, and the results of tumor samples at 60× coverage are shown in Figure 1A. CLImAT accurately estimates tumor impurity from 0 to 0.9 with significant correlation with the ground truth (correlation coefficient = 0.999, P = 6.24 × 10−21 for diploid samples, correlation coefficient = 0.999, P = 2.75 × 10−12 for triploid samples and correlation coefficient = 0.999, P = 1.42 × 10−11 for tetraploid samples), indicating CLImAT can precisely recover the proportion of cancer cells in tumor samples. In contrast, the performance of ABSOLUTE is not optimal and sometimes the results obviously deviate from the ground truth. Similar results are observed for simulated samples at 30× coverage (Supplementary Figure S7). To assess the performance of tumor ploidy estimation, we calculate the ACNs for simulated samples from the results of ABSOLUTE and CLImAT. As shown in Figure 1B, CLImAT exhibits prominent advantage over ABSOLUTE in estimating tumor ploidy. For example, CLImAT correctly identify all diploid samples at 30× coverage as diploidy, whereas ABSOLUTE tends to assign them as hyperploidy. Taken together, these results suggest that CLImAT can efficiently estimate both tumor impurity and tumor ploidy from complicated tumor samples.
Fig. 1.
Estimated tumor impurity and ACN of simulated samples. (A) Tumor impurity estimated by ABSOLUTE and CLImAT for samples at 60× coverage. 2p: diploid samples, 3p: triploid samples, 4p: tetraploid samples. (B) ACNs estimated by ABSOLUTE and CLImAT for simulated samples. Each bar shows the mean and standard deviation of estimated ACNs obtained from 10 samples with tumor impurity ranging from 0 to 0.9
Estimated tumor impurity and ACN of simulated samples. (A) Tumor impurity estimated by ABSOLUTE and CLImAT for samples at 60× coverage. 2p: diploid samples, 3p: triploid samples, 4p: tetraploid samples. (B) ACNs estimated by ABSOLUTE and CLImAT for simulated samples. Each bar shows the mean and standard deviation of estimated ACNs obtained from 10 samples with tumor impurity ranging from 0 to 0.9
3.2.2 LOH and CNA detection
We adopt the performance evaluation procedure proposed in APOLLOH (Ha ), in which all the calls of the informative (heterozygous) positions are used as the golden standard to compare the abilities of different computational methods in detecting genomic aberrations. Accordingly, the CNA/LOH calls of heterozygous positions pre-determined in unpaired simulated data are treated as the ground truth. We use the standard way for performance evaluation by separately comparing the results of the computational methods investigated in this study to the ground truth in terms of sensitivity and specificity (more details of performance evaluation are provided in Supplementary Methods). The LOH detection results of three computational methods, FREEC (Boeva ), SNVMix (Goya ) and CLImAT, are shown in Figure 2. For diploid tumor samples (Fig. 2A), FREEC shows high specificity in all tests and the sensitivity is generally good at medium tumor impurity levels. Compared with the other methods, CLImAT demonstrates strong robustness to tumor impurity and maintains high sensitivity (>0.99) across all tumor samples with impurity level <0.9. It also keeps consistent high specificity with respect to different tumor impurity levels (<0.8) and sampled coverage. Similar results are observed for triploid and tetraploid tumor samples (Fig. 2B and C).
Fig. 2.
LOH detection performance of FREEC, SNVMix and CLImAT on unpaired simulated data. (A) Results for diploid samples. (B) Results for triploid samples. (C) Results for tetraploid samples
LOH detection performance of FREEC, SNVMix and CLImAT on unpaired simulated data. (A) Results for diploid samples. (B) Results for triploid samples. (C) Results for tetraploid samplesNext, CNA detection performance is evaluated for FREEC and CLImAT, and the results suggest that FREEC has good performance for diploid tumor samples when tumor impurity is <0.5 (Supplementary Figure S8A). At larger tumor impurity levels, the sensitivity decreases while the specificity remains high. With similar specificity across all impurity levels, CLImAT is able to retain high sensitivity (>0.99) when the tumor impurity is <0.9. For triploid and tetraploid tumor samples (Supplementary Figure S8B and C), CLImAT also performs well in identifying CNA regions. At the same time, we investigate the performance of Patchwork, and results of simulated tumor samples are shown in Table S2. We find that in general both Patchwork and CLImAT can provide accurate aberration detection with similar performance, if the intermediate arguments of Patchwork are correctly determined by the user. Furthermore, we test CLImAT on low-coverage sequencing data, and the results for simulated data with 10× coverage suggest that CLImAT may also be applied to low coverage tumor WGS data when tumor impurity level is not high (Supplementary Figure S9).In addition to aberration detection for tumor samples, we examine the reliability score (Supplementary Methods) that is used to measure how well the data fits to the model. For simulated tumor data with two cancer subclones (Supplementary Figure S10), the reliability score for the heterogeneous region is significantly lower than those of other homogeneous regions, suggesting it can help the user to evaluate the fitness of the model and provide better interpretation of the results.
3.3 Applying CLImAT to TNBC samples
Three TNBC samples sequenced at ∼30× coverage are adopted to examine the performance of CLImAT, which are also assayed by Affymetrix SNP6.0 array for comparison. By using ASCAT (Van Loo ), the results generated from SNP arrays are used as the ground truth. We first evaluate ACN and impurity of these tumor samples using different methods, and the results are shown in Table 1. From the results of ASCAT, sample 1 is identified as aneuploid tumor, whereas samples 2 and 3 are identified as hyperploid tumors. Tumor sample 1 demonstrates genome-wide deletions with ACN of 1.67, whereas tumor samples 2 and 3 include dramatic amplifications along the whole cancer genome, with ACN of 3.02 and 4.16, respectively. CLImAT provides consistent estimation of ACN for the three tumor samples. Also, the tumor impurity levels estimated by CLImAT are in good concordance with the ground truth. These results suggest CLImAT has the potential for automatically identifying and correcting for tumor impurity and aneuploidy in complicated tumor samples.
Table 1.
ACN and tumor impurity estimated by FREEC, ASCAT and CLImAT for primary TNBC samples
Methods
ACN
Impurity
Sample 1
Sample 2
Sample 3
Sample 1
Sample 2
Sample 3
ASCAT
1.67
3.02
4.16
0.26
0.44
0.38
CLImAT
1.87
3.15
4.13
0.19
0.43
0.32
FREEC
1.92
3.77
4.92
0.20
0.22
0.29
ACN and tumor impurity estimated by FREEC, ASCAT and CLImAT for primary TNBC samplesNext, we examine LOH detection performance of FREEC, CLImAT and SNVMix (Fig. 3). The same performance evaluation procedure for simulated data analysis is adopted here, and the CNA/LOH calls of heterozygous positions recognized by ASCAT are treated as the ground truth (Supplementary Methods). For all three tumor samples, CLImAT compares favorably to SNVMix and FREEC. It achieves superior sensitivity of 0.98, 0.97 and 0.94 for samples 1, 2 and 3, respectively, with specificity better than or comparable with those of the other methods. We also examine the performance of CNA detection, and the results in Supplementary Table S3 show CLImAT has high consistency with ASCAT. Furthermore, Figure 4 illustrates the WGS and SNP array data for chromosome 8, 13 and 14 of tumor sample 1, in which both BAF and LRR/RD signals generated from different platforms show similar patterns on aberrant regions. Both CLImAT and ASCAT identify consecutive LOH regions spanning chromosomes 8, 13 and 14, board hemizygous deletions on 8p(11.23–22), 8q(11.21–22.1), 13q(21.2–31.3), 14p(11.1–12) and 14q(21.3–23.1, 23.2–23.3 and 32.13–32.33), and board amplifications on chromosome 13q(12.11–13.3 and 32.1–34). In addition, benefited from high resolution of WGS platform, CLImAT provide more precise detection of small focal aberrations than ASCAT. For example, on 8p23 ASCAT only detects one homozygous deletion whereas CLImAT identify two additional homozygous deletion regions on 8p23.1, which harbors a potential tumor suppressor gene PinX1 related to telomerase activity and chromosome stability (Zhou ).
Fig. 3.
LOH detection performance for primary TNBC samples. LOH detected by ASCAT from Affymetrix SNP6.0 arrays is used as ground truth
Fig. 4.
Result comparison of CLImAT and ASCAT for TNBC sample 1. BAF is presented by five different aberration states: homozygous deletion (HOMD), hemizygous deletion (HEMD), heterozygous (HET), copy neutral LOH (NLOH) and amplified LOH (ALOH). LRR/RD is presented by homozygous deletion (HOMD), hemizygous deletion (HEMD), neutral (NEUT) and amplification (AMP)
LOH detection performance for primary TNBC samples. LOH detected by ASCAT from Affymetrix SNP6.0 arrays is used as ground truthResult comparison of CLImAT and ASCAT for TNBC sample 1. BAF is presented by five different aberration states: homozygous deletion (HOMD), hemizygous deletion (HEMD), heterozygous (HET), copy neutral LOH (NLOH) and amplified LOH (ALOH). LRR/RD is presented by homozygous deletion (HOMD), hemizygous deletion (HEMD), neutral (NEUT) and amplification (AMP)
4 DISCUSSION AND CONCLUSION
Featured with finer resolution than previous genomic technologies, WGS allows more comprehensive analysis of tumor aberrations. In this study, we introduce an efficient computational approach for this purpose, which presents remarkable advantages over existing methods for interpretation of complicated tumor samples without prior knowledge of tumor impurity and ploidy. One advantage of CLImAT is the correction and normalization procedure for improving data quality of unpaired tumor samples. For example, BAF is normalized in CLImAT for elimination of LOR bias, which is indispensible for further statistical modeling analysis of WGS data. GC-content and mappability correction of RD is also a crucial step for detecting aberrations in unpaired tumor samples.Another advantage of CLImAT lies in the fact that it takes integrated analysis of RD and BAF using a novel HMM to provide accurate detection of genome-wide aberrations in tumor samples. The emission probabilities of HMM used in CLImAT give comprehensive description of the statistical behavior of sequencing data generated from tumor samples. Unlike previous approaches using Poisson distributions, more flexible negative binomial distribution is adopted to model over-dispersed RD signals. Moreover, the relevant parameters including tumor impurity and ploidy are automatically estimated by EM algorithm. These approaches ensure the performance of CLImAT for complex tumor samples.Despite of the advantages mentioned above, CLImAT also has limitations in modeling and analysis of tumor sequencing data. First, CLImAT cannot be applied to exome-sequencing data, as it is originally designed to deal with unpaired WGS data. Second, although >2.6 million SNPs are investigated in CLImAT and only 1.5% adjacent SNPs have relatively large distance (>5 kb), the resolution of CLImAT may still be limited by genomic breakpoints that lie between SNPs. To further improve the resolution of CLImAT, we provide an option to estimate copy number for the regions between distant SNPs (>1 kb) by calculating the corresponding RD signals (Supplementary Methods). Third, CLImAT does not account for the issue of tumor heterogeneity (Mayrhofer ; Oesper ). The basic assumption adopted in CLImAT is that there is a single copy number for all tumor cells, which will not hold if multiple subclones exist in a tumor sample. Recently, Oesper et al. investigated tumor heterogeneity using DNA sequencing data and showed that multiple tumor subclones may often exist in tumor samples (Oesper ), suggesting that tumor heterogeneity is another key factor in interpreting tumor sequencing data. In heterogeneous tumor samples, the somatic aberrant signals derived from tumor sequencing data can be complicated, which makes it hard to deconvolute subclonal aberrations. Therefore, more advanced methods are required to assess tumor heterogeneity in tumor sequencing data.In conclusion, we present CLImAT, an efficient and powerful bioinformatics tool, for detection of genomic aberrations using tumor WGS data. We expect it will be helpful for comprehensive interpretation of cancer genome and show its potential usefulness in clinical diagnosis and treatment for cancers.
Authors: Ruibin Xi; Angela G Hadjipanayis; Lovelace J Luquette; Tae-Min Kim; Eunjung Lee; Jianhua Zhang; Mark D Johnson; Donna M Muzny; David A Wheeler; Richard A Gibbs; Raju Kucherlapati; Peter J Park Journal: Proc Natl Acad Sci U S A Date: 2011-11-07 Impact factor: 11.205
Authors: Jarupon Fah Sathirapongsasuti; Hane Lee; Basil A J Horst; Georg Brunner; Alistair J Cochran; Scott Binder; John Quackenbush; Stanley F Nelson Journal: Bioinformatics Date: 2011-08-09 Impact factor: 6.937
Authors: Peter Van Loo; Silje H Nordgard; Ole Christian Lingjærde; Hege G Russnes; Inga H Rye; Wei Sun; Victor J Weigman; Peter Marynen; Anders Zetterberg; Bjørn Naume; Charles M Perou; Anne-Lise Børresen-Dale; Vessela N Kristensen Journal: Proc Natl Acad Sci U S A Date: 2010-09-13 Impact factor: 11.205
Authors: Rodrigo Goya; Mark G F Sun; Ryan D Morin; Gillian Leung; Gavin Ha; Kimberley C Wiegand; Janine Senz; Anamaria Crisan; Marco A Marra; Martin Hirst; David Huntsman; Kevin P Murphy; Sam Aparicio; Sohrab P Shah Journal: Bioinformatics Date: 2010-02-03 Impact factor: 6.937
Authors: Ao Li; Zongzhi Liu; Kimberly Lezon-Geyda; Sudipa Sarkar; Donald Lannin; Vincent Schulz; Ian Krop; Eric Winer; Lyndsay Harris; David Tuck Journal: Nucleic Acids Res Date: 2011-03-11 Impact factor: 16.971
Authors: Scott L Carter; Kristian Cibulskis; Elena Helman; Aaron McKenna; Hui Shen; Travis Zack; Peter W Laird; Robert C Onofrio; Wendy Winckler; Barbara A Weir; Rameen Beroukhim; David Pellman; Douglas A Levine; Eric S Lander; Matthew Meyerson; Gad Getz Journal: Nat Biotechnol Date: 2012-05 Impact factor: 54.908
Authors: Yichen Cheng; James Y Dai; Thomas G Paulson; Xiaoyu Wang; Xiaohong Li; Brian J Reid; Charles Kooperberg Journal: Ann Appl Stat Date: 2017-07-20 Impact factor: 2.083
Authors: Man Kuen Yung; Kwok Wai Lo; Chi Wai Yip; Grace T Y Chung; Carol Y K Tong; Phyllis F Y Cheung; Tan To Cheung; Ronnie T P Poon; Samuel So; Sheung Tat Fan; Siu Tim Cheung Journal: BMC Cancer Date: 2015-04-11 Impact factor: 4.430
Authors: Ilari Scheinin; Daoud Sie; Henrik Bengtsson; Mark A van de Wiel; Adam B Olshen; Hinke F van Thuijl; Hendrik F van Essen; Paul P Eijk; François Rustenburg; Gerrit A Meijer; Jaap C Reijneveld; Pieter Wesseling; Daniel Pinkel; Donna G Albertson; Bauke Ylstra Journal: Genome Res Date: 2014-09-18 Impact factor: 9.043
Authors: Vinay Varadan; Salendra Singh; Arman Nosrati; Lakshmeswari Ravi; James Lutterbaugh; Jill S Barnholtz-Sloan; Sanford D Markowitz; Joseph E Willis; Kishore Guda Journal: Genome Med Date: 2015-07-20 Impact factor: 11.117