Simon Cabello-Aguilar1, Julie A Vendrell1, Charles Van Goethem2, Mehdi Brousse3, Catherine Gozé1, Laurent Frantz4,5, Jérôme Solassol1. 1. Laboratoire de Biologie des Tumeurs Solides, Département de Pathologie et Oncobiologie, CHU Montpellier, Université de Montpellier, 34295 Montpellier, France. 2. Laboratoire de Génétique Moléculaire, CHU Montpellier, Université de Montpellier, 34295 Montpellier, France. 3. Laboratoire d'Hématologie Biologique, CHU Montpellier, Université de Montpellier, 34295 Montpellier, France. 4. Palaeogenomics Group, Department of Veterinary Sciences, Ludwig Maximilian University of Munich, Munich, Germany. 5. School of Biological and Chemical Sciences, Queen Mary University of London, London, UK.
Abstract
Copy-number variations (CNVs) are an essential component of genetic variation distributed across large parts of the human genome. CNV detection from next-generation sequencing data and artificial intelligence algorithms have progressed in recent years. However, only a few tools have taken advantage of machine-learning algorithms for CNV detection, and none propose using artificial intelligence to automatically detect probable CNV-positive samples. The most developed approach is to use a reference or normal dataset to compare with the samples of interest, and it is well known that selecting appropriate normal samples represents a challenging task that dramatically influences the precision of results in all CNV-detecting tools. With careful consideration of these issues, we propose here ifCNV, a new software based on isolation forests that creates its own reference, available in R and python with customizable parameters. ifCNV combines artificial intelligence using two isolation forests and a comprehensive scoring method to faithfully detect CNVs among various samples. It was validated using targeted next-generation sequencing (NGS) datasets from diverse origins (capture and amplicon, germline and somatic), and it exhibits high sensitivity, specificity, and accuracy. ifCNV is a publicly available open-source software (https://github.com/SimCab-CHU/ifCNV) that allows the detection of CNVs in many clinical situations.
Copy-number variations (CNVs) are an essential component of genetic variation distributed across large parts of the human genome. CNV detection from next-generation sequencing data and artificial intelligence algorithms have progressed in recent years. However, only a few tools have taken advantage of machine-learning algorithms for CNV detection, and none propose using artificial intelligence to automatically detect probable CNV-positive samples. The most developed approach is to use a reference or normal dataset to compare with the samples of interest, and it is well known that selecting appropriate normal samples represents a challenging task that dramatically influences the precision of results in all CNV-detecting tools. With careful consideration of these issues, we propose here ifCNV, a new software based on isolation forests that creates its own reference, available in R and python with customizable parameters. ifCNV combines artificial intelligence using two isolation forests and a comprehensive scoring method to faithfully detect CNVs among various samples. It was validated using targeted next-generation sequencing (NGS) datasets from diverse origins (capture and amplicon, germline and somatic), and it exhibits high sensitivity, specificity, and accuracy. ifCNV is a publicly available open-source software (https://github.com/SimCab-CHU/ifCNV) that allows the detection of CNVs in many clinical situations.
Copy-number variations (CNVs) are a class of structural variations that result from the deletion or duplication of a DNA fragment. About 1,500 CNV regions have already been discovered in humans, accounting for ∼12%–16% of the entire human genome, making it one of most common types of genetic variation. Although the biological impact of the majority of these CNVs remains uncertain, nearly 50% of known CNVs overlap with protein-coding regions, and many are involved in genetic diseases. Recent studies have demonstrated that CNVs can be implicated in many rare diseases, such as inherited retinal dystrophies, and in diseases that involve dosage-sensitive developmental genes, such as Charcot-Marie-Tooth disease and DiGeorge syndrome, among others.4, 5, 6 CNVs, resulting from gene amplification (copy-number gain) as well as gene deletion (copy-number loss), are common in cancer cells, and multiple studies have shown that duplication or deletion of specific genes can contribute to tumor growth and to resistance to anti-tumor therapies., In cancer cells, the size of these molecular alterations can vary dramatically, from one or a few exons to an entire chromosomal arm.Although most CNVs found in cancer cells are likely to have accumulated as a direct consequence of clonal evolution during the disease course, some have been identified as playing a role in the early development of cancer (e.g. CNVs located in BRCA1/2 in familial breast and ovarian cancer). In fact, it has been estimated that CNVs represent more than 10% of the molecular alterations linked to cancer predisposition, making their detection a priority. Detection of acquired (somatic) focal copy-number changes is also required for diagnosis, prognosis, and the therapeutic management of patients with cancer. For example, loss of chromosomal arms 1p and 19q is closely associated with oligodendrogliomas, a subtype of primary brain tumors, and with a favorable prognosis in diffuse gliomas. Focal copy-number increases are biomarkers predictive of responses to particular therapies; for example, patients with oncogenic ERBB2 amplification in breast cancer respond well to trastuzumab, and acquired resistance to tyrosine kinase inhibitors is exhibited in patients with MET-amplified non-small cell lung carcinomas.,Recently, the rapid implementation of high-throughput next-generation sequencing (NGS) methods, especially targeted DNA panels, in clinical laboratories has led to the emergence of a fairly large number of pipelines and algorithms able to detect CNVs from NGS data.15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27 Most of these studies use the read-depth approach, relying on the hypothesis that the number of reads aligned to a genomic region is proportional to the copy number of the region. In multiple sample methods, CNVs are detected by comparing the read counts of the sample of interest to the read counts of a reference sample. The proper building of the reference is one of the main difficulties. To that end, two main solutions exist: (1) to gather a database of normal samples or (2) to add normal samples into the NGS run. Nevertheless, both solutions come with issues, mainly the presence of a batch effect and a high cost, respectively. To avoid these problems, use of the single sample method was previously proposed, which consists of statistical modelling of the target read counts within the sample of interest to detect CNVs. Recent advances in artificial intelligence and, in particular, the availability of accessible machine-learning packages have made it possible for developers to improve their algorithms in many areas. To date, only a few studies have taken advantage of these recent developments in the field of CNV detection from targeted NGS data.,,With careful consideration of these issues, we present ifCNV, a novel machine-learning-based software, provided as a python and R package (https://github.com/SimCab-CHU/ifCNV and https://github.com/SimCab-CHU/ifCNV-R, respectively). This approach combines several advantages, among which is that it allows detection of CNVs without the need for a reference sample and it is low resource consuming.To validate our model and explore its limitations in clinical practice, we tested ifCNV on different synthetic datasets mimicking relevant clinical situations and on datasets obtained from amplicon- or capture-based DNA library preparation technologies.
Results
ifCNV workflow
ifCNV is a CNV detection tool based on read-depth distribution obtained from NGS data (Figure 1A). It integrates a pre-processing step to create a read-depth matrix using as input the aligned binary alignment map (.bam) files and a corresponding .bed file. This reads matrix is composed of the samples as columns and the targets as rows. Next, it uses an Isolation Forest (IF) machine-learning algorithm to detect the samples with a strong bias between the 99th percentile and the mean (for amplifications, Figure 1B, top plot) and the 1st percentile and the mean (for deletions, Figure 1B, bottom plot). These samples are assumed to be CNVpos. The samples with no bias, which are therefore not detected by the IF as outliers, are considered CNVneg samples. After filtering of the samples with a mean read depth per target less than X (X = 10 by default but can be set by the user to any value), the reads matrix is normalized by dividing each column (i.e., the reads distribution of each sample) by its median. Then, ifCNV creates a mean normalized normal sample by averaging all CNVneg samples, to create the intra-run reference. The log ratio between each CNVpos sample and this reference is computed, and a second IF is used to detect the outlying targets (Figure 1C). The log ratio balances the differences between ratios under 1 (deletions) and ratios over 1 (amplifications), increasing the ability of ifCNV to detect outlying targets with a ratio under 1 (data not shown).
Figure 1
ifCNV workflow
(A) ifCNV is composed of three steps: the pre-processing (blue), the core algorithm (green), and the output (yellow). CNV, copy-number variation; CNVneg, CNV-negative samples; CNVpos, CNV-positive samples. (B) Top: 99th percentiles of the reads distribution according to the means of the reads distribution of the samples in a NGS sequencing run; bottom: 1st percentiles of the reads distribution according to the means of the reads distribution of the samples in a NGS sequencing run. The red dots correspond to the outlying samples. (C) In one CNVpos sample, the logarithm of the reads per target according to the logarithm of the mean normalized normal sample. The red dots correspond to the outlying targets.
ifCNV workflow(A) ifCNV is composed of three steps: the pre-processing (blue), the core algorithm (green), and the output (yellow). CNV, copy-number variation; CNVneg, CNV-negative samples; CNVpos, CNV-positive samples. (B) Top: 99th percentiles of the reads distribution according to the means of the reads distribution of the samples in a NGS sequencing run; bottom: 1st percentiles of the reads distribution according to the means of the reads distribution of the samples in a NGS sequencing run. The red dots correspond to the outlying samples. (C) In one CNVpos sample, the logarithm of the reads per target according to the logarithm of the mean normalized normal sample. The red dots correspond to the outlying targets.These assumed altered targets are then used to compute the localization score per region of interest (see materials and methods). Finally, a threshold is applied on this score to select the significantly altered regions that are compiled in an html report containing a table and a graph for easy user interpretation.
Performance of ifCNV
Detection of CNVpos samples
To quantify the ability of ifCNV to detect CNVpos samples, we created a synthetic dataset of 1,500 targets and 30 samples in which we inserted one CNVpos sample. It is noteworthy that if the copy-number ratio (CNR) or the modified target ratio (MTR; i.e., the number of altered targets, located on CNV regions, divided by the total number of targets in the panel) are low, the CNVpos samples will resemble the CNVneg samples and therefore will be difficult to detect. Thus, the performance of a CNV detector directly depends on the CNR and the MTR. Taking this fact into consideration, we iterated the CNR and MTR (from 0 to 10 and 0 to 0.1, respectively) and performed 1,000 simulations for each iteration (Figures 2A and 2B). The analysis of the attribution of CNVpos samples for a CNR greater than 1 is shown in Figure 2A. For CNRs greater than 6, ifCNV correctly identified the abnormal sample in 99.58% of simulations when the MTR is greater than 0.01 (Figure 2A). Furthermore, if the CNR was between 4 and 6 and the MTR was over 0.01, ifCNV found the abnormal sample in 99.47% of simulations. Finally, if the CNR was over 2 and the MTR over 0.01, ifCNV detected the abnormal samples in 99.34% of simulations; this reached 99.83% when the MTR was greater than 0.035.
Figure 2
Performance assessment in detecting CNVpos and CNVneg samples
(A) Heatmap of the detection rate of CNVpos samples as a function of the CNR and the MTR, for CNRs between 2 and 10. (B) Heatmap of the detection rate of CNVpos samples as a function of the CNR and the MTR, for CNRs between 0 and 0.9. (C) Heatmap of the detection rate of CNVneg samples as a function of the CNR and the MTR, for CNRs between 2 and 9. (D) Heatmap of the detection rate of CNVneg samples as a function of the CNR and the MTR, for CNRs between 0 and 0.9. (E) Classification indicators of the detection of a single CNVpos sample in a set of several CNVneg samples. (F) Classification indicators of the detection of multiple CNVpos samples in a set of several CNVneg samples. CNR, copy-number ratio; MTR, modified target ratio.
Performance assessment in detecting CNVpos and CNVneg samples(A) Heatmap of the detection rate of CNVpos samples as a function of the CNR and the MTR, for CNRs between 2 and 10. (B) Heatmap of the detection rate of CNVpos samples as a function of the CNR and the MTR, for CNRs between 0 and 0.9. (C) Heatmap of the detection rate of CNVneg samples as a function of the CNR and the MTR, for CNRs between 2 and 9. (D) Heatmap of the detection rate of CNVneg samples as a function of the CNR and the MTR, for CNRs between 0 and 0.9. (E) Classification indicators of the detection of a single CNVpos sample in a set of several CNVneg samples. (F) Classification indicators of the detection of multiple CNVpos samples in a set of several CNVneg samples. CNR, copy-number ratio; MTR, modified target ratio.ifCNV was also able to detect deletion (CNRs under 0); as for CNRs greater than 1, the sensitivity was related to both the CNR and the MTR (Figure 2B). For CNRs under 0.5, ifCNV detected the abnormal sample in 92.26% of simulations. For CNRs over 0.5, ifCNV only detected the abnormal sample in 27% of simulations. Although 27% is a higher detection rate than a random choice, for which the probability is 3.33% (1 sample out of 30), it is not satisfactory enough for diagnostic purposes given the importance of heterozygous deletion (hDel) in several diseases. To overcome this pitfall, we assessed an alternative solution based on the ability of the IF to accurately detect the CNVneg samples. Indeed, samples labeled as CNVneg were considered to be normal and were adopted as an internal reference.
Identification of the CNVneg samples
To quantify the ability of ifCNV to detect CNVneg samples, we used the same synthetic dataset, and we also iterated the CNR and the MTR (from 0 to 10 and 0 to 0.1, respectively) and performed 1,000 simulations for each iteration (Figures 2C and 2D). ifCNV was able to correctly identify the CNVneg samples in 99.87% of simulations, regardless of the CNR and MTR. For CNRs over 2, this reached 99.9%. Interestingly, for CNRs under 1, ifCNV identified the CNVneg samples in 98.57% of the simulations. Finally, a value of 99.69% was obtained for CNRs under 0.5.
Detection of one CNVpos sample in a set of several CNVneg samples
Conventional hospital and research laboratories must determine the CNV status of numerous samples. The number of samples in a sequencing run can vary from a few to several dozen. We therefore assessed the ability of ifCNV to correctly find a unique CNVpos sample in a set of several negative samples. We created synthetic datasets of 2 to 100 samples in which we inserted a CNVpos sample harboring an amplification with a CNR of 5 and an MTR of 0.03. We performed 100 simulations for each and calculated the sensitivities (Se), specificities (Sp), and accuracies (Acc) of the algorithm (Figure 2E). For one CNVpos sample in a set of two, ifCNV failed to label any sample as positive, leading to one true negative (TN) and one false negative (FN) (Se = 0, Sp = 1, and Acc = 0.5). Interestingly, for one CNVpos sample in three, ifCNV correctly labeled the positive sample in every simulation and found one FP in less than 50% of simulations. When increasing the number of samples in the set, the Se remained close to 1 (0.992 from 3 to 100 samples), and the Sp and the Acc tended to 0.9.
Detection of multiple CNVpos samples in a set of several CNVneg samples
As several CNVpos can be present in the same sequencing run, we tested the performance of ifCNV in such situations. We randomly chose 2 to 25 samples in a synthetic dataset of 50 samples. We then added random CNRs (from 2 to 6) to 3% of the targets of these samples and performed 100 iterations to determine the Se, Sp, and Acc of our algorithm (Figure 2F). ifCNV exhibited relatively high Se, Sp, and Acc (around 0.85, 0.7, and 0.75, respectively), regardless of the number of CNVpos samples in the dataset. Notably, when half of the tested set (25/50) was CNVpos, ifCNV correctly labeled a mean of 20 samples across all simulations.
Detection of altered targets
The second step of ifCNV consists of labeling the targets that are modified among the CNVpos samples. To assess its performance, we created a synthetic dataset of 30 samples and 300 targets. One sample was CNVpos with one modified target randomly chosen at each iteration, with a CNR from 0 to 5. We then performed 1,000 iterations and calculated the Se, Acc, positive predictive value (PPV), and Matthews correlation coefficient (MCC) (Figure 3). On the one hand, ifCNV exhibited a Se very close to 1 for CNRs from both 0 to 0.3 (deletions) and 3 to 5 (amplification), meaning that the modified target was accurately labeled in almost every simulation. On the other hand, the PPV and the MCC were low (∼0.02 and ∼0.1, respectively), reflecting a high number of FPs. However, the Acc was ∼0.87 and stayed approximately unchanged, meaning that the number of TNs was high and dominated the number of FPs.
Figure 3
Classification indicators for the detection of altered targets
Classification indicators for the detection of altered targetsMCC, Matthews correlation coefficient; PPV, positive predictive value.
Thresholding on the localization score
To test the ability of the score to discriminate the FPs from the TPs, we used the same synthetic dataset as before and grouped targets (from 2 to 15) together to mimic regions. We then modified all the targets corresponding to a randomly chosen region. Finally, we computed the localization score and calculated the Se, Acc, PPV, and MCC before (Figure 4A, left panel) and after thresholding (Figure 4A, right panel).
Figure 4
Performance assessment of the localization score thresholding
(A) Classification indicators for the detection of altered targets before (left panel) and after (right panel) thresholding. (B) ROC curve. (C) Associated table. FPR, false positive rate; Se, sensitivity; ROC, receiver operating characteristic.
Performance assessment of the localization score thresholding(A) Classification indicators for the detection of altered targets before (left panel) and after (right panel) thresholding. (B) ROC curve. (C) Associated table. FPR, false positive rate; Se, sensitivity; ROC, receiver operating characteristic.As Figure 3 shows, before thresholding, the Se was close to 1, and the PPV and MCC were low (around 0.1 and 0.2, respectively). Acc was lower using the grouped targets (around 0.4) because, by construction, the total number of regions in the dataset is lower than the number of targets, and therefore the number of TN is smaller. After thresholding, we observed that the PPV, MCC, and Acc increased to reach values very close to 1 for CNRs over 2, while the Se had only slightly decreased, meaning that the score thresholding enables the TPs to be kept while the FPs are discarded. The localization score thresholding approach can therefore be validated and represents an important improvement in the performance of our algorithm.To characterize the dependence on the score, we iterated the threshold from 0 to 15, calculated the corresponding TP rate (TPR) and FP rate (FPR), and plotted a receiver operating characteristic (ROC) curve (Figure 4B). The corresponding table in Figure 4C shows that, on the simulated data, CNVs with a score over 4 were 100% TPs and that less than 5% FPs and CNVs with a score over 8 were 100% TPs and ∼1% FPs.
Evaluation of ifCNV on patient datasets
We then tested ifCNV on three real datasets. First, we used the ICR96 dataset and then used two in-house datasets for which we had NGS and array comparative genomic hybridization (aCGH) results to compare to. Tables 1 and 2 present the results.
Table 1
Description of the datasets used in the study
Datasets
Samples
Positives
Negatives
Number of alterations
Amplifications
Deletions
ICR96
96 (germline)
68
28
296
80
216
TSCA
81 (somatic)
25
56
39
21
17
Juno
43 (somatic)
26
17
28
9
19
Table 2
Classification indicators for ifCNV and the tools described in Morena-Cabrera et al. on the ICR96 dataset
Description of the datasets used in the studyClassification indicators for ifCNV and the tools described in Morena-Cabrera et al. on the ICR96 datasetAcc, accuracy; FN, false negative; FP, false positive; Sp, specificity; TN, true negative; TP, true positive.
ICR96 dataset
To compare the performance of ifCNV wtih that of other methods,,23, 24, 25, we compared our results with those obtained by Moreno-Cabrera et al., who benchmarked five widely used tools with the ICR96 dataset. This dataset, however, only possess one target per exon, rendering the advantage of the ifCNV score thresholding strategy not practicable in this case. As an alternative strategy, we decided to take advantage of the ability to change the contamination parameter of the IF. This main actionable parameter is the proportion of outliers in the dataset. By default, it can be set to a value between 0 and 0.5 and to “auto,” for an automatic detection of the proportion of outliers. In ifCNV, we added the ability to set this parameter to “none”; it is then calculated as 1 on the number of samples in the dataset. We thus iterated several values of the contamination parameter and calculated the correspondent binary classification indicators (Figure 5). We also compiled the results obtained with the two pre-set contamination values (auto and none) with the results from the other tools (Table 1). We observed that ifCNV exhibits performances in the same order of magnitude as the other tools, with the clear advantage of having an easily tunable parameter allowing the user to expect either no FNs (contamination = auto) or no FPs (contamination = none).
Figure 5
ROC curve for the ICR96 dataset
Some points of interest are highlighted: contamination = “auto” (pink), 0.01 < contamination < 0.14 (blue), and contamination = “none” (orange). TPR, true positive rate. ifCNV is a new software that combines artificial intelligence using two isolation forests and a comprehensive scoring method to faithfully detect CNVs among various samples. ifCNV is a publicly available open-source software (https://github.com/SimCab-CHU/ifCNV) that allows the detection of CNVs in various clinical situations.
ROC curve for the ICR96 datasetSome points of interest are highlighted: contamination = “auto” (pink), 0.01 < contamination < 0.14 (blue), and contamination = “none” (orange). TPR, true positive rate. ifCNV is a new software that combines artificial intelligence using two isolation forests and a comprehensive scoring method to faithfully detect CNVs among various samples. ifCNV is a publicly available open-source software (https://github.com/SimCab-CHU/ifCNV) that allows the detection of CNVs in various clinical situations.
TSCA dataset
Next, we aimed to validate the performance of ifCNV on an in-house dataset (Table 2). Its particularities are (1) it is composed of tumor samples containing variable percentages of altered cells, and (2) it possesses a small number of targets per region (range: 1–40, median: 4). Using this dataset, we found that ifCNV correctly labeled 19 of the 21 amplifications present in the dataset; the 2 FNs were measured as a gain of two copies (CNR = 2) by aCGH. In addition, 14 of 17 deletions were detected with no FPs. The 3 undetected deletions were from samples that have a lower percentage of tumor cells.
Juno dataset
Finally, we also assessed our tool on a distinct library preparation approach with a larger number of targets per region composed of a larger panel than the TSCA dataset (range: 1–164, median: 17) from tumor samples (Table 2). Interestingly, we could detect all amplifications with no FPs and 17 out of the 19 deletions, leading to an overall Acc of 0.96 and MCC of 0.93. Interestingly, the 2 missed deletions are on a gene that represents only 0.4% of the panel (MTR = 0.004).
Discussion
In recent years, numerous computational methods for detecting and measuring CNVs from NGS data have been developed. However, most of these are based on the use of internal or external reference samples. To date, only a few have taken advantage of easy-to-use machine-learning packages.,,Several artificial intelligence (AI)-based outlier detection methods exist. The main ones are the minimum covariance determinant (MCD), the local outlier factor (LOF), the IF, and the elliptic envelope algorithm (EEA). Each method has its benefits and drawbacks. Briefly, MCD and EEA were created to treat input variables with Gaussian distribution, LOF was designed for data with low dimensionality, and IF is a tree-based algorithm effective on high-dimensional data and no underlying assumption on the distribution of the data. The read-depth data obtained from targeted NGS do not follow a Gaussian distribution and can be of high dimensionality depending on the number of features of the panel and the sequencing run. Thus, the IF algorithm appears to be the most suitable to process this data. Therefore, we describe here ifCNV, a bioinformatics tool using the IF algorithm, that allows detection of CNVs without the need for a reference sample.Moreover, in routine clinical practice, the variety of pathologies involving specific molecular alterations leads to a broad diversity in the datasets generated. Thus, in general, most CNV software that is developed for a specific data type has suboptimal reliability for use in routine practice with diversified samples. In addition, most of the genetic workflows are either developed in Python or R, and, to our knowledge, no existing CNV detection tool is available in both languages.3ifCNV is available in both languages, making it more easily to implement in pre-existing pipelines. Also, by successfully creating its own normal reference inside each analyzed NGS run, ifCNV frees itself from any batch effect inherent to tools using external references. It also avoids the need for reference samples that are copy-number neutral to be sequenced in the same batch, which is an efficient but not cost-effective solution. ifCNV also has a simple framework: it is made up of only three main steps on which the user has full control through tunable parameters. Furthermore, its efficiency makes it possible to run on hardware with limited computing resources.Using simulated data, we demonstrate that ifCNV is highly reliable and adapts to several relevant clinical situations including (1) when one CNVpos sample is present in a set of several CNVneg samples, (2) when multiple CNVpos samples are present in a set of several CNVneg samples, (3) when only one target is altered, and (4) when the CNRs are close to one, which can correspond to small alterations or to mixtures of normal and altered cells. ifCNV also performed well using datasets generated from amplicon- or capture-based libraries prepared from germline or somatic clinical samples.Analyses of real data also demonstrated that ifCNV’s performance was comparable to that of other widely used tools but with substantial specific advantages. Our solution has a tuneable control of the FPR thanks to localization score thresholding and to the contamination parameter, which can both be optimized according to the dataset by an entry-level user. ifCNV was also able to accurately detect CNVs in difficult samples, such as those composed of a mixture of normal cells and tumoral cells, which dilutes the CNR of samples.Even if we demonstrated that ifCNV can process various targeted dataset, as is, it cannot be applied to whole-genome sequencing (WGS) and third-generation sequencing (TGS) datasets. The concept of the method should be pertinent to treat these data types but would need further development. Indeed, pre-processing and both IF parameters will need to be adapted and benchmarked. This would be the subject of a new study.In conclusion, ifCNV is a highly flexible tool that can detect CNVs in germline and somatic clinical samples with similar performances. ifCNV now represents an essential component of the cancer diagnosis pipeline that we routinely use to analyze samples from patients in our laboratory. We believe that the flexibility, high accuracy, easy implementation, and low hardware infrastructure afforded by our method will help other laboratories in increasing their throughput and improve disease characterization by accurate CNV detection.
Materials and methods
IF algorithm
The IF algorithm was developed by Liu et al. It “isolates” observations using a binary tree structure called an isolation tree. In this isolation tree, anomalies are more likely to be isolated closer to the root, whereas normal points are more likely to be isolated at the deeper end. The IF algorithm builds its isolation trees for a given dataset by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length is then averaged over a forest of such isolation trees to produce a decision value. The smaller the value, the more likely it is that the sample represents an anomaly.
Synthetic datasets
To create synthetic datasets that reproduce faithfully those obtained in routine clinical practice, we selected 1,910 samples from in-house targeted NGS data with no CNVs. We extracted the total reads on each target from the aligned .bam files with the bedtools (https://bedtools.readthedocs.io) multicov function and created a reference reads matrix ordered with samples as columns and targets as rows. This reference reads matrix was then normalized by dividing each column by its median. All medians were used to create a median reads distribution that was needed for the rescaling process. Next, we created a normalized target reads distribution from each row of the normalized matrix. Thus, to generate synthetic datasets, we filled each line by taking a normalized target reads distribution, in which we randomly picked a value for each column. To rescale this matrix, we multiplied each column with a value randomly picked from the median distribution. Finally, to create CNVpos samples within this synthetic dataset and test the algorithm, we modified the desired number of targets by multiplying it by a factor ranging from 0 to 10.
ICR96 dataset
The dataset ICR96 exon CNV validation series was downloaded from the European Genome-phenome Archive (EGA) (EGA: EGAD00001003335). This dataset consists of the read counts of 96 germline samples sequenced on a targeted panel for which the copy number, at the exon level, was validated using high-resolution multiplex ligation-dependent probe amplification (MLPA) experiments (Table 3).
Table 3
Classification indicators on the TSCA and Juno datasets
Dataset
CNV type
TP
TN
FP
FN
FNR
FPR
Se
Sp
PPV
Acc
MCC
TSCA
amplification
19
59
1
2
0.10
0.02
0.90
0.98
0.95
0.96
0.90
deletion
14
64
0
3
0.21
0
0.82
1
1
0.96
0.89
total
33
123
1
5
0.15
0.01
0.87
0.99
0.97
0.96
0.90
Juno
amplification
9
19
0
0
0
0
1
1
1
1
1
deletion
17
9
0
2
0.12
0
0.89
1
1
0.93
0.86
total
26
28
0
2
0.08
0
0.93
1
1
0.96
0.93
Classification indicators on the TSCA and Juno datasets
In-house datasets
DNA extracted from clinical somatic samples was analyzed alongside by two molecular approaches: aCGH as a reference method, and NGS using two different library preparation protocols.
DNA extraction of formalin-fixed paraffin-embedded samples
DNA was extracted from tissue samples using the Maxwell RSC DNA FFPE Kit (Promega, Madison, WI, USA) according to the manufacturer’s recommendations. DNA was quantified using the Qubit dsDNA BR Assay Kit and a Qubit Fluorometer (Thermo Fisher Scientific, Wilmington, DE, USA).
Comparative genomic hybridization
aCGH profiling was performed with the Human Agilent Sureprint G3 8 × 60 K Microarray Kit (Agilent Technologies, Santa Clara, CA, USA). Tumor DNA was labeled with cyanine 5 (Cy5), while reference DNA from an individual of the same sex as the patient was labeled with Cy3. Sample and reference DNAs were pooled and hybridized for 24 h at 67°C on the arrays. The fluorescence was read by an Agilent SureScan Microarray scanner, and the Cy5/Cy3 fluorescence ratios were converted into log2-transformed values with Cytogenomics software (Agilent).The threshold of the absolute value of the log2 fluorescence ratio retained to define a chromosomal anomaly was 0.25. A mean log2 ratio was calculated when, for at least three probes located on contiguous positions on the chromosome, a log2 ratio absolute value greater than 0.25 and of the same sign was measured. The minimum size of the anomalies considered in the interpretation of the results was set at 1 Mb.The different chromosomal anomalies were defined by the Cytogenomics software according to the mean log2 ratio values, as follows: homozygous deletion for a value <−1, loss of one gene copy for a value between −0.25 and −1, gain of one gene copy for a value between 0.25 and 1, and amplification (gain of at least five copies) for a value >2.
Library preparation was performed as previously described. Briefly, extracted DNA was qualified using KAPA Sybr Fast qPCR (Kapa Biosystems, Boston, MA, USA). A home-made panel targeting specific exons of 35 clinically relevant cancer genes was used for amplification of regions of interest. For each sample, dual-strand libraries were prepared using a TruSeq Custom Amplicon protocol, as described by the manufacturer (Illumina, San Diego, CA, USA). After amplification, PCR products were purified using AMPure XP beads (Beckman Coulter, Brea, CA, USA) and quantified, normalized, and pair-end sequenced on a MiSeq instrument (2 × 150 cycles, Illumina). This dataset is composed of 81 samples from 59 different sequencing runs, with 25 CNVpos samples (Table 1).
Libraries were prepared using the Advanta Solid Tumor NGS Library Prep Assay with the automated Juno system on integrated fluidic circuits (LP 8.8.6 IFC) (Fluidigm, San Francisco, CA, USA) following the manufacturer’s procedure. The panel is developed to allow the detection of somatic mutations in 53 oncology-relevant genes (234 kb, 1,508 assays). Briefly, the LP 8.8.6 IFCs were primed with 20 ng DNA per sample in the PCR mix. After amplification, pooled harvested samples were purified using AMPure XP beads (Beckman Coulter), and a second PCR was performed to integrate the sequencing adapters. Libraries were then quantified, normalized, and pair-end sequenced on a NextSeq instrument (2 × 150 cycles, Illumina). In this dataset, there are 43 samples with 26 CNVpos from 20 different sequencing runs (Table 1).
Binary classification indicators
TPR (or sensitivity), FPR, TN rate (TNR; or specificity), FN rate (FNR), PPV, Acc, and the MCC were used to measure the performance of ifCNV. These were computed as
Localization scoring
Specific regions of biological significance (gene or exon) can be covered by several targets. In the event that a region is altered, all the targets in the region should be modified. By contrast, if only one target in the region is modified, it is likely to be an FP. We integrated this reasoning to develop a localization score in order to reduce the number of FPs. The localization score depends on the number of modified targets in the region, the number of targets in the region, and the total number of targets in the panel. A semi-open log scale incorporating the ratio of modified targets in the region was chosen. It is calculated as follows:
Pre-processing
ifCNV requires the .bam sequence files as an input but does not provide a function to create them. Therefore, the user needs to generate the proper .bam files from the raw sequence .fastq files. ifCNV’s pre-processing step uses the bedtools multicov function to generate the reads matrix. It takes as input the aligned .bam files and outputs a read-depth matrix that was used for the CNV detection analysis. In this study, the .bam files were created using the Burrows-Wheeler Alignment (BWA) tool: the .fastq files were aligned against the reference genome with bwa mem.
Data availability statement
All the datasets used in this study are available at https://github.com/SimCab-CHU/ifCNV.
Authors: Lennart F Johansson; Freerk van Dijk; Eddy N de Boer; Krista K van Dijk-Bos; Jan D H Jongbloed; Annemieke H van der Hout; Helga Westers; Richard J Sinke; Morris A Swertz; Rolf H Sijmons; Birgit Sikkema-Raddatz Journal: Hum Mutat Date: 2016-02-24 Impact factor: 4.878
Authors: D Ross Camidge; Gregory A Otterson; Jeffrey W Clark; Sai-Hong Ignatius Ou; Jared Weiss; Steven Ades; Geoffrey I Shapiro; Mark A Socinski; Danielle A Murphy; Umberto Conte; Yiyun Tang; Sherry C Wang; Keith D Wilner; Liza C Villaruz Journal: J Thorac Oncol Date: 2021-03-04 Impact factor: 15.609
Authors: Richard Redon; Shumpei Ishikawa; Karen R Fitch; Lars Feuk; George H Perry; T Daniel Andrews; Heike Fiegler; Michael H Shapero; Andrew R Carson; Wenwei Chen; Eun Kyung Cho; Stephanie Dallaire; Jennifer L Freeman; Juan R González; Mònica Gratacòs; Jing Huang; Dimitrios Kalaitzopoulos; Daisuke Komura; Jeffrey R MacDonald; Christian R Marshall; Rui Mei; Lyndal Montgomery; Kunihiro Nishimura; Kohji Okamura; Fan Shen; Martin J Somerville; Joelle Tchinda; Armand Valsesia; Cara Woodwark; Fengtang Yang; Junjun Zhang; Tatiana Zerjal; Jane Zhang; Lluis Armengol; Donald F Conrad; Xavier Estivill; Chris Tyler-Smith; Nigel P Carter; Hiroyuki Aburatani; Charles Lee; Keith W Jones; Stephen W Scherer; Matthew E Hurles Journal: Nature Date: 2006-11-23 Impact factor: 49.962
Authors: Stephen Yip; Yaron S Butterfield; Olena Morozova; Suganthi Chittaranjan; Michael D Blough; Jianghong An; Inanc Birol; Charles Chesnelong; Readman Chiu; Eric Chuah; Richard Corbett; Rod Docking; Marlo Firme; Martin Hirst; Shaun Jackman; Aly Karsan; Haiyan Li; David N Louis; Alexandra Maslova; Richard Moore; Annie Moradian; Karen L Mungall; Marco Perizzolo; Jenny Qian; Gloria Roldan; Eric E Smith; Jessica Tamura-Wells; Nina Thiessen; Richard Varhol; Samuel Weiss; Wei Wu; Sean Young; Yongjun Zhao; Andrew J Mungall; Steven J M Jones; Gregg B Morin; Jennifer A Chan; J Gregory Cairncross; Marco A Marra Journal: J Pathol Date: 2011-11-10 Impact factor: 9.883
Authors: Anna Fowler; Shazia Mahamdallie; Elise Ruark; Sheila Seal; Emma Ramsay; Matthew Clarke; Imran Uddin; Harriet Wylie; Ann Strydom; Gerton Lunter; Nazneen Rahman Journal: Wellcome Open Res Date: 2016-11-25