Literature DB >> 29649218

Global characterization of copy number variants in epilepsy patients from whole genome sequencing.

Jean Monlong^1,2, Simon L Girard^1,3,4, Caroline Meloche⁴, Maxime Cadieux-Dion^4,5, Danielle M Andrade⁶, Ron G Lafreniere⁴, Micheline Gravel⁴, Dan Spiegelman⁷, Alexandre Dionne-Laporte⁷, Cyrus Boelman⁸, Fadi F Hamdan⁹, Jacques L Michaud⁹, Guy Rouleau⁷, Berge A Minassian¹⁰, Guillaume Bourque^1,2,11, Patrick Cossette⁴.

Abstract

Epilepsy will affect nearly 3% of people at some point during their lifetime. Previous copy number variants (CNVs) studies of epilepsy have used array-based technology and were restricted to the detection of large or exonic events. In contrast, whole-genome sequencing (WGS) has the potential to more comprehensively profile CNVs but existing analytic methods suffer from limited accuracy. We show that this is in part due to the non-uniformity of read coverage, even after intra-sample normalization. To improve on this, we developed PopSV, an algorithm that uses multiple samples to control for technical variation and enables the robust detection of CNVs. Using WGS and PopSV, we performed a comprehensive characterization of CNVs in 198 individuals affected with epilepsy and 301 controls. For both large and small variants, we found an enrichment of rare exonic events in epilepsy patients, especially in genes with predicted loss-of-function intolerance. Notably, this genome-wide survey also revealed an enrichment of rare non-coding CNVs near previously known epilepsy genes. This enrichment was strongest for non-coding CNVs located within 100 Kbp of an epilepsy gene and in regions associated with changes in the gene expression, such as expression QTLs or DNase I hypersensitive sites. Finally, we report on 21 potentially damaging events that could be associated with known or new candidate epilepsy genes. Our results suggest that comprehensive sequence-based profiling of CNVs could help explain a larger fraction of epilepsy cases.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 29649218 PMCID： PMC5978987 DOI： 10.1371/journal.pgen.1007285

Source DB: PubMed Journal: PLoS Genet ISSN： 1553-7390 Impact factor: 5.917

Introduction

Structural variants (SVs) are defined as genetic mutations affecting more than 50 base pairs and encompass several types of rearrangements: deletion, duplication, novel insertion, inversion and translocation. Deletions and duplications, which affect DNA copy number, are collectively known as copy number variants (CNVs). SVs arise from a broad range of mechanisms and show a heterogeneous distribution of location and size across the genome [1-3]. Numerous diseases are caused by SVs with a demonstrated detrimental effect [4, 5]. While cytogenetic approaches and array-based technologies have been used to identify large SVs, whole-genome sequencing (WGS) has the potential to uncover the full range of SVs both in terms of type and size [6, 7]. SV detection methods that use read-pair and split read information [8] can detect deletions and duplications but most CNV-focused approaches look for an increased or decreased read coverage, the expected consequence of a duplication or a deletion. Coverage-based methods exist to analyze single samples [9], pairs of samples [10] or multiple samples [11-13] but the presence of technical bias in WGS remains an important challenge. Indeed, various features of sequencing experiments, such as mappability [14, 15], GC content [16], replication timing [17], DNA quality and library preparation [18], have a negative impact on the uniformity of the read coverage [19]. Epilepsy is a common neurological disorder characterized by recurrent and unprovoked seizures. It is estimated that up to 3% of the population will suffer from a form of epilepsy at some point during their lifetime. Although the disease presents a strong genetic component that can be as high as 95%, typical “monogenic” epilepsy is rare, accounting for only a fraction of cases [20, 21]. Genetic factors have been associated with epilepsy in the past such as rare genetic variations found by linkage studies as well as common genetic variations found by genome-wide association studies [22, 23] For example, a meta-analysis combining multiple epilepsy cohorts found positive associations with the disease [24], the strongest in SCN1A, a gene already associated with the genetic mechanism of the disease via linkage studies and subsequent sequencing [25] or more recently as harboring de novo variants [26]. Thanks to array-based technologies, surveys of large CNVs (>50 Kbp) first associated CNVs in genomic hotspots such as 15q11.2 and 16p13.11 with generalized epilepsy [27, 28]. Other studies have further shown the importance of large and de novo CNVs as well as identified a few associations with specific genes [29-34]. Rare genic CNVs were typically found in around 10% of epilepsy patients [30, 34, 35] and CNVs larger than 1 Mbp were significantly enriched in patients compared to controls [33, 35–37]. Unfortunately, small CNVs and other types of SVs could not be efficiently or consistently detected using these technologies, hence much remains to be done. To more comprehensively characterize the role of CNVs in epilepsy, we performed whole-genome sequencing of epileptic patients from the Canadian Epilepsy Network (CENet), the largest WGS study on epilepsy to date. In the present study, we assessed the frequency of CNVs in epileptic individuals using 198 unrelated patients and 301 healthy individuals. Using this data, we showed that technical variation in WGS remains problematic for CNV detection despite state-of-the-art intra-sample normalization. To correct for this and to maximize the potential of the CENet cohorts, we developed a population-based CNV detection algorithm called PopSV. Our method uses information across samples to avoid systematic biases and to more precisely detect regions with abnormal coverage. Using two public WGS datasets [38, 39], and additional orthogonal validation, we showed that PopSV outperforms other analytical methods both in terms of specificity and sensitivity, especially for small CNVs. Using this tool, we built a comprehensive catalog of CNVs in the CENet epilepsy patients and studied the properties of these potentially damaging structural events across the genome.

Results

Technical bias in read coverage

We sequenced the genomes of 198 unrelated individuals affected with epilepsy and 301 unrelated healthy controls. Because CNV detection relies on read coverage we first investigated the presence of technical bias and the value of standard corrections and filters (e.g. GC correction, mappability filtering). The genome was fragmented in 5 Kb bins and we counted the number of uniquely mapped reads in each bin. In contrast to simulated datasets, we found that the inter-sample mean coverage in each bin varied between genomic regions even after stringent corrections and filters (Fig 1a). Supporting this observation, the bin coverage variance across samples was also lower than expected and varied between regions (S1 Fig). We also observed experiment-specific biases. In particular, some samples consistently had the highest, or the lowest, coverage across large portions of the genome (S1 Fig). These observations were not unique to our data and could also be observed in two public WGS datasets, and persisted even after correcting the GC bias and mappability using the more elaborate model from the QDNAseq pipeline [40] (S2 Fig). Our results across multiple samples suggest that existing GC bias and mappability corrections [40] cannot correct completely the technical variation in read coverage. This fluctuation of coverage has implications for CNV detection approaches that assume a uniform distribution [9, 10, 41] after standard bias correction and will lead to false positives.

Fig 1

PopSV approach.

a) Technical bias across the genome remains after stringent correction and filtering. The distribution of the bin inter-sample mean coverage in the epilepsy cohort (red) is compared to null distributions (blue: bins shuffled, green: simulated normal distribution). b) PopSV approach. First the genome is fragmented and reads mapping in each bin are counted for each sample and GC corrected (1). Next, coverage of the sample is normalized (2) and each bin is tested by computing a Z-score (3), estimating p-values (4) and identifying abnormal regions (5). c) Number and proportion of calls from a twin that was replicated in the other monozygotic twin.

PopSV approach.

CNV detection with PopSV

To better control for technical bias, we developed PopSV, a new SV detection method. PopSV uses read depth across the samples to normalize coverage and detect change in DNA copy number (Fig 1b). The normalization step here is critical since most approaches will fail to give acceptable normalized coverage scores (S1 Fig). Moreover, with global median/variance adjustment or quantile normalization, the remaining subtle experimental variation impairs the abnormal coverage test (S3 Fig). The targeted normalization used by PopSV was found to have better statistical properties (S3 Fig). In order to assess the performance of our tool, we compared it to several algorithms [8-11] using a dataset that included monozygotic twins and also performed experimental validation of different types of predicted CNVs in the epilepsy cohort (see below). We found that PopSV performed as well or better in different aspects. First, for several algorithms, a large proportion of the detected events in a typical sample were also identified in almost all samples (60% of the calls found in >95% of the samples, S4 Fig). PopSV’s calls were better distributed across the frequency spectrum, hence more informative as we expect the relative frequency of disease-related variants to be rare. In addition, the pedigree structure was more accurately recovered when the CNVs were used to cluster the individuals in the Twins dataset (S5 Fig). The agreement with the pedigree was computed by the Rand index after clustering the individuals with three hierarchical clustering approaches (see S1 Text). Looking at the replication between 10 pairs of monozygotic twins, PopSV detected more replicated CNVs compared to other methods, while maintaining similar replication rates (Fig 1c). The CNV calls were further filtered with gradually more stringent significance thresholds and PopSV remained superior in term of number of replicated calls (S6 Fig). When investigating the overlap of calls between different methods, we noticed that PopSV was better recovering calls from CNVnator [9], FREEC [10], cn.MOPS [11] or LUMPY [8], especially if found by two or more methods (S7 Fig). For example, around 92% of the CNVs called by other methods were also found by PopSV when focusing on calls found in at least two methods. Similar results were also obtained in a cancer dataset where we looked for replicated germline CNVs in the paired tumor (S8 Fig). Finally, we repeated the twin analysis using 500 bp bins and observed high consistency with the 5 Kbp calls (S9 Fig). These results suggest that PopSV can accurately detect around 75% of events that are as large as half the bin size used (see S1 Text).

CNVs in the CENet cohorts and experimental validation

Having demonstrated the quality of the PopSV calls, we applied our tool to the epilepsy and control cohorts. The epilepsy cohort comprises 198 individuals diagnosed with either generalized (n = 160), focal (n = 32) or unclassified (n = 6) epilepsy. CNVs ranged from 5 Kbp to 3.2 Mbp with an average size of 9.98 Kbp. We observed an average of 870 CNVs per individual accounting for 8.7 Mb of variant calls (Fig 2a). This is around 9 times more variants and considerably smaller than in typical array-based studies [42, 43], such as the previous epilepsy surveys [30, 31, 34, 35], although a similar size distribution was previously obtained using denser arrays [4] but were never applied to epilepsy (S10 Fig). Next, we annotated each variant using four public SV databases [13, 44–46] as well as an internal database of the germline calls from PopSV in the two public datasets used earlier (see S1 Text). For each CNV, we derived the maximum frequency across these databases and defined as rare any region consistently annotated in less than 1% of the individuals (Fig 2b). In total, we identified 12,480 regions with rare CNVs in the epilepsy cohort including: 8,022 (64.3%) with heterozygous deletions, 21 (0.2%) with homozygous deletions and 4,850 (38.9%) with duplications. Although the overall amount of rare CNVs was not higher in epilepsy patients, the proportion of deletion was significantly higher compared to controls (χ2 test: P-value 10−7). Next, we selected 151 CNVs and further validated them using a Taqman CNV assay and Real-Time PCR. To explore PopSV’s performance across different CNV profiles, we selected variants of different types, sizes and frequencies. We found that the calls were concordant in 90.7% of the cases (Table 1 and S2 Table). As expected, the estimated false positive rate was slightly higher for rare or smaller variants (12.1% for rare CNVs; 15.1% for CNV <20 Kbp). Furthermore, we noted that calls supported by both PopSV and LUMPY (when available) had a similar validation rate as calls found by PopSV only (86.2% and 87.5% respectively).

Fig 2

CNVs in the epilepsy and control cohorts.

a) Regions with a CNV in each epilepsy patient. b) Each CNV in the CNV catalog of the epilepsy and control cohorts was annotated with its maximum frequency in five CNV databases. c) Enrichment in exonic sequence for all CNVs (left) and rare CNVs (right), larger than 50 Kbp (top) or smaller than 50 Kbp (bottom). The fold-enrichment (y-axis) represents how many CNVs overlap coding sequences compared to control regions randomly distributed in the genome.

Table 1

Real-Time PCR validation rates of PopSV calls.

	Region	Validation rate
Total	151	0.907
CNV type Deletion Duplication	10249	0.9020.918
Frequency in databases 0 (0, 0.01] (0.01, 1]	2624101	0.9230.8330.921
Carrier in CENet cohorts 1 2 > 2	2119111	0.8570.9470.910
Size (Kbp) < 20 (20, 100] > 100	733840	0.8490.9740.950

Number and proportion of regions validated for CNVs of different types, sizes and frequencies.

CNVs in the epilepsy and control cohorts.

CNV enrichment in exonic regions

To assess the role of CNVs in the pathogenic mechanism of epilepsy, we evaluated the prevalence of exonic CNVs in our epileptic cohort compared with healthy controls. First, focusing on CNVs larger than 50 Kbp, we found no difference between epileptic patients and controls (Fig 2c). As expected, we observed fewer CNVs overlapping exonic sequence than expected by chance but similar levels for both groups. The number of CNVs overlapping exonic sequences of genes intolerant to loss-of-function mutations [47] was even lower. Interestingly, the coding regions of those genes were significantly more affected by CNVs in epileptic patients compared with controls (permutation P-value<0.001, Fig 2c and S11 Fig). Because they are more likely pathogenic and of greater interest, we performed the same analysis using rare CNVs only. Here, we observed the increased exonic burden described previously for large rare CNVs [35-37]. In contrast to previous studies, we could also detect and compare small CNVs (<50 Kbp) in epileptic patients and healthy controls. We found similar enrichment patterns than for large CNVs (Fig 2c and S11 Fig), suggesting that small rare exonic CNVs are also associated with epilepsy. Indeed, there was no significant difference between epileptic patients and controls when considering all small CNVs and all genes. The exonic enrichment was significant for genes with predicted loss-of-function intolerance and for rare variants (permutation P-value<0.001, Fig 2c and S11 Fig). In both cohorts, most of the rare exonic CNVs were private, i.e. present in only one individual. However, we observed that rare exonic CNVs were less likely private in the epileptic patients (permutation P-value<0.001, S12 Fig). We replicated this result using only individuals with a similar population background (French-Canadians, S12 Fig). Overall we concluded that rare CNVs were not only enriched in exons but also affected exons more recurrently in the epilepsy cohort as compared to controls.

CNV enrichment in and near epilepsy genes

We then sought to evaluate if there was an excess of CNVs disrupting epilepsy-related genes or nearby functional regions. We first retrieved genes whose exons were hit by rare deletions or duplications and evaluated how many were known epilepsy genes based on a list of 154 genes previously associated with epilepsy [48] (Fig 3a). Because epilepsy genes tend to be large, we controlled for the gene size when testing for enrichment (S13 Fig). In the epilepsy cohort only, we noted a clear enrichment for epilepsy genes hit by rare deletions (S13 Fig). Moreover, the enrichment became stronger for rare CNVs. For instance, the exons of 921 genes were disrupted in the epilepsy cohort when considering deletions completely absent from the public and internal databases, 17 of which were epilepsy genes (P-value 0.015, Fig 3b). In addition, we observed significantly more epilepsy patients with a rare non-coding CNV close to an epilepsy gene compared to control individuals (S14 Fig). Interestingly, this enrichment was stronger for non-coding deletions (S14 Fig). We further explored the distribution of rare non-coding deletions by testing each epilepsy gene for a difference in mutation load between patients and controls. The GABRD gene had the strongest and only nominally significant association with four non-coding deletions among the 198 epileptic patients and none in the 301 controls. GABRD encodes the delta subunit of the gamma-aminobutyric acid A receptor and has been associated with juvenile myoclonic epilepsy [49]. In our cohort, two of the four patients with a rare non-coding deletion close to GABRD had been diagnosed with this syndrome, including one patient with a 2.7 Kbp deletion located only 3 Kbp upstream of GABRD’s transcription start site (S15 Fig). Although none survived multiple testing correction, we noted that the strongest associations were all in the direction of a higher mutation load in the epilepsy cohort rather than in the control cohort.

Fig 3

CNVs and epilepsy genes.

CNVs and epilepsy genes.

a) Number of rare CNVs in or close to exons of protein-coding genes (top) or epilepsy genes (bottom), in the epilepsy cohort. b) Number of epilepsy genes hit by exonic deletions in the epilepsy cohort and never seen in the public and internal databases (dotted line), compared to the expected distribution in all genes and size-matched genes (histograms). c) Rare non-coding CNVs in functional regions near epilepsy genes. The graph shows the cumulative number of individuals (y-axis) with a rare non-coding CNV located at X Kbp or less (x-axis) from the exonic sequence of a known epilepsy gene. We used CNVs overlapping regions functionally associated with the epilepsy gene (eQTL or promoter-associated DNase site). To get a better idea of the functional regions close to epilepsy genes, we retrieved their associated eQTLs in the GTEx database [50] and the DNase hypersensitivity sites associated with their promoter regions [51]. Notably, focusing on rare non-coding CNVs overlapping these functional regions, the enrichment in epileptic patients was greatly strengthened and clearly present up to 100 Kbp from an epilepsy gene (Kolmogorov-Smirnov test: P-value 9 × 10−5, Fig 3c). Comparing epilepsy patients and controls, the odds ratio of having such a CNV at a distance of 100 Kbp or less from an exon was 1.33 and gradually increased the closer to the exon (2.9 for CNVs at 5 Kbp or less, S16 Fig). These non-coding CNVs were rare even in the epileptic cohort, but collectively represented an important fraction of affected patients. While 20 patients (10.1%) had exonic CNVs in epilepsy genes that were not seen in any control or in the public and internal databases, this number rose to 57 patients (28.8%) when counting non-coding CNVs in functional regions located at less than 100 Kbp of an epilepsy gene. These non-coding CNVs were never seen in the controls nor the CNV databases and overlap with annotated enhancer of epilepsy genes. Although their functional impact remains putative, we believe these CNVs to be of high-interest for the identification of disease causing genes. Among these CNVs of high-interest, a duplication of a regulatory region 5 Kbp downstream of CSNK1E was detected and validated in two different patients but absent from our controls and the public and internal databases (S15 Fig). Another example is a short deletion of an extremely conserved region downstream of FAM63B, detected in one patient and overlapping expression QTLs for this epilepsy gene (S15 Fig).

Putatively pathogenic CNVs

Next, we used an array of criteria to select the rare CNVs (less than 1% in 301 controls) with the highest disruptive potential in the epilepsy cohort. Priority was given to exonic CNVs in genes already known to be associated with epilepsy. For CNVs in other genes, we also prioritize recurrent variants and deletions in genes highly intolerant to loss-of-function mutations. In total, we identified 21 such putative pathogenic CNVs (Tables 2 and 3 and S3 Table). Out of these, 8 directly affected a gene previously associated with epilepsy [48] (Table 2). In particular, we identified a deletion resulting in the loss of more than half of the DEPDC5 gene in a patient affected with partial epilepsy. A number of point mutations have previously been reported in this gene for the same condition [52, 53]. We also identified two deletions and one duplication in CHD2 gene (see Fig 4). The first deletion is large and affects a major portion of the gene while the second is a small 4.6 Kbp deletion of exon 13, the last exon of CHD2’s second isoform (S17 Fig). No exon-disruptive CNVs were reported in any individuals from the control cohort. This gene was previously associated with patients suffering from photosensitive epilepsy [54]. Interestingly, all three patients carrying the CNVs in CHD2 have been diagnosed with eyelid myoclonia epilepsy with absence, the same diagnosis that was largely enriched in the Galizia et al. study. Other known epilepsy genes affected by deletions include LGI1 and the 15q13.3 region.

Table 2

Pathogenic profiles in known epilepsy genes.

Patient	Epilepsy type	Syndrome	Copy number	Chr.	CNV start	CNV end	Epilepsy gene with exon disrupted	Taqman probe	Discovery		Replication
Patient	Epilepsy type	Syndrome	Copy number	Chr.	CNV start	CNV end	Epilepsy gene with exon disrupted	Taqman probe	Patients	Controls	Patients	Controls
CNET0108	Generalized	Eyelid myoclonia epilepsy with absence	1	1	44195001	44460000	ST3GAL3	Hs05759463_cn	1 DEL	0	0	-
CNET0159	Generalized	Eyelid myoclonia epilepsy with absence	1	8	141925001	142010000	PTK2	Hs06202928_cn	1 DEL	0	0	-
CNET0093	Generalized	Juvenile onset; GTCs, Abs, Comp Partial	1	10	95525001	95545000	LGI1	Hs02682696_cn	1 DEL	0	0	-
CNET0140	Generalized	Idiopathic generalized epilepsies	1	13	35750001	35785000	NBEA	Hs05286691_cn	1 DEL	0	0	-
CNET0144	Generalized	Eyelid myoclonia epilepsy with absence	1	15	22745001	23275000	NIPA2	Hs04452887_cn	3 DEL	2 DEL	4 DEL (2DUP)	1 DEL (5 DUP)
CNET0009	Generalized	Idiopathic generalized epilepsies	1	15	30910001	32445000	CHRNA7	Hs03909657_cn	1 DEL	0	3 DEL	(1 DUP)
CNET0119	Generalized	Eyelid myoclonia epilepsy with absence	1	15	93300001	93515000	CHD2	Hs05385106_cn	1 DEL	0	0	-
CNET0143	Generalized	Childhood absence epilepsy	1		93489776	93494317		Hs026436998_cn	1 DEL	0	0	-
CNET0130	Generalized	Eyelid myoclonia epilepsy with absence	3		93445001	93450000		Hs01379802_cn	1 DUP	0	0	-
CNET0074	Focal	Frontal Lobe Epilepsy	1	22	32125001	32255000	DEPDC5	Hs01632214_cn	1 DEL	0	0	-

The 198 epileptic patients and 301 controls represent the discovery set. The replication set contains 325 epileptic patients and 380 controls. Variants that were not tested are marked with “-”.

Table 3

Recurrent CNVs with a pathogenic profile.

Patient	Epilepsy type	Syndrome	Copy number	Chr.	CNV start	CNV end	Gene with exon disrupted	Taqman probe	Discovery		Replication
Patient	Epilepsy type	Syndrome	Copy number	Chr.	CNV start	CNV end	Gene with exon disrupted	Taqman probe	Patients	Controls	Patients	Controls
CNET0184	Generalized	Lennox-Gastaut syndrome	3	2	32625001	33335000	TTC27;LTBP1;BIRC6	Hs03387774_cn	2 DUP	0	2 DUP	0
CNET0097	Generalized	Eyelid myoclonia epilepsy with absence	3	2	32625001	33335000	TTC27;LTBP1;BIRC6	Hs03387774_cn	2 DUP	0	2 DUP	0
CNET0020	Generalized	Juvenile myoclonic epilepsy	1	12	7995001	8125000	SLC2A3;SLC2A14	Hs04406005_cn	2 DEL	2 DEL	2 DEL	2 DEL
CNET0198	Focal	Frontal lobe epilepsy	1	12	7995001	8125000	SLC2A3;SLC2A14	Hs04406005_cn	2 DEL	2 DEL	2 DEL	2 DEL
CNET0012	Generalized	Idiopathic generalized epilepsy	3	15	90845001	90955000	ZNF774;IQGAP1	Hs03895490_cn	2 DUP	0	(1 DEL)	0
CNET0167	Generalized	Childhood absence epilepsy	3	15	90845001	90955000	ZNF774;IQGAP1	Hs03895490_cn	2 DUP	0	(1 DEL)	0
CNET0063	Generalized	Idiopathic generalized epilepsies	3	16	15460001	16290000	KIAA0430;MPV17L; NPIPA5;C16orf45; ABCC6;NDE1; FOPNL;ABCC1;MYH11	Hs05396556_cn	1 DUP + 1 DEL	0	1 DEL	1 DUP
CNET0037	Generalized	Idiopathic generalized epilepsies	1	16	15460001	16290000		Hs05396556_cn	1 DUP + 1 DEL	0	1 DEL	1 DUP

The 198 epileptic patients and 301 controls represent the discovery set. The replication set contains 325 epileptic patients and 380 controls.

Fig 4

Exonic CNVs in CHD2 detected by PopSV.

Exonic CNVs in CHD2 detected by PopSV.

The ‘CNV’ panel shows the exonic deletions (blue) and duplications (red) called by PopSV. The ‘Coverage’ panel shows the read depth signal in the affected individuals (colored points/lines) and the coverage distribution in the reference samples (boxplot and grey point). The 198 epileptic patients and 301 controls represent the discovery set. The replication set contains 325 epileptic patients and 380 controls. Variants that were not tested are marked with “-”. The 198 epileptic patients and 301 controls represent the discovery set. The replication set contains 325 epileptic patients and 380 controls. Four of the 21 putative pathogenic CNVs were found in more than one individual (see Table 3 for precise numbers). To assess their global prevalence we tested them in an additional cohort of 325 epileptic patients and 380 ethnically matched controls (Table 3). Two regions were replicated: the first region in chromosome 2 consists of duplication of the genes TTC27, LTPB1 and BIRC6. In total, 4 patients carried this duplication and it was not reported in any of the two sets of controls. The second region was found on chromosome 16 and encompasses several genes. Two deletions were found in epileptic patients for this region and 1 epileptic individual and 1 control were also carriers of a duplication in the same region. This region corresponds to a genomic hotspot whose deletions were previously associated with epilepsy [30] and other neurological disorders. Finally, the remaining putative pathogenic CNVs were also associated with a number of genes (S3 Table). However, as we lack additional evidence for those specific CNV regions, we propose that these genes should be assessed in independent epilepsy cohorts. Of note, one patient had a rare 170 Kbp deletion encompassing three exons of the PTPRD gene which is predicted to be highly intolerant to loss-of-function mutations (pLI = 1) [47]. Rare deletions in this gene were previously found in four independent individuals with attention-deficit hyperactivity disorder [55] and associated with intellectual disability [56]. In addition, de novo deletions were found in an individual with autism [57] and more recently in a patient with epileptic encephalopathy [32]. A common intronic variant in PTPRD was also associated with remission of seizures after treatment in a clinical cohort of epilepsy patients [58].

Discussion

Although several tools exist for the detection of CNVs using WGS data, we found that none of them could efficiently account for technical biases, thus resulting in limited sensitivity. To improve on this, we developed a new tool, PopSV, which we demonstrated was able to accurately detect CNVs, including rare and small events. A key aspect of our approach is the use of a set of reference samples to identify abnormal read coverage. In this context, the choice and number of reference samples will have an effect on the analysis. Results from running PopSV using different reference cohort sizes suggest that CNV calls are consistent across runs but that a higher number of reference samples increases the sensitivity and robustness of the CNV detection (S18 Fig). Based on these results, we recommend PopSV when 20 samples or more can be used as reference. In a given study, all samples can be used as a reference, or a subset of a few hundreds if the total sample size is extremely large. Although variants with frequency around 50% might not be detected, PopSV excels at detecting less frequent variants, smaller variants or variants in challenging regions such as repeat-rich regions. In a case/control design, the control samples could be used as reference in order to maximize the detection of case-specific variants. In the current study we used both epilepsy patients and controls as reference in order to be able to directly compare the observed CNV distributions. Finally, in a cancer project with paired normal and tumor samples, only normal samples should be used as reference such that PopSV can detect somatic CNVs of any frequency. To maximize performance, the same library preparation, sequencing and data pre-processing should be employed on all the samples. To identify potential batch effects, a principal component analysis of read coverage was implemented as part of the PopSV package and is recommended to assess the homogeneity of the reference samples. The read length and aligner can lead to drastic changes in the read coverage and should be consistent across the cohort when analyzed with PopSV. This is particularly important in repeat-rich regions. Although the different datasets were produced by different sequencing and pre-processing protocols and showed varying degrees of technical bias (Fig 1a, S1 and S2 Figs), the performance of PopSV was comparable when benchmarking the methods in the two public datasets and experimentally validating calls in the CENet cohort. PopSV’s approach does not require a uniform read coverage and integrate the coverage variation separately in each studied region. For these reasons, it would be straightforward to analyze targeted sequencing data, such as exome-sequencing. PopSV could also be extended for the detection of other types of SVs such as balanced SVs. To do this, instead of counting properly mapped reads, the method could be modified to test for an excess of discordant reads. Finally, additional modules could be added to PopSV to help characterize the detected variants. For instance, instead of computing a copy-number estimate from the average coverage in the reference, a HMM approach including all samples could provide a better genotyping strategy. Similar to other approaches [9, 16], an additional step in the pipeline could explore the effect of the bin size on the variation in read coverage across the population and suggest an optimal bin size. As in previous array-based studies [35-37], we observed an enrichment of large rare exonic CNVs in patients compared to controls. However, thanks to the resolution of WGS and PopSV, we found that the global distribution of small CNVs (<50 Kbp) in 198 unrelated epilepsy patients was also skewed towards rare exonic CNVs. In addition, genes disrupted by rare deletions in patients were enriched for previously known epilepsy genes. These observations support the association of small CNVs with epilepsy and could not have been detected in previous array-based studies. We also observed a clear enrichment of non-coding CNVs in the neighborhood of previously implicated genes. When focusing on CNVs seen only in the epilepsy cohort and around epilepsy genes, 10.1% of epilepsy patients have an exonic CNVs and our results shows that up to 28.8% of patients harbor non-coding CNVs of high-interest in the proximity of epilepsy genes. These non-coding variants are present in the epilepsy cohort only and located in annotated regulatory regions associated to known epilepsy genes. Although it is challenging to directly test their functional impact, their frequency and location suggest a putative importance in the genetic mechanism of epilepsy and should be further investigated in the future. Finally, to better understand the impact of these findings on an individual scale, we selected CNVs with the highest pathogenic potential within our patients. These CNVs highlighted known but also potentially new epilepsy genes. Using a second epilepsy cohort, we were also able to identify two chromosomal regions that were recurrently disrupted by CNVs. These findings highlight the benefits of having a comprehensive survey of CNVs when trying to understand the genetic causes of a disease.

Materials and methods

Ethics statement

This study was approved by the Research Ethics Board at the Sick Kids Hospital (REB number 1000033784) and the ethics committee at the Centre Hospitalier Universitaire de Montréal (project number 2003-1394,ND02.058-BSP(CA)). Before their inclusion in this study, patients or parents (when needed) had to give written informed consents.

Epilepsy patients and sequencing

Patients were recruited through two main recruitment sites at the Centre Hospitalier Universitaire de Montréal (CHUM) and the Sick Kids Hospital in Toronto as part of the Canadian Epilepsy Network (CENet). The main cohort of this study was constituted of 198 unrelated patients with various types of epilepsy; 85 males and 113 females. The mean age at onset of the disease for our cohort was 9.2 (±6.7) years. S1 Table presents a detailed description of the clinical features for the various individuals recruited in this study. 301 unrelated healthy parents of other probands from CENet were also included in this study and used as a control cohort. DNA was exclusively extracted from blood DNA. Libraries were generated using the TruSeq DNA PCR-Free Library Preparation Kit (Illumina) and paired-end reads of size 125 bp were sequenced on a HiSeq 2500 to an average coverage of 37.6x ± 5.6x. Reads were aligned to reference Homo_sapiens b37 with BWA [59]. Finally, Picard was used to merge, realign and mark duplicate reads. Raw sequence data has been deposited in the European Genome-phenome Archive, under the accession code EGAS00001002825. For more details, see S1 Text.

Public WGS datasets

Two high-coverage public datasets were used to benchmark PopSV against existing methods. A Twin study provided WGS sequencing data for 45 individuals, including 10 monozygotic twin quartets from the Quebec Study of Newborn Twins [38]. All patients gave informed consent in written form to participate in the Quebec Study of Newborn Twins. Ethic boards from the Centre de Recherche du CHUM, from the Université Laval and from the Montreal Neurological Institute approved this study. DNA was extracted from blood and sequencing was done on an Illumina HiSeq 2500 (paired-end mode, fragment length 300 bp). The reads were aligned using a modified version of the Burrows-Wheeler Aligner [59] (bwa version 0.6.2-r126-tpx with threading enabled). The options were ‘bwa aln -t 12 -q 5’ and ‘bwa sampe -t 12’. Aligned reads are available on the European Nucleotide Archive under ENA PRJEB8308. The 45 samples had an average sequencing depth of 40x (minimum 34x / maximum 57x). A cancer dataset from a study of renal cell carcinoma [39] was also used. 95 pairs of normal/tumor tissues were sequenced using GAIIx and HiSeq2000 instruments. Paired-end reads of size 100 bp totaled an average sequencing depth of 54x (minimum 26x / maximum 164x). Reads were trimmed with FASTX-Toolkit and mapped per lane with BWA [59] backtrack to the GRCh37 reference genome. Picard was used to adjust pairs coordinates, flag duplicates and merge lanes. Finally, realignment was done with GATK. Raw sequence data has been deposited in the European Genome-phenome Archive, under the accession code EGAS00001000083. More details can be found in Scelo et al. [39].

Testing for technical biases in WGS

To investigate the bias in read depth (RD), we fragmented the genome in non-overlapping bins of 5 Kbp and counted the number of properly mapped reads. In each sample, we corrected for GC bias and removed bins with extremely low or high coverage (see S1 Text). Then, read counts across all samples were combined and quantile-normalized. Using simulations and permutations, we constructed two control RD datasets with no region-specific or sample-specific bias. We computed the mean and standard deviation of the coverage in each bin across samples. Next, to investigate experiment-specific bias, we retrieved which sample had the highest coverage in each bin. Then we computed, for each sample, the proportion of the genome where it had the highest coverage. The same analysis was performed monitoring the lowest coverage. This analysis was performed separately on the CENet dataset, the Twin dataset and the normal samples from the cancer dataset. On the Twin dataset, the same analysis was also run after correcting the read coverage following the QDNAseq pipeline [40] (see S1 Text).

PopSV

The main idea behind PopSV is to assess whether the coverage observed in a given location of the genome diverges significantly from the coverage observed in a set of reference samples. PopSV was implemented in an R package (see Data and code availability). The genome is first segmented into bins and the number of reads with proper mapping in each bin is counted for each sample. In a typical design, the genome is segmented in non-overlapping consecutive windows of equal size, but custom designs could also be used. With PopSV, we propose a new normalization procedure which we call targeted normalization that retrieves, for each bin, other genomic regions with similar profile across the reference samples and uses these bins to normalize read coverage (see S1 Text). Our targeted normalization was compared to global approaches that adjust for the median coverage, or quantile-based approaches. After normalization, the value observed in each bin is compared with the profiles observed in the reference samples and a Z-score is calculated (Fig 1b). False Discovery Rate (FDR) is estimated based on these Z-score distributions and a bin is marked as abnormal based on a user-defined FDR threshold. Consecutive abnormal bins are merged and considered as one variant. In PopSV’s R package, circular binary segmentation [60] can also be used to merge bins into variant regions. Copy number was estimated by dividing the coverage in a region by the average coverage across the reference samples, multiplied by 2 (see S1 Text).

Validation and benchmark of PopSV

We compared PopSV to CNVnator [9], FREEC [10] and cn.MOPS [11], three popular RD methods that can be applied to WGS datasets. We also ran LUMPY [8] which uses an orthogonal mapping signal: the insert size, orientation and split mapping of paired reads. For LUMPY, all the CNVs (deletions and duplications) and intra-chromosomal translocations (labeled as ‘BND’ in Lumpy’s output) larger than 300 bp were kept for the upcoming analysis. These methods were run on the two publicly available datasets, using 5 Kbp bins for the RD methods. First, we compared the frequency at which a region is affected by a CNV using the calls from the different methods. To investigate the presence of systematic calls in each method, we compute how many of the calls in a typical sample are called at different frequencies in the dataset. For example, on average, how many calls in one sample are called in more than 90% of the samples. In the Twin dataset, the samples were clustered using the CNV calls from each method. Different linkage criteria were used for the hierarchical clustering (see S1 Text). The Rand index estimated the concordance between the clustering and the known pedigree (family-level). Next, we measured the number of CNVs identified in each twin that were also found in their monozygotic twin. We removed calls present in more than 50% of the samples to ensure that systematic errors were not biasing our replication estimates. Hence, a replicated call is most likely true as it is present in a minority of samples but consistently in the twin pair. For CNVnator, LUMPY and PopSV, the eval1/eval2 columns, number of supporting reads and adjusted P-values (respectively) were used to gradually filter low-quality calls and explore their effect on the replication metrics. In addition to their replication, we annotated the calls when their region overlapped a call found by other methods in the same sample. For calls found by at least two methods, we computed the proportion of calls from a method found by each of the other methods. The approach described previously comparing pairs of twins was also applied in the cancer dataset, on pairs of normal/tumor samples. In this case, a replicated call is found in the normal sample and in the paired tumor sample. Finally, we compared calls using small bins (500 bp) and calls using larger bins (5 Kbp). This comparison explores the quality of the calls, the size of detectable events and the resolution for different bin sizes. First, we counted how many small bin calls supported any large bin call. We then looked at the proportion of small bin calls of different sizes that were also found in the large bin calls.

CNV detection in the CENet cohorts

CNVs were called using PopSV using 5 Kbp bins and all the samples from both the epilepsy and control cohorts as reference. We annotated the frequency of the CNVs using germline CNV calls from the Twin and cancer datasets (internal database) as well as four public CNV databases from the 1000 Genomes Project [13, 45], the Genome of Netherlands [44] and the Simons Genome Diversity Project [46]. CNVs were annotated with the maximum frequency in the databases. Hence, a rare CNV is defined as present in less than 1% of the samples in each of the five CNV databases. To test for a difference in deletion/duplication ratio among rare CNVs, we compared the numbers of rare deletions and duplications in the epilepsy patients and controls using a χ2 test. The same test was performed after downsampling the controls to the sample size of the epilepsy cohort.

Validation by Taqman RT-PCR

We first selected CNV calls in epilepsy patients that spanned at least 2 consecutive bins. We kept exonic CNVs of different sizes and overlapping a Taqman probe. A second batch of CNVs, containing small non-coding CNVs, was also sent for validation. Here, hundreds of non-coding CNVs spanning only one bin were randomly selected. When possible the breakpoints were manually fine-tuned from manual inspection of a base-pair level coverage representation or using IGV [61]; the breakpoints remained unchanged when they could not be refined. Finally, we kept regions overlapping a Taqman probe. Probes were selected using the assay search tool on the Thermofisher website. All probes were tested for patients and controls that were called in PopSV as well as an additional 10 control individuals to ensure the validity of the probe. For each CNV, one assay was chosen in the middle of the genomic region of interest and located in an exon when possible. All reactions with TaqMan Copy Number Assays were performed in duplex using the FAM dye label based assay for the target of interest (Taqman copy number assay, Made to order, #4400291, Applied Biosystems by Life Technologies) and the VIC dye label based TaqMan Copy Number Reference Assay for RNase P (4403326, Life technologies). Amplification reactions (10μL), which were performed in quadruplicate, consisted of: 10 ng gDNA, 1X TaqMan Copy Number Assay, 1X TaqMan Copy Number Reference Assay, RNase P, 1X TaqMan Genotyping Master Mix (4371355, Life Technologies) or 1X SensiFAST Probe Lo-ROX Kit (BIO-84020, Froggabio). PCR was performed with an Applied Biosystems QuantStudio7 flex Real-Time PCR system using the standard curve settings and the default universal cycling conditions: 95 °C 10 minutes followed by 40 cycles: 95 °C 15 seconds, 60 °C 60 seconds. Data was analyzed with QuantStudio Real-Time PCR system software v1.2 (Applied Biosystems by Life Technologies) using autobaseline and manual Ct threshold of 0.2. Results export files were opened in CopyCaller™ Software v2.0 for sample copy number analysis by the relative quantitation method. The median ΔCt was used as the calibrator sample in the analysis settings. For each cohort (epilepsy and control), we retrieved the CNV catalog by merging CNV that are recurrent in multiple samples. Hence, the CNV catalog represents all the different CNVs found in each cohort. Because the epilepsy and control cohorts have different sample sizes, the CNV catalogs for each cohort were built using 150 randomly selected samples. For each sub-sampling and each cohort, control regions were selected to fit the size distribution of the CNV catalog and the overlap with centromeres, telomeres and assembly gaps (see S1 Text). The fold-enrichment represents how much more/less of the CNVs overlap an exon compared to the control regions. To robustly compare the two cohorts, we computed the median difference in fold-enrichment between the CNV catalogs from patients and controls across 100 sub-sampled catalogs. The cohort labels of the CNV catalogs were then permuted 10,000 times and the analysis repeated to derive a null distribution for the median difference in fold enrichment. A permuted P-value was computed from the observed difference and the null distribution. Small (<50Kbp) and large (>50 Kbp) CNVs were analyzed separately. Exons from genes predicted to be loss-of-function intolerant [47] (probability of loss-of-function intolerance > 0.9) were also analyzed separately. The same analysis was repeated using only rare CNVs, i.e. being present in less than 1% of PopSV calls in the Twins and renal cancer datasets, and in four public datasets (see S1 Text). In each cohort, we then retrieved the CNV catalog of rare exonic CNVs. We evaluated the proportion of the CNVs in the catalog that are private (i.e. seen in only one sample). The control cohort was down-sampled a thousand times to the same sample size as the epilepsy cohort to provide a confidence interval and empirical P-value (see S1 Text). We also visualize the proportion of CNVs in the catalog seen in 2 samples or more, 3 samples or more, etc (S12 Fig). We performed the same analysis after removing the top 20 samples with the highest number of non-private rare exonic CNVs. The analysis was also repeated using French-Canadian individuals only. We used the list of genes associated with epilepsy from the EpilepsyGene resource [48] which consists of 154 genes strongly associated with epilepsy. We tested different sets of CNVs: deletion or duplications in the epilepsy cohort, control individuals and samples from the twin study, and using different threshold of maximum frequency. For each set of CNVs, we counted how many of the genes hit were known epilepsy genes. To control for the size of epilepsy genes and CNV-hit genes, we randomly selected genes with sizes similar to the genes hit by CNVs and evaluated how many were epilepsy genes. After sampling 10,000 gene sets, we computed an empirical P-value (see S1 Text). To investigate rare non-coding CNVs close to known epilepsy genes, we counted how many patients have such a CNV at different thresholds of distance to the nearest exon. We compared this cumulative distribution to the control cohort, after down-sampling it to the sample size as the epilepsy cohort. We performed the same analysis using deletions only. Each epilepsy gene was also tested for an excess of rare non-coding deletions in patients versus controls using a Fisher test. Next, we restricted our analysis to rare non-coding CNVs that overlap an eQTL associated with the epilepsy genes [50] or a DNase I hypersensitive site associated with the promoter of epilepsy genes [51]. A Kolmogorov-Smirnov test was used to test the difference in distribution. Finally, using different values for the maximum distance to the nearest epilepsy gene, we computed the odds ratio of having such a CNV between epilepsy patients and controls. Exonic CNVs larger than 10 Kbp and found in less than 1% of the 301 controls were first selected. We further retained either CNVs overlapping the exon of a known epilepsy-associated gene [48] or deletions overlapping the exon of a loss-of-function intolerant gene [47], or CNVs present in two or more of our epilepsy patients. All the putatively pathogenic CNVs were validated by Taqman RT-PCR.

Data and code availability

The PopSV R package and its documentation are available at http://jmonlong.github.io/PopSV/. Scripts are provided to run the pipeline on different high performance computing systems. The code used for the analysis and to produce figures and numbers is documented at http://github.com/jmonlong/epipopsv and archived in https://doi.org/10.5281/zenodo.1172181. Necessary data, including the CNV calls, was deposited at https://figshare.com/s/20dfdedcc4718e465185. Raw sequence data has been deposited in the European Genome-phenome Archive, under the accession code EGAS00001002825.

Supplementary text for the experiments and methods.

(PDF) Click here for additional data file.

Clinical features of epileptic patients.

The Excel file contains the type of epilepsy, age of onset, sex, family history, pharmaco-resistance and potential intellectual disabilities. (XLSX) Click here for additional data file.

PopSV calls validated by RT-PCR.

The Excel file contains the location of each region, the CNV type, the number of carriers in the CENet cohorts, the maximum proportion of carriers in the CNV databases, Taqman probe ID and validation status. (XLSX) Click here for additional data file.

Other pathogenic profiles.

(PDF) Click here for additional data file.

Variation and bias in whole-genome sequencing experiments in the epilepsy cohort.

a) Distribution of the bin inter-sample standard deviation coverage (red) and null distribution (blue: bins shuffled, green: simulated normal distribution). b) Proportion of the genome in which a given sample (x-axis) has the highest (red) or lowest (blue) RD. In the absence of bias all samples should be the most extreme at the same frequency (dotted horizontal line). (PDF) Click here for additional data file.

Variation and bias in whole-genome sequencing experiments in the normals from CageKid (a,d,g), the twin dataset (b,e,h) and the twin dataset after using QDNAseq [40] correction (c,f,i).

a-c) Distribution of the bin inter-sample standard deviation coverage (red) and null distribution (blue: bins shuffled, green: simulated normal distribution). d-f) Same for the bin inter-sample standard deviation coverage. g-i) Proportion of the genome in which a given sample (x-axis) has the highest (red) or lowest (blue) RD. In the absence of bias all samples should be the most extreme at the same frequency (dotted horizontal line). (PDF) Click here for additional data file.

Comparison of different normalization approaches.

a) For each normalization approach, the sample with the least normal Z-score distribution is shown. b) After targeted normalization, a lower proportion of the genome looks problematic for the analysis. Fewer bins have non-normal bin counts (top-left), the sample ranks are more random suggesting less sample-specific bias (top-right), and Z-scores fit better a Normal distribution on average (bottom-left) and in the worst sample (bottom-right). The dotted line is computed from simulated bin counts. (PDF) Click here for additional data file.

Frequency of calls in an average sample from the twin study.

The bars show the proportion of calls in an average samples (y-axis), grouped by the frequency of the call in the dataset (x-axis), for different methods. (PDF) Click here for additional data file.

CNV clustering and twin pedigree.

The hierarchical cluster tree from the CNV calls is cut at different levels (x-axis), cluster groups are compared to the known pedigree using the Rand index (y-axis). Different clustering linkage criteria (point style) are used and the one showing the best Rand index is highlighted by the line. (PDF) Click here for additional data file.

Replication in monozygotic twins for different significance thresholds.

Each point represents the number of replicated calls per sample (average across samples) and the proportion of replicated calls per sample. The vertical error bar shows the variation of the replication rate across the samples. The points and lines were computed by filtering calls at different significance levels (q-value for PopSV, number of supporting reads for LUMPY and eval1/eval2 for CNVnator, see S1 Text). (PDF) Click here for additional data file.

Calls found by several methods.

Focusing on calls found by at least two methods, the heatmap shows the proportion of calls from one method (x-axis) that were also found by another (y-axis) on average per sample. (PDF) Click here for additional data file.

Benchmark across paired normal/tumor in CageKid.

Number (a) and proportion (b) of germline calls replicated in the paired tumor in CageKid. c) Number and proportion of replicated calls when filtering calls at different significance levels. d) Focusing on calls found by at least two methods, the color shows the proportion of calls from one method (x-axis) that were also found by another (y-axis) on average per sample. (PDF) Click here for additional data file.

Comparison of PopSV results using different bin sizes.

a) 5 Kbp calls of different sizes (x-axis) are split according to the proportion of the call supported by 500 bp calls. The Z-score of 500 bp bins in 5 Kbp calls is consistent with the call for deletion b) and duplication c) signal. 5 Kbp calls with lower significance (e.g. single-bin calls) are less supported by 500 bp calls (a) but their Z-scores are in the consistent direction (b,c) although not always significant enough to be called. d) Proportion of 500 bp calls of different sizes (x-axis) overlapping a 5 Kbp call. (PDF) Click here for additional data file.

CNV size in our cohort and four array-based studies.

The bars show the average number of CNVs called in a sample in the different cohorts. Redon 2006 [42] and Itsara 2009 [43] are population studies using technology similar to previous epilepsy studies. Addis 2016 [34] is a recent study of large CNVs in absence epilepsy. Conrad 2010 [4] is a population study that used multiple arrays to increase its resolution. (PDF) Click here for additional data file.

Exonic enrichment significance.

The grey violin plot represents the difference in fold-enrichment between patients and controls across 10,000 permutations where the patient/control labels had been shuffled. The red point represents the observed difference between patients and controls (Fig 2c). (PDF) Click here for additional data file.

Rare exonic CNVs are less private in the epilepsy cohort.

Proportion of rare exonic CNVs (y-axis) seen in X or more individuals (x-axis). The ribbon shows the 5%–95% confidence interval. In b), only French-Canadians individuals were analyzed and we down-sampled the epilepsy cohort to match the sample size of the French-Canadians controls. In c), the top 20 samples with the most non-private rare exonic SVs were removed. (PDF) Click here for additional data file.

Enrichment in epilepsy genes.

a) Epilepsy genes (red) are genes known to be associated with epilepsy. The control genes (dotted blue) are random genes selected so that the size distribution is similar to the sizes of genes hit by CNVs (plain blue). b) In three different datasets (color), genes hit by rare deletion (top) or duplications (bottom) at different frequency thresholds (x-axis) were tested for enrichment in epilepsy genes (y-axis, point-size). (PDF) Click here for additional data file.

Rare non-coding CNVs near epilepsy genes.

The graphs show the cumulative number of individuals (y-axis) with a rare non-coding variants located at X Kbp or less (x-axis) from the exonic sequence of a known epilepsy gene. The controls were down-sampled to the sample size of the epilepsy cohort. The ribbon shows the 5%/95% confidence interval. In a), deletions and duplications were considered; in b), only deletions were used. (PDF) Click here for additional data file.

Non-coding CNVs with putative pathogenicity.

a) 2.7 Kbp deletion in an epilepsy patient, never seen in controls or CNV databases. Three other epilepsy patients have a rare non-coding deletions located at less than 200 Kbp from the GABRD gene. b) 8.8 Kbp duplication in two epilepsy patients, never seen in controls or CNV databases and overlapping a regulatory region associated with CSNK1E. c) 6.5 Kbp deletion of an ultra-conserved regions downstream of FAM63B. Two expression QTLs for this gene are highlighted with arrows. (PDF) Click here for additional data file.

The enrichment in rare non-coding CNVs overlapping functional regions increases close to epilepsy genes.

The graph shows the log odds ratio of having a rare non-coding CNV located at X Kbp or less (x-axis) from the exonic sequence of a known epilepsy gene. The y-axis shows the log odds ratio between epilepsy patients and controls. The controls were down-sampled to the sample size of the epilepsy cohort. We used CNVs overlapping regions functionally associated with the epilepsy gene (eQTL or promoter-associated DNase site). (PDF) Click here for additional data file.

Small deletion of exon 13 in CHD2.

Abnormal mapping of the read pairs highlighted in red support the deletion detected by PopSV using the read coverage. The deletion region is highlighted in orange. (PDF) Click here for additional data file.

Reference cohort size and CNV detection quality.

PopSV was run on the Twins study using 10, 20, 30 or 45 samples as reference (color). In a), the y-axis shows how many calls from the down-sampled run were found in the original 45-refs run. The x-axis represents the FDR threshold (lower threshold being more stringent). b) Replication in monozygotic twins. For different cohort sizes and FDR thresholds, the number (x-axis) and proportion (y-axis) of calls replicated in the other monozygotic twin is shown. In both graphs, the lines represents the median per sample and the ribbon the minimum/maximum values. (PDF) Click here for additional data file.

Targeted normalization.

The coverage across the reference samples (blue) in the bin to normalize is used to find supporting bins across the genome. These supporting bins only are used to compute the normalization factor. The same supporting bins will be used to normalize the bin count in a test sample (red). (PDF) Click here for additional data file.

60 in total

Review 1. Statistical challenges associated with detecting copy number variations with next-generation sequencing.

Authors: Shu Mei Teo; Yudi Pawitan; Chee Seng Ku; Kee Seng Chia; Agus Salim
Journal: Bioinformatics Date: 2012-08-31 Impact factor: 6.937

Review 2. Clinical significance of rare copy number variations in epilepsy: a case-control survey using microarray-based comparative genomic hybridization.

Authors: Pasquale Striano; Antonietta Coppola; Roberta Paravidino; Michela Malacarne; Stefania Gimelli; Angela Robbiano; Monica Traverso; Marianna Pezzella; Vincenzo Belcastro; Amedeo Bianchi; Maurizio Elia; Antonio Falace; Elisabetta Gazzerro; Edoardo Ferlazzo; Elena Freri; Roberta Galasso; Giuseppe Gobbi; Cristina Molinatto; Simona Cavani; Orsetta Zuffardi; Salvatore Striano; Giovanni Battista Ferrero; Margherita Silengo; Maria Luigia Cavaliere; Matteo Benelli; Alberto Magi; Maria Piccione; Franca Dagna Bricarelli; Domenico A Coviello; Marco Fichera; Carlo Minetti; Federico Zara
Journal: Arch Neurol Date: 2011-11-14

3. Structural genomic variation in childhood epilepsies with complex phenotypes.

Authors: Ingo Helbig; Marielle E M Swinkels; Emmelien Aten; Almuth Caliebe; Ruben van 't Slot; Rainer Boor; Sarah von Spiczak; Hiltrud Muhle; Johanna A Jähn; Ellen van Binsbergen; Onno van Nieuwenhuizen; Floor E Jansen; Kees P J Braun; Gerrit-Jan de Haan; Niels Tommerup; Ulrich Stephani; Helle Hjalgrim; Martin Poot; Dick Lindhout; Eva H Brilstra; Rikke S Møller; Bobby P C Koeleman
Journal: Eur J Hum Genet Date: 2013-11-27 Impact factor: 4.246

4. De novo SCN1A mutations are a major cause of severe myoclonic epilepsy of infancy.

Authors: Lieve Claes; Berten Ceulemans; Dominique Audenaert; Katrien Smets; Ann Löfgren; Jurgen Del-Favero; Sirpa Ala-Mello; Lina Basel-Vanagaite; Barbara Plecko; Salmo Raskin; Paul Thiry; Nicole I Wolf; Christine Van Broeckhoven; Peter De Jonghe
Journal: Hum Mutat Date: 2003-06 Impact factor: 4.878

5. Large multiallelic copy number variations in humans.

Authors: Robert E Handsaker; Vanessa Van Doren; Jennifer R Berman; Giulio Genovese; Seva Kashin; Linda M Boettger; Steven A McCarroll
Journal: Nat Genet Date: 2015-01-26 Impact factor: 38.330

6. DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly.

Authors: Ilari Scheinin; Daoud Sie; Henrik Bengtsson; Mark A van de Wiel; Adam B Olshen; Hinke F van Thuijl; Hendrik F van Essen; Paul P Eijk; François Rustenburg; Gerrit A Meijer; Jaap C Reijneveld; Pieter Wesseling; Daniel Pinkel; Donna G Albertson; Bauke Ylstra
Journal: Genome Res Date: 2014-09-18 Impact factor: 9.043

7. Copy number variant analysis from exome data in 349 patients with epileptic encephalopathy.

Authors:
Journal: Ann Neurol Date: 2015-07-01 Impact factor: 10.422

8. A genome-wide association study and biological pathway analysis of epilepsy prognosis in a prospective cohort of newly treated epilepsy.

Authors: Doug Speed; Clive Hoggart; Slave Petrovski; Ioanna Tachmazidou; Alison Coffey; Andrea Jorgensen; Hariklia Eleftherohorinou; Maria De Iorio; Marian Todaro; Tisham De; David Smith; Philip E Smith; Margaret Jackson; Paul Cooper; Mark Kellett; Stephen Howell; Mark Newton; Raju Yerra; Meng Tan; Chris French; Markus Reuber; Graeme E Sills; David Chadwick; Munir Pirmohamed; David Bentley; Ingrid Scheffer; Samuel Berkovic; David Balding; Aarno Palotie; Anthony Marson; Terence J O'Brien; Michael R Johnson
Journal: Hum Mol Genet Date: 2013-08-19 Impact factor: 6.150

9. Analysis of rare copy number variation in absence epilepsies.

Authors: Laura Addis; Richard E Rosch; Antonio Valentin; Andrew Makoff; Robert Robinson; Kate V Everett; Lina Nashef; Deb K Pal
Journal: Neurol Genet Date: 2016-03-22

10. Burden analysis of rare microdeletions suggests a strong impact of neurodevelopmental genes in genetic generalised epilepsies.

Authors: Dennis Lal; Ann-Kathrin Ruppert; Holger Trucks; Herbert Schulz; Carolien G de Kovel; Dorothée Kasteleijn-Nolst Trenité; Anja C M Sonsma; Bobby P Koeleman; Dick Lindhout; Yvonne G Weber; Holger Lerche; Claudia Kapser; Christoph J Schankin; Wolfram S Kunz; Rainer Surges; Christian E Elger; Verena Gaus; Bettina Schmitz; Ingo Helbig; Hiltrud Muhle; Ulrich Stephani; Karl M Klein; Felix Rosenow; Bernd A Neubauer; Eva M Reinthaler; Fritz Zimprich; Martha Feucht; Rikke S Møller; Helle Hjalgrim; Peter De Jonghe; Arvid Suls; Wolfgang Lieb; Andre Franke; Konstantin Strauch; Christian Gieger; Claudia Schurmann; Ulf Schminke; Peter Nürnberg; Thomas Sander
Journal: PLoS Genet Date: 2015-05-07 Impact factor: 5.917

13 in total

Review 1. From Genetic Testing to Precision Medicine in Epilepsy.

Authors: Pasquale Striano; Berge A Minassian
Journal: Neurotherapeutics Date: 2020-04 Impact factor: 7.620

2. Epilepsy subtype-specific copy number burden observed in a genome-wide study of 17 458 subjects.

Authors: Lisa-Marie Niestroj; Eduardo Perez-Palma; Daniel P Howrigan; Yadi Zhou; Feixiong Cheng; Elmo Saarentaus; Peter Nürnberg; Remi Stevelink; Mark J Daly; Aarno Palotie; Dennis Lal
Journal: Brain Date: 2020-07-01 Impact factor: 13.501

Review 3. Non-coding regulatory elements: Potential roles in disease and the case of epilepsy.

Authors: Susanna Pagni; James D Mills; Adam Frankish; Jonathan M Mudge; Sanjay M Sisodiya
Journal: Neuropathol Appl Neurobiol Date: 2021-12-16 Impact factor: 6.250

4. Diagnostic Yields of Trio-WES Accompanied by CNVseq for Rare Neurodevelopmental Disorders.

Authors: Chao Gao; Xiaona Wang; Shiyue Mei; Dongxiao Li; Jiali Duan; Pei Zhang; Baiyun Chen; Liang Han; Yang Gao; Zhenhua Yang; Bing Li; Xiu-An Yang
Journal: Front Genet Date: 2019-05-24 Impact factor: 4.599

5. Human copy number variants are enriched in regions of low mappability.

Authors: Jean Monlong; Patrick Cossette; Caroline Meloche; Guy Rouleau; Simon L Girard; Guillaume Bourque
Journal: Nucleic Acids Res Date: 2018-08-21 Impact factor: 16.971

6. GenPipes: an open-source framework for distributed and scalable genomic analyses.

Authors: Mathieu Bourgey; Rola Dali; Robert Eveleigh; Kuang Chung Chen; Louis Letourneau; Joel Fillon; Marc Michaud; Maxime Caron; Johanna Sandoval; Francois Lefebvre; Gary Leveque; Eloi Mercier; David Bujold; Pascale Marquis; Patrick Tran Van; David Anderson de Lima Morais; Julien Tremblay; Xiaojian Shao; Edouard Henrion; Emmanuel Gonzalez; Pierre-Olivier Quirion; Bryan Caron; Guillaume Bourque
Journal: Gigascience Date: 2019-06-01 Impact factor: 6.524

7. Beyond the Exome: The Non-coding Genome and Enhancers in Neurodevelopmental Disorders and Malformations of Cortical Development.

Authors: Elena Perenthaler; Soheil Yousefi; Eva Niggl; Tahsin Stefan Barakat
Journal: Front Cell Neurosci Date: 2019-07-31 Impact factor: 5.505

Review 8. Exome sequencing in genetic disease: recent advances and considerations.

Authors: Jay P Ross; Patrick A Dion; Guy A Rouleau
Journal: F1000Res Date: 2020-05-06

9. Comprehensive multi-omics integration identifies differentially active enhancers during human brain development with clinical relevance.

Authors: Ruizhi Deng; Kristina Lanko; Eva Medico Salsench; Anita Nikoncuk; Soheil Yousefi; Herma C van der Linde; Elena Perenthaler; Tjakko J van Ham; Eskeatnaf Mulugeta; Tahsin Stefan Barakat
Journal: Genome Med Date: 2021-10-19 Impact factor: 11.117

Review 10. Interpreting the impact of noncoding structural variation in neurodevelopmental disorders.

Authors: Eva D'haene; Sarah Vergult
Journal: Genet Med Date: 2020-09-25 Impact factor: 8.822