Literature DB >> 36112662

Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score.

Ozvan Bocher^1,2, Thomas E Ludwig^1,3, Marie-Sophie Oglobinsky¹, Gaëlle Marenne¹, Jean-François Deleuze⁴, Suryakant Suryakant⁵, Jacob Odeberg^6,7, Pierre-Emmanuel Morange⁸, David-Alexandre Trégouët⁵, Hervé Perdry⁹, Emmanuelle Génin^1,3.

Abstract

Rare variant association tests (RVAT) have been developed to study the contribution of rare variants widely accessible through high-throughput sequencing technologies. RVAT require to aggregate rare variants in testing units and to filter variants to retain only the most likely causal ones. In the exome, genes are natural testing units and variants are usually filtered based on their functional consequences. However, when dealing with whole-genome sequence (WGS) data, both steps are challenging. No natural biological unit is available for aggregating rare variants. Sliding windows procedures have been proposed to circumvent this difficulty, however they are blind to biological information and result in a large number of tests. We propose a new strategy to perform RVAT on WGS data: "RAVA-FIRST" (RAre Variant Association using Functionally-InfoRmed STeps) comprising three steps. (1) New testing units are defined genome-wide based on functionally-adjusted Combined Annotation Dependent Depletion (CADD) scores of variants observed in the gnomAD populations, which are referred to as "CADD regions". (2) A region-dependent filtering of rare variants is applied in each CADD region. (3) A functionally-informed burden test is performed with sub-scores computed for each genomic category within each CADD region. Both on simulations and real data, RAVA-FIRST was found to outperform other WGS-based RVAT. Applied to a WGS dataset of venous thromboembolism patients, we identified an intergenic region on chromosome 18 enriched for rare variants in early-onset patients. This region that was missed by standard sliding windows procedures is included in a TAD region that contains a strong candidate gene. RAVA-FIRST enables new investigations of rare non-coding variants in complex diseases, facilitated by its implementation in the R package Ravages.

Entities: Chemical

Mesh：

Substances：
DNA, Intergenic

Year: 2022 PMID： 36112662 PMCID： PMC9518893 DOI： 10.1371/journal.pgen.1009923

Source DB: PubMed Journal: PLoS Genet ISSN： 1553-7390 Impact factor: 6.020

Introduction

With advance in sequencing technologies, it is now possible to explore the role of rare genetic variants in complex diseases. Rare variant association tests (RVAT) have been developed that gather rare variants into testing units and compare their rare variant content between cases and controls [1-3]. While the impact of rare variants has already been shown in several complex diseases [4-6], RVAT face two key challenges: (i) the definition of the testing units and (ii) the selection of the qualifying rare variants to include in these units. The proportion of causal variants in the testing units being a major driver of power, especially for burden tests, it is indeed important to ensure that qualifying variants are enriched in variants likely to have some functional impact [3, 7]. When exome analyses are undertaken, rare variants are most often grouped by genes and included in the analysis depending on their impact on the corresponding protein [8, 9]. Nevertheless, the gene definition is not always optimal as differences in rare variants burden between cases and controls could sometimes only be found in a sub-region of a gene. This is for example the case in the RNF213 gene where an enrichment in rare variants located in the C-terminal region was found in Moyamoya cases [10]. Defining testing units and qualifying variants is much more challenging in the non-coding genome due to the lack of defined genomic elements and the higher difficulty to predict the functional impact of non-coding variants [11]. It is yet a question of interest as several studies have shown the importance of rare non-coding variants in the development of complex diseases [12-14]. Functional elements such as enhancers or promoters can be used as testing units [5,15,16]. However, these elements only cover a small portion of the non-coding genome and their size is often too small to gather a sufficient number of rare variants. On the other hand, sliding windows procedures such as SCAN-G [17] or WGSCAN [18] can be used to test for association over the whole-genome. Nevertheless, they present several limits including the window definition that is arbitrary and blind to biological information, the high number of tests and the associated computation time. With overlapping windows, there is also a strong correlation between the different testing units that requires permutation procedures to account for multiple testing. Finally, to filter rare variants in the testing units, pathogenicity scores are often used but without guidelines on which score to use and which threshold to apply. In this paper, we propose RAVA-FIRST (RAre Variant Association using Functionally InfoRmed STeps), a new strategy for analysing rare variants in the coding and the non-coding genome that addresses the previous issues. First, we provide pre-defined testing units in the whole genome called “CADD regions” based on the Combined Annotation Dependent Depletion (CADD) scores of deleteriousness of variants observed in the gnomAD general population. Second, we propose a filtering approach based on CADD scores with region-dependant thresholds to represent the genetic context of each CADD region and avoid the use of a fixed threshold along the genome. Finally, we integrate functional information into the burden test to detect an accumulation of rare variants in specific genomic categories within CADD regions. Through a statistical description of these CADD regions, we show that they preserve the integrity of the majority of functional elements in the genome. We also show that the RAVA-FIRST filtering strategy enables a better discrimination between functional and non-functional variants. We applied RAVA-FIRST to real whole-genome sequencing data from individuals with venous thromboembolism (VTE) and detected an intergenic association signal that would have been missed with sliding windows and a classical filtering of rare variants. RAVA-FIRST is implemented in the R package Ravages available on the CRAN and maintained on Github [19,20].

Description of the method

Ethics statement

The MARTHA study was approved by its institutional ethics committee and informed written consent was obtained in accordance with the Declaration of Helsinki. Ethics approval were obtained from the “Departement santé de la direction générale de la recherche et de l’innovation du ministère” (Projects DC: 2008–880 and 09.576). RAVA-FIRST is developed to test for association with rare variants in the whole genome. It deals with all steps from the definition of testing units and the filtering of rare variants, to the association test accounting for functional information. The main steps are described here and represented in Fig 1 and further details are provided in S1 File and S1 Fig.

Fig 1

Steps performed in RAVA-FIRST: definition of ACS, CADD regions, region-specific thresholds and functionally-informed burden tests.

Testing units in RAVA-FIRST: The CADD regions

To define testing units for association tests, we took inspiration from the work of Havrilla et al. (2019) [21]. They defined “constrained coding regions” (CCR) as exonic regions where no important functional variation (defined as being at least missense) was found in the general population of gnomAD [22]. Those regions could be of interest in RVAT as we can expect that an accumulation of rare variants within them would lead to an increased risk of developing a disease. However, in our experience, two limits prevent the direct use of CCR as testing units in the whole genome: they are too small to gather a sufficient number of rare variants (224 bp being the maximum length of a CCR) and their definition relies on the consequence of the variants on the translated protein, not available in the non-coding genome. To define regions in the non-coding genome using the same underlying hypothesis, we therefore decided to expand the proposed approach by estimating the functionality of variants through CADD scores [23]. CADD scores were chosen because of their availability for every substitution in the genome and because they rank well in the comparison test of functional annotation tools [24]. The goal here is to split the genome into regions according to the distribution of functional variation observed in gnomAD and not to detect the most constrained regions as aimed by Havrilla et al (2019) [21]. Coding variants tend to present higher CADD values than non-coding variants [23]. A selection based on a CADD threshold would therefore result in a majority of coding variants selected. In order to avoid this pattern, we adjusted the RAW CADD scores of all possible SNVs and of a set of 48,000,000 Indels on a PHRED scale within each of three genomic categories: “coding”, “regulatory” and “intergenic” regions to obtain an “adjusted CADD score” also called “ACS”. Coding regions correspond to CCDS [25] and represent 1.2% of the genome. Regulatory regions represent 44.3% of the genome and gather introns, 5’ and 3’ UTR, promoters and enhancers, all being involved in gene regulation [26]. Enhancers and promoters have been obtained with the SCREEN tool from ENCODE which enables the definition of a large number of regulatory elements in diverse cell types [27]. Finally, intergenic regions correspond to all other regions and represent 54.5% of the genome. More details are given in the S1 File. ACS were used to select the variants that will bound the “CADD regions” based on criteria defined from a fine tuning to ensure that CADD regions had sizes compatible with RVAT; i.e., not too small to contain enough rare causal variants in cases and not too large to avoid pollution by too many rare neutral variants. First, we selected the variants with an ACS greater than 20, which is the top 1% of variants with the highest predicted functional impact within each of the three genomic categories. Then, among those variants, only the ones observed at least two times in gnomAD r2.0.1 genomes were used as boundaries of CADD regions. The choice of excluding gnomAD singletons was made to avoid splitting CADD regions because of sequencing errors. Contiguous small regions of less than 10 kb were grouped together. All genomic regions where CADD scores are not available (such as centromeres and telomeres among others) were excluded, as well as regions that are not sequenced or are low-covered in gnomAD but contain genomic sites where predicted ACS exceeds 20 for at least one of the possible alleles. This creates gaps within CADD regions that are sometimes of only one base pair but avoids keeping artificially long CADD regions due to a lack of observed variants in gnomAD. More details about the steps and parameters used for the definition of CADD regions are presented in the S1 File.

The RAVA-FIRST filtering strategy

In addition to the definition of new testing units in the whole genome, we propose a new filtering strategy in RAVA-FIRST to select qualifying variants based on thresholds that are specific to each CADD region. The idea is similar to the gene-specific CADD thresholds proposed by Itan et al (27) to improve variant deleteriousness prediction. To define region-specific thresholds, we computed the median of ACS of all the variants (SNPs and InDels) observed at least two times in gnomAD in each CADD region. This value is expected to represent the median score level that is tolerated in the general population within each CADD region. Qualifying variants are then defined as rare variants with an ACS above the threshold specific to their region. We chose to include InDels in this median so that they can be analysed using the RAVA-FIRST strategy as they represent an important source of genetic variation.

Burden test in RAVA-FIRST: Taking into account functional information

Several of the CADD regions overlap different genomic categories (coding, regulatory or intergenic, Figs 1 and S2). As the effects of variants belonging to these different genomic categories may not be the same, we extended the burden test by integrating a sub-score for each genomic category into the regression model, similarly to the analysis of rare and frequent variants proposed by Li and Leal (2008) [7]: Y is the vector of phenotypes for the n individuals: 0 for the group of controls and 1 for the group of cases. β0 represents the intercept of the model and X the matrix of covariates (if any) with their associated effect, β corresponds to the estimated effect of the burden X within each genomic category within the tested CADD region. It can be computed for example using WSS [1], which corresponds to a weighted sum of rare alleles based on their frequency, the rarest alleles having the highest weights. Sub-scores X are thus constructed for each genomic category within a CADD region, with at most three sub-scores (coding, regulatory or intergenic). The p-value can be determined using a likelihood ratio test comparing this model to the null model where the sub-scores are not included. This sub-score analysis, referred to as RAVA-FIRST burden test, is also available for continuous and for categorical phenotypes using the extension of burden tests developed in Bocher et al. (2019) [19]. The RAVA-FIRST burden test coupled with the region-specific filtering on the ACS enables to perform only one test by CADD region while keeping the most important functional variants within each genomic category and accounting for those categories in the association test.

Verification and comparison

Statistics on CADD regions and comparison with genomic elements

A total of 135,224 CADD regions were defined covering 93.2% of the genome (in build GRCh37), of which 106,251 CADD regions are larger than 1kb (covering 93% of the genome). Overall, 42.1% of CADD regions span only one type of genomic category, 47.5% span two of the three types of genomic categories, and 10.4% overlap the three genomic categories (S2 Fig). Some CADD regions are extremely large, mainly in the regions close to the centromeres (Table 1). Care should be taken when interpreting results obtained in these regions. Indeed, only few high-quality genomes covering these genomic regions are currently available and CADD scores may not be as reliable as in other parts of the genome. About 70% of CADD regions have a size between 1 and 50 kb with a mean of 20 kb, making them completely compatible with the size of genes commonly used as testing units in RVAT.

Table 1

Summary statistics of the lengths of CADD regions.

	Quantiles					Mean
	0%	25%	50%	75%	100%	Mean
Length (kb)	0.002	2.576	13.006	24.323	1,731.228	19.852

We then compared the position of genomic elements relative to the defined CADD regions (see S1 Table for the definition of genomic elements). A large majority of genomic elements are entirely included into a single CADD region and thus their integrity is preserved (see S2 Table). This is expected as all these genomic elements are substantially smaller than the CADD. For larger elements such as introns or lncRNA, the percentage decreases but remains high (more than 80% of lncRNA are overlapped by at most two CADD regions). The genomic elements spanning more than one CADD region are on average longer than the ones being entirely included into a single CADD region. However, when comparing CCR and CADD regions, it is interesting to note that the CCRs entirely encompassed within a single CADD region are the longest ones that also represent the most constrained regions.

Performance of RAVA-FIRST filtering based on ACS

To assess the performance of the ACS and the RAVA-FIRST filtering, we evaluated its capacity in discriminating rare pathogenic SNVs defined in the Clinvar database [28] from rare SNVs polymorphisms observed in the 1000Genomes project [29]. We computed true positive rate (TPR), true negative rate (TNR) and precision for the RAVA-FIRST filtering and compared the results to the ones obtained by applying a fixed CADD threshold of 10, 15 or 20 on variants annotated with CADD scores v1.4. A total of 82,811 variants (44,566 benign and 38,245 pathogenic), both coding and non-coding, were included in this analysis (see S1 and S2 Files for more details on the selection of variants). For coding variants, all filtering strategies based on CADD scores (fixed threshold or ACS) show a very high TPR (Fig 2A), meaning that the majority of pathogenic variants would be selected as qualifying variants for RVAT. The TNR increases with the increasing CADD score threshold which is expected as less variants, and therefore less benign variants, are included in the analysis. The RAVA-FIRST filtering shows the highest TNR and the highest precision. While the TPR value is extremely important to select the most probable causal variants in RVAT, it is also important to have a high TNR value, otherwise the signal will be diluted by a high proportion of non-causal variants. The precision value summarises the TPR and TNR parameters and is representative, to a certain extent, of the percentage of causal variants among selected variants. Therefore, we show that the RAVA-FIRST filtering strategy is the most accurate to select qualifying rare variants for RVAT. Focusing on the coding genome, we also compared the performance of RAVA-FIRST filtering approach against two others procedures classically applied on genes as testing units: (1) filter for variants with a functional impact expected to change the protein (“missense_variant", "missense_variant&splice_region_variant", "splice_acceptor_variant", "splice_donor_variant", "start_lost", "start_lost&splice_region_variant", "stop_gained", "stop_gained&splice_region_variant", "stop_lost", "stop_lost&splice_region_variant" and "stop_retained_variant”), and (2) filter on the MSC value, a gene-specific CADD threshold [30]. These two filtering approaches resulted in a slightly higher TPR than our proposed strategy but lower TNR and lower precision (Fig 2A). Therefore, even in an exome analysis, the RAVA-FIRST filtering would outperform classical filtering strategies to select qualifying rare variants for RVAT.

Fig 2

TPR, TNR and precision of different filtering strategies on the Clinvar coding or non-coding variants pathogenic variants compared to rare 1000Genome polymorphisms.

For non-coding variants, performances are lower than for coding variants. This is true when using both fixed CADD thresholds and the ACS median (Fig 2B) but the TPR is much lower when a fixed CADD threshold is used. This is explained by the fact that CADD values are lower in the non-coding genome. The best CADD threshold among hard-threshold filtering is indeed 10 in the non-coding genome while it is 20 in the coding genome. It is thus difficult to use a single fixed CADD value to select rare variants in testing units in the whole genome and the proposed ACS thresholds may therefore be preferred. Note however that, because of a bias towards coding variants in ClinVar pathogenic variants, the number of non-coding variants included in this analysis is rather low (2,980) compared to coding variants (79,831) and results should therefore be interpreted with caution. In summary, compared to classical filtering strategies, the RAVA-FIRST approach based on ACS is expected to improve rare variant selection for RVAT in both the coding and the non-coding parts of the genome.

RAVA-FIRST burden test–Simulations

To validate the RAVA-FIRST burden test, we performed simulations under the null hypothesis and under different scenarios of association using data from the 1000 Genomes European populations [29] in the LCT gene. We simulated 1,000 controls and 1,000 cases using the simulations based on haplotypes implemented in the R package Ravages [19]. A total of 201 variants was considered in the LCT gene. These variants were polymorphic in the European populations with a MAF lower than 1%. Two CADD regions overlap the LCT gene, R019233 and R019234, containing respectively 75 and 126 variants, both regions overlap coding and regulatory categories.

Type I error

We first simulated data under the null hypothesis to verify that the RAVA-FIRST burden test maintains appropriate type I errors. We simulated two groups of 1,000 individuals in the R019234 CADD region without any genetic effect and we applied the classical WSS and the RAVA-FIRST WSS. Type I errors were computed using 5∙106 simulations at three significance levels: 5∙10−2, 10−3 and 2.5∙10−6 (the usual threshold for whole exome rare variant association tests). The RAVA-FIRST WSS maintains good type I error levels at these different significance thresholds, similar to the ones obtained with the classical WSS (S3 Table).

Power analysis

We then performed a power study with causal variants located either in the R019234 CADD region only or in the entire LCT gene in any of the two CADD regions. We simulated 50% of causal variants randomly spread in the whole unit (scenarios S1 and S3), in the coding regions (scenarios S2A and S4A) or in the regulatory regions (scenarios S2B and S4B). All the scenarios are summarised in Table 2. We compared the classical WSS to the RAVA-FIRST WSS using the gene or the two CADD regions as testing units. When CADD regions were used as testing units, analyses were performed for each of the two CADD regions and the minimum p-value was taken and multiplied by two to correct for multiple testing. A total of 1,000 replicates were simulated for each scenario and power was assessed at a genome-wide significance threshold of 2.5∙10−6.

Table 2

Scenarios of association simulated to assess the performance of the RAVA-FIRST burden test.

	LCT gene
	R019233		R019234
	Coding	Regulatory	Coding	Regulatory
S1			50%
S2A			50%	0%
S2B			0%	50%
S3	50%
S4A	50%	0%	50%	0%
S4B	0%	50%	0%	50%

Table 3 presents the power results obtained from this simulation study for both the classical WSS and the RAVA-FIRST WSS. Similar trends were observed between the two analyses, regardless if the simulations are performed at the scale of CADD regions or at the scale of the gene. When the causal variants were randomly sampled across the entire region (scenarios S1 and S3), the classical WSS with only one score for the entire region slightly outperformed the RAVA-FIRST method with sub-scores. Nevertheless, the loss of power for the latter was modest (less than 10%). By contrast, when causal variants were present only in the coding categories (scenarios S2A and S4A), which represent a small proportion of the entire region (approximately 15%), the RAVA-FIRST strategy was much more powerful than the classical WSS (approximately 50% gain in power). When causal variants were present in the regulatory categories only (scenarios S2B and S4B), both strategies showed similar power. All these results highlight the gain of power using the RAVA-FIRST WSS when a cluster of causal variants is present within a functional category of the CADD region while maintaining good power levels when causal variants are spread across the entire region. When comparing the simulations with causal variants sampled at the gene level or at the CADD region level, burden tests gathering variants within the corresponding testing units show, as expected, the highest levels of power. Nevertheless, the loss of power when using CADD regions as testing units instead of the entire gene is lower when causal variants are sampled across the entire gene (scenario S3) than the gain of power they present when causal variants are sampled within a specific CADD region (scenario S1). This is particularly true for the RAVA-FIRST WSS.

Table 3

Power at the genome-wide significance level of 2.5∙10−6 under the different simulation scenarios using either the classical WSS or the RAVA-FIRST WSS at the scale of either the entire gene or CADD regions.

	By gene		By CADD regions
	Classical WSS	RAVA-FIRST WSS	Classical WSS	RAVA-FIRST WSS
S1	0.409	0.370	0.782	0.701
S2A	0	0.431	0.002	0.602
S2B	0.408	0.404	0.689	0.706
S3	0.751	0.678	0.512	0.433
S4A	0.004	0.564	0.012	0.474
S4B	0.657	0.64	0.39	0.391

Applications

RAVA-FIRST analysis

RAVA-FIRST was used on whole-genome sequence (WGS) data from patients affected by venous thromboembolism (VTE). VTE is a multifactorial disease with a strong genetic component [31]. There exists a huge heterogeneity between patients in the age at first VTE event. To study the role of rare variants on VTE age of onset, WGS data were used from 200 individuals from the MARTHA cohort [32]. These individuals were selected among patients with unprovoked VTE event who were previously genotyped for a genome-wide association study [33] and present no known genetic predisposing factor. Individuals were dichotomized based on the age at first VTE event either before 50 years of age (early-onset) or after (late-onset). The threshold of 50 years was chosen based on the results of recent studies [34] that hint toward a genetic heterogeneity between these two groups. A quality control (QC) of the sequencing data was performed using the RAVAQ pipeline [35] (https://gitlab.com/gmarenne/ravaq). After QC, 184 individuals were included for analysis with 127 presenting an early-onset VTE and 57 a late-onset VTE. Only variants passing all QC steps and with a MAF lower than 1% in the sample were considered in the association tests comparing early and late-onset groups. For these comparisons, rare variants were gathered either by CADD regions or by using the sliding windows procedure implemented in WGScan [18]. Qualifying variants were selected based on CADD scores and using two filtering strategies: (1) a fixed CADD threshold of 15 or (2) the RAVA-FIRST CADD region-specific filtering (applied on ACS). Association was tested using the WSS burden test. When the RAVA-FIRST filtering was used, the corresponding WSS test with sub-scores was applied. Table 4 shows the number of testing units and variants kept under each strategy. For all tests with CADD regions, only testing units containing at least 5 rare variants were kept. WGScan was used with default parameters, i.e. with testing units of 5, 10, 15, 25 or 50 kb.

Table 4

Number of testing units and variants kept under the three strategies.

	Testing units	Filtering	Number of testing units	Number of variants
WGScanFixed CADD threshold	Sliding windows	MAF ≤ 1%CADD v1.4 ≥ 15	377,092	96,347
RAVA-FIRST units (CADD regions)No CADD filtering	CADD regions	MAF ≤ 1%	103,439	9,423,012
RAVA-FIRST units (CADD regions)Fixed CADD threshold		MAF ≤ 1%CADD v1.4 ≥ 15	10,389	96,294
RAVA-FIRST units (CADD regions)RAVA-FIRST filtering		MAF ≤ 1%ACS ≥ median	95,220	3,494,327

QQ-plots for the WSS tests using those three strategies are shown in Fig 3. As expected, a lower significance threshold is required to reach genome-wide significance with the sliding window procedure due to the higher number of testing units. Accordingly, the computation time was much lower for the two analyses by CADD regions (6min when filtering based on a fixed CADD score threshold and 25min when using the region-specific CADD thresholds) than for the sliding windows procedure (47min). Our dataset contains less than 200 individuals, suggesting that the gain in computation time of CADD regions compared to sliding window procedures would be even greater in larger WGS datasets. No significant result was found when no functional filter was applied nor when selecting variants with a CADD score greater than 15. One association reached borderline significance (p = 6.41∙10−7) when using the RAVA-FIRST strategy, i.e. with CADD regions and the corresponding ACS filtering.

Fig 3

QQ-plot of WSS analyses on VTE data using the four strategies of analysis.

Early-onset patients (<50 years old) were compared to late-onset patients (≥50 years old).

QQ-plot of WSS analyses on VTE data using the four strategies of analysis.

Early-onset patients (<50 years old) were compared to late-onset patients (≥50 years old). This association maps to R126442, a CADD region of 21 kb on chromosome 18:66788277–66809402 that contains 30 rare variants after RAVA-FIRST filtering. In this region, none of the variants observed in VTE patients or in gnomAD achieved a CADD score above 15. This explains why the association could not have been detected by the two other strategies based on fixed CADD score ≥ 15. The median of ACS observed for gnomAD variants in this region is 1.73 and the ACS of selected variants range from 1.82 to 8.50. These observations emphasize the need to adapt thresholds depending on the genomic region under analysis. Interestingly, only early-onset VTE patients carry qualifying rare variants and have non-null WSS scores (Fig 4 and S3 File). Among early-onset patients, a trend is also observed for WSS scores to decrease with increasing age of onset. Information about the CADD region R126442 is available in the S3 File. Information about individuals (WSS score, age and sex) and variants (position, adjusted CADD score and weight in WSS) are given.

Fig 4

WSS scores in the CADD region depending on the age at first VTE event.

The dashed line corresponds to the age 50 discriminating early onset from late onset events.

WSS scores in the CADD region depending on the age at first VTE event.

The dashed line corresponds to the age 50 discriminating early onset from late onset events. To make sure that there was indeed an advantage of using CADD regions compared to random windows over each chromosome, we shuffled the CADD regions on chromosome 18, computed new CADD medians in each region and tested for association again. We repeated this procedure 500 times and looked at the region where the top signal (lowest p-value) is located in each permutation (S3 Fig). We found an enrichment of top signals around R126442 and no other region in the chromosome reached the same significance level. Specifically, the top signal overlaps with R126442 in 34.2% of the replicates and this percentage increases to 62.4% if we consider the top 5 signals. The percentage of CADD regions overlapping with R126442 is yet smaller than 0.1% when looking at the whole p-value spectrum. The CADD region R126442 was then tested for association with 20 biological VTE biomarkers available in MARTHA patients: antithrombin, basophil, eosinophil, Factor VIII, Factor XI, fibrinogen, hematocrit, lymphocytes, mean corpuscular volume, mean platelet volume, monocytes, neutrophils, PAI-1, platelets count, protein C, protein S, prothrombin time, red blood cells count, von Willebrand Factor, and white blood cells count. For this, a linear regression model was used where adjustment was made on age at sampling and sex. At the Bonferroni threshold of 0.0025, one significant association (p = 7.1∙10−4) was observed, VTE patients with a non-null WSS score exhibiting decreased haematocrit levels, a surrogate marker of red blood cells (S4 Table). A similar trend (p = 4.6∙10−3) was observed with red blood cell count. We also investigated the association of the identified region with 376 plasma protein antibodies that were selected to be involved in thrombosis-related processes and that have been previously profiled in MARTHA [32,36]. Regression analysis were conducted on log transformed values of antibodies and were adjusted for age, sex, and three internal control antibodies. In order to handle the correlation between measured protein antibodies, we used the Li and Ji method [37] to estimate the number of effective independent tests. This number, calculated to be 163, was then used to define a Bonferroni threshold for declaring study-wide statistical significance. While not reaching the study-wise significance level of p = 3.1∙10−4 after correction for multiple testing, it is worth noting that the two proteins that exhibited the strongest significance with marginal association at p < 0.001, procalcitonin tagged by the HPA043700 antibody (p = 7.2∙10−4) and PDPK1 tagged by HPA035199 (p = 7.5∙10−4), were both suggested to be involved in red blood cell biology [38,39]. According to ENCODE data, the R1246442 CADD region overlaps “intergenic” and “regulatory” categories with one distant enhancer-like signature. To further describe this region, we looked at TADs positions in https://dna.cs.miami.edu/TADKB/brows.php in HUVEC and HMEC cell lines, two cell types known to be relevant for VTE pathophysiology. We found that the CADD region is included into the topological associated domain (TAD) 18:66450000–68150000. By studying TADs described by Lieberman-Aiden et al. 2009 [40] in other cell lines such as KBM7, K562 or GM12878, we retrieved a TAD with similar positions, giving additional evidence for the presence of this TAD around the CADD region associated with early-onset patients. We then explored this TAD region for the presence of candidate VTE genes whose regulation could be influenced by the enhancer region that maps our R1246442 region. Using the UCSC genome browser [41] integrating information about interactions between GeneHancer regulatory elements and genes expression (see S4 Fig), we identified CD226 as a strong biological candidate. CD226 codes for a glycoprotein expressed at the surface of several types of cells, including blood cell, and several studies have shown that it was associated with vascular endothelial dysfunction [42-44]. Genetic variants in CD226 have also been found associated with several blood cell traits including platelets, white blood cells (e.g. neutrophil, eosinophil) [45] and reticulocyte counts [46], another red blood cell biomarker.

Discussion

Even though whole genome sequencing data are now widely available, rare variant association tests (RVAT) usually remain restricted to the coding parts of the genome. This is explained by the lack of tools to explore rare variant associations outside genes [11]. It is especially difficult to predict the functional consequence of non-coding variants and not currently possible to analyse them in RVAT without using computationally-intensive sliding window procedures. In this work, we propose RAVA-FIRST, an entire new strategy of analysis of rare variants in the coding and the non-coding genome that leverages functional information. RAVA-FIRST is composed of three steps. First, RAVA-FIRST groups variants observed in cases and controls into some new testing units, the so-called “CADD regions”. These CADD regions are defined over the entire genome based on CADD scores of variants observed in gnomAD. They are large enough to include a sufficient number of rare variants to allow RVAT. They tend to preserve functional elements that, for a majority of them, are not split into several CADD regions. Second, RAVA-FIRST filters variants based on region-specific adjusted CADD thresholds that allow the selection of the best candidate variants within each region. This filtering approach was found to be more efficient than traditional approaches to discriminate between benign and pathogenic variants within a set of variants. Indeed, our benchmarking study using a set of Clinvar variants compared to 1000Genomes polymorphisms showed that the other filtering strategies were good at identifying true causal variants (true positive rates were high) but bad at finding the non-causal variants (true negative rates were low), especially in the coding genome. Both true positive and true negative rates are important to achieve a high percentage of causal variants within testing units, a major driver of power in RVAT, especially in burden tests [2,3,7]. Thus, the RAVA-FIRST filtering strategy is expected to result in an appreciable gain of power. Indeed, RAVA-FIRST enables to keep the most important functional variants within coding, regulatory and intergenic categories of the genome by adapting CADD score threshold to the genomic context. Third, RAVA-FIRST includes a burden test that integrates information on genomic categories in the regression model and that, coupled with the region-specific filtering, leads to a better detection of causal variants, should they cluster in one of these genomic categories only. We also showed through simulations that good power levels were maintained using RAVA-FIRST burden test when causal variants were randomly sampled. RAVA-FIRST was applied on real WGS data from VTE patients where an accumulation of rare variants in patients with early-onset events was investigated. We did not detect any significant signal using the sliding window procedure or CADD regions when qualifying rare variants were selected based on a minor allele frequency threshold and/or a fixed CADD threshold. However, we detected an association signal using both the grouping and filtering of rare variants proposed in RAVA-FIRST. The associated CADD region is intergenic, contains a predicted enhancer and is surrounded by a TAD containing 5 genes including CD226, a strong candidate for blood cell traits that are now well recognized to be key players in VTE physiopathology [31]. All rare variants in this region present low CADD scores and were not even included in analyses based on a fix CADD threshold, highlighting the importance of considering the genomic context to detect the most important predicted functional variants within each CADD region. These 30 rare variants are exclusively observed in early-onset cases. Fourteen of these variants are absent from gnomAD, and 10 of the 16 remaining variants have a lower frequency in gnomAD population than in our sample. This reinforces the value of the association signal in this CADD region, although it should be further described and validated using functional experiments. Preliminary investigations that need to be further explored, at both experimental and epidemiological levels, strongly suggest that this region is associated with several inflammatory markers impaired in anaemia of inflammation [39,47] and in platelets, both mechanisms being involved in thrombotic processes [48]. The RAVA-FIRST approach could be improved on different aspects. First, the definition of CADD regions relies on the gnomAD population and on the adjusted CADD threshold. We chose to use the whole gnomAD dataset but it could be of interest to select only some of the populations to detect population-specific associations that could for example be explained by ancestry-related differential expression patterns [49]. Nevertheless, in classical exome analyses, rare variants are usually filtered based on the maximum frequency observed among multiple populations. Furthermore, CADD regions are not defined for low-covered and non-sequenced genomic regions in gnomAD and their definition could benefit from the inclusion of data from other large population datasets where these regions are better covered. We also observed that CADD regions close to the centromeres can be very large, possibly due to less accurate annotation scores resulting from only few high quality genomes mapping these areas. We therefore recommend to cautiously interpret association signals that would be detected in these regions. To define the regulatory regions of the genome as one of the three genomic categories, we decided to include all genomic elements directly implicated into regulatory functions but we did not include silencers or lncRNA for example. However, the choice of elements to include as the regulatory category will only impact the adjusted CADD scores that are similar between regulatory and intergenic regions, and won’t therefore have a huge impact on CADD regions definition. As an example, the use of DECRES [50] to predict enhancers and promoters instead of SCREEN results in a very high correlation between the definition of CADD regions, 80% of them being identical. The choice of focusing on variants seen at least twice in gnomAD and with an ACS larger than 20 could also be discussed. This decision was made based on a fine tuning to obtain testing units with sizes that were the most compatible with rare variant analysis, but this could also be adapted to the genomic context as we have done by grouping small regions where several variants showed high ACS. By using CADD scores to define the testing units in RAVA-FIRST, we were able to propose a general framework to cover the entire genome. Indeed, while several other predictive tools have been proposed (as for example LINSIGHT [51], JARVIS [52] or ORION [53]), only few provide a score that is variant specific and defined in both the coding and non-coding parts of the genome. The use of the same framework to define testing units in the whole genome offers several advantages, including the region-specific filtering which enables to overcome the question of selecting a hard threshold to filter rare variants in RVAT. In addition, the newly defined CADD regions can be used in existing software that require regions as input parameters [54,55], enabling to apply a wide variety of RVAT available in those programs to the whole genome. Especially, Bayesian methods which have been shown to be of great promise in the analysis and filtering of rare variants [56,57] could be applied beyond genes by using CADD regions. CADD regions represent predefined testing units for RVAT that cover the highest proportion of the genome and have been made publicly available. They are part of a whole new strategy of rare variant analysis in the whole genome, RAVA-FIRST, that further benefits from the integration of functional information both for the filtering of rare variants and their analysis with burden tests. RAVA-FIRST has been implemented in the R package Ravages available in the CRAN and on Github, offering an easy and straightforward tool to perform RVAT in the whole genome. We believe that our developments will help researchers to explore the role of genome-wide rare variants in complex diseases. Firstly, through the redefinition of testing units in the coding genome where cluster of causal variants can be found within genes and retrieved using CADD regions [10]. Secondly, through the study of non-coding variants, especially intergenic ones, which are currently often excluded from the analysis. Going beyond the gene and the consequences on proteins, RAVA-FIRST will help for a better understanding of biological mechanisms behind complex diseases.

Definition of CADD regions and removal of low-covered and non-sequenced regions in gnomAD.

(TIF) Click here for additional data file.

Percentage of CADD regions overlapping each of the three genomic categories.

(TIF) Click here for additional data file.

Association analysis on VTE data on chromosome 18 by shuffling the CADD regions (500 replicates).

(TIF) Click here for additional data file.

Screenshot of the TAD 18:66450000–68150000 in the UCSC genome browser containing the CADD region R126442 and a potential enhancer regulating the CD226 gene, a candidate gene in VTE.

(TIF) Click here for additional data file.

Sources used to get genomic elements for comparisons with CADD regions.

(DOCX) Click here for additional data file.

Percentage of genomic elements entirely encompassed within a CADD region.

(DOCX) Click here for additional data file.

Type I error of the classical WSS and the RAVA-FIRST WSS using 5∙106 simulations under the null hypothesis.

(DOCX) Click here for additional data file.

Characteristics of the studied VTE sample.

Mean (Standard Deviation) for quantitative variables. Count (%) for qualitative variables. (DOCX) Click here for additional data file.

Details about the RAVA-RIST method and its evaluation.

(DOCX) Click here for additional data file.

Variants used for the evaluation of RAVA-FIRST comparing ClinVar pathogenic variants to 1000Genomes polymorphisms.

(XLSX) Click here for additional data file.

Information on the CADD region R126442 associated with VTE age at onset.

Information about individuals (WSS score, age and sex) and variants (position, adjusted CADD score and weight in WSS) are given. (XLSX) Click here for additional data file. 5 Mar 2022 Dear Dr Bocher, Thank you very much for submitting your Methods entitled 'Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score' to PLOS Genetics. The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review a much-revised version. We cannot, of course, promise publication at that time. From an editorial perspective, the reviewers, and in particular reviewer 2 raises several major issues. I do think these issues can be addressed, as this is a request for additional work rather than a fundamental challenge of the concept of the paper, but none of them is trivial. This means (i) benchmark the proposed methods against the most commonly used tools in the field, (ii) understand how variability in mutation rate can complicate data interpretation and (iii) include indels into the model (both reviewers made that point and this is a major source of rare deleterious variation). These combined additions represent a high bar, but methods for rare variant association testing are quite mature, which in turn raises the expectations for alternative approaches. Should you decide to revise the manuscript for further consideration here, your revisions must address the specific points raised above. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org. If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process. To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder. [LINK] We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions. Yours sincerely, Vincent Plagnol Associate Editor PLOS Genetics David Balding Section Editor: Methods PLOS Genetics Reviewer's Comments to the Authors: Reviewer #1: In their publication 'Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score', Bocher et. all describe a novel method for selecting disease-candidate variants among rare SNVs genome-wide. The described methodology is very interesting and enables new insights especially in previously less regarding genomic regions. I would recommend however to revise the text somewhat in order to improve readability: major: - I would recommend to define the adjusted CADD scores once and then stick to that. Maybe use an acronym (e.g. ACS (adjusted CADD score), RAVA-Score or something like that). In my eyes, the switches between adjusted and not-adjusted CADD scores (i.e. l. 154) make the manuscript hard to read, especially as you define the adjustment multiple times. Something similar may help to distinguish 'all' and '>1000bp' CADD regions - Table 2: Are percentages only from within the chosen (>1000bp) CADD regions or all. Or why else are some coding CCDS not in a CADD region? Why are protein domains so much less covered (likelihood that entire domain is covered instead of % of each bp)? - Fig 1: Why do you include missense and MSC in the top panel. I understand that those are the same as the bottom panel, but the empty plots should hence be excluded altogether. Maybe you could use that space on the upper panel to put (a single) legend for both panels there (i.e. the TPR, TNR and precision colours). One should note that Figure 1 does not tell much as most variants in ClinVar are coding and thus the adjusted CADD score does not do much. You mostly just have a higher cut-off than CADD1.4=20 (similar S4, just lower) which decreases TPR and increases FPR suggestion: - I wonder what happens in the very large 'CADD regions' around the centromers where, presumedly, many variants are not scored at all and general genome conservation is low and hence the median, which may lead to small numbers of variants in slightly higher conserved areas to be considered significant, idk). I have no idea what effect this may have but I, personally, would probably implement a maximum length for CADD regions and split regions larger than, say, 200 kb into the maximum number so that each is at least 100 kb (just a suggestion, maybe this does not work as intended) - Maybe it's just me, but I find Figures S1 (to a lesser degree S2) definitely more relevant for the main manuscript than any of the Tables. You are generally moving a lot of the method (i.e. definition of CADD regions) in the supplement that seems an important part of the manuscript - l. 320/321 'as recommended by https://cadd.gs.washington.edu/info, version v1.4': not to be pedantic but I would interpret 'there is not a natural choice here -- it is always arbitrary' rather as a recommendation for a dynamic threshold like your method than a single fixed cut-off. Afterall, you are pretty much proposing a (automated) solution for a problem that has also been stated there - l.156: ‘Note that because CADD scores are only available for SNVs’ -> CADD is available for InDel, consider carefully however if you want to include those in the analysis minor: - l. 89/90 'These regions prevent the use of sliding windows procedures while enabling the study of rare variants in the whole genome' -> I would use 'avoid' or 'replace' instead of 'prevent' - l. 92 fix -> fixed - (multiple) I assume 'package R Ravages' should be 'R package Ravages' - (multiple) I would write 'gnomAD', not 'GnomAD' Reviewer #2: In this manuscript, Bocher et al. attempt to define a new approach to performing rare variant burden tests in the non-coding genome. With whole-genome sequencing of large disease cohorts increasing at a rapid rate, identifying such methods is of value to the field. Unfortunately, the authors’ approach at defining functional units of the non-coding genome fails to account for major confounders. Furthermore, they have failed to benchmark against some of the most popular tools in the field as outlined in my comments below. In addition, their rationale for defining these CADD regions is very unclear to me. 1. The authors define the boundaries of their CADD regions as regions between two variants with an adjusted CADD score > 20. However, they do not consider mutation rate. In CCR, which the authors compare their approach to, Havrilla et al. cleverly used CpG density as a proxy for mutation rate and showed that this approach worked well. Unfortunately, because the CADD regions do not account for mutation rate, it is entirely unclear whether the lack of variation in a CADD region is due to selective pressure or lower mutability of that region due to decreased mutation rate. 2. I’m not convinced that the authors are detecting regions of the genome depleted for functional variation. I’d like to see how their regions compare to other methods that attempt to define constrained genomic regions (JARVIS, Orion, CDTS, LINSIGHT, etc.). The authors should compare performance of these approaches in classifying non-coding ClinVar variants. Also, the authors should not use ClinVar benign variants as a negative set in these comparisons, as most ClinVar benign variants are gnomAD polymorphisms. Because gnomAD variation was used to define this score, they run into confounding with this comparison. A better approach would be to use variants seen in other databases (e.g., DiscovEHR) but not gnomAD as a putative benign set. 3. In defining CADD regions, the authors required the variant to be seen in gnomAD > 2x, but they do not provide any rationale how they came to this threshold. Same goes for choosing a CADD threshold for the burden tests. 4. While the authors compare to sliding-window based burden approaches, they should also compare to how their approach compares to defining functional units using the many other intolerance methods mentioned above. 5. They exclude indels/structural variants because these variants do not have CADD scores. Doing so results in a tremendous loss of power: indels / SVs should have much larger effect sizes than SNVs in the non-coding genome. A burden model that includes SNVs with a CADD threshold + any indels / SVs should be more powered. 6. The authors find a suggestive association between a CADD region and VTE. I’d like to see how their approach performs without any CADD filter as a negative control. Furthermore, I would want to see a comparison that their CADD region-based burden analysis does better than randomly splitting the genome into random chunks (matched in size to the CADD regions). ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 3 Jun 2022 Submitted filename: Reviews_PlosG_23May22_VF.docx Click here for additional data file. 18 Jul 2022 Dear Dr Bocher, Thank you very much for submitting your Methods entitled 'Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score' to PLOS Genetics. We are happy to say that the substantial reviewers' comments have been addressed and we are therefore close to being able to accept the manuscript. However, reviewer 1 made some minor comments and we ask you to address these, after which we expect to formally accept the manuscript. We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org. If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process. To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder. [LINK] Please let us know if you have any questions while making these revisions. Yours sincerely, Vincent Plagnol Associate Editor PLOS Genetics David Balding Section Editor: Methods PLOS Genetics Reviewer's Comments to the Authors: Reviewer #1: All my previous concerns have been addressed. However, there are a few minor comments to the new sections that should be fixed prior to publication: l. 196/197: "This is expected given the fact that conservation and thus CADD scores are low in these regions." l. 462/463 "We also observed that CADD regions close to the centromeres can be very large, possibly due to a general low conservation in these areas" > My understanding is that there are few high quality genomes that cover these areas (repetitive elements, GC content) and hence alignments are missing to generate proper conservation scores. l. 239 "Fig 2B" -> that is 2A now, right? l. 477 "if" -> "of" (or the sentence does not make sense) l. 479 "use of a" -> "use of the" optional: Consider rereading and revising some sentences here and there. Quite a few feel unnecessarily lengthy. E.g l. 478, you can remove "of them can" without changing context Reviewer #2: The authors have addressed my concerns ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 3 Aug 2022 Submitted filename: Review2_response_reviewers_2.docx Click here for additional data file. 15 Aug 2022 Dear Dr Bocher, We are pleased to inform you that your manuscript entitled "Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score" has been editorially accepted for publication in PLOS Genetics. Congratulations! Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made. Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org. In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date. Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics! Yours sincerely, Vincent Plagnol Academic Editor PLOS Genetics David Balding Section Editor PLOS Genetics www.plosgenetics.org Twitter: @PLOSGenetics ---------------------------------------------------- Comments from the reviewers (if applicable): ---------------------------------------------------- Data Deposition If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website. The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-21-01454R2 More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support. Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present. ---------------------------------------------------- Press Queries If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org. 12 Sep 2022 PGENETICS-D-21-01454R2 Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score Dear Dr Bocher, We are pleased to inform you that your manuscript entitled "Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work! With kind regards, Anita Estes PLOS Genetics On behalf of: The PLOS Genetics Team Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom plosgenetics@plos.org | +44 (0) 1223-442823 plosgenetics.org | Twitter: @PLOSGenetics

56 in total

1. CD226 mediates platelet and megakaryocytic cell adhesion to vascular endothelial cells.

Authors: Hiroshi Kojima; Hirotaka Kanada; Seiichi Shimizu; Emi Kasama; Kazuko Shibuya; Hiromitsu Nakauchi; Toshiro Nagasawa; Akira Shibuya
Journal: J Biol Chem Date: 2003-07-07 Impact factor: 5.157

Review 2. Platelets in inflammation and thrombosis.

Authors: Denisa D Wagner; Peter C Burger
Journal: Arterioscler Thromb Vasc Biol Date: 2003-09-18 Impact factor: 8.311

3. Incorporating model uncertainty in detecting rare variants: the Bayesian risk index.

Authors: Melanie A Quintana; Jonine L Berstein; Duncan C Thomas; David V Conti
Journal: Genet Epidemiol Date: 2011-08-26 Impact factor: 2.135

4. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data.

Authors: Bingshan Li; Suzanne M Leal
Journal: Am J Hum Genet Date: 2008-08-07 Impact factor: 11.025

Review 5. Anemia of inflammation.

Authors: Guenter Weiss; Tomas Ganz; Lawrence T Goodnough
Journal: Blood Date: 2018-11-06 Impact factor: 22.113

6. Contribution to Alzheimer's disease risk of rare variants in TREM2, SORL1, and ABCA7 in 1779 cases and 1273 controls.

Authors: Céline Bellenguez; Camille Charbonnier; Benjamin Grenier-Boley; Olivier Quenez; Kilan Le Guennec; Gaël Nicolas; Ganesh Chauhan; David Wallon; Stéphane Rousseau; Anne Claire Richard; Anne Boland; Guillaume Bourque; Hans Markus Munter; Robert Olaso; Vincent Meyer; Adeline Rollin-Sillaire; Florence Pasquier; Luc Letenneur; Richard Redon; Jean-François Dartigues; Christophe Tzourio; Thierry Frebourg; Mark Lathrop; Jean-François Deleuze; Didier Hannequin; Emmanuelle Genin; Philippe Amouyel; Stéphanie Debette; Jean-Charles Lambert; Dominique Campion
Journal: Neurobiol Aging Date: 2017-07-14 Impact factor: 4.673

Review 7. Mining the Unknown: Assigning Function to Noncoding Single Nucleotide Polymorphisms.

Authors: Sierra S Nishizaki; Alan P Boyle
Journal: Trends Genet Date: 2016-12-06 Impact factor: 11.639

8. Comprehensive mapping of long-range interactions reveals folding principles of the human genome.

Authors: Erez Lieberman-Aiden; Nynke L van Berkum; Louise Williams; Maxim Imakaev; Tobias Ragoczy; Agnes Telling; Ido Amit; Bryan R Lajoie; Peter J Sabo; Michael O Dorschner; Richard Sandstrom; Bradley Bernstein; M A Bender; Mark Groudine; Andreas Gnirke; John Stamatoyannopoulos; Leonid A Mirny; Eric S Lander; Job Dekker
Journal: Science Date: 2009-10-09 Impact factor: 47.728

9. Non-coding and Loss-of-Function Coding Variants in TET2 are Associated with Multiple Neurodegenerative Diseases.

Authors: J Nicholas Cochran; Ethan G Geier; Luke W Bonham; J Scott Newberry; Michelle D Amaral; Michelle L Thompson; Brittany N Lasseigne; Anna M Karydas; Erik D Roberson; Gregory M Cooper; Gil D Rabinovici; Bruce L Miller; Richard M Myers; Jennifer S Yokoyama
Journal: Am J Hum Genet Date: 2020-04-23 Impact factor: 11.025

10. Genome-wide rare variant analysis for thousands of phenotypes in over 70,000 exomes from two cohorts.

Authors: Elizabeth T Cirulli; Simon White; Robert W Read; Gai Elhanan; William J Metcalf; Francisco Tanudjaja; Donna M Fath; Efren Sandoval; Magnus Isaksson; Karen A Schlauch; Joseph J Grzymski; James T Lu; Nicole L Washington
Journal: Nat Commun Date: 2020-01-28 Impact factor: 14.919