Literature DB >> 20964851

CCRaVAT and QuTie-enabling analysis of rare variants in large-scale case control and quantitative trait association studies.

Robert Lawrence1, Aaron G Day-Williams, Katherine S Elliott, Andrew P Morris, Eleftheria Zeggini.   

Abstract

BACKGROUND: Genome-wide association studies have been successful in finding common variants influencing common traits. However, these associations only account for a fraction of trait heritability. There has been a shift in the field towards studying low frequency and rare variants, which are now widely recognised as putative complex trait determinants. Despite this increasing focus on examining the role of low frequency and rare variants in complex disease susceptibility, there is a lack of user-friendly analytical packages implementing powerful association tests for the analysis of rare variants.
RESULTS: We have developed two software tools, CCRaVAT (Case-Control Rare Variant Analysis Tool) and QuTie (Quantitative Trait), which enable efficient large-scale analysis of low frequency and rare variants. Both programs implement a collapsing method examining the accumulation of low frequency and rare variants across a locus of interest that has more power than single variant analysis. CCRaVAT carries out case-control analyses whereas QuTie has been developed for continuous trait analysis.
CONCLUSIONS: CCRaVAT and QuTie are easy to use software tools that allow users to perform genome-wide association analysis on low frequency and rare variants for both binary and quantitative traits. The software is freely available and provides the genetics community with a resource to perform association analysis on rarer genetic variants.

Entities:  

Mesh:

Year:  2010        PMID: 20964851      PMCID: PMC2973964          DOI: 10.1186/1471-2105-11-527

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

Recent advances in high-throughput genotyping have made large-scale genetic association studies possible. Genome-wide association studies (GWAS) for complex disease have met with unprecedented success in identifying common susceptibility variants. However, the discovered common-frequency single nucleotide polymorphism (SNP) associations do not account for a large proportion of the genetic component of disease. The field is now focusing on the analysis of low frequency and rare variants (i.e. minor allele frequency (MAF) ≤0.05) to investigate if they will help explain the missing heritability in complex trait etiology [1,2]. While the sample sizes currently investigated are large enough for a well-powered GWAS of common variants, they are not large enough to provide sufficient power for the single-point analysis of low frequency/rare variants with small to moderate effect sizes [3]. We have developed association analysis software, CCRaVAT (Case-Control Rare Variant Analysis Tool) and QuTie (Quantitative Trait), which allow the large-scale analysis of low frequency/rare polymorphisms. The software increases power over single marker analysis of these variants by pooling the low frequency/rare variants within defined regions and treating them as a single "super-locus" [3,4]. These software tools are suitable for the analysis of SNP data from both commercial GWAS platforms as well as of variants discovered from resequencing projects. The programs find loci where the low frequency/rare variant content is significantly different between cases and controls, or where the means of a quantitative trait differ between groups with and without these variants.

Implementation

CCRaVAT and QuTie are Linux command-line based utilities written in Perl. The scripts utilize the GetOpt, POSIX, and GD Perl modules. The GD module is necessary to produce the graphical output, and the POSIX module is used to calculate the logarithm base 10 of the p values. The tools have been tested on a variety of GWAS datasets and the system requirements depend mainly on the size of the study (i.e. number of SNPs and individuals genotyped). The software requires that the data be separated by chromosome for efficiency. For a genome-wide dataset separated by chromosome consisting of 450,000 SNPs typed in 5,000 individuals, CCRaVAT requires ~200 Mb of RAM. The software development and testing of the applications were performed on machines with dual-core Athlon processors. The scripts can take a variable amount of time to run depending on the options used. The run time for a typical gene-centric genome-wide analysis, using approximately 450,000 SNPs and 5,000 individuals separated by chromosome, is less than 24 hours. Permutation testing can add considerably to the computing time depending on the number of regions analyzed and the numbers of permutations run.

Results and Discussion

The statistical properties of the low frequency/rare variant collapsing (super-locus) association test that we have implemented have been described previously [3,4]. Although methods for how to analyze low frequency/rare variants have been developed, to our knowledge there are no published software packages that implement them. This lack of software tools motivated the development of CCRaVAT and QuTie.

Analytical Framework

Figure 1 provides an overview of the analytical approach implemented in CCRaVAT and QuTie. The first step in implementing the collapsing approach involves the definition of regions in which low frequency/rare variants are collapsed. These chromosomal regions can either be defined by sliding windows of predefined length across the genome or genic regions defined by intervals either side of the transcriptional start and stop sites of genes. CCRaVAT and QuTie differ in the study designs analyzed and statistical techniques used to determine the significance of the comparison. CCRaVAT analyzes binary trait data and constructs a 2 x 2 contingency table of the presence or absence of low frequency/rare variant minor alleles in cases and controls for each region. Differences in the proportion of cases and controls carrying low frequency/rare variant minor alleles are tested using a Pearson's chi-squared test or a Fisher's exact test. CCRaVAT also allows users to generate empirical p values by permuting case-control status a predefined number of times and repeating the analysis for each replicate. QuTie implements the analysis of quantitative traits in a sample of unrelated individuals and analyzes the differences in quantitative trait means for individuals carrying at least one low frequency/rare variant minor allele and individuals carrying no low frequency/rare variant minor alleles within the defined region. The quantitative trait values in the two groups are compared using linear regression and a Student's t-test. The analysis methods assume all individuals are unrelated.
Figure 1

CCRaVAT and QuTie Workflow. Flowchart summarizing the implementation of the low frequency/rare variant analysis methods in CCRaVAT and QuTie.

CCRaVAT and QuTie Workflow. Flowchart summarizing the implementation of the low frequency/rare variant analysis methods in CCRaVAT and QuTie.

Input Files

CCRaVAT and QuTie require two input files per chromosome: a map file and a pedigree file. The map file contains information about the markers analyzed and their position along the chromosome. CCRaVAT and QuTie allow both a 3 column and a 4 column formatted map file, as seen in Table 1. The 3 column map file illustrated in Table 1A contains information on chromosome, marker name, and base pair (bp) position of analyzed markers. The 4 column map file shown in Table 1B is the map file format used by the program PLINK [5] and contains the chromosome, marker name, genetic position and bp position of analyzed markers. The pedigree file holds information about the individuals and their genotypes. The pedigree file is a white-space delimited (space or tab) file that needs to be in the standard pre-Makeped linkage format described and illustrated in Table 2. If performing a gene-centric analysis an additional file defining gene names and coordinates is required. This file is a white-space delimited file (space or tab) and illustrated in Table 3. The software download includes the gene files for both build 35 and 36 of the genome.
Table 1

Three Column Map File

CHRMARKERBP POS
1SNP11111
1SNP22111
1SNP33111
1SNP44111

3 column map file that contains the chromosome, marker name, and base pair position of each marker. The header row is for display purposes only and should not appear in the actual file.

Table 2

Four Column Map File

CHRMARKERGEN POSBP POS
1SNP101111
1SNP212111
1SNP323111
1SNP434111

4 column map file that contains the chromosome, marker name, genetic position, and base pair position. The four column format is the same used by the genetics software package PLINK. The header row is for display purposes only and should not appear in the actual file.

Table 3

Pedigree File

PED IDINDIV IDFATHER IDMOTHER IDSEXAFF STATGENOTYPES
110011AAACTG
220021AGAAGG
330022GGCCTT
440012AGACTG

Pedigree file that contains genotype data for 3 SNPs and 4 individuals (2 controls and 2 cases). The first column is for pedigree IDs, the second for individual IDs, the third for paternal ID, the forth for maternal ID, the fifth for sex code, and the sixth for disease designation or quantitative phenotype value. Column 7 starts the genotype data for the markers, with each allele of each genotype in its own column (e.g. for 3 markers there will be 6 allele columns). The header row is for display purposes only and should not appear in the actual file.

Three Column Map File 3 column map file that contains the chromosome, marker name, and base pair position of each marker. The header row is for display purposes only and should not appear in the actual file. Four Column Map File 4 column map file that contains the chromosome, marker name, genetic position, and base pair position. The four column format is the same used by the genetics software package PLINK. The header row is for display purposes only and should not appear in the actual file. Pedigree File Pedigree file that contains genotype data for 3 SNPs and 4 individuals (2 controls and 2 cases). The first column is for pedigree IDs, the second for individual IDs, the third for paternal ID, the forth for maternal ID, the fifth for sex code, and the sixth for disease designation or quantitative phenotype value. Column 7 starts the genotype data for the markers, with each allele of each genotype in its own column (e.g. for 3 markers there will be 6 allele columns). The header row is for display purposes only and should not appear in the actual file.

Program Options

CCRaVAT and QuTie provide users with 25 command-line options, all detailed in the users manual, allowing the analysis to be tailored to specific interests. The options belong to three broad categories: altering the definitions of a region, low frequency/rare variant; altering significance levels and defining statistical analysis method, and altering the appearance of the graphical output. Fundamental to the collapsing method is the definition of the region within which the accumulation of low frequency/rare variants will be examined. CCRaVAT and QuTie provide the user with two options for defining the locus of interest, either through defining regions based on known gene coordinates or by employing a sliding window approach. If the analysis is based on sliding windows, the user defines how large the analysis windows should be. If a gene-based analysis is undertaken the user can also define how further upstream and downstream from the transcription start and stop sites to extend the analysis. The user can adjust the MAF cut-off that determines which markers are considered to be low frequency/rare variants and therefore included in the analysis. Unlike association tests of common variants, there is no well-defined significance threshold for the analysis of multiple low frequency/rare variants. The programs allow the user to define a significance threshold that produces separate files for significant regions, allowing the researcher to focus on top hits without having to troll through all the data. The researcher can also set significance thresholds to select regions for follow-up by undergoing permutation analysis. The number of permutations can also be preset. As chi-squared test results can be unreliable with low cell counts, CCRaVAT provides an option for the user to set a minimum number of cell counts; the Fisher's exact test is then implemented for any region that falls below this value. The standard analysis of QuTie is a linear regression, but QuTie provides an option to additionally carry out a two-sample t-test. To assist researchers in interpreting the results, CCRaVAT and QuTie produce visual output summaries. The programs allow the user to define a significance threshold to highlight loci in the Manhattan plot on the basis of their p value, as well as to manipulate graphical parameters such as the height, width, and size of data points of the figures. The programs also provide an option to (re)produce figures based on previously run analyses.

Output Files

CCRaVAT and QuTie produce text-based summaries and graphical summaries of the analysis results. The format of the CCRaVAT output file that provides summary statistics for all genes/windows that achieved a user-specified level of significance is displayed in Table 5. The same summary file produced by QuTie is illustrated in Table 6. The results of permutation testing for all regions that reached the significance threshold are demonstrated in Table 7. CCRaVAT and QuTie produce comprehensive output including summary statistics for all analysed genes/windows on each chromosome and this output is summarized in Tables 8 and 9 (respectively). The programs also produce a list of SNPs that were analyzed within each significant region, and the format of that file is shown in Table 10. In addition to these output files, CCRaVAT and QuTie produce a Manhattan plot that visually summarizes the significance of all analyzed regions (Figure 2). QuTie produces two additional graphic summaries (Figures 3 and 4). The histogram shown in Figure 3 shows the distribution of quantitative trait values for all individuals in the pedigree file. Figure 4 is an example of the histogram that QuTie produces for every region achieving a user-specified level of significance, and shows the distribution of trait values of individuals with (red) and without (blue) low frequency/rare variant minor alleles. The output for a genome-wide, gene-centric scan for low frequency/rare variant (MAF≤0.05) analysis typically totals less than 2 Mb for all files. The output size for sliding windows-based analysis genome-wide depends on the size of the intervals examined and the MAF threshold imposed. This usually ranges from 3 to 6 Mb for all files.
Table 5

CCRaVAT Summary Output File

Gene/WindChrStartEnd_PosN_SNPsCaseRVCaseNoRVContRVContNoRVChiSqP-valFisherExPval
MGC332123197456409197583455(10/1)2190931290315.50.0000831.96E-05
PPIC5122336979122450324(12/6)2618777290621.440.00000375.94E-06
NR3C15142589325142813087(24/2)8191247286914.710.000137.28E-05
ADAMTS25178423474178754935(30/2)62185944288516.150.000059No < 30
3.8_1.562979087329892049(43/2)26189010292016.210.0000579.74E-05
KLF61037612333867455(14/2)1419071293118.180.000022.13E-05

This file provides summary statistics for all genes that achieved a p value ≤ the p value set by the -pout command line option. The summary file is a tab-delimited file with 12 columns: Gene/Window name, Chromosome, Starting bp position, End bp position, Number of SNPs in the Gene/Window, Number of cases with low frequency/rare variant minor alleles, Number of cases without low frequency/rare variant minor alleles, Number of controls with low frequency/rare variant minor alleles, Number of controls without low frequency/rare variant minor alleles, Chi-Squared Value, Chi-Squared p value, Fisher exact p value.

Table 6

QuTie Summary Output File

GeneChrStartEnd_posSNPs/rvSNPsQT+RVQT-RVQT+RV meanQT-RV meanp-valBetaCoefSt.Er[lowCI - upCI]t-testt-test_p-val
MIB210107622(10/1)1061125-0.3670.0376.77E-050.4040.101[0.206 - 0.602]-3.9753.72E-05
GOLGA8C151897771419091040(29/4)1271097-0.3470.0423.19E-050.3890.093[0.206 - 0.571]-4.1491.78E-05
TPO213462421575502(29/6)7011580.478-0.0284.00E-05-0.5060.122[-0.746 - -0.266]4.1071.80E-05
EXOC35446375570407(24/1)412281.935-0.0061.00E-04-1.9410.498[-2.918 - -0.964]3.8745.12E-05
C10orf1101010086061130138(66/4)11111160.379-0.0364.00E-05-0.4150.099[-0.609 - -0.221]4.1611.48E-05

This file provides summary statistics for all genes that achieved a p value ≤ the p value set by the -pout command line option. The summary file is a tab-delimited file with 15 columns: Gene/Window name, Chromosome, Starting bp position, End bp position, Number of SNPs in the Gene/Window, Number of individuals with low frequency/rare variant minor alleles, Number of individuals without low frequency/rare variant minor alleles, The mean QT value of individuals with low frequency/rare variant minor alleles, The mean QT value of individuals without low frequency/rare variant minor alleles, The p value of the linear regression, The Beta coefficient from the linear regression, The Standard Error of the Beta Coefficient, the lower and upper 95% confidence intervals of Beta, The t-statistic, and The t-statistic p value.

Table 7

CCRaVAT Permutation Summary Output File

Gene/WindowChrStartEndCase: (RV/NonRV)Control: (RV/NonRV)PvalPermutation Results
LOC254099110123201219359case: (86/1213) cont: (75/581)0.00026Perm: 0/10 = 0
TTLL10110550001261164case: (164/1133) cont: (104/549)0.047Perm: 0/10 = 0
TNFRSF18110788121282012case: (164/1133) cont: (104/549)0.047Perm: 1/10 = 0.1
TNFRSF4110866301289435case: (164/1133) cont: (104/549)0.047Perm: 1/10 = 0.1
SDF4110922121307334case: (164/1133) cont: (104/549)0.047Perm: 0/10 = 0
B3GALT6111075681310341case: (164/1133) cont: (104/549)0.047Perm: 1/10 = 0.1
C1QDC2111177511318766case: (164/1133) cont: (104/549)0.047Perm: 0/10 = 0
UBE2J2111292171349157case: (164/1133) cont: (104/549)0.047Perm: 0/10 = 0
SCNN1D111574991367332case: (164/1133) cont: (104/549)0.047Perm: 0/10 = 0

This file provides summary statistics for all genes that achieved a p value ≤ the p value set by the -pperm command line option, which initiates permutation testing. The summary file is a tab-delimited file with 8 columns: Gene/Window name, Chromosome, Starting bp position, End bp position, Summary of the number of cases and controls that have low frequency/rare variant minor alleles, the original p value, Summary of permutations run, and Permutation p value. The output file for QuTie is the same except that column 5 contains the number of individuals with and without low frequency/rare variant minor alleles and corresponding QT values.

Table 8

CCRaVAT Chromosome Output File

GeneChrStartStopN_SNPsCase+RVCase-RVCont+RVCont-RVChisqPearsonPvalFishExPvalPermutations
MIB210107622(5/0)0192402938011
OR4G11P12878103747(6/1)11922229350.050.821
MMP23B19202111672(0/0)0192402938011
MMP23A19225111672(12/2)112178416727550.080.78No < 30
CDC2L2112742197336(32/1)11921729262.460.120.158
LOC440748139316143660(47/12)380153159423290.140.71No < 30
NBPF201114476233524(23/1)41919229321.840.170.222
CCNL21115136226993(13/0)0192402938011
OR4F291357522544452(7/3)159175623926850.030.86No < 30
LOC4405511519055657573(5/0)0192402938011
LOC4405521558787660167(29/9)207167926326404.740.029No < 30Perm: 1/10 = 0.1
FAM87B1742614845077(28/3)1519062129070.060.810.865

This file provides summary statistics for all genes analyzed on each chromosome and is the most comprehensive output file. The summary file is a tab-delimited file with 13 columns: Gene/Window name, Chromosome, Starting bp position, End bp position, Number of SNPs in the Gene/Window and the number that are low frequency/rare, Number of cases with low frequency/rare variant minor alleles, Number of cases without low frequency/rare variant minor alleles, Number of controls with low frequency/rare variant minor alleles, Number of controls without low frequency/rare variant minor alleles, Chi-Squared Value, Chi-Squared p value, Fisher exact p value, and a description of any permutations run.

Table 9

QuTie Chromosome Output File

GeneStartEnd(SNPs/RV)QT+RVQT-RVQT+RV_AvQT-RV_AvRegPvalBetaCoefStEr[lCI - uCI]TTtest_Pval
MIB29244121025537(20/1)45979.014-0.1370.222-9.1517.484[-23.819 - 5.517]1.2230.111
OR4G11P9389461039986(25/1)45979.014-0.1370.222-9.1517.484[-23.819 - 5.517]1.2230.111
MMP23B9455871081419(43/6)752813.308-0.3540.06-3.6621.94[-7.484 - 0.159]1.8840.03
MMP23A9836711097098(46/7)892802.929-0.4930.059-3.4231.807[-6.983 - 0.138]1.890.03
CDC2L29971201097869(43/7)892802.929-0.4930.059-3.4231.807[-6.983 - 0.138]1.890.03
LOC44074810071281117407(47/8)1152602.73-0.1350.085-2.8651.659[-6.134 - 0.404]1.7240.04
NBPF2010623201169359(32/10)1592240.713-0.1690.558-0.8821.505[-3.847 - 2.082]0.5880.28
CCNL211050001211164(35/14)150523-1.2910.2240.2811.5151.403[-1.235 - 4.266]-1.0810.14
OR4F2911288121232012(30/17)142437-1.4020.2310.2581.6331.442[-1.193 - 4.459]-1.1330.13
LOC44055111366301239435(30/17)143443-1.4070.2040.2621.6111.435[-1.201 - 4.423]-1.1230.13
LOC44055211422121257334(28/15)139444-1.480.1560.2611.6361.454[-1.214 - 4.485]-1.1260.13
FAM87B11575681260341(22/14)110449-1.6120.2090.2541.8221.596[-1.306 - 4.949]-1.1420.13

This file provides summary statistics for all genes analyzed on each chromosome and is the most comprehensive output file. The summary file is a tab-delimited file with 14 columns: Gene/Window name, Starting bp position, End bp position, Number of SNPs in the Gene/Window, Number of individuals with low frequency/rare variant minor alleles, Number of individuals without low frequency/rare variant minor alleles, Mean QT value for individuals with low frequency/rare variant minor alleles, Mean QT value for individuals without low frequency/rare variant minor alleles, Linear regression p value, Beta coefficient from linear regression, Standard Error of the Beta coefficient, The lower and upper 95% Confidence Interval s of Beta, T-statistic, and T-statistic p value.

Table 10

CCRaVAT/QuTie Significant Region Output File

MarkerChromosomePositionMAF
rs715643112128300.042
rs3934834110457290.163
rs3737728110613380.301
rs6687776110704880.158
rs9651273110714630.295
rs4970405110888780.086
rs12726255110898730.125
rs2298217111049020.133
rs4970357111169870.093
rs4970362111346610.378
rs9660710111392650.068
rs4970420111463960.192
rs1320565111597810.095
rs11260549111617170.116
rs9729550111751650.262
rs11721111925540.101
rs2887286111960540.17
rs3813199111982000.106
rs3766186112023580.105
rs7515488112037270.158
rs6675798112165200.105

This file provides summary statistics for all SNPs that reside within a gene or region with p value ≤ the p value set by the -pout command line option. The file is tab-delimited with 5 columns: Marker name, Chromosome, bp position of Marker, and the Minor Allele Frequency (MAF) considering all analyzed individuals.

Figure 2

CCRaVAT and QuTie Manhattan Plot. An example Manhattan plot generated by CCRaVAT and QuTie displaying the -LOG10 p value of all genes/windows analyzed. Each point represents a gene or region, with loci achieving p values below a predefined threshold denoted in red.

Figure 3

QuTie Quantitative Trait Distribution Histogram. Histogram showing the distribution of the analysed quantitative trait across all individuals (individuals with and without low frequency/rare-variant minor alleles).

Figure 4

QuTie Quantitative Trait Distribution Comparison Histogram. Histogram displaying the distribution of quantitative trait values for individuals that either do (red) or do not (blue) carry at least one low frequency/rare variant minor allele within a region that has a p value ≤ the value set by the -pout option. A histogram is produced for every significant gene/window.

Gene File Gene file that defines the genes to be analyzed and their coordinates to allow the collapsing of the correct markers defined in the map file. The first five columns of the file must be: Gene ID, Gene Name/Symbol, Chromosome, Start bp position, End bp position. Additional columns will be ignored. CCRaVAT Summary Output File This file provides summary statistics for all genes that achieved a p value ≤ the p value set by the -pout command line option. The summary file is a tab-delimited file with 12 columns: Gene/Window name, Chromosome, Starting bp position, End bp position, Number of SNPs in the Gene/Window, Number of cases with low frequency/rare variant minor alleles, Number of cases without low frequency/rare variant minor alleles, Number of controls with low frequency/rare variant minor alleles, Number of controls without low frequency/rare variant minor alleles, Chi-Squared Value, Chi-Squared p value, Fisher exact p value. QuTie Summary Output File This file provides summary statistics for all genes that achieved a p value ≤ the p value set by the -pout command line option. The summary file is a tab-delimited file with 15 columns: Gene/Window name, Chromosome, Starting bp position, End bp position, Number of SNPs in the Gene/Window, Number of individuals with low frequency/rare variant minor alleles, Number of individuals without low frequency/rare variant minor alleles, The mean QT value of individuals with low frequency/rare variant minor alleles, The mean QT value of individuals without low frequency/rare variant minor alleles, The p value of the linear regression, The Beta coefficient from the linear regression, The Standard Error of the Beta Coefficient, the lower and upper 95% confidence intervals of Beta, The t-statistic, and The t-statistic p value. CCRaVAT Permutation Summary Output File This file provides summary statistics for all genes that achieved a p value ≤ the p value set by the -pperm command line option, which initiates permutation testing. The summary file is a tab-delimited file with 8 columns: Gene/Window name, Chromosome, Starting bp position, End bp position, Summary of the number of cases and controls that have low frequency/rare variant minor alleles, the original p value, Summary of permutations run, and Permutation p value. The output file for QuTie is the same except that column 5 contains the number of individuals with and without low frequency/rare variant minor alleles and corresponding QT values. CCRaVAT Chromosome Output File This file provides summary statistics for all genes analyzed on each chromosome and is the most comprehensive output file. The summary file is a tab-delimited file with 13 columns: Gene/Window name, Chromosome, Starting bp position, End bp position, Number of SNPs in the Gene/Window and the number that are low frequency/rare, Number of cases with low frequency/rare variant minor alleles, Number of cases without low frequency/rare variant minor alleles, Number of controls with low frequency/rare variant minor alleles, Number of controls without low frequency/rare variant minor alleles, Chi-Squared Value, Chi-Squared p value, Fisher exact p value, and a description of any permutations run. QuTie Chromosome Output File This file provides summary statistics for all genes analyzed on each chromosome and is the most comprehensive output file. The summary file is a tab-delimited file with 14 columns: Gene/Window name, Starting bp position, End bp position, Number of SNPs in the Gene/Window, Number of individuals with low frequency/rare variant minor alleles, Number of individuals without low frequency/rare variant minor alleles, Mean QT value for individuals with low frequency/rare variant minor alleles, Mean QT value for individuals without low frequency/rare variant minor alleles, Linear regression p value, Beta coefficient from linear regression, Standard Error of the Beta coefficient, The lower and upper 95% Confidence Interval s of Beta, T-statistic, and T-statistic p value. CCRaVAT/QuTie Significant Region Output File This file provides summary statistics for all SNPs that reside within a gene or region with p value ≤ the p value set by the -pout command line option. The file is tab-delimited with 5 columns: Marker name, Chromosome, bp position of Marker, and the Minor Allele Frequency (MAF) considering all analyzed individuals. CCRaVAT and QuTie Manhattan Plot. An example Manhattan plot generated by CCRaVAT and QuTie displaying the -LOG10 p value of all genes/windows analyzed. Each point represents a gene or region, with loci achieving p values below a predefined threshold denoted in red. QuTie Quantitative Trait Distribution Histogram. Histogram showing the distribution of the analysed quantitative trait across all individuals (individuals with and without low frequency/rare-variant minor alleles). QuTie Quantitative Trait Distribution Comparison Histogram. Histogram displaying the distribution of quantitative trait values for individuals that either do (red) or do not (blue) carry at least one low frequency/rare variant minor allele within a region that has a p value ≤ the value set by the -pout option. A histogram is produced for every significant gene/window.

Data Quality Control

Performing the collapsing analysis based on low frequency and rare variants (particularly those typed as part of GWAS) requires special attention to quality control. Genotype calling algorithms for GWAS chips perform well for common variants, but are known to be error-prone for loci with low MAF. Therefore, we recommend users that have performed the analysis based on GWAS chip data to check the cluster plots for all variants contributing to interesting signals, exclude any poorly clustering variants and rerunning the analysis for the specific regions of interest to ensure the association is robust to these exclusions. Quality control is also an important consideration when analyzing sequencing data. Major considerations are the effects of small insertions-deletions leading to false positive SNPs, read depth at variant sites, mapping quality score, and SNP quality score.

Conclusions

In this paper we have described two novel analysis tools, CCRaVAT and QuTie, for investigating low frequency/rare variant associations in GWAS and resequencing data. Both programs employ a simple collapsing method to increase power over single point analysis. CCRaVAT analyzes case/control data and investigates significance using Pearson's chi-squared and Fisher's exact tests. QuTie analyzes quantitative trait data and implements a linear regression and Student's t-test. Both CCRaVAT and QuTie are easy-to-use Linux command line tools that use standard files typically employed in common variant GWAS analysis. CCRaVAT and QuTie can be used as a complement to existing common disease GWAS by analyzing low frequency/rare variant associations or in analyzing sequence-based low frequency/rare variant genotype calls in regions of interest or genome-wide. These tools are important first steps in the analysis of rare variants. We are currently developing more powerful natural extensions to the current methods as well as novel approaches that incorporate weights based on quality metrics.

Availability and requirements

Project name: CCRaVAT and QuTie Project homepage: http://www.sanger.ac.uk/resources/software/rarevariant/ Operating system: Linux/Unix Programming Language: Perl License: GNU GPL

List of abbreviations

CCRaVAT: case control rare variant analysis tool; QuTie: quantitative trait; QT: quantitative trait; GWAS: genome-wide association study; SNP: single nucleotide polymorphism; MAF: minor allele frequency; bp: base-pair; CHR: chromosome; POS: position; GEN: genetic; AFF STAT: affection status; RV: rare variant; Cont: control; ChiSq: Chi-square statistic; FisherEX: Fisher's exact test; Wind: window; Coef: coefficient; StEr: standard error; CI: confidence interval; Av: average; RegPval: regression p-value.

Authors' contributions

RL wrote the code for CCRaVAT and QuTie. ADW wrote the documentation, developed the homepage, and drafted the manuscript. KSE compiled and created the gene files for the gene-centric analysis. APM supervised the development of CCRaVAT and QuTie. EZ supervised the development of CCRaVAT and QuTie and drafted the manuscript. All authors have read and approved this manuscript.
Table 4

Gene File

GENE IDGENE NAMECHRSTART BP POSSTOP BP POS
7293TNFRSF4111365691139375
51150SDF4111421511157274
126792B3GALT6111575081160281
388581C1QDC2111676961171965
118424UBE2J2111791551199097
6339SCNN1D112074391217272
116983CENTB5112188071228503
126789PUSL1112338571236920

Gene file that defines the genes to be analyzed and their coordinates to allow the collapsing of the correct markers defined in the map file. The first five columns of the file must be: Gene ID, Gene Name/Symbol, Chromosome, Start bp position, End bp position. Additional columns will be ignored.

  5 in total

1.  PLINK: a tool set for whole-genome association and population-based linkage analyses.

Authors:  Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham
Journal:  Am J Hum Genet       Date:  2007-07-25       Impact factor: 11.025

2.  Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data.

Authors:  Bingshan Li; Suzanne M Leal
Journal:  Am J Hum Genet       Date:  2008-08-07       Impact factor: 11.025

Review 3.  Finding the missing heritability of complex diseases.

Authors:  Teri A Manolio; Francis S Collins; Nancy J Cox; David B Goldstein; Lucia A Hindorff; David J Hunter; Mark I McCarthy; Erin M Ramos; Lon R Cardon; Aravinda Chakravarti; Judy H Cho; Alan E Guttmacher; Augustine Kong; Leonid Kruglyak; Elaine Mardis; Charles N Rotimi; Montgomery Slatkin; David Valle; Alice S Whittemore; Michael Boehnke; Andrew G Clark; Evan E Eichler; Greg Gibson; Jonathan L Haines; Trudy F C Mackay; Steven A McCarroll; Peter M Visscher
Journal:  Nature       Date:  2009-10-08       Impact factor: 49.962

Review 4.  Common and rare variants in multifactorial susceptibility to common diseases.

Authors:  Walter Bodmer; Carolina Bonilla
Journal:  Nat Genet       Date:  2008-06       Impact factor: 38.330

5.  An evaluation of statistical approaches to rare variant analysis in genetic association studies.

Authors:  Andrew P Morris; Eleftheria Zeggini
Journal:  Genet Epidemiol       Date:  2010-02       Impact factor: 2.135

  5 in total
  12 in total

1.  ARIEL and AMELIA: testing for an accumulation of rare variants using next-generation sequencing data.

Authors:  Jennifer L Asimit; Aaron G Day-Williams; Andrew P Morris; Eleftheria Zeggini
Journal:  Hum Hered       Date:  2012-03-22       Impact factor: 0.444

Review 2.  Meta-analysis methods for genome-wide association studies and beyond.

Authors:  Evangelos Evangelou; John P A Ioannidis
Journal:  Nat Rev Genet       Date:  2013-05-09       Impact factor: 53.242

3.  Knowledge-constrained K-medoids Clustering of Regulatory Rare Alleles for Burden Tests.

Authors:  R Michael Sivley; Alexandra E Fish; William S Bush
Journal:  Evol Comput Mach Learn Data Min Bioinform       Date:  2013

4.  Secondary analysis of publicly available data reveals superoxide and oxygen radical pathways are enriched for associations between type 2 diabetes and low-frequency variants.

Authors:  Mojgan Yazdanpanah; Chuhua Chen; Jinko Graham
Journal:  Ann Hum Genet       Date:  2013-08-14       Impact factor: 1.670

5.  Rare variant collapsing in conjunction with mean log p-value and gradient boosting approaches applied to Genetic Analysis Workshop 17 data.

Authors:  Yauheniya Cherkas; Nandini Raghavan; Stephan Francke; Frank Defalco; Marsha A Wilcox
Journal:  BMC Proc       Date:  2011-11-29

6.  An aggregating U-Test for a genetic association study of quantitative traits.

Authors:  Ming Li; Wenjiang Fu; Qing Lu
Journal:  BMC Proc       Date:  2011-11-29

7.  A novel approach for the simultaneous analysis of common and rare variants in complex traits.

Authors:  Ao Yuan; Guanjie Chen; Yanxun Zhou; Amy Bentley; Charles Rotimi
Journal:  Bioinform Biol Insights       Date:  2012-01-22

8.  An evaluation of different target enrichment methods in pooled sequencing designs for complex disease association studies.

Authors:  Aaron G Day-Williams; Kirsten McLay; Eleanor Drury; Sarah Edkins; Alison J Coffey; Aarno Palotie; Eleftheria Zeggini
Journal:  PLoS One       Date:  2011-11-01       Impact factor: 3.240

Review 9.  Computational and statistical approaches to analyzing variants identified by exome sequencing.

Authors:  Nathan O Stitziel; Adam Kiezun; Shamil Sunyaev
Journal:  Genome Biol       Date:  2011-09-14       Impact factor: 13.583

10.  A rapid method for combined analysis of common and rare variants at the level of a region, gene, or pathway.

Authors:  David Curtis
Journal:  Adv Appl Bioinform Chem       Date:  2012-07-24
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.