| Literature DB >> 19096721 |
Stefan Stefanov1, James Lautenberger, Bert Gold.
Abstract
We developed an efficient pipeline to analyze genome-wide association study single nucleotide polymorphism scan results. Purl scripts were used to convert genotypes called using the BRLMM algorithm into a modified PB format. We computed summary statistics characteristic of our case and control populations including allele counts, missing values, heterozygosity, measures of compliance with Hardy-Weinberg equilibrium, and several population difference statistics. In addition, we computed association tests, including exact tests of association for genotypes, alleles, the Cochran-Armitage linear trend test, and dominant, recessive, and over dominant models at every single nucleotide polymorphism (SNP). In addition, pairwise linkage disequilibrium statistics were elaborated, using the command line version of HaploView, which was possible by writing a reformatting script. Additional Perl scripts permit loading the results into a MySQL database conjoined with a Generic Genome Browser (gbrowse) for comprehensive visualization. This browser incorporates a download feature that provides actual case and control genotypes to users in associated genomic regions. Thus, re-analysis "on the fly" is possible for casual browser users from anywhere on the Internet.Entities:
Keywords: GWAS; SNP; genetic association; genetic epidemiology; single nucleotide polymorphism
Year: 2008 PMID: 19096721 PMCID: PMC2603547 DOI: 10.4137/cin.s966
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1Input File Formats. A) Schematic structure of a traditional Prettybase file with the four columns delineated. The leftmost column is the displacement relative to a reference value; the middle column is a patient identifier (PID); and the third and fourth columns are the allele 1 and allele 2 base calls (A1 A2). B) A modified format designated as a Modified PB file. The leftmost displacement file consists of an 11 digit displacement, the first two digits of which are the chromosome number (X is 23, Y is 24), and the rightmost 9 digits of which are the displacement relative to the beginning of the chromosome. The PID consists of alphanumeric identifiers preceded by a group identifier (for example, U for Utah or Z for Zuni Native American) with the third and fourth columns as allele 1 and allele 2 base calls (A1 A2). Additional fields appended to the right hand portion may include dbSNP identifiers, Affymetrix probe identifiers, or confidence values, as shown in the right hand portion of panel B. C) Cartoon clarifying the relationship of the 11 digit chromosome-displacement identifier (ChrDisp) with the population subgroup PID identifier, ordered through a UNIX sort.
Figure 2Analysis Pipeline Workflow. The pipeline was initially written for the analysis of Affymetrix genotyping microarrays, but has since been adapted to Illumina Bead Arrays. The upper portion of the figure provides a workflow for dual style (500 K) microarrays, but has also been used for single-chip Affy 6.0 microarrays. Illumina BeadArray 317 K, 500 K, 550 K, 650 K and 1 M data feeds into the workflow on the right-hand side, after flat table export of a BeadStudio V3.x “Full Data Table Output” file. The arrays provide a basis for assembly of the modified PB file, which is the raw input for the analytic pipeline.