Marcin Kierczak1, Jagoda Jabłońska2, Simon K G Forsberg3, Matteo Bianchi2, Katarina Tengvall2, Mats Pettersson1, Veronika Scholz2, Jennifer R S Meadows2, Patric Jern2, Örjan Carlborg3, Kerstin Lindblad-Toh4. 1. Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden, Computational Genetics Section, Department of Clinical Sciences, Swedish University of Agricultural Sciences, Uppsala, Sweden and. 2. Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden. 3. Computational Genetics Section, Department of Clinical Sciences, Swedish University of Agricultural Sciences, Uppsala, Sweden and. 4. Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden, Broad Institute of MIT and Harvard, Boston, MA, USA.
Abstract
UNLABELLED: High-throughput genotyping and sequencing technologies facilitate studies of complex genetic traits and provide new research opportunities. The increasing popularity of genome-wide association studies (GWAS) leads to the discovery of new associated loci and a better understanding of the genetic architecture underlying not only diseases, but also other monogenic and complex phenotypes. Several softwares are available for performing GWAS analyses, R environment being one of them. RESULTS: We present cgmisc, an R package that enables enhanced data analysis and visualization of results from GWAS. The package contains several utilities and modules that complement and enhance the functionality of the existing software. It also provides several tools for advanced visualization of genomic data and utilizes the power of the R language to aid in preparation of publication-quality figures. Some of the package functions are specific for the domestic dog (Canis familiaris) data. AVAILABILITY AND IMPLEMENTATION: The package is operating system-independent and is available from: https://github.com/cgmisc-team/cgmisc CONTACT: marcin.kierczak@imbim.uu.se. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
UNLABELLED: High-throughput genotyping and sequencing technologies facilitate studies of complex genetic traits and provide new research opportunities. The increasing popularity of genome-wide association studies (GWAS) leads to the discovery of new associated loci and a better understanding of the genetic architecture underlying not only diseases, but also other monogenic and complex phenotypes. Several softwares are available for performing GWAS analyses, R environment being one of them. RESULTS: We present cgmisc, an R package that enables enhanced data analysis and visualization of results from GWAS. The package contains several utilities and modules that complement and enhance the functionality of the existing software. It also provides several tools for advanced visualization of genomic data and utilizes the power of the R language to aid in preparation of publication-quality figures. Some of the package functions are specific for the domestic dog (Canis familiaris) data. AVAILABILITY AND IMPLEMENTATION: The package is operating system-independent and is available from: https://github.com/cgmisc-team/cgmisc CONTACT: marcin.kierczak@imbim.uu.se. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
High-throughput genotyping and sequencing has opened several new research opportunities to study complex genetic traits and genome-wide association studies (GWAS) is a popular way to analyse genotyping data from segregating populations. Widely used GWAS softwares include PLINK (Purcell ), EMMAX (Kang ), GCTA (Yang ) and GenABEL (Aulchenko ). A single software package is rarely sufficiently complete to cover all aspects of a typical genome-wide analysis pipeline. Transferring data between different softwares is often a laborious process. One advantage of the GenABEL package, that often makes it the software of choice, is that in addition to GWAS-specific functionalities, it provides access to the R (R Development Core Team, 2008) language and community-contributed packages. Here we developed a number of algorithms and solutions to several common GWAS tasks. Some of these solutions aim at facilitating production of publication-quality data/results visualization (see, e.g. Fig. 1). Several cgmisc functions have been used to produce results and visualizations for peer-reviewed publications, e.g. Tengvall , Owczarek-Lipska and Olsson . Here, we present these and more functions in the form of the documented and supported R package cgmisc.
Fig. 1.
An example figure generated using the cgmisc package: p-values from Fisher’s exact test for allele counts highlight the most divergent regions between two populations. Colour of the points correspond to their LD (r2) with the most significant marker (the reference)
An example figure generated using the cgmisc package: p-values from Fisher’s exact test for allele counts highlight the most divergent regions between two populations. Colour of the points correspond to their LD (r2) with the most significant marker (the reference)
2 Description
cgmisc (ver. 2.9.10), provides 34 functions for the analysis and visualization of GWAS data. A few functions in the package are tailored for working with data from the domestic dog (Canis familiaris) but are easy to adjust for analysing other species. Some functions rely on third-party softwares which are freely available for research purposes. Internally, cgmisc functions use data structures implemented in the GenABEL package. For all functions and parameters in the package, we use the period-separated naming convention (Bååth, 2012). Functions provided by the cgmisc package can be grouped into the following categories:Analyses of population structure. Population strata can be compared based on their allele-frequency differences, using either (i) fixation index F or (ii) Fisher’s exact test for reference allele count observed versus the allele count expected under the null hypothesis of no population structure.Tools related to association scans. Enhanced quantile–quantile (qq) plot showing (i) theoretical and (ii) empirical confidence intervals as well as (iii) empirical significance thresholds. We also implemented an extended version of the Manhattan plot, with colour-coded information on linkage disequilibrium (LD) between a selected marker and its neighbours plus a minor-allele frequency panel. Easy ways of interfacing variance GWAS scans (vGWAS; Shen ) and bigRR (Shen ) packages (BLUP, ridge regression) as well as simple visualization of per-genotype distribution of phenotypic values are provided. We also complement the standard tests for association with a basic scan for gene-gene interaction (epistasis).Heterozygosity analyses. We provide functions for the detection and visualization of runs of homozygosity along the genome to facilitate the detection of suggestive selective sweeps and highlight regions that may be challenging for standard association mapping tools.Analyses and visualization of linkage structure. The cgmisc package provides tools for assessing average haplotype lengths by visualization of LD-decay as a function of the distance between markers. In addition, the package offers export functions that enable haplotype phasing using PHASE (Stephens ) and haplotype visualization using Haploview (Barrett ). In addition, we implemented the marker clumping procedure used by PLINK.Improved annotation. The package provides functions for genome annotation in the domestic dog (Canis familiaris, canFam3.1 assembly), offers the direct interaction with the UCSC Genome Browser (Kuhn ) and improved analyses of pseudo-autosomal regions on the X chromosome. In addition, we provide a convenient method for retrieving and plotting information on endogenous retroviral sequences identified by the RetroTector software (Sperber ).Data subsetting, manipulation and visualization. cgmisc can generate windows for sliding-window (also with overlap) and jumping-window type analyses. A number of convenience functions enables users to, e.g. retrieve information about LD or chromosome start/end point coordinates.All functions were designed to be user-friendly with attention to the quality of visualizations. Complete documentation is available upon cgmisc installation. In order to facilitate package usage, we included a quick tutorial (package vignette in the supplementary information) that takes the user through all steps necessary to use each of the package functions. The tutorial is based on the included example dataset. A detailed description of the methods and algorithms used by the functions is provided in the vignette and documentation.
Funding
M.K. was supported by the Swedish Foundation for Strategic Research. J.R.S.M., M.K. and Ö.C. received support from FORMAS. M.K., S.F. and Ö.C. were supported by the Swedish Research Council. K.L.-T. and M.K. were supported by the European Research Council. J.J. was supported by the European Commission, Erasmus mobility grant.Conflict of Interest: none declared.
Authors: Marta Owczarek-Lipska; Béatrice Lauber; Vivianne Molitor; Sabrina Meury; Marcin Kierczak; Katarina Tengvall; Matthew T Webster; Vidhya Jagannathan; Yvette Schlotter; Ton Willemse; Anke Hendricks; Kerstin Bergvall; Ake Hedhammar; Göran Andersson; Kerstin Lindblad-Toh; Claude Favrot; Petra Roosje; Eliane Marti; Tosso Leeb Journal: PLoS One Date: 2012-06-15 Impact factor: 3.240
Authors: Mia Olsson; Linda Tintle; Marcin Kierczak; Michele Perloski; Noriko Tonomura; Andrew Lundquist; Eva Murén; Max Fels; Katarina Tengvall; Gerli Pielberg; Caroline Dufaure de Citres; Laetitia Dorso; Jérôme Abadie; Jeanette Hanson; Anne Thomas; Peter Leegwater; Åke Hedhammar; Kerstin Lindblad-Toh; Jennifer R S Meadows Journal: PLoS One Date: 2013-10-09 Impact factor: 3.240
Authors: Biaty Raymond; Anna Maria Johansson; Heather Anne McCormack; Robert Hall Fleming; Matthias Schmutz; Ian Chisholm Dunn; Dirk Jan De Koning Journal: J Anim Sci Date: 2018-06-29 Impact factor: 3.159
Authors: Lea I Mikkola; Saila Holopainen; Anu K Lappalainen; Tiina Pessa-Morikawa; Thomas J P Augustine; Meharji Arumilli; Marjo K Hytönen; Osmo Hakosalo; Hannes Lohi; Antti Iivanainen Journal: PLoS Genet Date: 2019-07-19 Impact factor: 5.917
Authors: Karen M Vernau; Eduard Struys; Anna Letko; Kevin D Woolard; Miriam Aguilar; Emily A Brown; Derek D Cissell; Peter J Dickinson; G Diane Shelton; Michael R Broome; K Michael Gibson; Phillip L Pearl; Florian König; Thomas J Van Winkle; Dennis O'Brien; B Roos; Kaspar Matiasek; Vidhya Jagannathan; Cord Drögemüller; Tamer A Mansour; C Titus Brown; Danika L Bannasch Journal: Genes (Basel) Date: 2020-09-02 Impact factor: 4.096