Literature DB >> 31070701

gpart: human genome partitioning and visualization of high-density SNP data by identifying haplotype blocks.

Sun Ah Kim¹, Myriam Brossard², Delnaz Roshandel³, Andrew D Paterson^3,4, Shelley B Bull^2,4, Yun Joo Yoo^5,6.

Abstract

SUMMARY: For the analysis of high-throughput genomic data produced by next-generation sequencing (NGS) technologies, researchers need to identify linkage disequilibrium (LD) structure in the genome. In this work, we developed an R package gpart which provides clustering algorithms to define LD blocks or analysis units consisting of SNPs. The visualization tool in gpart can display the LD structure and gene positions for up to 20 000 SNPs in one image. The gpart functions facilitate construction of LD blocks and SNP partitions for vast amounts of genome sequencing data within reasonable time and memory limits in personal computing environments.
AVAILABILITY AND IMPLEMENTATION: The R package is available at https://bioconductor.org/packages/gpart. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Year: 2019 PMID： 31070701 PMCID： PMC6821423 DOI： 10.1093/bioinformatics/btz308

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

In recent genome wide association studies (GWAS) and population genetic studies, researchers increasingly investigate dense single nucleotide polymorphism (SNP) data produced by new sequencing technologies (Kilpinen and Barrett, 2013). To reduce the dimension of high-throughput genomic data for genetic association analysis or to find evidence for population genetic phenomenon, one can utilize genomic linkage disequilibrium (LD) structure, especially LD blocks (or haplotype blocks). The development of algorithms and software to identify the LD blocks from SNP genotype data mostly occurred before the era of deep sequencing technology. To determine the LD blocks, Gabriel proposed a method based on estimation of the confidence interval of . Zhang , 2003) developed a dynamic programming algorithm to detect common haplotypes in a block. Wang proposed an approach using a four-gamete test. Barrett (2005) proposed the Solid Spine method which finds blocks based on the strong LD with markers at the block boundary, and Pattaro developed a method based on an MCMC algorithm. As reported in Kim , the previous methods and definitions for LD blocks (Gabriel ; Pattaro ; Wang ) do not serve well to identify long range LD blocks in sequencing data such as available in the 1000 Genomes Project. We previously proposed a new method of LD block construction called Big-LD (Kim ), using graph-based clustering techniques. We showed that Big-LD produces larger size blocks, achieves better optimization in terms of LD strength within and across LD blocks, and agrees better with recombination hotspots, compared to existing approaches such as methods implemented in Haploview (Barrett, 2005; Gabriel ; Wang ) or related methods (Pattaro ; Taliun , 2016). In this R/Bioconductor implementation gpart, we provide a new SNP partitioning method based on not only LD block structures but also on gene positions, together with a visualization tool to display a LD heatmap with LD block partitioning information and gene positions. The algorithm GPART uses an updated version of Big-LD which can deal with both r and LD measures and has improved speed and memory efficiency for construction of LD blocks by means of a new heuristic algorithm.

2 Implementation and main functions

The R package gpart provides three main functions, BigLD, GPART, LDblockHeatmap; and is available at the Bioconductor repository (https://bioconductor.org/packages/gpart). The package contains a vignette with detailed explanation about the functions and their options, illustrated by various examples and figures.

2.1 Updated version of Big-LD

Big-LD is a method to identify LD blocks using SNPs (Kim ). The results of the Big-LD algorithm can be obtained using the BigLD function in the gpart package. In gpart, the Big-LD algorithm adopts an updated version of the published CLQ algorithm (Kim ) that finds LD bins using the newly added heuristic algorithm (near-nonhrst algorithm, detailed in Supplementary Methods) which has been extended to account for both LD measures (r and ). Although the new heuristic algorithm is not as fast as the existing heuristic CLQ algorithm (fast algorithm), it returns results more similar to those obtained by the non-heuristic CLQ algorithm in a reasonable time (Supplementary Table S1). Users can choose a CLQ mode (maximal/density) and heuristic algorithm (nonhrst/fast/near-nonhrst) depending on their research aim or computational environment (see Supplementary Results, Supplementary Tables S2 and S3). We apply BigLD to 1000 Genomes Project phase 3 data for MAF > 5% (Supplementary Table S4) and to a GWAS dataset (Supplementary Table S5) (Roshandel ).

2.2 GPART: SNP partitioning method

We developed a SNP partitioning algorithm, GPART, which partitions sets of contiguous SNPs into blocks using the Big-LD results combined with gene position information. Big-LD considers only LD structure within the given data; therefore depending on the LD, the results can include a large number of singleton SNPs or extremely large LD blocks. According to the purpose of downstream analysis, it can be appropriate to limit the number of SNPs in each block to increase analytical effectiveness. The GPART algorithm partitions an entire set of SNPs in a specified region so that all blocks satisfy specified minimum and maximum size limits, where size refers to a number of SNPs. The function GPART provides two different method types, a gene-based method (geneBased) and an LDblock-based method (LDblockBased). The gene-based method first fuses gene position information and Big-LD blocks, then splits or merges blocks that do not meet pre-defined size criteria. The LDblock-based method splits large LD blocks to satisfy the pre-defined size criteria and first takes them as new blocks. Then it merges the remaining consecutive small-sized LD blocks into new blocks of at least the minimum size. In this merging stage, as many small LD blocks as possible can be merged if the small blocks overlap with a gene region. Depending on whether the gene position information is used when combining small blocks, the LDblock-based method is divided into two methods: the only-block method (onlyBlocks) and the use-gene-region method (useGeneRegions). The algorithm is detailed in Supplementary methods. Application of GPART to 1000 Genomes Project phase 3 data and a GWAS dataset (Roshandel ) is reported in Supplementary results (Supplementary Tables S6 and S7).

2.3 LDblockheatmap: visualization function to show LD structure and gene positions

The LDblockheatmap function provides plotting capabilities to visualize the LD heatmap, LD block boundaries of Big-LD results or genomic sequence partitioning results of GPART, and physical location of LD blocks and genes (Fig. 1). The function displays gene regions when gene positions are provided and can draw a figure including up to 20 000 SNPs. See Supplementary Figures S1–S3 for examples using various number of SNPs. For datasets with less than 200 SNPs, the LD bin structure obtained by the CLQ algorithm can be visualized (Fig. 1 and Supplementary Fig. S1). The LD heatmap can also be visualized without Big-LD results or gene positions.

Fig. 1.

Example plot produced by LDblockHeatmap function with GPART result (see Supplementary Fig. S1 for detailed explanation of each component of the plot)

Example plot produced by LDblockHeatmap function with GPART result (see Supplementary Fig. S1 for detailed explanation of each component of the plot) For various examples plotted by LDblockHeatmap, see the vignette of the package gpart.

3 Conclusion

In this paper, we introduce an R package, called gpart, which provides novel functions to cluster and partition a given genomic region by modeling the underlying LD structures of the SNPs as graphs. In addition, the package offers an efficient visualization function to display the obtained results with genomic information. The package gpart is available at Bioconductor.

Funding

This work was supported by the National Research Foundation of Korea (NRF) [NRF-2018R1A2B6008016]; the Canadian Institutes of Health Research (CIHR) [Operating/Projects MOP-84287 & PJT 159463]; the Canadian Statistical Sciences Institute; and the Canadian Institutes of Health Research Strategic Training for Advanced Genetic Epidemiology (CIHR STAGE) [GET-101831]. Conflict of Interest: none declared. Click here for additional data file.

11 in total

1. A dynamic programming algorithm for haplotype block partitioning.

Authors: Kui Zhang; Minghua Deng; Ting Chen; Michael S Waterman; Fengzhu Sun
Journal: Proc Natl Acad Sci U S A Date: 2002-05-28 Impact factor: 11.205

2. The structure of haplotype blocks in the human genome.

Authors: Stacey B Gabriel; Stephen F Schaffner; Huy Nguyen; Jamie M Moore; Jessica Roy; Brendan Blumenstiel; John Higgins; Matthew DeFelice; Amy Lochner; Maura Faggart; Shau Neen Liu-Cordero; Charles Rotimi; Adebowale Adeyemo; Richard Cooper; Ryk Ward; Eric S Lander; Mark J Daly; David Altshuler
Journal: Science Date: 2002-05-23 Impact factor: 47.728

3. Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation.

Authors: Ning Wang; Joshua M Akey; Kun Zhang; Ranajit Chakraborty; Li Jin
Journal: Am J Hum Genet Date: 2002-10-15 Impact factor: 11.025

4. HaploBlockFinder: haplotype block analyses.

Authors: Kun Zhang; Li Jin
Journal: Bioinformatics Date: 2003-07-01 Impact factor: 6.937

5. Haploview: analysis and visualization of LD and haplotype maps.

Authors: J C Barrett; B Fry; J Maller; M J Daly
Journal: Bioinformatics Date: 2004-08-05 Impact factor: 6.937

6. A new haplotype block detection method for dense genome sequencing data based on interval graph modeling of clusters of highly correlated SNPs.

Authors: Sun Ah Kim; Chang-Sung Cho; Suh-Ryung Kim; Shelley B Bull; Yun Joo Yoo
Journal: Bioinformatics Date: 2018-02-01 Impact factor: 6.937

Review 7. How next-generation sequencing is transforming complex disease genetics.

Authors: Helena Kilpinen; Jeffrey C Barrett
Journal: Trends Genet Date: 2012-10-25 Impact factor: 11.639

8. Meta-genome-wide association studies identify a locus on chromosome 1 and multiple variants in the MHC region for serum C-peptide in type 1 diabetes.

Authors: Delnaz Roshandel; Rose Gubitosi-Klug; Shelley B Bull; Angelo J Canty; Marcus G Pezzolesi; George L King; Hillary A Keenan; Janet K Snell-Bergeon; David M Maahs; Ronald Klein; Barbara E K Klein; Trevor J Orchard; Tina Costacou; Michael N Weedon; Richard A Oram; Andrew D Paterson
Journal: Diabetologia Date: 2018-02-05 Impact factor: 10.460

9. Haplotype block partitioning as a tool for dimensionality reduction in SNP association studies.

Authors: Cristian Pattaro; Ingo Ruczinski; Danièle M Fallin; Giovanni Parmigiani
Journal: BMC Genomics Date: 2008-08-29 Impact factor: 3.969

10. Efficient haplotype block recognition of very long and dense genetic sequences.

Authors: Daniel Taliun; Johann Gamper; Cristian Pattaro
Journal: BMC Bioinformatics Date: 2014-01-14 Impact factor: 3.169

4 in total

1. Are Copy Number Variations within the Fec^B Gene Significantly Associated with Morphometric Traits in Goats?

Authors: Yi Bi; Zhiying Wang; Qian Wang; Hongfei Liu; Zhengang Guo; Chuanying Pan; Hong Chen; Haijing Zhu; Lian Wu; Xianyong Lan
Journal: Animals (Basel) Date: 2022-06-15 Impact factor: 3.231

2. A Workflow for Selection of Single Nucleotide Polymorphic Markers for Studying of Genetics of Ischemic Stroke Outcomes.

Authors: Gennady Khvorykh; Andrey Khrunin; Ivan Filippenkov; Vasily Stavchansky; Lyudmila Dergunova; Svetlana Limborska
Journal: Genes (Basel) Date: 2021-02-25 Impact factor: 4.096

3. Haplotype-Based Single-Step GWAS for Yearling Temperament in American Angus Cattle.

Authors: Andre C Araujo; Paulo L S Carneiro; Amanda B Alvarenga; Hinayah R Oliveira; Stephen P Miller; Kelli Retallick; Luiz F Brito
Journal: Genes (Basel) Date: 2021-12-22 Impact factor: 4.096

4. Root angle in maize influences nitrogen capture and is regulated by calcineurin B-like protein (CBL)-interacting serine/threonine-protein kinase 15 (ZmCIPK15).

Authors: Hannah M Schneider; Vai Sa Nee Lor; Meredith T Hanlon; Alden Perkins; Shawn M Kaeppler; Aditi N Borkar; Rahul Bhosale; Xia Zhang; Jonas Rodriguez; Alexander Bucksch; Malcolm J Bennett; Kathleen M Brown; Jonathan P Lynch
Journal: Plant Cell Environ Date: 2021-07-08 Impact factor: 7.947

4 in total