Literature DB >> 29617937

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.

Florian Privé1, Hugues Aschard2,3, Andrey Ziyatdinov3, Michael G B Blum1.   

Abstract

Motivation: Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses, leading to some software becoming obsolete and researchers having limited access to diverse analysis tools.
Results: Here we present two R packages, bigstatsr and bigsnpr, allowing for the analysis of large scale genomic data to be performed within R. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement fast and accurate computations of principal component analysis and association studies, functions to remove single nucleotide polymorphisms in linkage disequilibrium and algorithms to learn polygenic risk scores on millions of single nucleotide polymorphisms. We illustrate applications of the two R packages by analyzing a case-control genomic dataset for celiac disease, performing an association study and computing polygenic risk scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500 000 individuals and 1 million markers on a single desktop computer. Availability and implementation: https://privefl.github.io/bigstatsr/ and https://privefl.github.io/bigsnpr/. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Year:  2018        PMID: 29617937      PMCID: PMC6084588          DOI: 10.1093/bioinformatics/bty185

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  27 in total

1.  Principal components analysis corrects for stratification in genome-wide association studies.

Authors:  Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal:  Nat Genet       Date:  2006-07-23       Impact factor: 38.330

2.  GenABEL: an R library for genome-wide association analysis.

Authors:  Yurii S Aulchenko; Stephan Ripke; Aaron Isaacs; Cornelia M van Duijn
Journal:  Bioinformatics       Date:  2007-03-23       Impact factor: 6.937

3.  Long-range LD can confound genome scans in admixed populations.

Authors:  Alkes L Price; Michael E Weale; Nick Patterson; Simon R Myers; Anna C Need; Kevin V Shianna; Dongliang Ge; Jerome I Rotter; Esther Torres; Kent D Taylor; David B Goldstein; David Reich
Journal:  Am J Hum Genet       Date:  2008-07       Impact factor: 11.025

4.  GWASTools: an R/Bioconductor package for quality control and analysis of genome-wide association studies.

Authors:  Stephanie M Gogarten; Tushar Bhangale; Matthew P Conomos; Cecelia A Laurie; Caitlin P McHugh; Ian Painter; Xiuwen Zheng; David R Crosslin; David Levine; Thomas Lumley; Sarah C Nelson; Kenneth Rice; Jess Shen; Rohit Swarnkar; Bruce S Weir; Cathy C Laurie
Journal:  Bioinformatics       Date:  2012-10-10       Impact factor: 6.937

5.  Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia.

Authors:  Kevin J Galinsky; Gaurav Bhatia; Po-Ru Loh; Stoyan Georgiev; Sayan Mukherjee; Nick J Patterson; Alkes L Price
Journal:  Am J Hum Genet       Date:  2016-02-25       Impact factor: 11.025

6.  Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors:  Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal:  J Stat Softw       Date:  2010       Impact factor: 6.440

7.  Strong rules for discarding predictors in lasso-type problems.

Authors:  Robert Tibshirani; Jacob Bien; Jerome Friedman; Trevor Hastie; Noah Simon; Jonathan Taylor; Ryan J Tibshirani
Journal:  J R Stat Soc Series B Stat Methodol       Date:  2012-03       Impact factor: 4.488

8.  Multiple common variants for celiac disease influencing immune gene expression.

Authors:  Patrick C A Dubois; Gosia Trynka; Lude Franke; Karen A Hunt; Jihane Romanos; Alessandra Curtotti; Alexandra Zhernakova; Graham A R Heap; Róza Adány; Arpo Aromaa; Maria Teresa Bardella; Leonard H van den Berg; Nicholas A Bockett; Emilio G de la Concha; Bárbara Dema; Rudolf S N Fehrmann; Miguel Fernández-Arquero; Szilvia Fiatal; Elvira Grandone; Peter M Green; Harry J M Groen; Rhian Gwilliam; Roderick H J Houwen; Sarah E Hunt; Katri Kaukinen; Dermot Kelleher; Ilma Korponay-Szabo; Kalle Kurppa; Padraic MacMathuna; Markku Mäki; Maria Cristina Mazzilli; Owen T McCann; M Luisa Mearin; Charles A Mein; Muddassar M Mirza; Vanisha Mistry; Barbara Mora; Katherine I Morley; Chris J Mulder; Joseph A Murray; Concepción Núñez; Elvira Oosterom; Roel A Ophoff; Isabel Polanco; Leena Peltonen; Mathieu Platteel; Anna Rybak; Veikko Salomaa; Joachim J Schweizer; Maria Pia Sperandeo; Greetje J Tack; Graham Turner; Jan H Veldink; Wieke H M Verbeek; Rinse K Weersma; Victorien M Wolters; Elena Urcelay; Bozena Cukrowska; Luigi Greco; Susan L Neuhausen; Ross McManus; Donatella Barisani; Panos Deloukas; Jeffrey C Barrett; Paivi Saavalainen; Cisca Wijmenga; David A van Heel
Journal:  Nat Genet       Date:  2010-02-28       Impact factor: 38.330

9.  PRSice: Polygenic Risk Score software.

Authors:  Jack Euesden; Cathryn M Lewis; Paul F O'Reilly
Journal:  Bioinformatics       Date:  2014-12-29       Impact factor: 6.937

10.  SNPFile--a software library and file format for large scale association mapping and population genetics studies.

Authors:  Jesper Nielsen; Thomas Mailund
Journal:  BMC Bioinformatics       Date:  2008-12-08       Impact factor: 3.169

View more
  62 in total

1.  JASS: command line and web interface for the joint analysis of GWAS results.

Authors:  Hanna Julienne; Pierre Lechat; Vincent Guillemot; Carla Lasry; Chunzi Yao; Robinson Araud; Vincent Laville; Bjarni Vilhjalmsson; Hervé Ménager; Hugues Aschard
Journal:  NAR Genom Bioinform       Date:  2020-01-24

2.  Making the Most of Clumping and Thresholding for Polygenic Scores.

Authors:  Florian Privé; Bjarni J Vilhjálmsson; Hugues Aschard; Michael G B Blum
Journal:  Am J Hum Genet       Date:  2019-11-21       Impact factor: 11.025

3.  The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities.

Authors:  Lauren J Beesley; Maxwell Salvatore; Lars G Fritsche; Anita Pandit; Arvind Rao; Chad Brummett; Cristen J Willer; Lynda D Lisabeth; Bhramar Mukherjee
Journal:  Stat Med       Date:  2019-12-20       Impact factor: 2.373

4.  CHICKN: extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis.

Authors:  Olga Permiakova; Romain Guibert; Alexandra Kraut; Thomas Fortin; Anne-Marie Hesse; Thomas Burger
Journal:  BMC Bioinformatics       Date:  2021-02-12       Impact factor: 3.169

5.  A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.

Authors:  Junyang Qian; Yosuke Tanigawa; Wenfei Du; Matthew Aguirre; Chris Chang; Robert Tibshirani; Manuel A Rivas; Trevor Hastie
Journal:  PLoS Genet       Date:  2020-10-23       Impact factor: 5.917

6.  Identification of the First Oomycete Mating-type Locus Sequence in the Grapevine Downy Mildew Pathogen, Plasmopara viticola.

Authors:  Yann Dussert; Ludovic Legrand; Isabelle D Mazet; Carole Couture; Marie-Christine Piron; Rémy-Félix Serre; Olivier Bouchez; Pere Mestre; Silvia Laura Toffolatti; Tatiana Giraud; François Delmotte
Journal:  Curr Biol       Date:  2020-08-13       Impact factor: 10.834

7.  DeCompress: tissue compartment deconvolution of targeted mRNA expression panels using compressed sensing.

Authors:  Arjun Bhattacharya; Alina M Hamilton; Melissa A Troester; Michael I Love
Journal:  Nucleic Acids Res       Date:  2021-05-07       Impact factor: 16.971

8.  Leveraging both individual-level genetic data and GWAS summary statistics increases polygenic prediction.

Authors:  Clara Albiñana; Jakob Grove; John J McGrath; Esben Agerbo; Naomi R Wray; Cynthia M Bulik; Merete Nordentoft; David M Hougaard; Thomas Werge; Anders D Børglum; Preben Bo Mortensen; Florian Privé; Bjarni J Vilhjálmsson
Journal:  Am J Hum Genet       Date:  2021-05-07       Impact factor: 11.043

Review 9.  Tutorial: a guide to performing polygenic risk score analyses.

Authors:  Shing Wan Choi; Timothy Shin-Heng Mak; Paul F O'Reilly
Journal:  Nat Protoc       Date:  2020-07-24       Impact factor: 13.491

10.  Reference-free deconvolution, visualization and interpretation of complex DNA methylation data using DecompPipeline, MeDeCom and FactorViz.

Authors:  Michael Scherer; Petr V Nazarov; Reka Toth; Shashwat Sahay; Tony Kaoma; Valentin Maurer; Nikita Vedeneev; Christoph Plass; Thomas Lengauer; Jörn Walter; Pavlo Lutsik
Journal:  Nat Protoc       Date:  2020-09-25       Impact factor: 13.491

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.