Literature DB >> 30957838

TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes.

Aritra Bose1, Vassilis Kalantzis2, Eugenia-Maria Kontopoulou1, Mai Elkady1, Peristera Paschou3, Petros Drineas1.   

Abstract

MOTIVATION: Principal Component Analysis is a key tool in the study of population structure in human genetics. As modern datasets become increasingly larger in size, traditional approaches based on loading the entire dataset in the system memory (Random Access Memory) become impractical and out-of-core implementations are the only viable alternative.
RESULTS: We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method to perform Principal Component Analysis of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able to successfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover, TeraPCA has minimal dependencies on external libraries and only requires a working installation of the BLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on a million markers, TeraPCA requires <5 h (in multi-threaded mode) to accurately compute the 10 leading principal components. An extensive experimental analysis shows that TeraPCA is both fast and accurate and is competitive with current state-of-the-art software for the same task.
AVAILABILITY AND IMPLEMENTATION: Source code and documentation are both available at https://github.com/aritra90/TeraPCA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Year:  2019        PMID: 30957838     DOI: 10.1093/bioinformatics/btz157

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  7 in total

1.  Impact of Clinical and Genomic Factors on COVID-19 Disease Severity.

Authors:  Sanjoy Dey; Aritra Bose; Subrata Saha; Prithwish Chakraborty; Mohamed Ghalwash; Aldo Guzm X E N-Sáenz; Filippo Utro; Kenney Ng; Jianying Hu; Laxmi Parida; Daby Sow
Journal:  AMIA Annu Symp Proc       Date:  2022-02-21

2.  Reconstructing SNP allele and genotype frequencies from GWAS summary statistics.

Authors:  Zhiyu Yang; Peristera Paschou; Petros Drineas
Journal:  Sci Rep       Date:  2022-05-17       Impact factor: 4.996

3.  Integrating Linguistics, Social Structure, and Geography to Model Genetic Diversity within India.

Authors:  Aritra Bose; Daniel E Platt; Laxmi Parida; Petros Drineas; Peristera Paschou
Journal:  Mol Biol Evol       Date:  2021-05-04       Impact factor: 16.240

4.  Fast and accurate population admixture inference from genotype data from a few microsatellites to millions of SNPs.

Authors:  Jinliang Wang
Journal:  Heredity (Edinb)       Date:  2022-05-04       Impact factor: 3.832

5.  Benchmarking principal component analysis for large-scale single-cell RNA-sequencing.

Authors:  Koki Tsuyuzaki; Hiroyuki Sato; Kenta Sato; Itoshi Nikaido
Journal:  Genome Biol       Date:  2020-01-20       Impact factor: 13.583

6.  Efficient toolkit implementing best practices for principal component analysis of population genetic data.

Authors:  Florian Privé; Keurcien Luu; Michael G B Blum; John J McGrath; Bjarni J Vilhjálmsson
Journal:  Bioinformatics       Date:  2020-08-15       Impact factor: 6.937

7.  Scalable probabilistic PCA for large-scale genetic variation data.

Authors:  Aman Agrawal; Alec M Chiu; Minh Le; Eran Halperin; Sriram Sankararaman
Journal:  PLoS Genet       Date:  2020-05-29       Impact factor: 6.020

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.