Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes.

Literature DB >> 30957838

TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes.

Aritra Bose¹, Vassilis Kalantzis², Eugenia-Maria Kontopoulou¹, Mai Elkady¹, Peristera Paschou³, Petros Drineas¹.

Abstract

MOTIVATION: Principal Component Analysis is a key tool in the study of population structure in human genetics. As modern datasets become increasingly larger in size, traditional approaches based on loading the entire dataset in the system memory (Random Access Memory) become impractical and out-of-core implementations are the only viable alternative.
RESULTS: We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method to perform Principal Component Analysis of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able to successfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover, TeraPCA has minimal dependencies on external libraries and only requires a working installation of the BLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on a million markers, TeraPCA requires <5 h (in multi-threaded mode) to accurately compute the 10 leading principal components. An extensive experimental analysis shows that TeraPCA is both fast and accurate and is competitive with current state-of-the-art software for the same task.
AVAILABILITY AND IMPLEMENTATION: Source code and documentation are both available at https://github.com/aritra90/TeraPCA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: CellLine Chemical Species

Year: 2019 PMID： 30957838 DOI： 10.1093/bioinformatics/btz157

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

Keyword Cloud
Cited

7 in total

TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes.

1. Impact of Clinical and Genomic Factors on COVID-19 Disease Severity.

2. Reconstructing SNP allele and genotype frequencies from GWAS summary statistics.

3. Integrating Linguistics, Social Structure, and Geography to Model Genetic Diversity within India.

4. Fast and accurate population admixture inference from genotype data from a few microsatellites to millions of SNPs.

5. Benchmarking principal component analysis for large-scale single-cell RNA-sequencing.

6. Efficient toolkit implementing best practices for principal component analysis of population genetic data.

7. Scalable probabilistic PCA for large-scale genetic variation data.