| Literature DB >> 35211719 |
Laura Balagué-Dobón1, Alejandro Cáceres1, Juan R González1.
Abstract
Single nucleotide polymorphisms (SNPs) are the most abundant type of genomic variation and the most accessible to genotype in large cohorts. However, they individually explain a small proportion of phenotypic differences between individuals. Ancestry, collective SNP effects, structural variants, somatic mutations or even differences in historic recombination can potentially explain a high percentage of genomic divergence. These genetic differences can be infrequent or laborious to characterize; however, many of them leave distinctive marks on the SNPs across the genome allowing their study in large population samples. Consequently, several methods have been developed over the last decade to detect and analyze different genomic structures using SNP arrays, to complement genome-wide association studies and determine the contribution of these structures to explain the phenotypic differences between individuals. We present an up-to-date collection of available bioinformatics tools that can be used to extract relevant genomic information from SNP array data including population structure and ancestry; polygenic risk scores; identity-by-descent fragments; linkage disequilibrium; heritability and structural variants such as inversions, copy number variants, genetic mosaicisms and recombination histories. From a systematic review of recently published applications of the methods, we describe the main characteristics of R packages, command-line tools and desktop applications, both free and commercial, to help make the most of a large amount of publicly available SNP data.Entities:
Keywords: GWAS; SNP arrays; bioinformatic methods; genomic structures; software; structural variants
Mesh:
Year: 2022 PMID: 35211719 PMCID: PMC8921734 DOI: 10.1093/bib/bbac043
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Top-five tools (the most cited between January 2020 and September 2021 in PubMed) for the study of population structure and ancestry with their characteristics
| Tool | Type | Availability | Input data | Algorithm | Characteristics | Year | Reference |
|---|---|---|---|---|---|---|---|
| STRUCTURE | Desktop App Command-line (C) | Free | Text file with genotypes and other optional fields | Bayesian with multiple tunning parameters | The first one Works with any type of multilocus genotype data | 2000 | [ |
| EIGENSOFT | Command-line | Free | PLINK | Principal Components Analysis (PCA) | Combination of SMARTPCA and EIGENSTRAT Specific for case/control studies | 2006 | [ |
| ADMIXTOOLS | R Package Command-line (C) | Free | ‘ind’ file, ‘snp’ file and ‘geno’ file | Several methods | Infers proportion and dates of mixtures | 2012 | [ |
| fastSTRUCTURE | Command-line (Python) | Free | Binary PLINK (BED/BIM/FAM) | Bayesian framework | Fast | 2014 | [ |
| fineSTRUCTURE | Command-line | Free | Phased Haplotypes | ChromoPainter [ | Fine-scale population structure | 2012 | [ |
Top-five tools (the most cited between January 2020 and September 2021 in PubMed) for the study of LD with their characteristics
| Tool | Type | Availability | Input data | Algorithm | Characteristics | Year | Reference |
|---|---|---|---|---|---|---|---|
| Haploview | Desktop App Command-line (JAVA) | Free | Linkage format Phased Haplotypes HapMap Project Data Dumps PHASE PLINK | Two marker Expectation Maximization (EM) | Suite of tools | 2005 | [ |
| Big-LD | R package gpart | Free | Text file with genotypes | Interval graph modeling of LD bins | Visualization options | 2018 | [ |
| ALOHOMORA | Desktop App (Perl) | Free | Genotype data generated by GeneChip DNA Analysis Software (GDAS v3.0) from Affymetrix For other chips: MAP file, allele frequency file and genotype file in the Alohomora format | Several | Visualization options | 2005 | [ |
| VarLD | Command-line (JAVA) | Free | Text file with genotypes | Quantification of the LD by the signed r2 metric | Performs inter-population comparisons | 2010 | [ |
| LDExplorer | R Package | Free | Phased genotypes in VCF or HAPMAP2 format | MIG Algorithms | Deals with SNPs at any distance | 2014 | [ |
Top-five tools (the most cited between January 2020 and September 2021 in PubMed) for the study of identity by descent fragments with their characteristics
| Tool | Type | Availability | Input data | Algorithm | Characteristics | Year | Reference |
|---|---|---|---|---|---|---|---|
| RefinedIBD | Command-line (JAVA) BEAGLE Software | Free | Phased Data VCF file with genotypes | GERMLINE Algorithm + probabilistic approach | Does not allow genotype errors | 2013 | [ |
| GERMLINE | Command-line (C++) | Free | Phased Data PLINK haplotype data | Dynamic Programming | Allows genotype errors | 2009 | [ |
| fastIBD | Command-line (JAVA) BEAGLE Software | Free | Phased Data Text file with genotypes | Estimation of frequencies of shared haplotypes | Fast | 2011 | [ |
| Hap-IBD | Command-line (JAVA) | Free | Phased Data VCF file with genotypes PLINK text files (PED/MAP) | Positional Burrows-Wheeler transform PBWT | Fast and simple | 2020 | [ |
| RaPID | Command-line (Python) | Free | Phased Data VCF file with haplotypes | Random Projection (based on the positional Burrows-Wheeler transform, PBWT) | Fast Configurable parameters | 2019 | [ |
Tools for the study of heritability with their characteristics
| Tool | Type | Availability | Input data | Algorithm | Characteristics | Year | Reference |
|---|---|---|---|---|---|---|---|
| LD Score (LDSC) | Command-line (Python) | Free | GWAS summary statistics | Linkage Disequilibrium Score Regression | Suite of tools | 2015 | [ |
| LDAK | Command-line (Compiled) | Free | PLINK | Modified kinship matrix Restricted maximum likelihood (REML) Haseman Elston (HE) regression Phenotype-correlation, genotype-correlation (PCGC) regression | Suite of tools | 2012 | [ |
| HERRA | R Code | Free | Matrix with genotypes, disease status and covariates | Machine learning | Continuous or Dichotomous outcomes | 2017 | [ |
| RHE-mc | Command-line (C++) | Free | PLINK | Randomized algorithm Method-of-moments (MoM) estimator | Estimates the variation that can be attributed to additive and dominance deviation | 2021 | [ |
Top-five (the most cited between January 2020 and September 2021 in PubMed) tools for the study of PRS with their characteristics
| Tool | Type | Availability | Input data | Algorithm | Characteristics | Year | Reference |
|---|---|---|---|---|---|---|---|
| PRSice | Command-line (C++, Compiled, R for plotting) | Free | Binary PLINK BED/BIM/FAM) or imputed (Oxford .bgen) | Pruning and Thresholding (P + T) | Visualization options with R | 2015 | [ |
| PRS-CS | Command-line (Python) | Free | GWAS summary statistics External LD reference panel | Continuous shrinkage (CS) on SNP effect sizes + High-dimensional Bayesian regression framework | External LD reference panel | 2019 | [ |
| SBLUP/BLUP GCTA | Command-line (C++, Compiled) | Free | Binary PLINK BED/BIM/FAM) or imputed (Oxford .bgen v1.2) | Linear mixed-effects model | Analyses individual chromosomes | 2020 v1.93.2beta | [ |
| SBayesR GCTB | Command-line (C++, Compiled) | Free | Binary PLINK BED/BIM/FAM) | Bayesian mixture model | Uses low computational resources | 2019 | [ |
| lassosum | R Package bigstatsr | Free | Binary PLINK BED/BIM/FAM) | Regularized regression model | External LD reference panel Pseudovalidation | 2017 | [ |
Tools for the study of inversions with their characteristics
| Tool | Type | Availability | Input data | Algorithm | Characteristics | Year | Reference |
|---|---|---|---|---|---|---|---|
| invClust | R package | Free | PLINK | Mixture model, uses all the SNPs in the inverted segment | Detects 20 human inversions from the invFest database, experimentally validated and greater than 0.2 Mb Allows including ancestry information | 2015 | [ |
| PFIDO | R package | Free | PLINK | Pairwise identity-by-state distance matrix transformed by MDS. Model-based approach with 18 parameterized Gaussian mixture models | Detects 8p23 inversion-type Does not rely on any specific SNP | 2012 | [ |
| inveRsion | R package | Free | Text files with 0/1/2 coded genotypes | Sliding window scan, uses linkage between groups of SNPs | Detects inversions directly from genotypes Can detect new possible inversion regions Optimal for homogenous samples and old inversions. | 2012 | [ |
| scoreInvHap | R package | Free | PLINK or VCF files | Comparison with reference haplotype-genotypes | Detects 20 human inversions from the invFest database, experimentally validated and greater than 0.2 Mb | 2019 | [ |
| RecombClust | R package | Free | Phased VCF files | LDmixture model | Detects chromosomal subpopulations with distinct recombination histories | 2020 | [ |
Top-five tools (the most cited between January 2020 and September 2021 in PubMed) for the study of mosaicism with their characteristics
| Tool | Type | Availability | Input data | Algorithm | Characteristics | Year | Reference |
|---|---|---|---|---|---|---|---|
| GISTIC | Command-line (MATLAB) | Free | Segmented Data | Ziggurat Deconstruction (ZD) | Specific for cancer samples | 2007 | [ |
| MoChA | Command-line (C) (bcfools extension) R for graphical outputs | Free | VCF files with LRR and BAF values (raw Affymetrix or Illumina files if using a complementary pipeline) | Hidden Markov Model (HMM) | Detects LOH Visualization options | 2020 | [ |
| PICNIC | Command-line (MATLAB) | Free (under license) | Affymetrix CEL files | Hidden Markov Model (HMM) | Specific for cancer samples Predicts absolute copy number Visualization options | 2010 | [ |
| BAFSegmentation | Command-line (perl, R) | Free | Preprocessed files with BAF and LRR | Segmentation-based | Specific for cancer samples Provides percentage Detects LOH Visualization options | 2008 | [ |
| hapLOH | Command-line (Python, Perl) | Free | BAF file and phased genotypes | Hidden Markov Model (HMM) | Supports low aberrant cell proportions | 2013 | [ |
Figure 1Chromosomic inversions appear when two breaks occur in the same chromosome and the cleaved fragment rotates before re-joining. They can be found in heterozygosis (center) or homozygosis (right). One of the methods for inversion detection is the clustering detection performed by invClust, which classifies the inversion genotypes into clusters of similar haplotype origin.
Figure 2Representation of a CNV region in a normal state, gain of genetic material, loss of genetic material and CNN LOH.
Figure 3Changes in the BAF and LRR within CNVs of different types. (Orange) Normal state where the BAF (a measure of heterozygosity) is on average 0 or 1 for homozygous probes and 0.5 for heterozygous probes and the LLR (a normalized measure of DNA content) is on average 0. (Blue) CN gain is represented by a split of the BAF signal at 1/3 and 2/3 and a gain in LRR. (Red) CN loss is represented by a loss BAF for heterozygous probes (0.5) and a loss in LRR signal. (Green) Loss of heterozygosis by CNV is represented by a loss BAF for heterozygous probes and no change in the LRR signal.
Top-five tools for the study of CNVs with their characteristics
| Tool | Type | Availability | Input data | Algorithm | Characteristics | Year | Reference |
|---|---|---|---|---|---|---|---|
| PennCNV | Command-line (Perl) | Free | Processed Intensity files with LRR and BAF + PFB (Population frequency of B allele) files (supplied with the package for several Affymetrix/Illumina arrays) | Hidden Markov Model (HMM) | Visualization options | 2007 | [ |
| QuantiSNP | Command-line (MATLAB) | Free | Illumina Infinium I/II or Affymetrix 500 K and SNP 6.0 processed intensity files with LRR and BAF | Objective Bayes Hidden Markov Model (OB-HMM) | Visualization options Detects Loss of Heterozygosity | 2007 | [ |
| Birdsuite | Command-line (Bash, needs R, JAVA, Matlab, Python) | Free | Affymetrix CEL files (Genome-Wide Human SNP Array 6.0) Illumina 610 (beta version) | Birdseye - Hidden Markov Model (HMM) Canary - One-dimensional Gaussian mixture model (GMM) | Linux only PLINK conversion pipeline | 2008 | [ |
| SCIMMkit | Command-line (Perl, R) | Free | Final call report from Illumina BeadStudio (Infinium II and GoldenGate BeadXpress chips) | SCIMM (SNP-Conditional Mixture Modeling) - Mixture-likelihood based clustering SCOUT (SNP-Conditional Outlier detection) - Scoring function. | Visualization options (scatterplots) | 2008 | [ |
| GLAD | R package | Free | Preprocessed files with LRR values | Segmentation based on Adaptive Weights Smoothing (AWS) | Specific for cancer samples | 2004 | [ |
Figure 4Changes in the BAF and LRR depending on the type of mosaicism. (Top) A mosaic CN gain is represented by a split of the BAF signal between in values between 1/3 and 2/3 and a gain in LRR. (Middle) A mosaic CN loss is represented by a BAF split between 0 and 1 and a loss of LRR. (Bottom) A mosaic loss of heterozygosity is represented by a BAF split between 0 and 1 and a normal LRR.