| Literature DB >> 34601555 |
Alan E Murphy1, Brian M Schilder1, Nathan G Skene1.
Abstract
MOTIVATION: Genome-wide association studies (GWAS) summary statistics have popularised and accelerated genetic research. However, a lack of standardisation of the file formats used has proven problematic when running secondary analysis tools or performing meta-analysis studies.Entities:
Year: 2021 PMID: 34601555 PMCID: PMC8652100 DOI: 10.1093/bioinformatics/btab665
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Most common summary statistic formats show the most common summary statistic formats from a repository of over 200 publicly available GWAS (Gloudemans, 2021). Note that, a GWAS can have more than 1 summary statistics file and ‘
MungeSumstats implemented checks
| |S| | MungeSumstats check | Description |
|---|---|---|
| 1 | Check VCF format | If the input file is in variant call format (VCF), if so import |
| 2 | Check tab, space or comma delimited | If input is space or comma delimited convert to tab delimited. Can handle .tsv, .txt, .csv, .tsv.gz, .txt.gz, .csv.gz, .tsv.bgz, .txt.bgz, .csv.bgz, .vcf, .vcf.gz, .vcf.bgz files. |
| 3 | Check for header name synonyms | If any alternative names are found for SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS, SIGNED_SUMSTAT, N, N_CAS, N_CON, NSTUDY, INFO or FRQ convert them to a standard name. Robust conversion approach with 176 unique mappings |
| 4 | Check for multiple models or traits in GWAS | If multiple, user must specify one to analyze |
| 5 | Check for uniformity in SNP ID | Ensure no mix of RS ID, missing ‘rs’ prefix and/or CHR: BP |
| 6 | Check for CHR: BP: A2: A1 all in one column | Split into separate columns if found |
| 7 | Check for CHR: BP in one column | Split into separate columns if found |
| 8 | Check for A1/A2 in one column | Split into separate columns if found |
| 9 | Check if CHR and/or BP is missing | If so, infer from the chosen reference genome |
| 10 | Check if SNP ID is missing | If so, infer from the chosen reference genome |
| 11 | Check if A1 and/or A2 are missing | If so, infer from the chosen reference genome |
| 12 | Check that vital columns are present | Check for the necessary columns; SNP, CHR, BP, P, A1, A2 |
| 13 | Check for one signed/effect column | Effect columns Z, OR, BETA, LOG_ODDS, SIGNED_SUMSTAT |
| 14 | Check for missing data | If data is missing from any entry, remove the SNP |
| 15 | Check for duplicated columns | If there are any remove one |
| 16 | Check for | These are not recognized in R and cause issues with downstream analysis software like LDSC/MAGMA. User can convert to 0. |
| 17 | Check N column | Ensure it is an integer and check if the sample size for a SNP isn’t greater than mean multiplied by five times the standard deviation. Removes SNPs that have substantial more samples than the rest. |
| 18 | Check SNPs are RS ID's | Checks validity of SNP IDs as RS IDs, other IDs can still be used |
| 19 | Check for duplicated rows, based on SNP ID | Duplicates are removed |
| 21 | Check for duplicated rows, based on base-pair position | Duplicates are removed |
| 22 | Check for SNPs on reference genome | Correct any missing from reference genome using BP and CHR |
| 23 | Check INFO score | Remove SNPs with imputation score less than 0.9 |
| 24 | Check for strand-ambiguous SNPs | Remove strand-ambiguous SNPs if found |
| 25 | Check for non-biallelic SNPs (infer from reference genome) | Infer from chosen reference genome and remove any if found |
| 26 | Check for allele flipping | The effect/alternative/minor allele is assumed to be A2. The allele flipping function checks A1 against a reference genome. For a given SNP, if A1 doesn't match the reference genome sequence (i.e. it is the alternative allele, not the reference allele for example), A1 and A2 along with the effect and frequency columns are flipped, creating consistent directionality of allelic effects across GWAS. |
| 27 | Check for SNPs on chromosome X, Y and mitochondrial SNPs (MT) | If any are found these are removed. |
| 28 | Check output format is LDSC ready | Standardized file can be passed to LDSC without pre-processing |
| 29 | Check effect column values | Ensure effect columns (like BETA) aren’t equal to 0 |
| 30 | Check Standard Error | Ensure standard error (SE) is positive |
| 31 | Check dropped and imputed values | Return indicators of the imputed values for a SNP and return the SNPs and the reason for exclusion because of QC. |