Literature DB >> 21382596

SNPTransformer: a lightweight toolkit for genome-wide association studies.

Abstract

High-throughput genotyping chips have produced huge datasets for genome-wide association studies (GWAS) that have contributed greatly to discovering susceptibility genes for complex diseases. There are two strategies for performing data analysis for GWAS. One strategy is to use open-source or commercial packages that are designed for GWAS. The other is to take advantage of classic genetic programs with specific functions, such as linkage disequilibrium mapping, haplotype inference and transmission disequilibrium tests. However, most classic programs that are available are not suitable for analyzing chip data directly and require custom-made input, which results in the inconvenience of converting raw genotyping files into various data formats. We developed a powerful, user-friendly, lightweight program named SNPTransformer for GWAS that includes five major modules (Transformer, Operator, Previewer, Coder and Simulator). The toolkit not only works for transforming the genotyping files into ten input formats for use with classic genetics packages, but also carries out useful functions such as relational operations on IDs, previewing data files, recoding data formats and simulating marker files, among other functions. It bridges upstream raw genotyping data with downstream genetic programs, and can act as an in-hand toolkit for human geneticists, especially for non-programmers. SNPTransformer is freely available at http://snptransformer.sourceforge.net.

Entities: Chemical Species

Mesh：

Year: 2010 PMID： 21382596 PMCID： PMC5054149 DOI： 10.1016/S1672-0229(10)60029-0

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

High-throughput genotyping technologies contribute greatly to the hunt for susceptibility genes for complex diseases by constantly improving the precision of and the capacity for parallel genotyping 1, 2, 3. Driven by these emerging technologies, some challenging projects, such as HapMap Phase I∼III 4, 5, ENCODE 6, 7 and 1000 Genomes (, were proposed to explore the pattern in the human genome of common or rare variation. Broad application of the whole-genome single nucleotide polymorphism (SNP) chips into genome-wide association studies (GWAS) has also led to the discovery of more than 100 loci for nearly 40 common diseases and traits (. Due to the great challenges presented by huge datasets, two strategies have been developed for performing data analysis for whole-genome genotyping data. One strategy is to use open-source or commercial packages that have been designed for GWAS. PLINK ( is one of the most popular and powerful of these programs. GenABEL (, GWAF (, SNPAssoc ( and snpMatrix ( are programs based on R language, which is a well-executed open-source framework. GWAS Analyzer (, GWAS GUI ( and MAVEN ( provide platforms for intuitively viewing the results of GWAS. SNPTEST ( also can be used for data analysis with a software suite consisting of several programs. The above software packages are easily compatible with chip data and possess more or fewer functions according to the purpose of genetic analysis. The other strategy for performing these analyses is to take advantage of classic genetic programs that implement specific functions, such as transmission disequilibrium tests (TDT) for alleles [UNPHASED (], calculating linkage disequilibrium (LD) measures and constructing LD maps [Haploview (, GOLD ( and JLIN (], haplotype inference [PHASE (], haplotype block partition [HapBlock (], tagSNPs selection [Tagger (] and multilocus interaction methods [MDR (]. However, most of the classic programs that are available would be not suitable for inputting the chip data directly and require custom-made input, which results in the inconvenience of converting raw genotyping files into various data formats. SNP_Tools ( is an MS-Excel add-in that can convert genotyping files into several formats, such as Haploview and PHASE. Because it is an add-in program for Excel (255 columns and 65,536 rows in MS-Excel 2003), supports for the chip data are scarce. Furthermore, output formats are limited to Haploview, SNPHAP, PHASE and PedPhase. However, as classic programs implement specific algorithms for genetic analysis and provide an option for analysis of GWAS data, it is important to develop a tool to bridge these programs with raw genotyping data. Here, we present a powerful, user-friendly, lightweight toolkit named SNPTransformer for GWAS. The major aim of SNPTransformer is to convert genotyping input (such as linkage and chip formats) into various outputs (such as packages for association, TDT, calculating LD measures, haplotype inference, haplotype block partition, tagSNPs and multilocus interaction). With this toolkit, researchers can avoid manual coding between formats and can easily construct workflows for data analysis. Additionally, accessory tools in SNPTransformer perform data previewing, relational operations on IDs, recoding data files and simulating map files that assist data conversion and GWAS analysis.

Implementation

SNPTransformer V1.0 was written using C++ Builder with a concise and user-friendly interface. It was built and tested under Windows XP. Because SNPTransformer is a lightweight toolkit, no installation or other package is required and it is compatible with other Windows platforms. All binary programs, source codes, tutorials, examples and updates are available freely under the GNU General Public License at the SNPTransformer website (http://snptransformer.sourceforge.net).

Results

The current version of SNPTransformer provides five major modules for GWAS: Transformer, Previewer, Operator, Coder and Simulator (Table 1).

Table 1

Modules and functions of SNPTransformer

Module	Function	Example
Transformer	Converting genotyping files into formats of other analysis tools	Converting chip data into PLINK input
Previewer	Previewing the first N lines of large data files	Previewing annotation files of Affymetrix SNP Array 6.0
Operator	Relational operations on IDs	Retrieving annotation information for positive SNPs
Coder	Recoding genotypes between letters and numbers	Recoding AB-coding genotypes into ACGT-coding
Simulator	Simulating map or pedigree files	Simulating PLINK map files according to a SNP list

Transformer

As the most important module of this software package, Transformer is positioned in the main window of the software and is responsible for converting file formats from genotyping data to the formats of specific classic genetics analytical tools (Figure 1 and Table 1). Input data include genotyping files, marker files and pedigree files (Figure 2). Formats of genotyping files can be one of the following: linkage format, which integrates genotyping data and pedigree information in a single file; chip format, which is compatible with whole-genome genotyping data of Affymetrix, Illumina and many other chip platforms (see manual in detail); or custom format, which is similar to sequencing results. A marker file stores chromosome information and the physical positions of SNPs, as well as genetic positions, which are usually set to zero. Pedigree files contain information on individual IDs, gender and disease status, or qualitative traits, and this file type is suitable not only for pedigree data such as linkage and TDT, but also for case/control data. Because a pedigree file is the same as the pedigree part (first six columns) of a linkage format, it is not required for a linkage format genotyping file.

Figure 1

Screenshot of SNPTransformer. The screenshot shows the user-friendly interface of SNPTransformer. Transformer is located in the main interface of SNPTransformer and consists of two windows. The upper window is used to set input files and their relevant parameters, and the output formats are designated from the bottom.

Figure 2

Workflow of Transformer. Input data of Transformer include genotyping files (linkage, chip or custom formats), marker files and pedigree files, filtered by sample and SNP lists. The output formats cover routine analyses for GWAS, and ten programs are selected as representatives.

The output formats of SNPTransformer are diverse and represent essential genetic analyses for GWAS (Figure 2). Similar to many other tools, the linkage format is considered the basic format of SNPTransformer. PLINK is one of the most popular software platforms for GWAS and can perform various genetic analyses, and the marker file of SNPTransformer references that of PLINK, leading to consistency between the input data in linkage format for SNPTransformer and those of PLINK. Haploview is a program that presents LD-based analysis tools: TDT, calculating LD measures and constructing LD maps, inferring haplotypes, partitioning haplotype blocks and selecting tagSNPs and case/control association. UNPHASED, GOLD, JLIN, PHASE, HaploBlock and Tagger each carry out one of these analysis steps. MDR excels at performing multi-locus interaction analysis and is widely used in association studies. At present, no tool related to two-locus interaction is included in SNPTransformer due to the lack of well-recognized analysis tools for this except for logistic regression, which has quite a different format from the output of SNPTransformer. Additional custom formats can be output using some options that are designed to count genotype/allele numbers and frequencies. To filter input data, SNP or sample lists can be set to improve data quality or to narrow the target scope. With these filters, Transformer can act as an extractor that searches for specific genotypes for meta-analysis.

Previewer, Operator, Coder and Simulator

The other four modules in SNPTransformer have been developed to satisfy specific demands, such as previewing data files and retrieving annotation information for positive SNPs (Table 1). Since genotyping or annotation files of whole-genome SNP chips are too large to open with generic text-editor tools, Previewer performs previewing of the top N lines of these files. With this function, users can view the format of annotation files for Affymetrix GeneChip 6.0 sets that are larger than hundreds of megabytes. Furthermore, Previewer can reorder the columns of input file to attain a new file that is arranged as required for further analysis. During the processing of genetic analyses, the relational operations can help search for specific information. For example, annotation information for positive SNPs can be retrieved by performing the “inner join” or “left join” operation between the SNP list and the annotation file. Operations including one-item (operating on single file) and two-item operations (operating between two files) are implemented by Operator, similar to relational databases such as MySQL and Access support. Coder recodes genotypes between letters and numbers that can code one heterozygote with two alleles “A” and “T” as “AT” (ACGT-coding), “14” (1234-coding), “12” (12-coding), “AB” (AB-coding) or even “1” (012-coding). Another module, called Simulator, is used to generate pseudo map and pedigree files without the use of real information. When using Haploview to calculate LD measures, physical positions are not required to be real numbers, but rather the order of SNPs is sufficient if the pairwise distance can be ignored and if the order of SNPs is correct. In such a case, a map file can be easily simulated from a SNP list by Simulator. Considerable work remains to be accomplished in the future to meet the needs of GWAS analysis. The first step in this process is to adopt parallel technology to further improve the speed of the analysis process. Another important aim in improving SNPTransformer is to design a personal interface for Affymetrix and Illumina SNP chips to provide more options. In the current version, these were implemented through chip interface.

Competing interests

The author has declared that no competing interests exist.

27 in total

1. GOLD--graphical overview of linkage disequilibrium.

Authors: G R Abecasis; W O Cookson
Journal: Bioinformatics Date: 2000-02 Impact factor: 6.937

2. A comparison of bayesian methods for haplotype reconstruction from population genotype data.

Authors: Matthew Stephens; Peter Donnelly
Journal: Am J Hum Genet Date: 2003-10-20 Impact factor: 11.025

3. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays.

Authors: Hajime Matsuzaki; Shoulian Dong; Halina Loi; Xiaojun Di; Guoying Liu; Earl Hubbell; Jane Law; Tam Berntsen; Monica Chadha; Henry Hui; Geoffrey Yang; Giulia C Kennedy; Teresa A Webster; Simon Cawley; P Sean Walsh; Keith W Jones; Stephen P A Fodor; Rui Mei
Journal: Nat Methods Date: 2004-11 Impact factor: 28.547

4. Efficiency and power in genetic association studies.

Authors: Paul I W de Bakker; Roman Yelensky; Itsik Pe'er; Stacey B Gabriel; Mark J Daly; David Altshuler
Journal: Nat Genet Date: 2005-10-23 Impact factor: 38.330

5. A haplotype map of the human genome.

Authors:
Journal: Nature Date: 2005-10-27 Impact factor: 49.962

6. GenABEL: an R library for genome-wide association analysis.

Authors: Yurii S Aulchenko; Stephan Ripke; Aaron Isaacs; Cornelia M van Duijn
Journal: Bioinformatics Date: 2007-03-23 Impact factor: 6.937

7. A new multipoint method for genome-wide association studies by imputation of genotypes.

Authors: Jonathan Marchini; Bryan Howie; Simon Myers; Gil McVean; Peter Donnelly
Journal: Nat Genet Date: 2007-06-17 Impact factor: 38.330

8. Likelihood-based association analysis for nuclear families and unrelated subjects with missing genotype data.

Authors: Frank Dudbridge
Journal: Hum Hered Date: 2008-03-31 Impact factor: 0.444

9. A second generation human haplotype map of over 3.1 million SNPs.

Authors: Kelly A Frazer; Dennis G Ballinger; David R Cox; David A Hinds; Laura L Stuve; Richard A Gibbs; John W Belmont; Andrew Boudreau; Paul Hardenbol; Suzanne M Leal; Shiran Pasternak; David A Wheeler; Thomas D Willis; Fuli Yu; Huanming Yang; Changqing Zeng; Yang Gao; Haoran Hu; Weitao Hu; Chaohua Li; Wei Lin; Siqi Liu; Hao Pan; Xiaoli Tang; Jian Wang; Wei Wang; Jun Yu; Bo Zhang; Qingrun Zhang; Hongbin Zhao; Hui Zhao; Jun Zhou; Stacey B Gabriel; Rachel Barry; Brendan Blumenstiel; Amy Camargo; Matthew Defelice; Maura Faggart; Mary Goyette; Supriya Gupta; Jamie Moore; Huy Nguyen; Robert C Onofrio; Melissa Parkin; Jessica Roy; Erich Stahl; Ellen Winchester; Liuda Ziaugra; David Altshuler; Yan Shen; Zhijian Yao; Wei Huang; Xun Chu; Yungang He; Li Jin; Yangfan Liu; Yayun Shen; Weiwei Sun; Haifeng Wang; Yi Wang; Ying Wang; Xiaoyan Xiong; Liang Xu; Mary M Y Waye; Stephen K W Tsui; Hong Xue; J Tze-Fei Wong; Luana M Galver; Jian-Bing Fan; Kevin Gunderson; Sarah S Murray; Arnold R Oliphant; Mark S Chee; Alexandre Montpetit; Fanny Chagnon; Vincent Ferretti; Martin Leboeuf; Jean-François Olivier; Michael S Phillips; Stéphanie Roumy; Clémentine Sallée; Andrei Verner; Thomas J Hudson; Pui-Yan Kwok; Dongmei Cai; Daniel C Koboldt; Raymond D Miller; Ludmila Pawlikowska; Patricia Taillon-Miller; Ming Xiao; Lap-Chee Tsui; William Mak; You Qiang Song; Paul K H Tam; Yusuke Nakamura; Takahisa Kawaguchi; Takuya Kitamoto; Takashi Morizono; Atsushi Nagashima; Yozo Ohnishi; Akihiro Sekine; Toshihiro Tanaka; Tatsuhiko Tsunoda; Panos Deloukas; Christine P Bird; Marcos Delgado; Emmanouil T Dermitzakis; Rhian Gwilliam; Sarah Hunt; Jonathan Morrison; Don Powell; Barbara E Stranger; Pamela Whittaker; David R Bentley; Mark J Daly; Paul I W de Bakker; Jeff Barrett; Yves R Chretien; Julian Maller; Steve McCarroll; Nick Patterson; Itsik Pe'er; Alkes Price; Shaun Purcell; Daniel J Richter; Pardis Sabeti; Richa Saxena; Stephen F Schaffner; Pak C Sham; Patrick Varilly; David Altshuler; Lincoln D Stein; Lalitha Krishnan; Albert Vernon Smith; Marcela K Tello-Ruiz; Gudmundur A Thorisson; Aravinda Chakravarti; Peter E Chen; David J Cutler; Carl S Kashuk; Shin Lin; Gonçalo R Abecasis; Weihua Guan; Yun Li; Heather M Munro; Zhaohui Steve Qin; Daryl J Thomas; Gilean McVean; Adam Auton; Leonardo Bottolo; Niall Cardin; Susana Eyheramendy; Colin Freeman; Jonathan Marchini; Simon Myers; Chris Spencer; Matthew Stephens; Peter Donnelly; Lon R Cardon; Geraldine Clarke; David M Evans; Andrew P Morris; Bruce S Weir; Tatsuhiko Tsunoda; James C Mullikin; Stephen T Sherry; Michael Feolo; Andrew Skol; Houcan Zhang; Changqing Zeng; Hui Zhao; Ichiro Matsuda; Yoshimitsu Fukushima; Darryl R Macer; Eiko Suda; Charles N Rotimi; Clement A Adebamowo; Ike Ajayi; Toyin Aniagwu; Patricia A Marshall; Chibuzor Nkwodimmah; Charmaine D M Royal; Mark F Leppert; Missy Dixon; Andy Peiffer; Renzong Qiu; Alastair Kent; Kazuto Kato; Norio Niikawa; Isaac F Adewole; Bartha M Knoppers; Morris W Foster; Ellen Wright Clayton; Jessica Watkin; Richard A Gibbs; John W Belmont; Donna Muzny; Lynne Nazareth; Erica Sodergren; George M Weinstock; David A Wheeler; Imtaz Yakub; Stacey B Gabriel; Robert C Onofrio; Daniel J Richter; Liuda Ziaugra; Bruce W Birren; Mark J Daly; David Altshuler; Richard K Wilson; Lucinda L Fulton; Jane Rogers; John Burton; Nigel P Carter; Christopher M Clee; Mark Griffiths; Matthew C Jones; Kirsten McLay; Robert W Plumb; Mark T Ross; Sarah K Sims; David L Willey; Zhu Chen; Hua Han; Le Kang; Martin Godbout; John C Wallenburg; Paul L'Archevêque; Guy Bellemare; Koji Saeki; Hongguang Wang; Daochang An; Hongbo Fu; Qing Li; Zhen Wang; Renwu Wang; Arthur L Holden; Lisa D Brooks; Jean E McEwen; Mark S Guyer; Vivian Ota Wang; Jane L Peterson; Michael Shi; Jack Spiegel; Lawrence M Sung; Lynn F Zacharia; Francis S Collins; Karen Kennedy; Ruth Jamieson; John Stewart
Journal: Nature Date: 2007-10-18 Impact factor: 49.962

10. JLIN: a java based linkage disequilibrium plotter.

Authors: Kim W Carter; Pamela A McCaskie; Lyle J Palmer
Journal: BMC Bioinformatics Date: 2006-02-09 Impact factor: 3.169