Moara Machado1, Wagner Cs Magalhães1, Allan Sene1, Bruno Araújo1, Alessandra C Faria-Campos2, Stephen J Chanock3,4, Leandro Scott5, Guilherme Oliveira5, Eduardo Tarazona-Santos1, Maira R Rodrigues1. 1. Departamento de Biologia Geral, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Av Antonio Carlos 6627, Pampulha, Caixa Postal 486, Belo Horizonte, MG, CEP 31270-910, Brazil. 2. Departamento de Ciência da Computação, Instituto de Ciências Exatas, Universidade Federal de Minas Gerais, Av Antonio Carlos 6627, Pampulha, Belo Horizonte, MG, CEP 31270-910, Brazil. 3. Laboratory of Translational Genomics of the Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Gaithersburg, MD, USA. 4. 8717 Grovemont Circle Advanced Technology Center, Room 127, Gaithersburg, MD, 20877, USA. 5. Genomics and Computational Biology Group and Center for Excellence in Bioinformatics, René Rachou Institute, Fundação Oswaldo Cruz, Av Augusto de Lima 1715, Belo Horizonte, MG, 30190-002, Brazil.
Abstract
BACKGROUND: Targeted re-sequencing is one of the most powerful and widely used strategies for population genetics studies because it allows an unbiased screening for variation that is suitable for a wide variety of organisms. Examples of studies that require re-sequencing data are evolutionary inferences, epidemiological studies designed to capture rare polymorphisms responsible for complex traits and screenings for mutations in families and small populations with high incidences of specific genetic diseases. Despite the advent of next-generation sequencing technologies, Sanger sequencing is still the most popular approach in population genetics studies because of the widespread availability of automatic sequencers based on capillary electrophoresis and because it is still less prone to sequencing errors, which is critical in population genetics studies. Two popular software applications for re-sequencing studies are Phred-Phrap-Consed-Polyphred, which performs base calling, alignment, graphical edition and genotype calling and DNAsp, which performs a set of population genetics analyses. These independent tools are the start and end points of basic analyses. In between the use of these tools, there is a set of basic but error-prone tasks to be performed with re-sequencing data. RESULTS: In order to assist with these intermediate tasks, we developed a pipeline that facilitates data handling typical of re-sequencing studies. Our pipeline: (1) consolidates different outputs produced by distinct Phred-Phrap-Consed contigs sharing a reference sequence; (2) checks for genotyping inconsistencies; (3) reformats genotyping data produced by Polyphred into a matrix of genotypes with individuals as rows and segregating sites as columns; (4) prepares input files for haplotype inferences using the popular software PHASE; and (5) handles PHASE output files that contain only polymorphic sites to reconstruct the inferred haplotypes including polymorphic and monomorphic sites as required by population genetics software for re-sequencing data such as DNAsp. CONCLUSION: We tested the pipeline in re-sequencing studies of haploid and diploid data in humans, plants, animals and microorganisms and observed that it allowed a substantial decrease in the time required for sequencing analyses, as well as being a more controlled process that eliminates several classes of error that may occur when handling datasets. The pipeline is also useful for investigators using other tools for sequencing and population genetics analyses.
BACKGROUND: Targeted re-sequencing is one of the most powerful and widely used strategies for population genetics studies because it allows an unbiased screening for variation that is suitable for a wide variety of organisms. Examples of studies that require re-sequencing data are evolutionary inferences, epidemiological studies designed to capture rare polymorphisms responsible for complex traits and screenings for mutations in families and small populations with high incidences of specific genetic diseases. Despite the advent of next-generation sequencing technologies, Sanger sequencing is still the most popular approach in population genetics studies because of the widespread availability of automatic sequencers based on capillary electrophoresis and because it is still less prone to sequencing errors, which is critical in population genetics studies. Two popular software applications for re-sequencing studies are Phred-Phrap-Consed-Polyphred, which performs base calling, alignment, graphical edition and genotype calling and DNAsp, which performs a set of population genetics analyses. These independent tools are the start and end points of basic analyses. In between the use of these tools, there is a set of basic but error-prone tasks to be performed with re-sequencing data. RESULTS: In order to assist with these intermediate tasks, we developed a pipeline that facilitates data handling typical of re-sequencing studies. Our pipeline: (1) consolidates different outputs produced by distinct Phred-Phrap-Consed contigs sharing a reference sequence; (2) checks for genotyping inconsistencies; (3) reformats genotyping data produced by Polyphred into a matrix of genotypes with individuals as rows and segregating sites as columns; (4) prepares input files for haplotype inferences using the popular software PHASE; and (5) handles PHASE output files that contain only polymorphic sites to reconstruct the inferred haplotypes including polymorphic and monomorphic sites as required by population genetics software for re-sequencing data such as DNAsp. CONCLUSION: We tested the pipeline in re-sequencing studies of haploid and diploid data in humans, plants, animals and microorganisms and observed that it allowed a substantial decrease in the time required for sequencing analyses, as well as being a more controlled process that eliminates several classes of error that may occur when handling datasets. The pipeline is also useful for investigators using other tools for sequencing and population genetics analyses.
Authors: Nelson J R Fagundes; Nicolas Ray; Mark Beaumont; Samuel Neuenschwander; Francisco M Salzano; Sandro L Bonatto; Laurent Excoffier Journal: Proc Natl Acad Sci U S A Date: 2007-10-31 Impact factor: 11.205
Authors: Bruce Budowle; Jianye Ge; Xavier G Aranda; John V Planz; Arthur J Eisenberg; Ranajit Chakraborty Journal: J Forensic Sci Date: 2009-07-15 Impact factor: 1.832
Authors: S Fuselli; R H Gilman; S J Chanock; S L Bonatto; G De Stefano; C A Evans; D Labuda; D Luiselli; F M Salzano; G Soto; G Vallejo; A Sajantila; D Pettener; E Tarazona-Santos Journal: Pharmacogenomics J Date: 2006-07-18 Impact factor: 3.550
Authors: Eduardo Tarazona-Santos; Toralf Bernig; Laurie Burdett; Wagner C S Magalhaes; Cristina Fabbri; Jason Liao; Rodrigo A F Redondo; Robert Welch; Meredith Yeager; Stephen J Chanock Journal: Hum Mutat Date: 2008-05 Impact factor: 4.878
Authors: Gloria M Petersen; Laufey Amundadottir; Charles S Fuchs; Peter Kraft; Rachael Z Stolzenberg-Solomon; Kevin B Jacobs; Alan A Arslan; H Bas Bueno-de-Mesquita; Steven Gallinger; Myron Gross; Kathy Helzlsouer; Elizabeth A Holly; Eric J Jacobs; Alison P Klein; Andrea LaCroix; Donghui Li; Margaret T Mandelson; Sara H Olson; Harvey A Risch; Wei Zheng; Demetrius Albanes; William R Bamlet; Christine D Berg; Marie-Christine Boutron-Ruault; Julie E Buring; Paige M Bracci; Federico Canzian; Sandra Clipp; Michelle Cotterchio; Mariza de Andrade; Eric J Duell; J Michael Gaziano; Edward L Giovannucci; Michael Goggins; Göran Hallmans; Susan E Hankinson; Manal Hassan; Barbara Howard; David J Hunter; Amy Hutchinson; Mazda Jenab; Rudolf Kaaks; Charles Kooperberg; Vittorio Krogh; Robert C Kurtz; Shannon M Lynch; Robert R McWilliams; Julie B Mendelsohn; Dominique S Michaud; Hemang Parikh; Alpa V Patel; Petra H M Peeters; Aleksandar Rajkovic; Elio Riboli; Laudina Rodriguez; Daniela Seminara; Xiao-Ou Shu; Gilles Thomas; Anne Tjønneland; Geoffrey S Tobias; Dimitrios Trichopoulos; Stephen K Van Den Eeden; Jarmo Virtamo; Jean Wactawski-Wende; Zhaoming Wang; Brian M Wolpin; Herbert Yu; Kai Yu; Anne Zeleniuch-Jacquotte; Joseph F Fraumeni; Robert N Hoover; Patricia Hartge; Stephen J Chanock Journal: Nat Genet Date: 2010-01-24 Impact factor: 38.330
Authors: Olivier Harismendy; Pauline C Ng; Robert L Strausberg; Xiaoyun Wang; Timothy B Stockwell; Karen Y Beeson; Nicholas J Schork; Sarah S Murray; Eric J Topol; Samuel Levy; Kelly A Frazer Journal: Genome Biol Date: 2009-03-27 Impact factor: 13.583
Authors: Silvane Maria Fonseca Murta; Laila Alves Nahum; Jéssica Hickson; Lucas Felipe Almeida Athayde; Thainá Godinho Miranda; Policarpo Ademar Sales Junior; Anderson Coqueiro Dos Santos; Lúcia Maria da Cunha Galvão; Antônia Cláudia Jácome da Câmara; Daniella Castanheira Bartholomeu; Rita de Cássia Moreira de Souza Journal: Parasit Vectors Date: 2022-06-06 Impact factor: 4.047
Authors: Roxana Zamudio; Latife Pereira; Carolina D Rocha; Douglas E Berg; Thaís Muniz-Queiroz; Hanaisa P Sant Anna; Lilia Cabrera; Juan M Combe; Phabiola Herrera; Martha H Jahuira; Felipe B Leão; Fernanda Lyon; William A Prado; Maíra R Rodrigues; Fernanda Rodrigues-Soares; Meddly L Santolalla; Camila Zolini; Aristóbolo M Silva; Robert H Gilman; Eduardo Tarazona-Santos; Fernanda S G Kehdy Journal: Dig Dis Sci Date: 2015-09-21 Impact factor: 3.199
Authors: Eduardo Tarazona-Santos; Lilian Castilho; Daphne R T Amaral; Daiane C Costa; Natália G Furlani; Luciana W Zuccherato; Moara Machado; Marion E Reid; Mariano G Zalis; Andréa R Rossit; Sidney E B Santos; Ricardo L Machado; Sara Lustigman Journal: PLoS One Date: 2011-01-24 Impact factor: 3.240
Authors: Steven Sijmons; Kim Thys; Michaël Corthout; Ellen Van Damme; Marnix Van Loock; Stefanie Bollen; Sylvie Baguet; Jeroen Aerssens; Marc Van Ranst; Piet Maes Journal: PLoS One Date: 2014-04-22 Impact factor: 3.240