Literature DB >> 28560825

PhredEM: a phred-score-informed genotype-calling approach for next-generation sequencing studies.

Peizhou Liao1, Glen A Satten2, Yi-Juan Hu1.   

Abstract

A fundamental challenge in analyzing next-generation sequencing (NGS) data is to determine an individual's genotype accurately, as the accuracy of the inferred genotype is essential to downstream analyses. Correctly estimating the base-calling error rate is critical to accurate genotype calls. Phred scores that accompany each call can be used to decide which calls are reliable. Some genotype callers, such as GATK and SAMtools, directly calculate the base-calling error rates from phred scores or recalibrated base quality scores. Others, such as SeqEM, estimate error rates from the read data without using any quality scores. It is also a common quality control procedure to filter out reads with low phred scores. However, choosing an appropriate phred score threshold is problematic as a too high threshold may lose data, while a too low threshold may introduce errors. We propose a new likelihood-based genotype-calling approach that exploits all reads and estimates the per-base error rates by incorporating phred scores through a logistic regression model. The approach, which we call PhredEM, uses the expectation-maximization (EM) algorithm to obtain consistent estimates of genotype frequencies and logistic regression parameters. It also includes a simple, computationally efficient screening algorithm to identify loci that are estimated to be monomorphic, so that only loci estimated to be nonmonomorphic require application of the EM algorithm. Like GATK, PhredEM can be used together with a linkage-disequilibrium-based method such as Beagle, which can further improve genotype calling as a refinement step. We evaluate the performance of PhredEM using both simulated data and real sequencing data from the UK10K project and the 1000 Genomes project. The results demonstrate that PhredEM performs better than either GATK or SeqEM, and that PhredEM is an improved, robust, and widely applicable genotype-calling approach for NGS studies. The relevant software is freely available.
© 2017 WILEY PERIODICALS, INC.

Entities:  

Keywords:  EM algorithm; common variant; rare variant; read data

Mesh:

Year:  2017        PMID: 28560825      PMCID: PMC5564424          DOI: 10.1002/gepi.22048

Source DB:  PubMed          Journal:  Genet Epidemiol        ISSN: 0741-0395            Impact factor:   2.135


  27 in total

1.  Adjust quality scores from alignment and improve sequencing accuracy.

Authors:  Ming Li; Magnus Nordborg; Lei M Li
Journal:  Nucleic Acids Res       Date:  2004-09-30       Impact factor: 16.971

2.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors:  Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal:  Genome Res       Date:  2010-07-19       Impact factor: 9.043

3.  Quality scores and SNP detection in sequencing-by-synthesis systems.

Authors:  William Brockman; Pablo Alvarez; Sarah Young; Manuel Garber; Georgia Giannoukos; William L Lee; Carsten Russ; Eric S Lander; Chad Nusbaum; David B Jaffe
Journal:  Genome Res       Date:  2008-01-22       Impact factor: 9.043

4.  Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Authors:  Heng Li; Jue Ruan; Richard Durbin
Journal:  Genome Res       Date:  2008-08-19       Impact factor: 9.043

5.  Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies.

Authors:  Brian L Browning; Zhaoxia Yu
Journal:  Am J Hum Genet       Date:  2009-12       Impact factor: 11.025

6.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2011-09-08       Impact factor: 6.937

7.  Base-calling of automated sequencer traces using phred. II. Error probabilities.

Authors:  B Ewing; P Green
Journal:  Genome Res       Date:  1998-03       Impact factor: 9.043

8.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers.

Authors:  Michael A Quail; Miriam Smith; Paul Coupland; Thomas D Otto; Simon R Harris; Thomas R Connor; Anna Bertoni; Harold P Swerdlow; Yong Gu
Journal:  BMC Genomics       Date:  2012-07-24       Impact factor: 3.969

9.  A global reference for human genetic variation.

Authors:  Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal:  Nature       Date:  2015-10-01       Impact factor: 49.962

Review 10.  Exome sequencing and complex disease: practical aspects of rare variant association studies.

Authors:  Ron Do; Sekar Kathiresan; Gonçalo R Abecasis
Journal:  Hum Mol Genet       Date:  2012-09-13       Impact factor: 6.150

View more
  6 in total

1.  Oral microbiome research - A Beginner's glossary.

Authors:  Priya Nimish Deo; Revati Shailesh Deshmukh
Journal:  J Oral Maxillofac Pathol       Date:  2022-03-31

2.  Systematic evaluation of error rates and causes in short samples in next-generation sequencing.

Authors:  Franziska Pfeiffer; Carsten Gröber; Michael Blank; Kristian Händler; Marc Beyer; Joachim L Schultze; Günter Mayer
Journal:  Sci Rep       Date:  2018-07-19       Impact factor: 4.379

3.  A beginner's guide for FMDV quasispecies analysis: sub-consensus variant detection and haplotype reconstruction using next-generation sequencing.

Authors:  Marco Cacciabue; Anabella Currá; Elisa Carrillo; Guido König; María Inés Gismondi
Journal:  Brief Bioinform       Date:  2020-09-25       Impact factor: 11.622

4.  Development and comparison of RNA-sequencing pipelines for more accurate SNP identification: practical example of functional SNP detection associated with feed efficiency in Nellore beef cattle.

Authors:  S Lam; J Zeidan; F Miglior; A Suárez-Vega; I Gómez-Redondo; P A S Fonseca; L L Guan; S Waters; A Cánovas
Journal:  BMC Genomics       Date:  2020-10-08       Impact factor: 3.969

5.  Genomic divergence, local adaptation, and complex demographic history may inform management of a popular sportfish species complex.

Authors:  Joe C Gunn; Leah K Berkman; Jeff Koppelman; Andrew T Taylor; Shannon K Brewer; James M Long; Lori S Eggert
Journal:  Ecol Evol       Date:  2022-10-05       Impact factor: 3.167

6.  Metagenomic analysis of primary colorectal carcinomas and their metastases identifies potential microbial risk factors.

Authors:  Luigi Marongiu; Jonathan J M Landry; Tobias Rausch; Mohammed L Abba; Susanne Delecluse; Henri-Jacques Delecluse; Heike Allgayer
Journal:  Mol Oncol       Date:  2021-08-30       Impact factor: 6.603

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.