Literature DB >> 23471300

freeIbis: an efficient basecaller with calibrated quality scores for Illumina sequencers.

Gabriel Renaud¹, Martin Kircher, Udo Stenzel, Janet Kelso.

Abstract

MOTIVATION: The conversion of the raw intensities obtained from next-generation sequencing platforms into nucleotide sequences with well-calibrated quality scores is a critical step in the generation of good sequence data. While recent model-based approaches can yield highly accurate calls, they require a substantial amount of processing time and/or computational resources. We previously introduced Ibis, a fast and accurate basecaller for the Illumina platform. We have continued active development of Ibis to take into account developments in the Illumina technology, as well as to make Ibis fully open source.
RESULTS: We introduce here freeIbis, which offers significant improvements in sequence accuracy owing to the use of a novel multiclass support vector machine (SVM) algorithm. Sequence quality scores are now calibrated based on empirically observed scores, thus providing a high correlation to their respective error rates. These improvements result in downstream advantages including improved genotyping accuracy.
AVAILABILITY AND IMPLEMENTATION: FreeIbis is freely available for use under the GPL (http://bioinf.eva.mpg.de/freeibis/). It requires a Python interpreter and a C++ compiler. Tailored versions of LIBOCAS and LIBLINEAR are distributed along with the package.

Entities: Chemical Species

Mesh：

Year: 2013 PMID： 23471300 PMCID： PMC3634191 DOI： 10.1093/bioinformatics/btt117

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

A crucial step in the Illumina sequencing pipeline is basecalling: the generation of individual nucleotide sequences and associated quality scores, which measure the probability of a sequencing error, from raw intensities. The default basecaller provided by Illumina, Bustard, develops a model from the raw intensities and uses it to perform basecalling. Alternative basecallers aimed at achieving a better performance than Bustard have been proposed (Whiteford ). These basecallers can be divided into those that apply a modelling strategy like Bustard (naiveBayescall, Kao or see Das and Vikalo, 2012 for a faster implementation) and All your Base (AYB) (Massingham and Goldman, 2012) and those that rely on supervised learning approaches (Ibis, Kircher ) or intermediate approaches (Altacyclic, Erlich ). We introduce an update to our basecaller Ibis. FreeIbis replaces the restricted license SVM library with LIBOCAS (Franc and Sonnenburg, 2009), which is released under the GNU Public License. Our results show that freeIbis outperforms the previous version of our software in terms of sequence accuracy. We measured how the decision score of the SVM corresponded to the observed error rate as measured by the number of mismatches for each predicted quality score of control reads to their respective genome. A function approximating this distribution is then used to assign quality scores for individual bases. The resulting scores show a high level of correlation between their observed error rate and the predicted one, thus obviating the need for quality score recalibration as a post-processing step (McKenna ). We compare the newest versions of freeIbis and Ibis against the default basecaller for two Genome Analyzer II (GA) runs, a HiSeq run and a MiSeq run. On a set of DNA sequences genotyped using both Sanger and Illumina sequencing technologies, freeIbis provides an improvement in genotype accuracy over the default Illumina basecaller.

2 METHODS

The performance and accuracy of a number of freely available SVM libraries for basecalling were evaluated on a control lane of 51 cycles from a ϕX174 reference strain (sequence provided by Illumina Inc.) sequenced on a GAII. An examination of our training data, built using ϕX174 control sequences, revealed that numerous mislabelled training examples (i.e. intensities representing a certain base but labelled as another) were present and could be attributed to two types of artefacts: genuine sequence errors and divergent bases in the control genome population. To eliminate the effects caused by the latter, a masking procedure for these positions on the genome of the organism used as control was devised. Any training example from a position with a mismatch to a given nucleotide with >10% of its coverage was removed. As the divergent bases on the ϕX174 were masked, we sought to measure whether the posterior probabilities of the SVM corresponded with the observed error rate. However, standard implementations of the SVM algorithm do not output posterior probabilities but decisions values for each hyperplane. We implemented a method to convert these values into actual base quality scores (see Supplementary Methods). Alignments were performed using BWA version 0.5.8 a (Li and Durbin, 2009) with default parameters.

3 RESULTS

We compared freeIbis with the masking disabled to the most recent version of Ibis on the aforementioned run containing 200 000 sequences from a ϕX174 control lane with a high thymine retention (Kircher ). The reads produced by both versions were aligned back to the ϕX174 genome, and the number of sequences mapped and average edit distance was computed. We observed that LIBOCAS outperforms the previous SVM library for both metrics. Because the introduction of incorrectly labelled training examples could influence the quality of the SVM model, we sought to evaluate whether our masking procedure would have an effect on the number of mapped reads. The mapping statistics confirmed that masking divergent bases on the ϕX genome improves the final sequence accuracy (170 572 sequences mapped) compared with not masking any bases (170 220) or masking random bases (170 225). We tested freeIbis on a recent paired-end GAIIx run from mid-2011 from our own sequencing centre with 2 × 126 cycles and a single index of seven nucleotides. This multiplexed run had both human DNA as target, and ϕX174 as control and was basecalled using the previous version, Ibis, and the current one, freeIbis as well as naiveBayesCall (v. 0.3) and All your base (AYB, v2.08). We compared how each performed in terms of sequence accuracy, the number of sequences mapped and edit distance to the reference, as well as runtime (Table 1). We showed that freeIbis provides more high-quality base calls, leading to an increased number of reads being mapped to the reference with a lower edit distance than is the case for other basecallers. The predicted versus observed quality scores were plotted for Bustard and for freeIbis (Fig. 1). The sequences for the two GA runs used for comparison were produced using Bustard Off-Line Basecaller (OLB v.1.9.3). Our results show that freeIbis offers an improved accuracy and calibrated quality scores for these sequencing runs (including one on a HiSeq and another on a MiSeq) and outperforms Bustard on runs with unusually high error rates (see Supplementary Data).

Table 1.

Accuracy for each basecaller on a Illumina GAIIx dataset (2 × 126 cycles with 366 135 257 clusters)

Basecaller	Training time	Calling time	Mapped (%)^a	Edit distance
Bustard			583 348 201 (83.93%)	1.379
naiveBayesCall	591 h	658 h	578 957 145 (83.34%)	1.496
AYB	394 h		593 183 967 (85.52%)	1.076
Ibis	19.4 h	13.2 h	592 929 953 (85.31%)	1.167
freeIbis	21.3 h	12.2 h	594 095 219 (85.48%)	1.145

The human sequences were mapped to the hg19 version of the human genome. The number of mapped sequences and the average number of mismatches for those were tallied for each method. Time trials were conducted on a machine with 74 GB of RAM and using 8 of the 12 Intel Xeon cores running at 2.27 GHz. aPercentage relative to sequences assigned to the read group of interest.

Fig. 1.

Plot of the predicted versus the observed base quality score for control reads. Ideally the base qualities should follow the diagonal line. The root mean square error (RMSE) shows that quality scores predicted using freeIbis have a greater correlation to their observed error rates Accuracy for each basecaller on a Illumina GAIIx dataset (2 × 126 cycles with 366 135 257 clusters) The human sequences were mapped to the hg19 version of the human genome. The number of mapped sequences and the average number of mismatches for those were tallied for each method. Time trials were conducted on a machine with 74 GB of RAM and using 8 of the 12 Intel Xeon cores running at 2.27 GHz. aPercentage relative to sequences assigned to the read group of interest. Using the genotype calls from the same sequencing data but using three different basecallers (Ibis, freeIbis and Bustard) to compare with calls from Sanger sequences, we determined that freeIbis offers improved genotyping accuracy (see Supplementary Data).

4 CONCLUSION

FreeIbis provides substantial improvements in sequence accuracy, quality score calibration and genotyping accuracy over Bustard, and is more computationally efficient than equally accurate model-based methods such as AYB.

8 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

2. OnlineCall: fast online parameter estimation and base calling for illumina's next-generation sequencing.

Authors: Shreepriya Das; Haris Vikalo
Journal: Bioinformatics Date: 2012-05-07 Impact factor: 6.937

3. BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing.

Authors: Wei-Chun Kao; Kristian Stevens; Yun S Song
Journal: Genome Res Date: 2009-08-06 Impact factor: 9.043

4. Alta-Cyclic: a self-optimizing base caller for next-generation sequencing.

Authors: Yaniv Erlich; Partha P Mitra; Melissa delaBastide; W Richard McCombie; Gregory J Hannon
Journal: Nat Methods Date: 2008-07-06 Impact factor: 28.547

5. All Your Base: a fast and accurate probabilistic approach to base calling.

Authors: Tim Massingham; Nick Goldman
Journal: Genome Biol Date: 2012-02-29 Impact factor: 13.583

6. Improved base calling for the Illumina Genome Analyzer using machine learning strategies.

Authors: Martin Kircher; Udo Stenzel; Janet Kelso
Journal: Genome Biol Date: 2009-08-14 Impact factor: 13.583

7. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

8. Swift: primary data analysis for the Illumina Solexa sequencing platform.

Authors: Nava Whiteford; Tom Skelly; Christina Curtis; Matt E Ritchie; Andrea Löhr; Alexander Wait Zaranek; Irina Abnizova; Clive Brown
Journal: Bioinformatics Date: 2009-06-23 Impact factor: 6.937

8 in total

31 in total

1. Multilineage communication regulates human liver bud development from pluripotency.

Authors: J Gray Camp; Keisuke Sekine; Tobias Gerber; Henry Loeffler-Wirth; Hans Binder; Malgorzata Gac; Sabina Kanton; Jorge Kageyama; Georg Damm; Daniel Seehofer; Lenka Belicova; Marc Bickle; Rico Barsacchi; Ryo Okuda; Emi Yoshizawa; Masaki Kimura; Hiroaki Ayabe; Hideki Taniguchi; Takanori Takebe; Barbara Treutlein
Journal: Nature Date: 2017-06-14 Impact factor: 49.962

2. Genomic landscape of human diversity across Madagascar.

Authors: Denis Pierron; Margit Heiske; Harilanto Razafindrazaka; Ignace Rakoto; Nelly Rabetokotany; Bodo Ravololomanga; Lucien M-A Rakotozafy; Mireille Mialy Rakotomalala; Michel Razafiarivony; Bako Rasoarifetra; Miakabola Andriamampianina Raharijesy; Lolona Razafindralambo; Fulgence Fanony; Sendra Lejamble; Olivier Thomas; Ahmed Mohamed Abdallah; Christophe Rocher; Amal Arachiche; Laure Tonaso; Veronica Pereda-Loth; Stéphanie Schiavinato; Nicolas Brucato; Francois-Xavier Ricaut; Pradiptajati Kusuma; Herawati Sudoyo; Shengyu Ni; Anne Boland; Jean-Francois Deleuze; Philippe Beaujard; Philippe Grange; Sander Adelaar; Mark Stoneking; Jean-Aimé Rakotoarisoa; Chantal Radimilahy; Thierry Letellier
Journal: Proc Natl Acad Sci U S A Date: 2017-07-17 Impact factor: 11.205

3. Nuclear DNA sequences from the Middle Pleistocene Sima de los Huesos hominins.

Authors: Matthias Meyer; Juan-Luis Arsuaga; Cesare de Filippo; Sarah Nagel; Ayinuer Aximu-Petri; Birgit Nickel; Ignacio Martínez; Ana Gracia; José María Bermúdez de Castro; Eudald Carbonell; Bence Viola; Janet Kelso; Kay Prüfer; Svante Pääbo
Journal: Nature Date: 2016-03-14 Impact factor: 49.962

4. A mitochondrial genome sequence of a hominin from Sima de los Huesos.

Authors: Matthias Meyer; Qiaomei Fu; Ayinuer Aximu-Petri; Isabelle Glocke; Birgit Nickel; Juan-Luis Arsuaga; Ignacio Martínez; Ana Gracia; José María Bermúdez de Castro; Eudald Carbonell; Svante Pääbo
Journal: Nature Date: 2013-12-04 Impact factor: 49.962

5. The formation of human populations in South and Central Asia.

Authors: Vagheesh M Narasimhan; Nick Patterson; Priya Moorjani; Nadin Rohland; Rebecca Bernardos; Swapan Mallick; Iosif Lazaridis; Nathan Nakatsuka; Iñigo Olalde; Mark Lipson; Alexander M Kim; Luca M Olivieri; Alfredo Coppa; Massimo Vidale; James Mallory; Vyacheslav Moiseyev; Egor Kitov; Janet Monge; Nicole Adamski; Neel Alex; Nasreen Broomandkhoshbacht; Francesca Candilio; Kimberly Callan; Olivia Cheronet; Brendan J Culleton; Matthew Ferry; Daniel Fernandes; Suzanne Freilich; Beatriz Gamarra; Daniel Gaudio; Mateja Hajdinjak; Éadaoin Harney; Thomas K Harper; Denise Keating; Ann Marie Lawson; Matthew Mah; Kirsten Mandl; Megan Michel; Mario Novak; Jonas Oppenheimer; Niraj Rai; Kendra Sirak; Viviane Slon; Kristin Stewardson; Fatma Zalzala; Zhao Zhang; Gaziz Akhatov; Anatoly N Bagashev; Alessandra Bagnera; Bauryzhan Baitanayev; Julio Bendezu-Sarmiento; Arman A Bissembaev; Gian Luca Bonora; Temirlan T Chargynov; Tatiana Chikisheva; Petr K Dashkovskiy; Anatoly Derevianko; Miroslav Dobeš; Katerina Douka; Nadezhda Dubova; Meiram N Duisengali; Dmitry Enshin; Andrey Epimakhov; Alexey V Fribus; Dorian Fuller; Alexander Goryachev; Andrey Gromov; Sergey P Grushin; Bryan Hanks; Margaret Judd; Erlan Kazizov; Aleksander Khokhlov; Aleksander P Krygin; Elena Kupriyanova; Pavel Kuznetsov; Donata Luiselli; Farhod Maksudov; Aslan M Mamedov; Talgat B Mamirov; Christopher Meiklejohn; Deborah C Merrett; Roberto Micheli; Oleg Mochalov; Samariddin Mustafokulov; Ayushi Nayak; Davide Pettener; Richard Potts; Dmitry Razhev; Marina Rykun; Stefania Sarno; Tatyana M Savenkova; Kulyan Sikhymbaeva; Sergey M Slepchenko; Oroz A Soltobaev; Nadezhda Stepanova; Svetlana Svyatko; Kubatbek Tabaldiev; Maria Teschler-Nicola; Alexey A Tishkin; Vitaly V Tkachev; Sergey Vasilyev; Petr Velemínský; Dmitriy Voyakin; Antonina Yermolayeva; Muhammad Zahir; Valery S Zubkov; Alisa Zubova; Vasant S Shinde; Carles Lalueza-Fox; Matthias Meyer; David Anthony; Nicole Boivin; Kumarasamy Thangaraj; Douglas J Kennett; Michael Frachetti; Ron Pinhasi; David Reich
Journal: Science Date: 2019-09-06 Impact factor: 47.728

6. Single-cell analysis uncovers convergence of cell identities during axolotl limb regeneration.

Authors: Tobias Gerber; Prayag Murawala; Dunja Knapp; Wouter Masselink; Maritta Schuez; Sarah Hermann; Malgorzata Gac-Santel; Sergej Nowoshilow; Jorge Kageyama; Shahryar Khattak; Joshua D Currie; J Gray Camp; Elly M Tanaka; Barbara Treutlein
Journal: Science Date: 2018-09-27 Impact factor: 47.728

7. Reconstructing Prehistoric African Population Structure.

Authors: Pontus Skoglund; Jessica C Thompson; Mary E Prendergast; Alissa Mittnik; Kendra Sirak; Mateja Hajdinjak; Tasneem Salie; Nadin Rohland; Swapan Mallick; Alexander Peltzer; Anja Heinze; Iñigo Olalde; Matthew Ferry; Eadaoin Harney; Megan Michel; Kristin Stewardson; Jessica I Cerezo-Román; Chrissy Chiumia; Alison Crowther; Elizabeth Gomani-Chindebvu; Agness O Gidna; Katherine M Grillo; I Taneli Helenius; Garrett Hellenthal; Richard Helm; Mark Horton; Saioa López; Audax Z P Mabulla; John Parkington; Ceri Shipton; Mark G Thomas; Ruth Tibesasa; Menno Welling; Vanessa M Hayes; Douglas J Kennett; Raj Ramesar; Matthias Meyer; Svante Pääbo; Nick Patterson; Alan G Morris; Nicole Boivin; Ron Pinhasi; Johannes Krause; David Reich
Journal: Cell Date: 2017-09-21 Impact factor: 41.582

8. The making of a branching annelid: an analysis of complete mitochondrial genome and ribosomal data of Ramisyllis multicaudata.

Authors: M Teresa Aguado; Christopher J Glasby; Paul C Schroeder; Anne Weigert; Christoph Bleidorn
Journal: Sci Rep Date: 2015-07-17 Impact factor: 4.379

9. A high-coverage Neandertal genome from Vindija Cave in Croatia.

Authors: Kay Prüfer; Cesare de Filippo; Steffi Grote; Fabrizio Mafessoni; Petra Korlević; Mateja Hajdinjak; Benjamin Vernot; Laurits Skov; Pinghsun Hsieh; Stéphane Peyrégne; David Reher; Charlotte Hopfe; Sarah Nagel; Tomislav Maricic; Qiaomei Fu; Christoph Theunert; Rebekah Rogers; Pontus Skoglund; Manjusha Chintalapati; Michael Dannemann; Bradley J Nelson; Felix M Key; Pavao Rudan; Željko Kućan; Ivan Gušić; Liubov V Golovanova; Vladimir B Doronichev; Nick Patterson; David Reich; Evan E Eichler; Montgomery Slatkin; Mikkel H Schierup; Aida M Andrés; Janet Kelso; Matthias Meyer; Svante Pääbo
Journal: Science Date: 2017-10-05 Impact factor: 47.728

10. Human cerebral organoids recapitulate gene expression programs of fetal neocortex development.

Authors: J Gray Camp; Farhath Badsha; Marta Florio; Sabina Kanton; Tobias Gerber; Michaela Wilsch-Bräuninger; Eric Lewitus; Alex Sykes; Wulf Hevers; Madeline Lancaster; Juergen A Knoblich; Robert Lachmann; Svante Pääbo; Wieland B Huttner; Barbara Treutlein
Journal: Proc Natl Acad Sci U S A Date: 2015-12-07 Impact factor: 11.205