| Literature DB >> 27102804 |
Thomas S Rask1,2,3, Bent Petersen4, Donald S Chen5, Karen P Day5,6, Anders Gorm Pedersen4.
Abstract
BACKGROUND: Amplicon pyrosequencing targets a known genetic region and thus inherently produces reads highly anticipated to have certain features, such as conserved nucleotide sequence, and in the case of protein coding DNA, an open reading frame. Pyrosequencing errors, consisting mainly of nucleotide insertions and deletions, are on the other hand likely to disrupt open reading frames. Such an inverse relationship between errors and expectation based on prior knowledge can be used advantageously to guide the process known as basecalling, i.e. the inference of nucleotide sequence from raw sequencing data.Entities:
Keywords: Amplicon sequencing; Basecalling; Bayesian methods
Mesh:
Substances:
Year: 2016 PMID: 27102804 PMCID: PMC4841065 DOI: 10.1186/s12859-016-1032-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Accuracy of Plasmodium falciparum reference strain amplicon resequencing using different basecalling methods. Shown for each basecalling method is the fraction of all sequences (provided in the legend as N) with a given number of errors. See Additional file 1: Figure S3 with log-scale for detailed frequencies of sequences with multiple errors
Fig. 2Ranking of the correct basecall according to P(CBS|flows). Upon flowgram clustering and alignment, Multipass was employed to calculate the fifty most likely basecalls given each flowgram alignment. The likelihood ranking of the correct basecall is shown against the number of flowgrams in the alignment. Maximally two hundred flowgrams from each cluster were used for alignment. The marker size is proportional to the abundance at a point
Fig. 3Receiver operator characteristics for prediction of incorrect sequences. Performance of three machine learning methods applied to differentiate between sequences with and without error in var sequences generated using Multipass basecalling. The classifiers were trained on a number of characteristics provided for each sequence, such as read coverage and maximal positional flow variance. Positive (P) and negative (N) refers to sequences with and without error, respectively. True (T) and false (F) refers to correct and incorrect predictions, respectively. For each method, the lowest false positive rate with perfect classification of the erroneous sequences is indicated (dotted lines)