Literature DB >> 25359895

deML: robust demultiplexing of Illumina sequences using a likelihood-based approach.

Gabriel Renaud¹, Udo Stenzel¹, Tomislav Maricic¹, Victor Wiebe¹, Janet Kelso¹.

Abstract

MOTIVATION: Pooling multiple samples increases the efficiency and lowers the cost of DNA sequencing. One approach to multiplexing is to use short DNA indices to uniquely identify each sample. After sequencing, reads must be assigned in silico to the sample of origin, a process referred to as demultiplexing. Demultiplexing software typically identifies the sample of origin using a fixed number of mismatches between the read index and a reference index set. This approach may fail or misassign reads when the sequencing quality of the indices is poor.
RESULTS: We introduce deML, a maximum likelihood algorithm that demultiplexes Illumina sequences. deML computes the likelihood of an observed index sequence being derived from a specified sample. A quality score which reflects the probability of the assignment being correct is generated for each read. Using these quality scores, even very problematic datasets can be demultiplexed and an error threshold can be set.
AVAILABILITY AND IMPLEMENTATION: deML is freely available for use under the GPL (http://bioinf.eva.mpg.de/deml/).

Entities: Chemical Disease Species

Mesh：

Year: 2014 PMID： 25359895 PMCID： PMC4341068 DOI： 10.1093/bioinformatics/btu719

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

While the high-throughput of next generation sequencing is beneficial for many applications, such as high coverage whole-genome sequencing, it may be economically disadvantageous for the sequencing of small numbers of loci. It is possible to sequence large number of samples in a single run by incorporating unique sequence indices for each sample, a process referred to as multiplexing. Current Illumina protocols allow for 1 or 2 index sequences to be used. The computational process by which reads are assigned to the sample of origin is called demultiplexing. The default demultiplexer provided by Illumina in the CASAVA package allows for 0 or 1 mismatches between the sequenced index and the user-supplied reference indices. Various heuristics have been proposed to assign reads to their sample of origin (Costea ; Davis ; Dodt ; Reid ). Although these methods perform well for sequencing reads with high quality, poor demultiplexing remains a common reason for low retrieval or misassignment of sequences from a multiplexed run. Increased error rates—particularly during sequencing of the index—can lead to a higher number of mismatches and hinders assignment to the correct sample. For some applications, high read error rates can be tolerated as long as the reads can be mapped to the reference (e.g. transcriptome quantification). We introduce deML, a new approach to demultiplexing samples based on likelihood of assignment to a particular sample and provide a freely available, open source C++ implementation. Briefly, we compute the likelihood of a read to originate from each of the original samples, assign reads to the most likely sample of origin and compute the overall confidence in this assignment. We show that by using thresholds on these confidence values, even very problematic datasets can be safely demultiplexed. By simulating increasing error in the indices we show that, especially at high error rates, deML with default quality cutoffs enables the user to demultiplex several fold more sequences than the vendor’s default demultiplexer or other methods based on fixed mismatches. The false discovery rate (FDR) remains below that of other tools based on hamming distance. deML, licensed under the GPL, can run on aligned or unaligned BAM files or FASTQ files.

2 Methods

2.1 Algorithm

We compute the likelihood of assignment of a read to all potential samples of origin, assign each read to the most likely sample and compute the uncertainty of the assignment. Let be the bases for a specific sample and be the two sequenced indices with their respective quality scores . Let m be a set of dummy variables which are equal to 1 if the corresponding bases between R and I match, or 0 otherwise. The likelihood of having sequenced the index given that it originates from a given sample, referred to as Z0, is given by: The Z0 score is computed for each potential match. Finally, the read is assigned to the most likely sample of origin. It can occur that a read is equally likely to belong to more than one sample. To quantify this uncertainty, the Z1 score models the probability of misassignment. Let M be the number of potential samples of origin and let be the likelihood scores for each sample. Let t be the sample with the highest likelihood, the misassignment score is given by: Additional details about the algorithm are found in the Supplementary Methods section. To evaluate the correctness of the sample assignment based on the indices, we produced double-indexed DNA libraries from amplicons of a 245 bp region of chromosome 7 from 99 human samples and from PhiX DNA fragmented to 350 bp. Double-indexing is increasingly used in applications requiring extremely accurate read assignment (Kircher ). The reads were basecalled, demultiplexed using deML and mapped to both the human genome and the PhiX genomes (see Supplementary Methods). The mapping of the forward and reverse reads indicates the sample of origin of the original cluster and was used to measure demultiplexing misassignments rates. Using simulations, we evaluated the robustness of deML read assignments for datasets at various error rates. Indices with perfect matches to an known sample had sequencing errors were added to them at various rates using an error profile derived from an Illumina MiSeq sequencing run. We computed the number of sequences demultiplexed by deML and by deindexer (https://github.com/ws6/deindexer), which allows users to increase the number of mismatches. We also measured the number of sequences with 0 or 1 mismatches as the standard Illumina demultiplexing approach (CASAVA) assigns sequences using this cutoff (see Supplementary Methods).

3 Results

Of the total of 15 245 844 clusters that were detected in our test dataset, 8 070 867 clusters had both forward and reverse reads aligning to the human control region and 4 629 687 to the PhiX. Using the sample assignment provided by deML for the reads mapping to the PhiX, the rate of false assignment was computed as a function of Z0 and Z1 scores. As expected, reads with a high likelihood of stemming from the PhiX control (Z0) group and with a low likelihood of stemming from another sample (Z1) were enriched for true assignments, whereas misassignments were found at the other end of the distribution. The distribution of the Z0 and Z1 scores for true and false positives (TP and FP) are presented in the Supplementary Results. As Z1 measures the probability of misassignment given the potential index sequence set on a PHRED scale, the relationship between the misassignment rate on a log scale and the Z1 score should be linear. For reads where both mates aligned to the PhiX, the misassignment rate was computed by considering any read pair not assigned by deML to the PhiX as a mislabeling. As Z1 can take many discrete values, the misassignment rate was plotted for multiple Z1 value bins (see Fig. 1).

Fig. 1.

Correlation between the Z1 score for reads aligned to the PhiX genome and the observed misassignment rate. Error bars were obtained using Wilson score intervals

Correlation between the Z1 score for reads aligned to the PhiX genome and the observed misassignment rate. Error bars were obtained using Wilson score intervals deML retrieves more sequences and achieves a lower FDR than currently available approaches (see Table 1 and Supplementary Results).

Table 1.

Number of sequences demultiplexed by deML and deindexer in terms of TP, FP and FDR for 12 374 149 sequences

Average error	deML			deindexer			CASAVA
Rate per base	TP	FP	FDR	TP	FP	FDR	0 mm	1 mm
0.002408	12 374 119	1	(0.00%)	12 372 007	0	(0.00%)	11 962 540	405 318
0.101145	11 898 460	205	(0.00%)	9 784 321	146	(0.00%)	2 783 384	4 381 588
0.196708	9 779 898	2761	(0.03%)	5 659 886	1683	(0.03%)	577 456	1 978 848

Note: The remaining columns present the number that could be identified using an approach allowing 1 mismatch (such as CASAVA).

Number of sequences demultiplexed by deML and deindexer in terms of TP, FP and FDR for 12 374 149 sequences Note: The remaining columns present the number that could be identified using an approach allowing 1 mismatch (such as CASAVA).

5 in total

1. Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform.

Authors: Martin Kircher; Susanna Sawyer; Matthias Meyer
Journal: Nucleic Acids Res Date: 2011-10-21 Impact factor: 16.971

2. TagGD: fast and accurate software for DNA Tag generation and demultiplexing.

Authors: Paul Igor Costea; Joakim Lundeberg; Pelin Akan
Journal: PLoS One Date: 2013-03-04 Impact factor: 3.240

3. Kraken: a set of tools for quality control and analysis of high-throughput sequence data.

Authors: Matthew P A Davis; Stijn van Dongen; Cei Abreu-Goodger; Nenad Bartonicek; Anton J Enright
Journal: Methods Date: 2013-06-29 Impact factor: 3.608

4. Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline.

Authors: Jeffrey G Reid; Andrew Carroll; Narayanan Veeraraghavan; Mahmoud Dahdouli; Andreas Sundquist; Adam English; Matthew Bainbridge; Simon White; William Salerno; Christian Buhay; Fuli Yu; Donna Muzny; Richard Daly; Geoff Duyk; Richard A Gibbs; Eric Boerwinkle
Journal: BMC Bioinformatics Date: 2014-01-29 Impact factor: 3.169

5. FLEXBAR-Flexible Barcode and Adapter Processing for Next-Generation Sequencing Platforms.

Authors: Matthias Dodt; Johannes T Roehr; Rina Ahmed; Christoph Dieterich
Journal: Biology (Basel) Date: 2012-12-14

5 in total

63 in total

1. Multilineage communication regulates human liver bud development from pluripotency.

Authors: J Gray Camp; Keisuke Sekine; Tobias Gerber; Henry Loeffler-Wirth; Hans Binder; Malgorzata Gac; Sabina Kanton; Jorge Kageyama; Georg Damm; Daniel Seehofer; Lenka Belicova; Marc Bickle; Rico Barsacchi; Ryo Okuda; Emi Yoshizawa; Masaki Kimura; Hiroaki Ayabe; Hideki Taniguchi; Takanori Takebe; Barbara Treutlein
Journal: Nature Date: 2017-06-14 Impact factor: 49.962

2. The role of matrilineality in shaping patterns of Y chromosome and mtDNA sequence variation in southwestern Angola.

Authors: Sandra Oliveira; Alexander Hübner; Anne-Maria Fehn; Teresa Aço; Fernanda Lages; Brigitte Pakendorf; Mark Stoneking; Jorge Rocha
Journal: Eur J Hum Genet Date: 2018-11-22 Impact factor: 4.246

3. Pinpointing the Genomic Localizations of Chromatin-Associated Proteins: The Yesterday, Today, and Tomorrow of ChIP-seq.

Authors: Sarah M Lloyd; Xiaomin Bao
Journal: Curr Protoc Cell Biol Date: 2019-09

Review 4. A plate-based single-cell ATAC-seq workflow for fast and robust profiling of chromatin accessibility.

Authors: Wei Xu; Yi Wen; Yingying Liang; Qiushi Xu; Xuefei Wang; Wenfei Jin; Xi Chen
Journal: Nat Protoc Date: 2021-07-19 Impact factor: 13.491

5. A human cell atlas of fetal gene expression.

Authors: Junyue Cao; Diana R O'Day; Hannah A Pliner; Paul D Kingsley; Mei Deng; Riza M Daza; Michael A Zager; Kimberly A Aldinger; Ronnie Blecher-Gonen; Fan Zhang; Malte Spielmann; James Palis; Dan Doherty; Frank J Steemers; Ian A Glass; Cole Trapnell; Jay Shendure
Journal: Science Date: 2020-11-13 Impact factor: 47.728

6. Single cell RNA sequencing to dissect the molecular heterogeneity in lupus nephritis.

Authors: Evan Der; Saritha Ranabothu; Hemant Suryawanshi; Kemal M Akat; Robert Clancy; Pavel Morozov; Manjunath Kustagi; Mareike Czuppa; Peter Izmirly; H Michael Belmont; Tao Wang; Nicole Jordan; Nicole Bornkamp; Janet Nwaukoni; July Martinez; Beatrice Goilav; Jill P Buyon; Thomas Tuschl; Chaim Putterman
Journal: JCI Insight Date: 2017-05-04

7. Altered social behavior in mice carrying a cortical Foxp2 deletion.

Authors: Vera P Medvedeva; Michael A Rieger; Beate Vieth; Cédric Mombereau; Christoph Ziegenhain; Tanay Ghosh; Arnaud Cressant; Wolfgang Enard; Sylvie Granon; Joseph D Dougherty; Matthias Groszer
Journal: Hum Mol Genet Date: 2019-03-01 Impact factor: 6.150

8. Identification of a reptile lyssavirus in Anolis allogus provided novel insights into lyssavirus evolution.

Authors: Masayuki Horie; Hiroshi Akashi; Masakado Kawata; Keizo Tomonaga
Journal: Virus Genes Date: 2020-11-07 Impact factor: 2.332

9. Differences in MHC-B diversity and KIR epitopes in two populations of wild chimpanzees.

Authors: Vincent Maibach; Kevin Langergraber; Fabian H Leendertz; Roman M Wittig; Linda Vigilant
Journal: Immunogenetics Date: 2019-12-03 Impact factor: 2.846

10. Organoid single-cell genomic atlas uncovers human-specific features of brain development.

Authors: Sabina Kanton; Michael James Boyle; Zhisong He; Malgorzata Santel; Anne Weigert; Fátima Sanchís-Calleja; Patricia Guijarro; Leila Sidow; Jonas Simon Fleck; Dingding Han; Zhengzong Qian; Michael Heide; Wieland B Huttner; Philipp Khaitovich; Svante Pääbo; Barbara Treutlein; J Gray Camp
Journal: Nature Date: 2019-10-16 Impact factor: 49.962