Literature DB >> 27717304

Je, a versatile suite to handle multiplexed NGS libraries with unique molecular identifiers.

Charles Girardot¹, Jelle Scholtalbers², Sajoscha Sauer², Shu-Yi Su², Eileen E M Furlong².

Abstract

BACKGROUND: The yield obtained from next generation sequencers has increased almost exponentially in recent years, making sample multiplexing common practice. While barcodes (known sequences of fixed length) primarily encode the sample identity of sequenced DNA fragments, barcodes made of random sequences (Unique Molecular Identifier or UMIs) are often used to distinguish between PCR duplicates and transcript abundance in, for example, single-cell RNA sequencing (scRNA-seq). In paired-end sequencing, different barcodes can be inserted at each fragment end to either increase the number of multiplexed samples in the library or to use one of the barcodes as UMI. Alternatively, UMIs can be combined with the sample barcodes into composite barcodes, or with standard Illumina® indexing. Subsequent analysis must take read duplicates and sample identity into account, by identifying UMIs.
RESULTS: Existing tools do not support these complex barcoding configurations and custom code development is frequently required. Here, we present Je, a suite of tools that accommodates complex barcoding strategies, extracts UMIs and filters read duplicates taking UMIs into account. Using Je on publicly available scRNA-seq and iCLIP data containing UMIs, the number of unique reads increased by up to 36 %, compared to when UMIs are ignored.
CONCLUSIONS: Je is implemented in JAVA and uses the Picard API. Code, executables and documentation are freely available at http://gbcs.embl.de/Je . Je can also be easily installed in Galaxy through the Galaxy toolshed.

Entities: Chemical Disease Gene Species

Keywords: Duplicates; Genomics; Multiplexing; NGS; Software; UMI

Mesh：

Year: 2016 PMID： 27717304 PMCID： PMC5055726 DOI： 10.1186/s12859-016-1284-2

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

High-throughput sequencing has become the approach of choice in genomic experiments (RNA-seq, ChIP-seq, DNA-seq, …). Continuous improvements in sequencing chemistry and hardware have translated into significant cost decreases with huge increases in productivity (up to 400 million reads per lane on current Illumina® HiSeq 4000). This throughput often exceeds the sequencing depth required in many applications [1], or when working with small genomes. Protocols to sequence multiple samples within the same sequencer lane (multiplexed library) are now common practice in both single end (SE) and paired end (PE) strategies. Multiplexing can also be guided by experimental design considerations where samples are sequenced in different lanes to gain information on technical variance, or in staged sequencing approaches to reduce sequencing costs where samples are sequenced sequentially until the required sequencing depth is achieved [1]. In multiplexed libraries, DNA fragments originating from the same sample are associated with a unique sequence of fixed length (e.g. six bases). This barcode (or index) is later used to computationally identify the original sample of each sequenced read. In the Illumina® TruSeq™ protocol (Fig. 1a, left), the barcode is inserted further down the DNA fragment and debarcoding is usually performed using the Illumina CASAVA pipeline. In alternative protocols (Fig. 1b, right), the barcode is inserted directly upstream of the DNA fragment during library construction and the debarcoding operation is typically performed using third party tools [2, 3] (also see Additional file 1: Table S1 for features comparison) or custom code.

Fig. 1

Barcoding Strategies. a Schematic view of the multiplexed library processing. A unique and different barcode (BC, white box with black stripes) is used for each sample. The barcode is placed further down the DNA fragment and sequenced in a specific sequencing round (Illumina® TruSeq™, left); or directly upstream the DNA fragment and sequenced concomitantly (custom protocol, right). After sequencing and image processing, reads of multiplexed samples are mixed together in the fastq result file. For each read, the barcoding sequence (black box with white stripes) is computationally clipped off the read end (custom protocols) or read from the additional barcode file (Illumina® TruSeq™, index file is provided with the I1 option); and the original sample is identified by comparing this barcoding sequence to known barcodes. Finally, read sequences are saved in sample specific fastq files. b In PE sequencing, barcodes can be added to one or both fragment ends. The Je demultiplex BPOS option indicates which read(s) contain(s) the barcode(s). c demultiplex options for barcodes present at both read ends. A decision is needed to specify which barcode is used to identify separate samples. d Combining UMIs (BC1 and BC2, white box with black stripes) with Illumina sample indexing (white box with black dots, top) or as composite barcode (bottom). In a composite barcode, the number of random base upstream and downstream the sample index is variable Custom multiplexing protocols offer great design flexibility, in particular in PE sequencing where barcodes can be inserted at one or both ends of the DNA fragment (Fig. 1b). In the latter, the barcode found in each read of the pair is usually the same, and this redundancy allows for more specificity when one of the barcoding sequences contains errors or bases of poor quality. The encoding possibilities are exponentiated by adapting a different barcode to each end of the DNA fragment. Lastly, the correct interpretation of experiments, such as single cell RNA-seq (scRNA-seq), requires the disentanglement of biological read duplicates that reflect RNA abundance in the cell from technical duplicates that result from sequencing the same RNA molecule multiple times (PCR duplicates). A common procedure towards this goal is to barcode each DNA fragments before PCR amplification i.e. each read is attached to a fixed-length (random) sequence that will act as a Unique Molecular Identifier (UMI) [4-7]. After read mapping, only duplicate reads with different UMIs will be kept in downstream processing. UMIs can be combined with sample barcodes in different ways, which varies between protocols: using separate ends of the DNA fragments (Fig. 1c, case 2), combining Illumina sample indexing with custom barcoding to add a UMI to DNA fragment ends (Fig. 1d, top) or using composite barcodes (Fig. 1d, bottom). Currently available tools do not offer the flexibility required to process these different barcoding configurations and perform duplicate filtering using UMIs. Here we present Je, a suite of tools that can demultiplex fastq files (accommodating all described situations above), extract UMIs from demultiplexed files and filter (or flag) read duplicates taking UMIs into account (Fig. 2).

Fig. 2

The different modules of Je (green squared blocks) and their usage in workflows. The clip, demultiplex and demulitplex-illu are the three possible entry points to process barcoded fastq files (blue squared blocks). In most setups (plain arrows), clipped or demultiplexed fastq files are mapped to the genome (grey squared block) using your favorite mapper and filtered for duplicate reads by the Je’s markdupes module using extracted UMIs. In more complex barcoding designs (e.g. composite barcodes, Supplementary Text), additional clipping before or after the sample demultiplexing step could be required (dashed arrows)

Implementation

Je is implemented in Java 7 and uses the htsjdk (http://samtools.github.io/htsjdk/) and picard [8] libraries. Je has been designed with extensibility in mind with each sub-module (demultiplex, demultiplex-illu, clip or markdupes) encapsulated in its own package. This is reflected on the command line level where the command to run (demultiplex, demultiplex-illu, clip or markdupes) should be specified right after the je executable followed by relevant module’s options e.g. je demultiplex < options>, where < options > is the option list. The top level class Je.java is responsible to parse this command line and invoke the appropriate sub-module’s class (for example Jeclipper.java in the jeclipper package) with user’s provided options. The sub-module class is then responsible to validate user’s options before computing.

The demultiplex command

The demultiplex command is used when the sample-encoding barcode is found at the beginning of the read (Fig. 1a, right). It can deal with SE and PE reads having barcodes in one or both reads, with or without UMIs (Additional file 1: Supplementary Text). This includes situations where barcodes contain degenerate positions (like in the individual-nucleotide resolution Cross-Linking and ImmunoPrecipitation (iCLIP) protocol), are combined with UMIs into composite barcodes (Fig. 1d, bottom) or found in different reads (e.g. sample-encoding barcode in read_1 and UMIs in read_2, Fig. 1c). Je’s demultiplex module offers many options to tune sample identification stringency (e.g. mismatch number, barcode combination), read processing (e.g. trimming, clipping) and output format (gzip compression, md5 checksum generation). In all situations that include UMIs (or degenerate barcodes), demultiplex output is fully compatible with Je’s markdupes command.

The demultiplex-illu command

The demultiplex-illu command is used when sample-encoding barcodes are provided in separate fastq file(s) and UMIs are found at the beginning of the read(s). While CASAVA’s bcl2fastq2 tool is usually used to convert bcl files to fastq files and perform demultiplexing at the same time; it can also generate non-demultiplexed fastq files together with associated fastq index files (Fig. 1a, left). This alternative proves useful when debugging new protocols that use the index position for other purposes than sample encoding; or to overcome bcl2fastq2 barcode matching limitations (e.g. only allows up to two mismatches). Je’s demultiplex-illu module offers the same options as the demultiplex module and its output is fully compatible with Je’s markdupes command.

The clip command

The clip command is used to extract UMIs from fastq files that do not require sample demultiplexing at the same time. Similarly to demultiplex and demultiplex-illu commands, extracted UMIs are added to the read headers (as expected by markdupes) and read headers are reformatted to fulfill read mappers requirements (most read mappers expect headers for read_1 and read_2 to be strictly identical). The clip module offers identical read processing (e.g. trimming, clipping) and output formatting options as the demultiplexing modules.

The markdupes command

The markdupes command extends the popular Picard’s MarkDuplicates tool [8] by adding support for UMIs embedded in read headers (as generated by the demultiplex, demultiplex-illu or clip commands). This module takes mapped reads as input (in SAM/BAM format) and identifies PCR (and optical) read duplicates based on their mapping positions and UMIs. In short, reads identified as duplicates based on their mapping locations are further regrouped based on their UMIs (Additional file 1: Supplementary Text). All reads of a UMI group are declared duplicates but one (according to the chosen scoring strategy). Finally, duplicate reads are either discarded or included in output (with bitwise flag 1024). Je’s markdupes module supports random UMIs (any combination of a k-mer can occur) or runs with a predefined list of UMIs (as in e.g. NEXTflex™ kit from Bioo Scientific). In both situations, different options (in addition to all native Picard’s MarkDuplicates options) are offered to tune UMI comparison stringency like the number of mismatches to still consider two UMIs identical, or how to handle Ns found in UMIs.

Galaxy integration

A wrapper for integration in Galaxy [9] was written for each Je sub-module following Galaxy guidelines and best practices. All wrappers (and Je code) were uploaded to the Galaxy toolshed [10] as a repository suite, enabling Galaxy administrators to either install each sub-module separately or together as a suite.

Results and discussion

Using UMIs significantly increases the number of useable reads

scRNA-seq is a powerful tool to quantify the extent of gene expression variability amongst a population of cells and, for example, reveal sub-populations of cells or new cell types. The low amount of starting material (combined with the low efficiency of RNA capture and cDNA synthesis) and the bias introduced by the substantial amplification required have been identified as major limitations and generally result in a high level of technical noise [7, 11]. By eliminating the noise introduced at the amplification step, the use of UMIs was demonstrated to be critical towards an accurate and absolute quantification of the number of original RNA messenger molecules present per cell [4, 12]; and globally facilitates distinguishing true biological variability from technical variability [7]. Generally, scRNA-seq data has a very high level of duplicate reads (as identified solely by their identical mapping position). Therefore, increasing the number of unique reads available for gene expression quantification is key, in particular for lowly to moderately expressed genes [11]. To quantify the gain of tagging reads with UMIs that are processed using the Je suite, we reprocessed 50 scRNA-seq single cell experiments from Islam et al. [12] (Additional file 1: Supplementary Methods) and identified unique reads with and without taking the UMIs into account (using je markdupes and Picard MarkDuplicates [8], respectively); which, in this case, directly translates to the number of RNA molecules present in the cell. We quantified the gain obtained using the UMIs as the number of duplicate reads reassigned as unique reads once the UMIs were taken into account. Expressed as a percentage relative to the number of unique reads identified without accounting for the UMIs, the gains ranged from 13 to 36 % with an average of 24 % (Additional file 1: Figure S1). iCLIP also suffers high duplication rates due to the low number of biologically relevant genomic positions. To evaluate the impact of using UMIs for this type of experimental data, we analyzed iCLIP human samples published by Zarnack et al. [13] (Additional file 1: Supplementary Methods) and observed gains in the number of useable reads ranging from 10 to 36 % with an average of 21 % (Additional file 1: Figure S2). The advantage of using UMIs is not limited to scRNA-seq or iCLIP experiments. Indeed, duplicate read filtering (using e.g. Picard MarkDuplicates) is standard practice in the processing of DNase I hypersensitive sites sequencing (DNAse-seq) and chromatin immunoprecipitation sequencing (ChIP-seq) data, which in single-end sequencing results in capping the coverage (number of reads or fragments overlapping a specific genomic position). As the sequencing depth increases, this approach severely impacts the signal-to-noise ratio as the background coverage increases while the signal coverage reached its upper limit. A straightforward solution is to systematically introduce UMIs and use Je to uniquely identify fragments in DNAse-seq and ChIP-seq libraries to avoid an artificial limitation of the dynamic range.

Je offers a unique set of features

Although a number of demultiplexing tools have been published, Je comes with a unique set of features when compared to available tools (Additional file 1: Table S1). For instance, deML [14] and bayexer [15] focus on improving Illumina TruSeq indices demultiplexing in the particular situation of low quality reads, TagGD [16], GBSX [17] and FLEXBAR [18] specialize in barcode design and provide debarcoding algorithms able to handle barcodes of variable length or found at variable position in the read, while fastq-multx [3] and fastx_barcode_splitter [2] only accommodate the standard in-line barcoding approaches (barcodes found at reads start and of fixed length). Although some of these tools might be more suited than Je in particular situations, none of them offer UMI support and should therefore be combined with specialized tools such as UMI-tools [19] or Je (clip and markdupes) when reads contain UMIs. Similarly, modules from MIGEC [20], a suite of tools specialized in the processing of T-cell receptor repertoire sequencing (RepSeq) data, can demultiplex and utilize UMI-tagged data but with a fundamental different deduplication approach in that it directly works on non-aligned reads and assembles them into consensus sequences. In addition, MIGEC (like UMI-tools) cannot accommodate for predefined list of UMIs. Taken together, we believe that the extent and flexible nature of the features offered by Je are unique, and constitute a valuable suite for data with complex experimental designs.

Conclusions

Je offers the necessary tools to address most barcoding situations with and without UMIs (also see Additional file 1: Supplementary Text) and the identification of PCR duplicates based on extracted UMIs. In standard experimental set ups (one barcode per sample, identical barcodes at both fragments’ ends) and using equivalent options (i.e. mismatch number), Je demultiplex produced identical results when compared to other demultiplexing tools [2, 3] and performed 3.8 times faster and 4.5 times slower than the popular FASTX [2] (barcode_splitter) and eautils [3] (fastq-multx) packages, respectively (Additional file 1: Supplementary Methods). However, Je demultiplex and demultiplex-illu can handle more complex designs such as mixing samples encoding barcodes and UMIs. Using Je to process complex public scRNA-seq and iCLIP data that leverage the advantages of UMIs, we observed an increase of unique reads up to 36 % when compared to Picard MarkDuplicates [8], which cannot account for the presence of UMIs. To broaden Je accessibility, we developed wrappers for Galaxy [9] and made Je available through the Galaxy toolshed [10].

17 in total

1. Digital RNA sequencing minimizes sequence-dependent bias and amplification noise with optimized single-molecule barcodes.

Authors: Katsuyuki Shiroguchi; Tony Z Jia; Peter A Sims; X Sunney Xie
Journal: Proc Natl Acad Sci U S A Date: 2012-01-09 Impact factor: 11.205

2. Counting absolute numbers of molecules using unique molecular identifiers.

Authors: Teemu Kivioja; Anna Vähärautio; Kasper Karlsson; Martin Bonke; Martin Enge; Sten Linnarsson; Jussi Taipale
Journal: Nat Methods Date: 2011-11-20 Impact factor: 28.547

3. Galaxy: a platform for interactive large-scale genome analysis.

Authors: Belinda Giardine; Cathy Riemer; Ross C Hardison; Richard Burhans; Laura Elnitski; Prachi Shah; Yi Zhang; Daniel Blankenberg; Istvan Albert; James Taylor; Webb Miller; W James Kent; Anton Nekrutenko
Journal: Genome Res Date: 2005-09-16 Impact factor: 9.043

4. Quantitative single-cell RNA-seq with unique molecular identifiers.

Authors: Saiful Islam; Amit Zeisel; Simon Joost; Gioele La Manno; Pawel Zajac; Maria Kasper; Peter Lönnerberg; Sten Linnarsson
Journal: Nat Methods Date: 2013-12-22 Impact factor: 28.547

5. Accounting for technical noise in single-cell RNA-seq experiments.

Authors: Philip Brennecke; Simon Anders; Jong Kyoung Kim; Aleksandra A Kołodziejczyk; Xiuwei Zhang; Valentina Proserpio; Bianka Baying; Vladimir Benes; Sarah A Teichmann; John C Marioni; Marcus G Heisler
Journal: Nat Methods Date: 2013-09-22 Impact factor: 28.547

6. Detection of ultra-rare mutations by next-generation sequencing.

Authors: Michael W Schmitt; Scott R Kennedy; Jesse J Salk; Edward J Fox; Joseph B Hiatt; Lawrence A Loeb
Journal: Proc Natl Acad Sci U S A Date: 2012-08-01 Impact factor: 11.205

7. Towards error-free profiling of immune repertoires.

Authors: Mikhail Shugay; Olga V Britanova; Ekaterina M Merzlyak; Maria A Turchaninova; Ilgar Z Mamedov; Timur R Tuganbaev; Dmitriy A Bolotin; Dmitry B Staroverov; Ekaterina V Putintseva; Karla Plevova; Carsten Linnemann; Dmitriy Shagin; Sarka Pospisilova; Sergey Lukyanov; Ton N Schumacher; Dmitriy M Chudakov
Journal: Nat Methods Date: 2014-05-04 Impact factor: 28.547

Review 8. Sequencing depth and coverage: key considerations in genomic analyses.

Authors: David Sims; Ian Sudbery; Nicholas E Ilott; Andreas Heger; Chris P Ponting
Journal: Nat Rev Genet Date: 2014-02 Impact factor: 53.242

9. Direct competition between hnRNP C and U2AF65 protects the transcriptome from the exonization of Alu elements.

Authors: Kathi Zarnack; Julian König; Mojca Tajnik; Iñigo Martincorena; Sebastian Eustermann; Isabelle Stévant; Alejandro Reyes; Simon Anders; Nicholas M Luscombe; Jernej Ule
Journal: Cell Date: 2013-01-31 Impact factor: 41.582

10. TagGD: fast and accurate software for DNA Tag generation and demultiplexing.

Authors: Paul Igor Costea; Joakim Lundeberg; Pelin Akan
Journal: PLoS One Date: 2013-03-04 Impact factor: 3.240

35 in total

1. Meganuclease targeting of PCSK9 in macaque liver leads to stable reduction in serum cholesterol.

Authors: Lili Wang; Jeff Smith; Camilo Breton; Peter Clark; Jia Zhang; Lei Ying; Yan Che; Janel Lape; Peter Bell; Roberto Calcedo; Elizabeth L Buza; Alexei Saveliev; Victor V Bartsevich; Zhenning He; John White; Mingyao Li; Derek Jantz; James M Wilson
Journal: Nat Biotechnol Date: 2018-07-09 Impact factor: 54.908

2. Quantitative microbiome profiling links gut community variation to microbial load.

Authors: Doris Vandeputte; Gunter Kathagen; Kevin D'hoe; Sara Vieira-Silva; Mireia Valles-Colomer; João Sabino; Jun Wang; Raul Y Tito; Lindsey De Commer; Youssef Darzi; Séverine Vermeire; Gwen Falony; Jeroen Raes
Journal: Nature Date: 2017-11-15 Impact factor: 49.962

3. An Essential Regulator of Bacterial Division Links FtsZ to Cell Wall Synthase Activation.

Authors: Patrick J Lariviere; Christopher R Mahone; Gustavo Santiago-Collazo; Matthew Howell; Allison K Daitch; Rilee Zeinert; Peter Chien; Pamela J B Brown; Erin D Goley
Journal: Curr Biol Date: 2019-04-25 Impact factor: 10.834

4. Galaxy CLIP-Explorer: a web server for CLIP-Seq data analysis.

Authors: Florian Heyl; Daniel Maticzka; Michael Uhl; Rolf Backofen
Journal: Gigascience Date: 2020-11-11 Impact factor: 6.524

5. Model-based identification of conditionally-essential genes from transposon-insertion sequencing data.

Authors: Vishal Sarsani; Berent Aldikacti; Shai He; Rilee Zeinert; Peter Chien; Patrick Flaherty
Journal: PLoS Comput Biol Date: 2022-03-07 Impact factor: 4.475

6. The Lon Protease Links Nucleotide Metabolism with Proteotoxic Stress.

Authors: Rilee D Zeinert; Hamid Baniasadi; Benjamin P Tu; Peter Chien
Journal: Mol Cell Date: 2020-08-04 Impact factor: 17.970

7. De novo repeat interruptions are associated with reduced somatic instability and mild or absent clinical features in myotonic dystrophy type 1.

Authors: Sarah A Cumming; Mark J Hamilton; Yvonne Robb; Helen Gregory; Catherine McWilliam; Anneli Cooper; Berit Adam; Josephine McGhie; Graham Hamilton; Pawel Herzyk; Michael R Tschannen; Elizabeth Worthey; Richard Petty; Bob Ballantyne; Jon Warner; Maria Elena Farrugia; Cheryl Longman; Darren G Monckton
Journal: Eur J Hum Genet Date: 2018-07-02 Impact factor: 4.246

8. Nicotinamide riboside attenuates age-associated metabolic and functional changes in hematopoietic stem cells.

Authors: Xuan Sun; Benjamin Cao; Marina Naval-Sanchez; Tony Pham; Yu Bo Yang Sun; Brenda Williams; Shen Y Heazlewood; Nikita Deshpande; Jinhua Li; Felix Kraus; James Rae; Quan Nguyen; Hamed Yari; Jan Schröder; Chad K Heazlewood; Madeline Fulton; Jessica Hatwell-Humble; Kaustav Das Gupta; Ronan Kapetanovic; Xiaoli Chen; Matthew J Sweet; Robert G Parton; Michael T Ryan; Jose M Polo; Christian M Nefzger; Susan K Nilsson
Journal: Nat Commun Date: 2021-05-11 Impact factor: 14.919

9. Transcriptional signature in microglia associated with Aβ plaque phagocytosis.

Authors: Alexandra Grubman; Xin Yi Choo; Gabriel Chew; John F Ouyang; Guizhi Sun; Nathan P Croft; Fernando J Rossello; Rebecca Simmons; Sam Buckberry; Dulce Vargas Landin; Jahnvi Pflueger; Teresa H Vandekolk; Zehra Abay; Yichen Zhou; Xiaodong Liu; Joseph Chen; Michael Larcombe; John M Haynes; Catriona McLean; Sarah Williams; Siew Yeen Chai; Trevor Wilson; Ryan Lister; Colin W Pouton; Anthony W Purcell; Owen J L Rackham; Enrico Petretto; Jose M Polo
Journal: Nat Commun Date: 2021-05-21 Impact factor: 14.919

10. Lipid antigens in bile from patients with chronic liver diseases activate natural killer T cells.

Authors: L Valestrand; N L Berntsen; F Zheng; E Schrumpf; S H Hansen; T H Karlsen; R S Blumberg; J R Hov; X Jiang; E Melum
Journal: Clin Exp Immunol Date: 2020-11-16 Impact factor: 4.330