Literature DB >> 35561197

Analysing high-throughput sequencing data in Python with HTSeq 2.0.

Givanna H Putri1,2, Simon Anders3, Paul Theodor Pyl4, John E Pimanda1,2,5,6, Fabio Zanini1,2,7.   

Abstract

SUMMARY: HTSeq 2.0 provides a more extensive application programming interface including a new representation for sparse genomic data, enhancements for htseq-count to suit single-cell omics, a new script for data using cell and molecular barcodes, improved documentation, testing and deployment, bug fixes and Python 3 support.
AVAILABILITY AND IMPLEMENTATION: HTSeq 2.0 is released as an open-source software under the GNU General Public License and is available from the Python Package Index at https://pypi.python.org/pypi/HTSeq. The source code is available on Github at https://github.com/htseq/htseq. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2022. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2022        PMID: 35561197      PMCID: PMC9113351          DOI: 10.1093/bioinformatics/btac166

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


Single-cell omics have exploded in popularity over the last few years, spearheaded by single cell transcriptomics. While commercial software solutions from manufacturers such as 10X Genomics and BD Biosciences provide standardized pipelines (e.g. cellranger) for analysing single-cell omics data, numerous experimental approaches rely on open source software to align reads and subsequently to quantify biological phenomena such as gene expression, chromatin accessibility, transcription factor binding affinities, and 3D chromatin conformation. HTSeq (Anders ) was initially developed as a general purpose tool to analyse high-throughput sequencing data in Python. In parallel, the htseq-count script was designed to count the number of reads or read pairs attributable to distinct genes in bulk RNA-Seq experiments. At that time, single-cell approaches were limited to specialized biotechnology laboratories. In this application note, we report the development of HTSeq 2.0, which improves the general-purpose application programming interface (API) and specifically htseq-count to encompass diverse omics analyses, including single-cell RNA sequencing (scRNA-Seq). First, we have improved htseq-count, a popular script used to quantify gene expression in bulk and scRNA-Seq experiments (Fig. 1A–C). Multiple BAM files can now be processed with a single call of the script, which results in a counts table with either each row or column representing the counts from a separate BAM file. This is not only convenient but also faster because genomic features are loaded only once from the Gene transfer format (GTF) file, which can take as long as processing the reads for a typical plate-based single-cell experiment (Supplementary Fig. S1A and B). If multiple cores are available on the machine, htseq-count is now able to parallelize the quantification by allocating distinct input BAM files to each core (Fig. 1A, Supplementary Fig. S1A and B). The script also supports more output formats: compressed sparse matrices via scipy (Virtanen ), mtx files in the style of cellranger, h5-like file formats such as h5ad (Wolf ), and loom (http://loompy.org) (Fig. 1A). These output formats make it easier for users to import the counts table into downstream analysis libraries, especially single-cell ones such as scanpy (Wolf ) and singlet (https://github.com/iosonofabio/singlet). We also added support for storing additional metadata for each genomic feature. This has two clear applications: (i) Tracking additional gene information such as chromosome or aliases, which is useful for downstream analyses (e.g. for excluding sex chromosomes), and (ii) Collecting disaggregated exon-level counts, which provides a simple yet powerful approach to quantifying differential isoform expression (Fig. 1B). To encourage users to customize their analysis pipeline, we also restructured the key steps of htseq-count into well-documented functions and added a tutorial that explains the feature counting step by step. In addition, through a new script called htseq-count-barcodes, we support quantification of features in data multiplexed via cell barcodes and unique molecular identifiers (UMIs). Among other applications, the new script enables custom re-analysis of BAM files produced by cellranger using different parameters. Pearson correlation between cellranger and htseq-count-barcodes with default parameters is 0.985, with uniformly high correlation across cells (Supplementary Fig. S1C).
Fig. 1.

Major (A–C) Improvements to htseq-count. (A) Parallel processing on multicore architectures enables faster processing of single-cell data, where each cell is represented by a BAM file [typical for Smart-seq2 (Picelli et al. 2013) and viscRNA-Seq (Zanini )]. Note the new output formats available in HTSeq 2.0. (B) Conventional gene–cell matrix, which collapses reads that align to distinct exons of the same gene into a single gene count. (C) Additional attributes enable quantification at the exon level while retaining information on which gene each exon belongs to. (D, E) Sparse data representations in HTSeq 2.0. (D) StepVector represents piecewise-constant sparse genomic data. (E) StretchVector represents sparse islands of genomic data

Major (A–C) Improvements to htseq-count. (A) Parallel processing on multicore architectures enables faster processing of single-cell data, where each cell is represented by a BAM file [typical for Smart-seq2 (Picelli et al. 2013) and viscRNA-Seq (Zanini )]. Note the new output formats available in HTSeq 2.0. (B) Conventional gene–cell matrix, which collapses reads that align to distinct exons of the same gene into a single gene count. (C) Additional attributes enable quantification at the exon level while retaining information on which gene each exon belongs to. (D, E) Sparse data representations in HTSeq 2.0. (D) StepVector represents piecewise-constant sparse genomic data. (E) StretchVector represents sparse islands of genomic data One of the key data structures in HTSeq is StepVector, an efficient sparse representation for piecewise-constant values on a 1D discrete space (typically a chromosome) (Fig. 1D). As an example, it can be used to store overlaps between gene bodies, critical for removing ambiguities in downstream gene expression analyses. However, genomic data is sometimes characterized by a distinct type of sparsity whereby the data appears as dense ‘islands of knowledge’ in a sea of missing data. This type of sparsity is apparent in the read coverage produced by amplicon sequencing or Chromatin Immunoprecipitation Sequencing (ChIP-Seq) where most of the genome is uncovered, but non-zero rapidly fluctuating coverage, down to a single nucleotide resolution (e.g. due to single nucleotide polymorphisms), are present only around specific kilobase-long stretches. To represent this type of sparsity efficiently, we created a new data structure called StretchVector. At its core, a StretchVector is a collection of stretches implemented via dense numpy arrays (Harris ), each with associated start-end coordinates (Fig. 1E). Each stretch represents an island of data, while the rest of the genome is not stored. We implemented functions for stretch extension, trimming, resetting, shifting, views or slices, copy and conversion to and from monolithic arrays for simple data ingestion/extraction. Separately from StretchVector, we also improved the support for custom ChIP-Seq and chromatin conformation capture (Hi-C) analyses by adding parsers for bedGraph and BigWig files via pyBigWig (Ryan ) and by writing new dedicated tutorials. Finally, we improved the API of HTSeq as a whole and made architectural changes to the package to ensure its compatibility with current software development standards. Among other things, we (i) modernized the codebase to Python 3, (ii) added provisions for continuous integration and development including automatic binary releases on multiple architectures, (iii) established unit tests and test suites, (iv) fixed bugs and (v) added support for improved dependency infrastructure such as autodetection of SAM/BAM/CRAM file type via HTSlib (Bonfield ). All aforementioned changes were carried out without compromising the efficiency of HTSeq, which stems from a cross-language design via Cython (Behnel ) and SWIG (Beazley, 2003). In conclusion, HTSeq 2.0 is a fast and reliable Python library for not only analysing high-throughput sequencing data, but also for quantifying gene expression from bulk and single-cell RNA-Seq experiments. Compared with the previous implementation, we added specific support for single-cell experiments and a richer API including a new data structure for managing ‘islands-of-data’ sparsity, improved API documentation and tutorials, fixed a number of bugs, established a robust testing and deployment framework to ensure scientific reproducibility, and enable continuous code integration. We believe these improvements will make HTSeq 2.0 a convenient tool for exploring and quantifying high-throughput sequencing experiment results across multiple omic modalities. Click here for additional data file.
  7 in total

1.  Smart-seq2 for sensitive full-length transcriptome profiling in single cells.

Authors:  Simone Picelli; Åsa K Björklund; Omid R Faridani; Sven Sagasser; Gösta Winberg; Rickard Sandberg
Journal:  Nat Methods       Date:  2013-09-22       Impact factor: 28.547

2.  HTSeq--a Python framework to work with high-throughput sequencing data.

Authors:  Simon Anders; Paul Theodor Pyl; Wolfgang Huber
Journal:  Bioinformatics       Date:  2014-09-25       Impact factor: 6.937

3.  SCANPY: large-scale single-cell gene expression data analysis.

Authors:  F Alexander Wolf; Philipp Angerer; Fabian J Theis
Journal:  Genome Biol       Date:  2018-02-06       Impact factor: 13.583

Review 4.  Array programming with NumPy.

Authors:  Charles R Harris; K Jarrod Millman; Stéfan J van der Walt; Ralf Gommers; Pauli Virtanen; David Cournapeau; Eric Wieser; Julian Taylor; Sebastian Berg; Nathaniel J Smith; Robert Kern; Matti Picus; Stephan Hoyer; Marten H van Kerkwijk; Matthew Brett; Allan Haldane; Jaime Fernández Del Río; Mark Wiebe; Pearu Peterson; Pierre Gérard-Marchant; Kevin Sheppard; Tyler Reddy; Warren Weckesser; Hameer Abbasi; Christoph Gohlke; Travis E Oliphant
Journal:  Nature       Date:  2020-09-16       Impact factor: 49.962

5.  HTSlib: C library for reading/writing high-throughput sequencing data.

Authors:  James K Bonfield; John Marshall; Petr Danecek; Heng Li; Valeriu Ohan; Andrew Whitwham; Thomas Keane; Robert M Davies
Journal:  Gigascience       Date:  2021-02-16       Impact factor: 6.524

6.  Single-cell transcriptional dynamics of flavivirus infection.

Authors:  Fabio Zanini; Szu-Yuan Pu; Shirit Einav; Stephen R Quake; Elena Bekerman
Journal:  Elife       Date:  2018-02-16       Impact factor: 8.140

Review 7.  SciPy 1.0: fundamental algorithms for scientific computing in Python.

Authors:  Pauli Virtanen; Ralf Gommers; Travis E Oliphant; Matt Haberland; Tyler Reddy; David Cournapeau; Evgeni Burovski; Pearu Peterson; Warren Weckesser; Jonathan Bright; Stéfan J van der Walt; Matthew Brett; Joshua Wilson; K Jarrod Millman; Nikolay Mayorov; Andrew R J Nelson; Eric Jones; Robert Kern; Eric Larson; C J Carey; İlhan Polat; Yu Feng; Eric W Moore; Jake VanderPlas; Denis Laxalde; Josef Perktold; Robert Cimrman; Ian Henriksen; E A Quintero; Charles R Harris; Anne M Archibald; Antônio H Ribeiro; Fabian Pedregosa; Paul van Mulbregt
Journal:  Nat Methods       Date:  2020-02-03       Impact factor: 28.547

  7 in total
  15 in total

1.  NUDT21 limits CD19 levels through alternative mRNA polyadenylation in B cell acute lymphoblastic leukemia.

Authors:  Matthew T Witkowski; Soobeom Lee; Eric Wang; Anna K Lee; Alexis Talbot; Chao Ma; Nikolaos Tsopoulidis; Justin Brumbaugh; Yaqi Zhao; Kathryn G Roberts; Simon J Hogg; Sofia Nomikou; Yohana E Ghebrechristos; Palaniraja Thandapani; Charles G Mullighan; Konrad Hochedlinger; Weiqiang Chen; Omar Abdel-Wahab; Justin Eyquem; Iannis Aifantis
Journal:  Nat Immunol       Date:  2022-09-22       Impact factor: 31.250

2.  Chronic Chemogenetic Activation of the Superior Colliculus in Glaucomatous Mice: Local and Retrograde Molecular Signature.

Authors:  Marie Claes; Emiel Geeraerts; Stéphane Plaisance; Stephanie Mentens; Chris Van den Haute; Lies De Groef; Lut Arckens; Lieve Moons
Journal:  Cells       Date:  2022-05-29       Impact factor: 7.666

3.  Adapted tensor decomposition and PCA based unsupervised feature extraction select more biologically reasonable differentially expressed genes than conventional methods.

Authors:  Y-H Taguchi; Turki Turki
Journal:  Sci Rep       Date:  2022-10-19       Impact factor: 4.996

4.  Transcriptomic Profiling of Gene Expression Associated with Granulosa Cell Tumor Development in a Mouse Model.

Authors:  Nan Ni; Xin Fang; Destiny A Mullens; James J Cai; Ivan Ivanov; Laurent Bartholin; Qinglei Li
Journal:  Cancers (Basel)       Date:  2022-04-27       Impact factor: 6.575

5.  Deciphering Pleiotropic Signatures of Regulatory SNPs in Zea mays L. Using Multi-Omics Data and Machine Learning Algorithms.

Authors:  Ataul Haleem; Selina Klees; Armin Otto Schmitt; Mehmet Gültas
Journal:  Int J Mol Sci       Date:  2022-05-04       Impact factor: 6.208

6.  A RET::GRB2 fusion in pheochromocytoma defies the classic paradigm of RET oncogenic fusions.

Authors:  Cynthia M Estrada-Zuniga; Zi-Ming Cheng; Purushoth Ethiraj; Qianjin Guo; Hector Gonzalez-Cantú; Elaina Adderley; Hector Lopez; Bethany N Landry; Abir Zainal; Neil Aronin; Yanli Ding; Xiaojing Wang; Ricardo C T Aguiar; Patricia L M Dahia
Journal:  Cell Rep Med       Date:  2022-07-19

7.  Simultaneous Ozone and High Light Treatments Reveal an Important Role for the Chloroplast in Co-ordination of Defense Signaling.

Authors:  Enjun Xu; Mikko Tikkanen; Fatemeh Seyednasrollah; Saijaliisa Kangasjärvi; Mikael Brosché
Journal:  Front Plant Sci       Date:  2022-07-07       Impact factor: 6.627

8.  Genome-Wide Identification and Spatial Expression Analysis of Histone Modification Gene Families in the Rubber Dandelion Taraxacum kok-saghyz.

Authors:  Francesco Panara; Carlo Fasano; Loredana Lopez; Andrea Porceddu; Paolo Facella; Elio Fantini; Loretta Daddiego; Giorgio Perrella
Journal:  Plants (Basel)       Date:  2022-08-09

9.  Investigating circulating miRNA in transition dairy cows: What miRNAomics tells about metabolic adaptation.

Authors:  Arash Veshkini; Harald Michael Hammon; Barbara Lazzari; Laura Vogel; Martina Gnott; Arnulf Tröscher; Vera Vendramin; Hassan Sadri; Helga Sauerwein; Fabrizio Ceciliani
Journal:  Front Genet       Date:  2022-08-23       Impact factor: 4.772

10.  Increased gene dosage and mRNA expression from chromosomal duplications in Caenorhabditis elegans.

Authors:  Bhavana Ragipani; Sarah Elizabeth Albritton; Ana Karina Morao; Diogo Mesquita; Maxwell Kramer; Sevinç Ercan
Journal:  G3 (Bethesda)       Date:  2022-07-29       Impact factor: 3.542

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.