Literature DB >> 31778155

PyIOmica: longitudinal omics analysis and trend identification.

Sergii Domanskyi¹, Carlo Piermarocchi¹, George I Mias^1,2,3.

Abstract

SUMMARY: PyIOmica is an open-source Python package focusing on integrating longitudinal multiple omics datasets, characterizing and categorizing temporal trends. The package includes multiple bioinformatics tools including data normalization, annotation, categorization, visualization and enrichment analysis for gene ontology terms and pathways. Additionally, the package includes an implementation of visibility graphs to visualize time series as networks.
AVAILABILITY AND IMPLEMENTATION: PyIOmica is implemented as a Python package (pyiomica), available for download and installation through the Python Package Index (https://pypi.python.org/pypi/pyiomica), and can be deployed using the Python import function following installation. PyIOmica has been tested on Mac OS X, Unix/Linux and Microsoft Windows. The application is distributed under an MIT license. Source code for each release is also available for download on Zenodo (https://doi.org/10.5281/zenodo.3548040). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics.

Entities: Chemical

Mesh：

Year: 2020 PMID： 31778155 PMCID： PMC7141865 DOI： 10.1093/bioinformatics/btz896

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

As sequencing costs continue to drop, systems biology based on large omics datasets is rapidly expanding its scope. In particular, time series obtained from multi-omics datasets are becoming more and more affordable (Chen ; Garrett-Bakelman ; Price ). The analysis of time series can have broad implications for precision medicine applications, since longitudinal data capture the dynamically changing collective microscopic behavior of molecular components in the body, reflecting the physiological state of a patient. There are many bioinformatics tools aiming at multimodal omics data integration (Pinu ). Specifically, Bioconductor (Gentleman ), Galaxy(Afgan ), GenePattern (Reich ), Biopython (Cock ), Pathomx (Fitzpatrick ), SECIMTools (Kirpich ) and more. Although multiple coding paradigms are used in bioinformatics, R and Python are essentially the lingua francas for data science analysis, where the open-source appeal and growing online community support are particularly helpful in developing a dedicated user base. Here we introduce PyIOmica, an open source Python package, for analyzing longitudinal omics datasets, such as transcriptomics, proteomics, metabolomics etc., which includes multiple tools for processing multi-modal mapped data, characterizing time series in terms of periodograms and autocorrelations, categorizing temporal behavior, visualizing visibility graphs and testing data for gene ontology and pathway enrichment. PyIOmica includes optimized new algorithms adapted from MathIOmica (Mias ; which runs on the proprietary Mathematica platform), now made available as Python open source code for all users, and additionally expands extensively graphical utilities for visualization of categorized temporal data, and network representation of time series. To our knowledge, there are no tools with the functionality of PyIOmica currently available in Python.

2 Materials and methods

2.1 Overview and codebase

PyIOmica provides a complete workflow for time series processing, illustrated in the Supplementary Figure S1. The modular nature of PyIOmica allows for smooth integration with any future and existing Python tools. With PyIOmica, any results can be visualized, exported and analyzed for gene enrichment by means of a user-friendly Python interface. PyIOmica’s codebase is a single Python module containing multiple groups of functions designed for annotations and enumerations, pre- and post-processing, clustering-related purposes, visualizations (heatmaps and categorization), normal and horizontal visibility graphs generation and other core and utility components. Installation is simply performed using pip install pyiomica, and package dependencies are automatically addressed directly from Python package index (PyPI). Function documentation is embedded in the module, and is easily accessible at runtime (and also at https://pyiomica.readthedocs.io). Data structures and implementation are described in Supplementary Material. An extensive set of PyIOmica pre-processing functions enables filtering low-quality signals, tagging missing or low values, normalization, standardization, merging and comparison of the datasets. The post-processing functions, such as temporal trends categorization of power spectrum and spikes, are built on using the SciPy and scikit-learn Python toolkits. Additional functionality includes gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses for both non-temporal data, as well as for clusters identified through the automated time series categorization. Temporal trends are automatically discovered using periodogram and autocorrelation calculations based on a Lomb-Scargle transformation algorithm (Mias ), which properly accounts for missing points and/or unevenly sampled data. The periodogram is used to identify each time series’ underlying dominant frequencies. Autocorrelations are also used to identify how measured intensities within each time series may depend on previous measurements, by correlating a time series with delayed versions of itself. Signals showing statistically significant trends are identified for downstream analysis. Multiple omics (genes, proteins and metabolites) that show similar trends in time are identified by clustering, and can be biologically evaluated through pathway and GO analyses.

2.2 Visibility graphs and visualization

Recent work on characterizing complex events focuses on using network/graph methodology that can capture non-linear behavior (Lacasa ). Time series are transformed into networks that conserve their topology, and allow the identification of varying temporal structures. We represent each timepoint in a series as a node. Then, for any timepoint pair with intensities at times and respectively, we can have an edge if for any other timepoint , such that we have . Representing the intensities as bars, this is equivalent to connecting the top of each bar to another top if there is a direct line-of-sight to that top. The resulting visibility graph has characteristics that reflect the equivalent time series temporal structure and can be used to identify trends. The shortest path identifies nodes (i.e. timepoints) that display high intensity, and thus dominate the global signal profile, are robust to noise, and are likely drivers of the global temporal behavior. A biological event deviating from baseline is likely to appear in one or more nodes within the shortest path. PyIOmica uses Matplotlib plotting functions to visualize histograms, dendrograms, heatmaps and visibility graphs. Figure 1a shows example RNA-sequencing gene expression data from a 24-h time series, clustered into two groups based on autocorrelations. Subgroups were determined from the gene expression in each autocorrelation group. The data from Group 1, Subgroup 2 containing 191 genes is visualized in Figure 1b as a visibility graph on a circular layout. Temporal events are detected and indicated with solid blue lines encompassing groups of points, or communities. Additional examples are provided in the PyIOmica documentation (Supplementary Material, using data that are provided with the PyIOmica Zenodo software release (under docs/examples)).

Fig. 1.

Example PyIOmica data visualization. (a) Dendrogram with heatmap of automatically categorized longitudinal gene expression data. Autocorrelations are used to identify temporal trends in the data. Subgroups are determined based on similar collective behavior over time. (b) Visibility graph of median signal intensity from group G1S2 from (a)

3 Conclusion

The open source PyIOmica Python package characterizes time series from multiple omics and categorizes temporal trends with a streamlined automated pipeline based on spectral analysis. PyIOmica also offers broad bioinformatics functionality, including clustering, visualization and enrichment, and extends previous developments (Mias ) to an open-source, community-accessible platform for data science. We anticipate future versions of PyIOmica to utilize its codebase flexibility to expand its bioinformatics tools for genomic as well as differential omics analyses, and graph construction and characterization.

Funding

This work was supported by the Translational Research Institute for Space Health through National Aeronautics and Space Administration (NASA) Cooperative Agreement NNX16AO69A. Conflict of Interest: G.M. has consulted for Colgate-Palmolive North America. C.P. owns equity in Salgomed, Inc. S.D. reports no potential confict of interest. Click here for additional data file.

12 in total

1. GenePattern 2.0.

Authors: Michael Reich; Ted Liefeld; Joshua Gould; Jim Lerner; Pablo Tamayo; Jill P Mesirov
Journal: Nat Genet Date: 2006-05 Impact factor: 38.330

2. From time series to complex networks: the visibility graph.

Authors: Lucas Lacasa; Bartolo Luque; Fernando Ballesteros; Jordi Luque; Juan Carlos Nuño
Journal: Proc Natl Acad Sci U S A Date: 2008-03-24 Impact factor: 11.205

3. Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors: Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal: Bioinformatics Date: 2009-03-20 Impact factor: 6.937

4. Bioconductor: open software development for computational biology and bioinformatics.

Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583

5. The NASA Twins Study: A multidimensional analysis of a year-long human spaceflight.

Authors: Francine E Garrett-Bakelman; Manjula Darshi; Stefan J Green; Ruben C Gur; Ling Lin; Brandon R Macias; Miles J McKenna; Cem Meydan; Tejaswini Mishra; Jad Nasrini; Brian D Piening; Lindsay F Rizzardi; Kumar Sharma; Jamila H Siamwala; Lynn Taylor; Martha Hotz Vitaterna; Maryam Afkarian; Ebrahim Afshinnekoo; Sara Ahadi; Aditya Ambati; Maneesh Arya; Daniela Bezdan; Colin M Callahan; Songjie Chen; Augustine M K Choi; George E Chlipala; Kévin Contrepois; Marisa Covington; Brian E Crucian; Immaculata De Vivo; David F Dinges; Douglas J Ebert; Jason I Feinberg; Jorge A Gandara; Kerry A George; John Goutsias; George S Grills; Alan R Hargens; Martina Heer; Ryan P Hillary; Andrew N Hoofnagle; Vivian Y H Hook; Garrett Jenkinson; Peng Jiang; Ali Keshavarzian; Steven S Laurie; Brittany Lee-McMullen; Sarah B Lumpkins; Matthew MacKay; Mark G Maienschein-Cline; Ari M Melnick; Tyler M Moore; Kiichi Nakahira; Hemal H Patel; Robert Pietrzyk; Varsha Rao; Rintaro Saito; Denis N Salins; Jan M Schilling; Dorothy D Sears; Caroline K Sheridan; Michael B Stenger; Rakel Tryggvadottir; Alexander E Urban; Tomas Vaisar; Benjamin Van Espen; Jing Zhang; Michael G Ziegler; Sara R Zwart; John B Charles; Craig E Kundrot; Graham B I Scott; Susan M Bailey; Mathias Basner; Andrew P Feinberg; Stuart M C Lee; Christopher E Mason; Emmanuel Mignot; Brinda K Rana; Scott M Smith; Michael P Snyder; Fred W Turek
Journal: Science Date: 2019-04-12 Impact factor: 47.728

6. MathIOmica: An Integrative Platform for Dynamic Omics.

Authors: George I Mias; Tahir Yusufaly; Raeuf Roushangar; Lavida R K Brooks; Vikas V Singh; Christina Christou
Journal: Sci Rep Date: 2016-11-24 Impact factor: 4.379

7. A wellness study of 108 individuals using personal, dense, dynamic data clouds.

Authors: Nathan D Price; Andrew T Magis; John C Earls; Gustavo Glusman; Roie Levy; Christopher Lausted; Daniel T McDonald; Ulrike Kusebauch; Christopher L Moss; Yong Zhou; Shizhen Qin; Robert L Moritz; Kristin Brogaard; Gilbert S Omenn; Jennifer C Lovejoy; Leroy Hood
Journal: Nat Biotechnol Date: 2017-07-17 Impact factor: 54.908

8. SECIMTools: a suite of metabolomics data analysis tools.

Authors: Alexander S Kirpich; Miguel Ibarra; Oleksandr Moskalenko; Justin M Fear; Joseph Gerken; Xinlei Mi; Ali Ashrafi; Alison M Morse; Lauren M McIntyre
Journal: BMC Bioinformatics Date: 2018-04-20 Impact factor: 3.169

9. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update.

Authors: Enis Afgan; Dannon Baker; Bérénice Batut; Marius van den Beek; Dave Bouvier; Martin Cech; John Chilton; Dave Clements; Nate Coraor; Björn A Grüning; Aysam Guerler; Jennifer Hillman-Jackson; Saskia Hiltemann; Vahid Jalili; Helena Rasche; Nicola Soranzo; Jeremy Goecks; James Taylor; Anton Nekrutenko; Daniel Blankenberg
Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971