Sergii Domanskyi1, Carlo Piermarocchi1, George I Mias1,2,3. 1. Department of Physics and Astronomy, East Lansing, MI 48824, USA. 2. Department of Biochemistry and Molecular Biology, East Lansing, MI 48824, USA. 3. Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.
Abstract
SUMMARY: PyIOmica is an open-source Python package focusing on integrating longitudinal multiple omics datasets, characterizing and categorizing temporal trends. The package includes multiple bioinformatics tools including data normalization, annotation, categorization, visualization and enrichment analysis for gene ontology terms and pathways. Additionally, the package includes an implementation of visibility graphs to visualize time series as networks. AVAILABILITY AND IMPLEMENTATION: PyIOmica is implemented as a Python package (pyiomica), available for download and installation through the Python Package Index (https://pypi.python.org/pypi/pyiomica), and can be deployed using the Python import function following installation. PyIOmica has been tested on Mac OS X, Unix/Linux and Microsoft Windows. The application is distributed under an MIT license. Source code for each release is also available for download on Zenodo (https://doi.org/10.5281/zenodo.3548040). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics.
SUMMARY: PyIOmica is an open-source Python package focusing on integrating longitudinal multiple omics datasets, characterizing and categorizing temporal trends. The package includes multiple bioinformatics tools including data normalization, annotation, categorization, visualization and enrichment analysis for gene ontology terms and pathways. Additionally, the package includes an implementation of visibility graphs to visualize time series as networks. AVAILABILITY AND IMPLEMENTATION: PyIOmica is implemented as a Python package (pyiomica), available for download and installation through the Python Package Index (https://pypi.python.org/pypi/pyiomica), and can be deployed using the Python import function following installation. PyIOmica has been tested on Mac OS X, Unix/Linux and Microsoft Windows. The application is distributed under an MIT license. Source code for each release is also available for download on Zenodo (https://doi.org/10.5281/zenodo.3548040). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics.
As sequencing costs continue to drop, systems biology based on large omics datasets is rapidly expanding its scope. In particular, time series obtained from multi-omics datasets are becoming more and more affordable (Chen ; Garrett-Bakelman ; Price ). The analysis of time series can have broad implications for precision medicine applications, since longitudinal data capture the dynamically changing collective microscopic behavior of molecular components in the body, reflecting the physiological state of a patient. There are many bioinformatics tools aiming at multimodal omics data integration (Pinu ). Specifically, Bioconductor (Gentleman ), Galaxy(Afgan ), GenePattern (Reich ), Biopython (Cock ), Pathomx (Fitzpatrick ), SECIMTools (Kirpich ) and more. Although multiple coding paradigms are used in bioinformatics, R and Python are essentially the lingua francas for data science analysis, where the open-source appeal and growing online community support are particularly helpful in developing a dedicated user base.Here we introduce PyIOmica, an open source Python package, for analyzing longitudinal omics datasets, such as transcriptomics, proteomics, metabolomics etc., which includes multiple tools for processing multi-modal mapped data, characterizing time series in terms of periodograms and autocorrelations, categorizing temporal behavior, visualizing visibility graphs and testing data for gene ontology and pathway enrichment. PyIOmica includes optimized new algorithms adapted from MathIOmica (Mias ; which runs on the proprietary Mathematica platform), now made available as Python open source code for all users, and additionally expands extensively graphical utilities for visualization of categorized temporal data, and network representation of time series. To our knowledge, there are no tools with the functionality of PyIOmica currently available in Python.
2 Materials and methods
2.1 Overview and codebase
PyIOmica provides a complete workflow for time series processing, illustrated in the Supplementary Figure S1. The modular nature of PyIOmica allows for smooth integration with any future and existing Python tools. With PyIOmica, any results can be visualized, exported and analyzed for gene enrichment by means of a user-friendly Python interface. PyIOmica’s codebase is a single Python module containing multiple groups of functions designed for annotations and enumerations, pre- and post-processing, clustering-related purposes, visualizations (heatmaps and categorization), normal and horizontal visibility graphs generation and other core and utility components. Installation is simply performed using pip install pyiomica, and package dependencies are automatically addressed directly from Python package index (PyPI). Function documentation is embedded in the module, and is easily accessible at runtime (and also at https://pyiomica.readthedocs.io). Data structures and implementation are described in Supplementary Material.An extensive set of PyIOmica pre-processing functions enables filtering low-quality signals, tagging missing or low values, normalization, standardization, merging and comparison of the datasets. The post-processing functions, such as temporal trends categorization of power spectrum and spikes, are built on using the SciPy and scikit-learn Python toolkits. Additional functionality includes gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses for both non-temporal data, as well as for clusters identified through the automated time series categorization.Temporal trends are automatically discovered using periodogram and autocorrelation calculations based on a Lomb-Scargle transformation algorithm (Mias ), which properly accounts for missing points and/or unevenly sampled data. The periodogram is used to identify each time series’ underlying dominant frequencies. Autocorrelations are also used to identify how measured intensities within each time series may depend on previous measurements, by correlating a time series with delayed versions of itself. Signals showing statistically significant trends are identified for downstream analysis. Multiple omics (genes, proteins and metabolites) that show similar trends in time are identified by clustering, and can be biologically evaluated through pathway and GO analyses.
2.2 Visibility graphs and visualization
Recent work on characterizing complex events focuses on using network/graph methodology that can capture non-linear behavior (Lacasa ). Time series are transformed into networks that conserve their topology, and allow the identification of varying temporal structures. We represent each timepoint in a series as a node. Then, for any timepoint pair with intensities at times and respectively, we can have an edge if for any other timepoint , such that we have . Representing the intensities as bars, this is equivalent to connecting the top of each bar to another top if there is a direct line-of-sight to that top. The resulting visibility graph has characteristics that reflect the equivalent time series temporal structure and can be used to identify trends. The shortest path identifies nodes (i.e. timepoints) that display high intensity, and thus dominate the global signal profile, are robust to noise, and are likely drivers of the global temporal behavior. A biological event deviating from baseline is likely to appear in one or more nodes within the shortest path.PyIOmica uses Matplotlib plotting functions to visualize histograms, dendrograms, heatmaps and visibility graphs. Figure 1a shows example RNA-sequencing gene expression data from a 24-h time series, clustered into two groups based on autocorrelations. Subgroups were determined from the gene expression in each autocorrelation group. The data from Group 1, Subgroup 2 containing 191 genes is visualized in Figure 1b as a visibility graph on a circular layout. Temporal events are detected and indicated with solid blue lines encompassing groups of points, or communities. Additional examples are provided in the PyIOmica documentation (Supplementary Material, using data that are provided with the PyIOmica Zenodo software release (under docs/examples)).
Fig. 1.
Example PyIOmica data visualization. (a) Dendrogram with heatmap of automatically categorized longitudinal gene expression data. Autocorrelations are used to identify temporal trends in the data. Subgroups are determined based on similar collective behavior over time. (b) Visibility graph of median signal intensity from group G1S2 from (a)
Example PyIOmica data visualization. (a) Dendrogram with heatmap of automatically categorized longitudinal gene expression data. Autocorrelations are used to identify temporal trends in the data. Subgroups are determined based on similar collective behavior over time. (b) Visibility graph of median signal intensity from group G1S2 from (a)
3 Conclusion
The open source PyIOmica Python package characterizes time series from multiple omics and categorizes temporal trends with a streamlined automated pipeline based on spectral analysis. PyIOmica also offers broad bioinformatics functionality, including clustering, visualization and enrichment, and extends previous developments (Mias ) to an open-source, community-accessible platform for data science. We anticipate future versions of PyIOmica to utilize its codebase flexibility to expand its bioinformatics tools for genomic as well as differential omics analyses, and graph construction and characterization.
Funding
This work was supported by the Translational Research Institute for Space Health through National Aeronautics and Space Administration (NASA) Cooperative Agreement NNX16AO69A.Conflict of Interest: G.M. has consulted for Colgate-Palmolive North America. C.P. owns equity in Salgomed, Inc. S.D. reports no potential confict of interest.Click here for additional data file.
Authors: Lucas Lacasa; Bartolo Luque; Fernando Ballesteros; Jordi Luque; Juan Carlos Nuño Journal: Proc Natl Acad Sci U S A Date: 2008-03-24 Impact factor: 11.205
Authors: Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon Journal: Bioinformatics Date: 2009-03-20 Impact factor: 6.937
Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583
Authors: Francine E Garrett-Bakelman; Manjula Darshi; Stefan J Green; Ruben C Gur; Ling Lin; Brandon R Macias; Miles J McKenna; Cem Meydan; Tejaswini Mishra; Jad Nasrini; Brian D Piening; Lindsay F Rizzardi; Kumar Sharma; Jamila H Siamwala; Lynn Taylor; Martha Hotz Vitaterna; Maryam Afkarian; Ebrahim Afshinnekoo; Sara Ahadi; Aditya Ambati; Maneesh Arya; Daniela Bezdan; Colin M Callahan; Songjie Chen; Augustine M K Choi; George E Chlipala; Kévin Contrepois; Marisa Covington; Brian E Crucian; Immaculata De Vivo; David F Dinges; Douglas J Ebert; Jason I Feinberg; Jorge A Gandara; Kerry A George; John Goutsias; George S Grills; Alan R Hargens; Martina Heer; Ryan P Hillary; Andrew N Hoofnagle; Vivian Y H Hook; Garrett Jenkinson; Peng Jiang; Ali Keshavarzian; Steven S Laurie; Brittany Lee-McMullen; Sarah B Lumpkins; Matthew MacKay; Mark G Maienschein-Cline; Ari M Melnick; Tyler M Moore; Kiichi Nakahira; Hemal H Patel; Robert Pietrzyk; Varsha Rao; Rintaro Saito; Denis N Salins; Jan M Schilling; Dorothy D Sears; Caroline K Sheridan; Michael B Stenger; Rakel Tryggvadottir; Alexander E Urban; Tomas Vaisar; Benjamin Van Espen; Jing Zhang; Michael G Ziegler; Sara R Zwart; John B Charles; Craig E Kundrot; Graham B I Scott; Susan M Bailey; Mathias Basner; Andrew P Feinberg; Stuart M C Lee; Christopher E Mason; Emmanuel Mignot; Brinda K Rana; Scott M Smith; Michael P Snyder; Fred W Turek Journal: Science Date: 2019-04-12 Impact factor: 47.728
Authors: George I Mias; Tahir Yusufaly; Raeuf Roushangar; Lavida R K Brooks; Vikas V Singh; Christina Christou Journal: Sci Rep Date: 2016-11-24 Impact factor: 4.379
Authors: Nathan D Price; Andrew T Magis; John C Earls; Gustavo Glusman; Roie Levy; Christopher Lausted; Daniel T McDonald; Ulrike Kusebauch; Christopher L Moss; Yong Zhou; Shizhen Qin; Robert L Moritz; Kristin Brogaard; Gilbert S Omenn; Jennifer C Lovejoy; Leroy Hood Journal: Nat Biotechnol Date: 2017-07-17 Impact factor: 54.908
Authors: Alexander S Kirpich; Miguel Ibarra; Oleksandr Moskalenko; Justin M Fear; Joseph Gerken; Xinlei Mi; Ali Ashrafi; Alison M Morse; Lauren M McIntyre Journal: BMC Bioinformatics Date: 2018-04-20 Impact factor: 3.169
Authors: Enis Afgan; Dannon Baker; Bérénice Batut; Marius van den Beek; Dave Bouvier; Martin Cech; John Chilton; Dave Clements; Nate Coraor; Björn A Grüning; Aysam Guerler; Jennifer Hillman-Jackson; Saskia Hiltemann; Vahid Jalili; Helena Rasche; Nicola Soranzo; Jeremy Goecks; James Taylor; Anton Nekrutenko; Daniel Blankenberg Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971