Jose Espinosa-Carrasco1, Ionas Erb2, Toni Hermoso Pulido2, Julia Ponomarenko3, Mara Dierssen4, Cedric Notredame5. 1. Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain; Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, Barcelona 08028, Spain. 2. Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain. 3. Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain. 4. Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain; Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), Valencia, Spain. Electronic address: mara.dierssen@crg.eu. 5. Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain. Electronic address: cedric.notredame@crg.eu.
Abstract
The growing appetite of behavioral neuroscience for automated data production is prompting the need for new computational standards allowing improved interoperability, reproducibility, and shareability. We show here how these issues can be solved by repurposing existing genomic formats whose structure perfectly supports the handling of time series. This allows existing genomic analysis and visualization tools to be deployed onto behavioral data. As a proof of principle, we implemented the conversion procedure in Pergola, an open source software, and used genomics tools to reproduce results obtained in mouse, fly, and worm. We also show how common genomics techniques such as principal component analysis, hidden Markov modeling, and volcano plots can be deployed on the reformatted behavioral data. These analyses are easy to share because they depend on the scripting of public software. They are also easy to reproduce thanks to their integration within Nextflow, a workflow manager using containerized software.
The growing appetite of behavioral neuroscience for automated data production is prompting the need for new computational standards allowing improved interoperability, reproducibility, and shareability. We show here how these issues can be solved by repurposing existing genomic formats whose structure perfectly supports the handling of time series. This allows existing genomic analysis and visualization tools to be deployed onto behavioral data. As a proof of principle, we implemented the conversion procedure in Pergola, an open source software, and used genomics tools to reproduce results obtained in mouse, fly, and worm. We also show how common genomics techniques such as principal component analysis, hidden Markov modeling, and volcano plots can be deployed on the reformatted behavioral data. These analyses are easy to share because they depend on the scripting of public software. They are also easy to reproduce thanks to their integration within Nextflow, a workflow manager using containerized software.
Fine-grained longitudinal recordings are one of the fastest growing collections of biological and societal data (de Montjoye et al., 2015, Jensen et al., 2014). Behavior is a prime target for these types of measurements (Price et al., 2017) since it constitutes a high-level phenotype that links genetics, development, neurobiology, evolution, environment, and social influences (Gomez-Marin et al., 2014). Novel high-throughput platforms for monitoring behavior have recently enabled unprecedented amounts of data to be collected (Schaefer and Claridge-Chang, 2012). However, these data present novel challenges, and their effective comparison and reproducible analysis across different systems and platforms will require improved interoperability (Anderson and Perona, 2014, Munafò et al., 2017). Here, we show how standards established for genomics can be repurposed to handle behavioral data in a fully interoperable fashion. This simple solution made it possible to analyze behavioral neuroscience datasets gathered on three model systems (Espinosa-Carrasco et al., 2018, Robie et al., 2017, Yemini et al., 2013) using standard genomic formats and software tools.High-throughput behavior-monitoring platforms can be classified into two categories: sensor- and video-based systems. Sensor-based systems measure selected parameters such as vocalizations, activity, feeding, and oxygen consumption. For mice, such devices include PHECOMP (Espinosa-Carrasco et al., 2018), PhenoMaster (Suez et al., 2014) and CLAMS (Solt et al., 2012). Video-based systems use computer vision techniques to track movements. Optimized video setups have been developed for the main model organisms, including mouse (Hong et al., 2015), worm (Yemini et al., 2013), fly (Robie et al., 2017), and zebrafish (Pérez-Escudero et al., 2014). Even though the recorded signals are different across systems, the outputs are comparable in the sense that they consist of discrete or continuous time series of behavioral quantitative or qualitative readouts. These time series are usually processed using commercial software shipped with the platform, often in combination with ad hoc implemented analyses. Although this approach is entirely suitable for establishing key biological results, the lack of common standards for longitudinal data processing hampers the ability to share established computational analysis procedures and data (Wilkinson et al., 2016), thus potentially limiting comparisons and reproducibility across different systems. Improving interoperability is not, however, an easy task, because it requires the whole community to agree on common standards, formats, and procedures. In genomics, the establishment of such standards has taken about 15 years and work is still in progress (Field et al., 2011, Karsch-Mizrachi et al., 2018). In this work, we show how mature standards can be recycled across research fields to save development time. To the best of our knowledge we document here the first case of a standard repurposing procedure between genomics and behavioral neuroscience.The rationale for this work was our original observation that the storage of genomics data and the way these data are subsequently analyzed supports all the requirements of longitudinal data since both data types share a sequential structure. In a genomic sequence, nucleotides are ordered as a discrete series to which various layers of qualitative and quantitative annotation can be attached. This data structure, which is essentially a large table with a position in each line and various attributes in each column, can hold any sequentially organized information, including longitudinal recordings. By simply treating each nucleotide coordinate as a unit of time, we show that it is possible to use the most common genomic data formats for storing, visualizing, and analyzing behavioral information (including metadata) in a lossless fashion. What makes this data structure attractive for the handling of longitudinal information is not its specification but rather its ubiquitous use in genomics. In this field, the early adoption of strict data storage standards (Kent et al., 2002) has resulted in a huge number of tools built around formats agreed upon and supported by a community of thousands of laboratories around the world. Formats developed to define the position of genomic annotations therefore provide a perfect scaffold to store discrete time series of behavior. In a similar fashion, file formats implemented to represent scores along genomic sequences allow the storage of continuous time series of behavior (Figure 1). The available genomic tools (Afgan et al., 2018, Ernst and Kellis, 2012, Gentleman et al., 2004, Quinlan and Hall, 2010, Ramírez et al., 2016, Robinson and Thorvaldsdóttir, 2011, Zerbino et al., 2014) are also suitable to deal with time series because they allow both sophisticated multi-scale visualization, genomic arithmetic operations required to conditionally extract and combine data portions, and complex post-processing techniques such as hidden Markov model (HMM) analysis.
Figure 1
Formatting of Common Behavioral Recordings into BED as an Example of the Pergola Approach to Format Data
(A) Typical tabulated (CSV, TSV, or XLSX) behavioral recording with three events (lines 2–4) being coded with an animal tag (id column), start (event_start column) and end time (event_end column), a typology (type_of_event), and a quantitative value (value column).
(B) Once mapped onto the Pergola generic ontology, these column labels are translated into a BED file format where the nucleotide relative index positions (columns 2 and 3) are used to specify the start and termination of the event whose nature is coded in the attribute column (#4) and quantified in the quantification column (#6). The BED format also supports specific coloring for the considered event (RGB values in the last column). A single BED file will contain all the events of an individual animal ID. Pergola will generate a FASTA file name chromosome 1 (“chr1” on the file) that is used to map the time points onto the nucleotide positions.
(C) Once formatted this way, data can then be displayed and processed using popular genomics tools such as the Integrative Genomic Viewer (IGV).
Formatting of Common Behavioral Recordings into BED as an Example of the Pergola Approach to Format Data(A) Typical tabulated (CSV, TSV, or XLSX) behavioral recording with three events (lines 2–4) being coded with an animal tag (id column), start (event_start column) and end time (event_end column), a typology (type_of_event), and a quantitative value (value column).(B) Once mapped onto the Pergola generic ontology, these column labels are translated into a BED file format where the nucleotide relative index positions (columns 2 and 3) are used to specify the start and termination of the event whose nature is coded in the attribute column (#4) and quantified in the quantification column (#6). The BED format also supports specific coloring for the considered event (RGB values in the last column). A single BED file will contain all the events of an individual animal ID. Pergola will generate a FASTA file name chromosome 1 (“chr1” on the file) that is used to map the time points onto the nucleotide positions.(C) Once formatted this way, data can then be displayed and processed using popular genomics tools such as the Integrative Genomic Viewer (IGV).
Results
We have developed Pergola (Python bEhavioRal GenOme tools LibrAry, http://cbcrg.github.io/pergola/) to demonstrate the feasibility of standard repurposing between genomics and behavioral neuroscience. Pergola is an open source tool capable of converting behavioral data into the most common genomic formats. This conversion merely requires a customizable mapping between the user's behavioral recordings and genomic data structures (e.g., time stamps mapped onto nucleotide positions, see Tables S1 and S2 and Supplemental Information for details). As a proof of principle, we have used Pergola to reprocess, visualize, and analyze worm, fly, and mouse data that had been obtained in previous studies using three different platforms and were stored in a variety of formats.The first dataset corresponds to worm locomotion trajectories recorded with a video tracking system (Yemini et al., 2013). A part of the original study aimed to quantify the differences in locomotion between a mutant strain with reduced mobility and its wild-type control. From these data, we extracted motion states (discrete time intervals annotated as moving forward, backward, or paused) and per-frame body speed to BED and BedGraph formats, respectively, using Pergola, and visualized them on the Integrative Genomics Viewer (IGV) (Robinson and Thorvaldsdóttir, 2011). As this browser permits the simultaneous presentation of several annotation tracks, it lends itself well to the side-by-side display of worm recordings from different strains (Figure 2A). When doing so, it becomes straightforward to confirm that control N2 worms systematically reached higher speeds than the unc-16 individuals (“unc” stands for uncoordinated), an observation supported by the relative intensity of the speed signals displayed on the heatmap. This type of visualization is equally well suited for categorical (e.g., back or forward movement; Figure S1A) and continuous recordings (e.g., speed; Figure S1B). Using as input the individual speed trackings, we used deepTools (Ramírez et al., 2016), a popular suite for exploring sequencing data, to compute a principal component analysis. Our analysis not only discriminates mutants from their wild-type counterparts but also reflects the increased variability of wild-type worms with respect to the mutant group (Figure 2B), thus revealing individual variability as a potential confounding factor when performing group comparisons. Explorative analyses can be complemented with quantitative analyses using tools developed for genome arithmetic. Here, we reproduced the comparison of speed carried out in the original analysis (Figure 2C) by using BEDTools to extract intervals of the original speed trajectory fulfilling specific criteria (e.g., moving forward, Figure 3) and by aggregating the speed measurements within these intervals.
Figure 2
Repurposing of Genomic Software for the Analysis of C. elegans Locomotion Trajectories Using Pergola
(A) IGV snapshot of the comparison between C. elegans unc-16 mutants (20 tracks) and controls (40 tracks), where blue and red indicate the speed (in microns by frame) of the forward and backward movements, respectively. Tracks span 29,000 frames. Positive values (red) correspond to speeds measured during forward movements, whereas negative quantities indicate speed attained during backward movement.
(B) PCA and resulting screen plot (eigenvalues versus dimension) of binned speed trajectories obtained using deepTools. Geometric shapes depict single-worm trajectories (yellow for controls, gray for un-16 mutants). (Different geometric shapes are a built-in feature of the visualization and do not provide additional information.)
(C) Quantification of the speed distributions across the three possible motion states (forward, backward, and paused; same color code as in B).
See also Figures S1 and S7.
Figure 3
Use of BEDtools for Conditional Extraction of C. elegans Phenotypic Variables by Intersection with Motion State
(A and B) Tracks represent (A) the mid body speed of a single animal encoded in a BedGraph file (continuous annotation of behavior) and (B) the periods during the trajectory moving forward encoded in a BED file (discrete annotation of behavior).
(C) Phenotypic features in the BedGraph file can be intersected with direction states in the BED file using BEDTools. The resulting files contain only the regions of mid body speed overlapping with forward direction. Figure inspired by BEDTools documentation (http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html).
Repurposing of Genomic Software for the Analysis of C. elegans Locomotion Trajectories Using Pergola(A) IGV snapshot of the comparison between C. elegans unc-16 mutants (20 tracks) and controls (40 tracks), where blue and red indicate the speed (in microns by frame) of the forward and backward movements, respectively. Tracks span 29,000 frames. Positive values (red) correspond to speeds measured during forward movements, whereas negative quantities indicate speed attained during backward movement.(B) PCA and resulting screen plot (eigenvalues versus dimension) of binned speed trajectories obtained using deepTools. Geometric shapes depict single-worm trajectories (yellow for controls, gray for un-16 mutants). (Different geometric shapes are a built-in feature of the visualization and do not provide additional information.)(C) Quantification of the speed distributions across the three possible motion states (forward, backward, and paused; same color code as in B).See also Figures S1 and S7.Use of BEDtools for Conditional Extraction of C. elegans Phenotypic Variables by Intersection with Motion State(A and B) Tracks represent (A) the mid body speed of a single animal encoded in a BedGraph file (continuous annotation of behavior) and (B) the periods during the trajectory moving forward encoded in a BED file (discrete annotation of behavior).(C) Phenotypic features in the BedGraph file can be intersected with direction states in the BED file using BEDTools. The resulting files contain only the regions of mid body speed overlapping with forward direction. Figure inspired by BEDTools documentation (http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html).The second dataset is a collection of Drosophila behavioral trajectories processed using JAABA (Robie et al., 2017), a machine learning software for video tracking annotation. In the original article, the video recordings were processed with JAABA and further analyzed using custom MATLAB scripts. Here, we used Pergola to reformat the JAABA-annotated data into BED files. These files can be visualized using third-party genomics software like the Sushi R-package (Phanstiel et al., 2014) from Bioconductor (Figure 4A). Because Pergola is implemented as a Python library, it can be integrated with any of the Python tools developed for genomics analysis, including Pybedtools (Dale et al., 2011) (a wrapper for the BEDTools). To reproduce one of the original results of this study, we used Pybedtools to extract the regions annotated as chase behavior (a behavior that consists in closely following another individual for a certain amount of time) from the BED files and compared the chasing time across Drosophila groups with different genetic backgrounds (Figure 4B). Finally, to assess the relative contribution of the various locomotion features when annotating chase bouts, we adapted a volcano plot representation (Ritchie et al., 2015). In genomics, volcano plots are mostly used to display the difference of expression of a set of genes between two conditions. log2-fold changes are plotted against their log p values, thus allowing rapid identification of differences both meaningful and statistically significant. Such plots can be adapted to any experiment wherein a similar parameter is measured across different conditions, and whenever a sufficient level of replication allows statistical assessment. In the worm example, we considered every trajectory parameter (speed, orientation, angular velocity …) and measured the average log2-fold differences of these parameters between the intervals annotated as chase bouts and the remaining intervals. The p values were then obtained by considering all the intervals of a given type (chase or non-chase) across all individuals as replicates on which we ran a t test. The results (Figure 4C) identified angular speed to be one of the most discriminative features between chase and non-chase behavior.
Figure 4
Analysis of D. melanogaster Locomotion Trajectories Using Pergola
(A) Fly chase behavior annotated by JAABA and rendered with Sushi. Chasing periods are represented as blue and red bars for the chase-prone and control flies, respectively. Color intensity indicates the confidence score for the behavioral classifier returned by JAABA.
(B) Boxplots represent the time fraction of the trajectories annotated as chase behavior (same color code as in B).
(C) In this volcano plot, each dot represents one of the parameters measured on a fly trajectory. The horizontal axis represents the average log fold change of this parameter between measurements made during periods annotated by JABBA as chase and non-chase. The vertical axis represents the p value (from t-test) of these differences. Parameters whose variation is both extreme and statistically significant (e.g., velmag: angular velocity) are colored in red.
See also Figure S8.
Analysis of D. melanogaster Locomotion Trajectories Using Pergola(A) Fly chase behavior annotated by JAABA and rendered with Sushi. Chasing periods are represented as blue and red bars for the chase-prone and control flies, respectively. Color intensity indicates the confidence score for the behavioral classifier returned by JAABA.(B) Boxplots represent the time fraction of the trajectories annotated as chase behavior (same color code as in B).(C) In this volcano plot, each dot represents one of the parameters measured on a fly trajectory. The horizontal axis represents the average log fold change of this parameter between measurements made during periods annotated by JABBA as chase and non-chase. The vertical axis represents the p value (from t-test) of these differences. Parameters whose variation is both extreme and statistically significant (e.g., velmag: angular velocity) are colored in red.See also Figure S8.The third application of Pergola involves mice behavioral data that were obtained using PHECOMP cages (Espinosa-Carrasco et al., 2018). These cages enable fine-grained longitudinal monitoring of mice intake (solid and liquid), and can record a single individual during several weeks. In the original study, these devices were used to detect the influence of hypercaloric diets on meal patterns by comparing the feeding activity of a mice group exclusively offered a high-fat diet to their control group fed with standard chow. The sheer size of this dataset, with 9 weeks of recordings at a 1-s resolution, makes its exploration challenging.The IGV browser used for the worm dataset does not allow interactive user-defined analysis (e.g., aggregation of data across time intervals). As an alternative, we therefore combined the Gviz Bioconductor R-package with the Shiny Visualization R-library (Figure S2) to assemble a suitable visualization tool. We used this interactive browser to explore the time-dependent behavioral changes in mice exposed to the high-fat diet (Figure 5) and complemented them with a standard heatmap analysis across the two conditions (Figure 6). To further investigate whether high-fat diet disrupted the circadian rhythm of mice, as previously shown (Kohsaka et al., 2007), we used deepTools to create a profile of the 24-hr feeding activity both group-wise (Figure 7) and individually (Figure S3), akin to an actogram. Although such analyses are totally standard in behavioral neurobiology, the aspect of this work most relevant to interoperability and reproducibility is that the actogram could be produced by a simple scripting of deepTools (cf. Transparent Methods) and that its packaging for further reuse without any need for extra coding is entirely supported by generic workflow managers such as Galaxy (Afgan et al., 2018).
Figure 5
Shiny-Pergola Rendering of Mice Feeding Activity Showing Diet Influence on Meal Patterns
Feeding behavior during (A) the habituation phase (first week) and (B) the habituation phase and its transition, indicated by the arrow, to the first 2 weeks of the development of obesity in the high-fat mice (HF). The upper part of each panel (gray and orange) corresponds to raw meals depicted as rectangles proportional to their duration, whereas the middle part (blue tones) displays the accumulated food intakes of mice within 30-min time windows as a heatmap. In both cases, each row depicts data of a single individual. The bottom plot displays the group-wise average intake during these time windows. In all cases the gray is used for control mice and orange for HF mice. The zoomed snapshot (A) shows the micro-structure of feeding behavior during the habituation phase and how intakes are more frequent during dark periods in both mice groups (as mice are nocturnal animals). Substitution of standard chow by the high-fat food rapidly induces a disruption of the circadian feeding rhythmicity in the HF mice, as observed when browsing the period corresponding to the transition. In the heatmap, the darkest blue rectangles correspond to the maximum values: 0.5 g. See also Figures S2 and S9.
Figure 6
Heatmap Representation of the Fold Change in Feeding Behavior between Mice Exposed to High Fat and Their Control
Each square corresponds to the high-fat versus control mice fold change of the specific behavior (rows) during a given week (columns). Increased fold change >1 are depicted in red, whereas decreasing changes are shown in green. Color intensity is proportional to the fold change magnitude, and black indicates no variation. During the habituation phase (week 0), both groups of mice were fed with standard chow and no difference between their feeding behavior was observed. As soon as the high-fat diet is introduced, mice forced to eat exclusively this diet increased intake, number, and eating duration of meals and reduced the duration and separation between meals (animals in control group: 9, high-fat-diet group: 8).
Figure 7
Mouse Circadian Profiles of Feeding Activity Obtained Using deepTools
(A) An actogram summarizing group-wise feeding activities generated using deepTools. Left panel corresponds to control mice, and the right panel corresponds to high-fat-diet mice. The x axis represents the 24-hr daily cycle of behavior across the light/dark phases with the active phase start (APS) and the resting phase start (RPS) representing points wherein lights were switched off and on, respectively. The y axis represents the group average food intake, averaged over all the days of the habituation phase (blue line) or development phase (green line).
(B) Day-by-day recordings (averaged over all mice in a group) with every line representing a day, ordered by date with the early ones on the top. Food intake was normalized across all recordings, with dark blue indicating the highest values.
See also Figure S3.
Shiny-Pergola Rendering of Mice Feeding Activity Showing Diet Influence on Meal PatternsFeeding behavior during (A) the habituation phase (first week) and (B) the habituation phase and its transition, indicated by the arrow, to the first 2 weeks of the development of obesity in the high-fat mice (HF). The upper part of each panel (gray and orange) corresponds to raw meals depicted as rectangles proportional to their duration, whereas the middle part (blue tones) displays the accumulated food intakes of mice within 30-min time windows as a heatmap. In both cases, each row depicts data of a single individual. The bottom plot displays the group-wise average intake during these time windows. In all cases the gray is used for control mice and orange for HF mice. The zoomed snapshot (A) shows the micro-structure of feeding behavior during the habituation phase and how intakes are more frequent during dark periods in both mice groups (as mice are nocturnal animals). Substitution of standard chow by the high-fat food rapidly induces a disruption of the circadian feeding rhythmicity in the HF mice, as observed when browsing the period corresponding to the transition. In the heatmap, the darkest blue rectangles correspond to the maximum values: 0.5 g. See also Figures S2 and S9.Heatmap Representation of the Fold Change in Feeding Behavior between Mice Exposed to High Fat and Their ControlEach square corresponds to the high-fat versus control mice fold change of the specific behavior (rows) during a given week (columns). Increased fold change >1 are depicted in red, whereas decreasing changes are shown in green. Color intensity is proportional to the fold change magnitude, and black indicates no variation. During the habituation phase (week 0), both groups of mice were fed with standard chow and no difference between their feeding behavior was observed. As soon as the high-fat diet is introduced, mice forced to eat exclusively this diet increased intake, number, and eating duration of meals and reduced the duration and separation between meals (animals in control group: 9, high-fat-diet group: 8).Mouse Circadian Profiles of Feeding Activity Obtained Using deepTools(A) An actogram summarizing group-wise feeding activities generated using deepTools. Left panel corresponds to control mice, and the right panel corresponds to high-fat-diet mice. The x axis represents the 24-hr daily cycle of behavior across the light/dark phases with the active phase start (APS) and the resting phase start (RPS) representing points wherein lights were switched off and on, respectively. The y axis represents the group average food intake, averaged over all the days of the habituation phase (blue line) or development phase (green line).(B) Day-by-day recordings (averaged over all mice in a group) with every line representing a day, ordered by date with the early ones on the top. Food intake was normalized across all recordings, with dark blue indicating the highest values.See also Figure S3.When collecting longitudinal data, individual measurements are often non-independent of one another. The precise estimate of dependencies between successive events can therefore yield clues on the biological process underlying the generation of these observations. HMMs have long been known as methods of choice for analyzing such dependencies on behavioral time series (Arakawa et al., 2014, Carola et al., 2011, Yamato et al., 1992). HMMs were introduced in genomics in 1995 (Eddy, 1995) and since then, many off-the-shelf packages have been made available to carry this modeling. We selected chromHMM, a popular genomic software that infers chromatin states from chromatin marks (histone acetylation, CpG methylation, etc.) using an HMM (Ernst and Kellis, 2012). When doing so, individual chromatin marks are treated as probabilistic emissions of their underlying chromatin state. These marks constitute the input of the HMM training, whereas the output is a segmentation of the genome into regions having the same chromatin state. These states are the hidden variables. When training a model, the number of distinct hidden states has to be specified. HMM modeling is especially suitable when several states exist that can all emit the same observations but with different probabilities (just like loaded and fair dice can emit the same numbers but with different probabilities). The estimation of the transition probabilities across states as well as the individual emission probabilities of each state is named the training phase. It is usually an unsupervised process that explicitly relies on Bayes’ theorem to estimate an HMM model whose joint probability with the data is maximized.In the case of mice longitudinal feeding recordings, it is possible to treat every feeding bout as an emission, and one can then use HMM modeling to determine the typology of feeding behaviors most likely to have emitted the combinations of feeding bouts. Two parameters are needed to achieve this result: the number of hidden states (e.g., number of distinct feeding behaviors) and a description of the emissions produced by each state of the model (e.g., short feeding bout, long feeding bout). In this precise case, since the emissions are feeding bouts defined by their duration—a continuous value—it is common practice to do a discretization by binning the values into quantiles. In an ideal situation, both the hidden states and the discretized bouts should be mapped onto precise biological quantities or processes. In practice, however, even when this mapping is approximate, the resulting segmentation can reveal subtle changes in the structure of the data, thus allowing meaningful comparisons across datasets. As an illustration of this process, we discretized the feeding bouts according to their duration in five quantiles, a number that roughly captures the various modes of bout distributions (Figure S4). We then used these quantiles as emissions and estimated a four-states model onto the data (Figure 8A). The result was a collection of intervals, each assigned to one of the four states. As expected, every bout duration was found to occur within every segmented interval, but with varying probabilities. Each state is characterized by a specific combination of emission probabilities for the feeding bouts and transition probabilities across states (Figures S5 and S6, respectively).
Figure 8
Profiles of Mice Feeding Behavior Segmented Using ChromHMM
(A) Mice feeding bouts were distributed in five quintiles (see Figure S4) and eventually used to train a four-state HMM model thus resulting in each trajectory segment (300 s) to be probabilistically assigned to one of the four states. The most common feeding bout duration observed in each state was used to label it (long meals, regular meals, short meals, and inactive). A day/night track indicates dark (violet) and light phases (white) of the circadian rhythm. The last track is colored to show habituation (gray) and obesity development (black) phases of the experiment.
(B) In the spie chart representation, every slice represents the fraction of time spent in any of the four HMM states during light and day phases (left and right) and for the control and the high-fat animals (top and bottom). On these charts, the angular spread of each slice represents the relative duration of the considered state during habituation (i.e., standard chow), whereas the radial extent of the same slice represents the ratio between a measurement made during development and habituation. For instance, the green slice representing high-fat mice daily intake (bottom left) is the one showing the strongest difference between the habituation and the development phases but represents only a small fraction of the state distribution during habituation (see also Figures S4–S6).
Profiles of Mice Feeding Behavior Segmented Using ChromHMM(A) Mice feeding bouts were distributed in five quintiles (see Figure S4) and eventually used to train a four-state HMM model thus resulting in each trajectory segment (300 s) to be probabilistically assigned to one of the four states. The most common feeding bout duration observed in each state was used to label it (long meals, regular meals, short meals, and inactive). A day/night track indicates dark (violet) and light phases (white) of the circadian rhythm. The last track is colored to show habituation (gray) and obesity development (black) phases of the experiment.(B) In the spie chart representation, every slice represents the fraction of time spent in any of the four HMM states during light and day phases (left and right) and for the control and the high-fat animals (top and bottom). On these charts, the angular spread of each slice represents the relative duration of the considered state during habituation (i.e., standard chow), whereas the radial extent of the same slice represents the ratio between a measurement made during development and habituation. For instance, the green slice representing high-fat mice daily intake (bottom left) is the one showing the strongest difference between the habituation and the development phases but represents only a small fraction of the state distribution during habituation (see also Figures S4–S6).When performing unsupervised modeling (i.e., starting from raw data without prior information), the states resulting from the training are merely statistical quantities whose combination maximizes the data probability. The biological meaning of these states remains to be explored. In the case at hand, most state labels turned out to be rather trivial to assign through visual inspection, with one of the states accounting for the absence of feeding activity, a second made of a combination of feeding bouts mostly coming from the short meals quantile, a third characterized by meals of a regular length, and finally, a fourth state enriched in long feeding bouts. Notably, although no information was provided to the model, the “short meals” state fits relatively well with binge eating behavior characteristic of compulsive-like feeding (Cottone et al., 2012). Binge eating is defined as the consumption of a large quantity of food in a very short period of time and seems to dominate the feeding activity of high-fat mice (green intervals in Figure 8A). Taking advantage of the BED formatting, we used Pybedtools to compare the distribution of states across the circadian cycle in basal conditions (habituation phase) and after the introduction of the high-fat food (obesity development phase, Figure 8B) with the resulting representation supporting the notion that high-fat-fed mice increase periods of short meals at the expense of inactive periods. This analysis also reveals the tendency of these changes to occur during the light phases of the day when mice should be less active owing to their nocturnal nature. Last but not least, the HMM decoding makes it possible to highlight probable artifacts such as the unusual high feeding activity observed in some control mice (Figure 8A, blue patch in the middle of the trajectory of control mice #4 and #9).The three examples of analysis of different model organisms based on distinct data processing procedures clearly illustrate the versatility brought about by genomics standards to behavioral analysis. Yet, these standards not only cater to a wide range of different analyses but also allow uniform operations to be carried out across datasets. For instance, we provided a proof of principle of how the three visualization procedures can be indistinctly applied onto the three datasets, regardless of the organism or the recording platform (Figures S7–S9). Altogether these analyses demonstrate how the use of genomics standards allows an unprecedented amount of interoperability across methods and datasets when dealing with behavioral time series data.To ensure computational reproducibility, we ran Pergola using Nextflow (Di Tommaso et al., 2017), a workflow manager that allows entire pipelines to be deployed using software containers. The main advantage of containers is that they bundle together software that can run in the same fashion irrespective of the computational environment. Containers can also be handled by workflow managers (Di Tommaso et al., 2015) like Nextflow that provide increased scalability through optimized parallelization. In order for analyses to become easily shareable and reproducible, however, containers, software, and data must all be made available through public repositories (e.g., Docker-Hub, GitHub, and Zenodo, respectively). Thanks to the integration of these resources within Nextflow, all the analyses shown here can be repeated by entering a single command line (Transparent Methods) on an adequately configured computer (i.e., Nextflow and Docker must be installed so as to allow software to be automatically gathered and deployed). Community-wide adoption of such a global computational strategy could be a major step toward the implementation of the FAIR principle (Wilkinson et al., 2016) in longitudinal behavioral studies.
Discussion
Pergola is a simple yet effective solution to the problem of data interoperability in behavioral neuroscience. By providing access to an established data standard, Pergola's data harmonization scheme has the potential to increase interoperability and computational reproducibility. The need for these improvements has become critical now that neuroscience is entering a data-intensive cycle (Wiener et al., 2016). Existing commercial and open source behavioral software often provide a toolkit designed to perform a specific set of analyses, but these toolkits are often hard to adapt and frequently bound to a specific acquisition platform. By contrast, Pergola enables a flexible framework in which ad hoc computational solutions can be rapidly implemented by re-adapting existing tools designed to deal with genomics standards. The benefits are non-neglectable since genomics offers a rapidly growing corpus of statistical analysis software (Gentleman et al., 2004, Huber et al., 2015). In this work, we show on three distinct model systems how genomic tools can be easily applied for reproducing common behavioral analysis. The simplest, albeit arguably the most powerful aspect of the reliance on genomics standards is visualization. Visualization tools are complex software whose user interface is essential for effective data exploration. Multiscale support is especially important when zooming in and out of large datasets exhibiting meaningful signals at various levels of granularity. We show here that genomics not only provides a ready-made solution but also allows for visualization procedures that are shareable and re-usable across a wide range of different analyses. Our Shiny-based browser is a striking example of how effective and problem-tailored shareable visualization procedures can be implemented in behavioral biology, following the trends set in genomics (Cao et al., 2018).Genomics can provide a substantial number of advanced analysis techniques readily implemented (Gentleman et al., 2004, Quinlan and Hall, 2010, Ernst and Kellis, 2012, Ramírez et al., 2016) for the study of animal behavior. Thanks to its critical mass and the development of very large flagship projects like ENCODE or GTEx (The ENCODE Project Consortium et al., 2012; Ardlie et al., 2015), genomics software tools tend to be more mature and better supported than their counterparts in other research fields. In this work, we show how we can integrate chromHMM (Ernst and Kellis, 2012), a tool that has been extensively used by ENCODE to characterize chromatin regions (Ernst and Kellis, 2010, The ENCODE Project Consortium et al., 2012, Yue et al., 2014), to annotate mice feeding activity. Likewise, we adopted deepTools (Ramírez et al., 2016) functionalities to cluster posture-tracking-derived data from C. elegans and to derive mouse circadian profiles. The latter constitutes a compelling example of how to implement a useful data representation with little computational proficiency. In addition, we showed how an analysis strategy commonly used for genome-wide gene expression data, the volcano plot (Ritchie et al., 2015, Liu et al., 2018), can be used to identify the main behavioral features, leading to the differential annotation of a given behavior.In par with visualization, the other main benefit of the standard repurposing presented here is probably increased interoperability. Interoperability stems from datasets being standardized into the same format and therefore operable with the same tools. As a consequence, it becomes easier to share, compare, and reuse data (Sansone et al., 2012) and apply common analysis procedures to heterogeneous datasets. Increased interoperability allows laboratories to work under open-data collaborative scientific environments with major benefits for the community (Gomez-Marin et al., 2014). Last but not least, interoperability offers unified and flexible visualizations of diverse recordings that may include non-behavioral data. For instance, in the near future, the simultaneous rendering of behavioral trajectories along with environmental annotations or electrophysiological recordings may allow to identify essential explanatory patterns on the data (Anderson and Perona, 2014).The technical feasibility of interoperability comes, however, with extra caveat. More specifically, it must be stressed that meaningful comparison of alternative recordings depends on a good understanding of their relative timescales (Brown and de Bivort, 2018). Although our approach makes it possible to process through the same channels behavioral features lasting just milliseconds (e.g., response to an odor stimulus in the fly [Gaudry et al., 2013]) or a day (e.g., circadian rhythms [Geissmann et al., 2017]), the decision as to whether such analyses can be meaningfully compared remains a matter for scientific debate and analysis. Interoperability, nonetheless, offers a unique window of opportunity to explore relationship between short- and long-term behavioral patterns. Inter-operating data across various species also brings new possibilities, such as the evolutionary treatment of behavioral characters and their interpretation in terms of homology (Gomez-Marin et al., 2016). Of course, in contrast with nucleotide homology where genomic positions are inferred to be homologous to one another, behavioral homology is more similar to character-based phylogeny and may have to be inferred by comparing secondary models of longitudinal data (e.g., HMMs) rather than the longitudinal data itself. Similar analyses have been carried out in the field of social sciences, with complex longitudinal studies recorded as simple sequences and processed using mathematical tools inspired from evolutionary biology (Gauthier et al., 2010, Gauthier et al., 2009).We also demonstrate here how high-throughput behavioral analysis can become more systematic and reproducible by adopting workflow managers such as Nextflow (Di Tommaso et al., 2017). The task of deploying analysis relying on such managers can be highly simplified by adopting mature genomic solutions. Additional benefit comes from the large user and developer communities and the rigorous version control of these tools. The robust handling of changing versions of the same software will be achieved by packaging them into software containers. Besides, these tools share input and output file formats, which allows to develop orchestrated pipelines.This cross-fertilization between scientific fields provides a prime example of how interdisciplinarity can save development time by allowing collections of extensively validated software and methods to be shared and mutualized. Indeed, the problem of massive data integration is not specific to neuroscience, with novel types of longitudinal recordings being collected in hospitals (Jensen et al., 2014) and societal contexts (de Montjoye et al., 2015). Although properly analyzing these data will pose a challenge, its expected impact on human health will definitely make it worthwhile (Jensen et al., 2014, Price et al., 2017).
Limitations of the Study
Pergola is not meant to integrate genomics and behavioral data. In this study, genomic format capacities are merely used as information technology solutions for the programmatic and standardized management of behavioral data and subsequent statistical analysis. The nucleotide alphabet does not contribute to the encoding process that ignores the biological nature of DNA. As a consequence, the encoded behavioral data cannot be used to infer functional and evolutionary relationships across species and experiments through standard evolutionary-based genome alignment tools.
Methods
All methods can be found in the accompanying Transparent Methods supplemental file.
Authors: Alice A Robie; Jonathan Hirokawa; Austin W Edwards; Lowell A Umayam; Allen Lee; Mary L Phillips; Gwyneth M Card; Wyatt Korff; Gerald M Rubin; Julie H Simpson; Michael B Reiser; Kristin Branson Journal: Cell Date: 2017-07-13 Impact factor: 41.582
Authors: Weizhe Hong; Ann Kennedy; Xavier P Burgos-Artizzu; Moriel Zelikowsky; Santiago G Navonne; Pietro Perona; David J Anderson Journal: Proc Natl Acad Sci U S A Date: 2015-09-09 Impact factor: 11.205
Authors: Alfonso Pérez-Escudero; Julián Vicente-Page; Robert C Hinz; Sara Arganda; Gonzalo G de Polavieja Journal: Nat Methods Date: 2014-06-01 Impact factor: 28.547
Authors: Laura A Solt; Yongjun Wang; Subhashis Banerjee; Travis Hughes; Douglas J Kojetin; Thomas Lundasen; Youseung Shin; Jin Liu; Michael D Cameron; Romain Noel; Seung-Hee Yoo; Joseph S Takahashi; Andrew A Butler; Theodore M Kamenecka; Thomas P Burris Journal: Nature Date: 2012-03-29 Impact factor: 49.962
Authors: Daniel R Zerbino; Nathan Johnson; Thomas Juettemann; Steven P Wilder; Paul Flicek Journal: Bioinformatics Date: 2013-12-19 Impact factor: 6.937
Authors: Jose Espinosa-Carrasco; Toni Hermoso Pulido; Ionas Erb; Mara Dierssen; Julia Ponomarenko; Cedric Notredame Journal: Nucleic Acids Res Date: 2019-07-02 Impact factor: 16.971