Shannon E Keenan1,2, Maria Avdeeva3, Liu Yang2, Daniel S Alber1,2, Eric F Wieschaus2,4, Stanislav Y Shvartsman2,3,4. 1. Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ 08540. 2. The Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540. 3. Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY 10010. 4. Department of Molecular Biology, Princeton University, Princeton, NJ 08540.
Abstract
During early Drosophila embryogenesis, a network of gene regulatory interactions orchestrates terminal patterning, playing a critical role in the subsequent formation of the gut. We utilized CRISPR gene editing at endogenous loci to create live reporters of transcription and light-sheet microscopy to monitor the individual components of the posterior gut patterning network across 90 min prior to gastrulation. We developed a computational approach for fusing imaging datasets of the individual components into a common multivariable trajectory. Data fusion revealed low intrinsic dimensionality of posterior patterning and cell fate specification in wild-type embryos. The simple structure that we uncovered allowed us to construct a model of interactions within the posterior patterning regulatory network and make testable predictions about its dynamics at the protein level. The presented data fusion strategy is a step toward establishing a unified framework that would explore how stochastic spatiotemporal signals give rise to highly reproducible morphogenetic outcomes.
During early Drosophila embryogenesis, a network of gene regulatory interactions orchestrates terminal patterning, playing a critical role in the subsequent formation of the gut. We utilized CRISPR gene editing at endogenous loci to create live reporters of transcription and light-sheet microscopy to monitor the individual components of the posterior gut patterning network across 90 min prior to gastrulation. We developed a computational approach for fusing imaging datasets of the individual components into a common multivariable trajectory. Data fusion revealed low intrinsic dimensionality of posterior patterning and cell fate specification in wild-type embryos. The simple structure that we uncovered allowed us to construct a model of interactions within the posterior patterning regulatory network and make testable predictions about its dynamics at the protein level. The presented data fusion strategy is a step toward establishing a unified framework that would explore how stochastic spatiotemporal signals give rise to highly reproducible morphogenetic outcomes.
At the start of animal gastrulation, epithelial primordia bend toward the interior of the embryo, taking the first step toward forming internal organs. In particular, involution of endoderm via the blastopore precedes formation of posterior structures of the gut (e.g., posterior midgut [PMG] and hindgut [HG]) (1–7). This morphogenetic event is foreshadowed by spatial patterning within the primordium that is defined by differential expression of regulators, such as signaling molecules and transcription factors. Patterning of the gut progresses in a similar fashion in organisms as diverse as corals and mammals, with the homologs of essential components (such as cdx, foxA, bra, and wnt genes) displaying striking similarities in their expression profiles (2, 8–16). Understanding the organization of such conserved patterns is fundamental for understanding organogenesis. Here, we examine patterning dynamics in the Drosophila embryo, where the PMG and HG primordia are established within a short time window of only 2 h.The PMG and HG primordia in Drosophila are induced by the locally activated extracellular signal–regulated kinase (ERK) pathway. ERK relieves gene repression by a high mobility group (HMG)-box DNA binding protein Capicua (Cic), which acts as a sensor of ERK activation in multiple stages of Drosophila development (17–19). Phosphorylated by ERK, Cic rapidly leaves the enhancers of its target genes (Fig. 1), which are positively regulated by maternal and zygotic activators (19, 20). Among the direct targets of Cic are tailless (tll), a member of the nuclear receptor family, and huckebein (hkb), a zinc-finger transcription factor, both of which are induced within minutes of ERK activation. Tll and Hkb proteins shape the expression domains of several genes within the PMG/HG primordia; these include brachyenteron (byn), the Drosophila homolog of T-box transcription factor Brachyury; wingless (wg), which encodes a secreted Wnt-family ligand; and forkhead (fkh), which encodes a winged helix transcription factor (Fig. 1) (2). Another key regulator of the posterior gut patterning is caudal (cad), encoded by the Drosophila homolog of the vertebrate CDX-2 homeobox gene (21). Posteriorly localized cad is translated from both maternal and zygotic transcripts and activates genes involved in Drosophila endoderm and ectoderm specification.
Fig. 1.
The posterior gut specification GRN and acquisition of transcriptional information. (A) ERK signaling, which is activated through the receptor tyrosine kinase (RTK) Torso at the posterior pole of the Drosophila embryo, antagonizes repression by Cic. Export of Cic from the nucleus and subsequent degradation in the cytoplasm allow for activation of downstream genes. (B) A minimal network of five genes that specifies the posterior gut (2, 21). Arrows indicate activation, and flat lines indicate repression. (C) Transcription for five genes was tracked by endogenously inserting MS2 stem loops (black) into the 5 UTR of each of our genes using CRISPR. (D, data acquisition) About 1/3 of the embryo is imaged starting from the posterior pole using light-sheet microscopy. A stack of images is taken once every minute from NC 11 to the start of gastrulation for a total of 90 min. (D, raw data) Representative raw images of a single slice (row 1) and a maximum intensity projection of all 200 slices (row 2) in the nuclear (Histone-GFP) and MS2-MCP (MCP-mCherry) channels are shown. (Scale bar: 50 µm.) (D, processing) Images were segmented in 2D using ilastik (41) and reconstructed in 3D using arivis Vision4D. Overlapping nuclear and dot “objects” indicated whether a nucleus was actively transcribing a gene. Nuclei containing an MS2 dot were “colored” with gene expression. (D, output) A list of all nuclear objects paired with information about their positions and transcriptional activity was exported for further analysis.
The posterior gut specification GRN and acquisition of transcriptional information. (A) ERK signaling, which is activated through the receptor tyrosine kinase (RTK) Torso at the posterior pole of the Drosophila embryo, antagonizes repression by Cic. Export of Cic from the nucleus and subsequent degradation in the cytoplasm allow for activation of downstream genes. (B) A minimal network of five genes that specifies the posterior gut (2, 21). Arrows indicate activation, and flat lines indicate repression. (C) Transcription for five genes was tracked by endogenously inserting MS2 stem loops (black) into the 5 UTR of each of our genes using CRISPR. (D, data acquisition) About 1/3 of the embryo is imaged starting from the posterior pole using light-sheet microscopy. A stack of images is taken once every minute from NC 11 to the start of gastrulation for a total of 90 min. (D, raw data) Representative raw images of a single slice (row 1) and a maximum intensity projection of all 200 slices (row 2) in the nuclear (Histone-GFP) and MS2-MCP (MCP-mCherry) channels are shown. (Scale bar: 50 µm.) (D, processing) Images were segmented in 2D using ilastik (41) and reconstructed in 3D using arivis Vision4D. Overlapping nuclear and dot “objects” indicated whether a nucleus was actively transcribing a gene. Nuclei containing an MS2 dot were “colored” with gene expression. (D, output) A list of all nuclear objects paired with information about their positions and transcriptional activity was exported for further analysis.Genetic studies of PMG and HG specification established several interactions within the corresponding gene regulatory network (GRN) (Fig. 1) (22–27). At the same time, most studies with the goal of creating a comprehensive description of gene expression in regulatory networks in Drosophila development rely on the use of fixed embryos. To construct spatially resolved expression profiles of multiple genes from fixed samples, spatial registration techniques have been applied to integrate partial observations in both Drosophila and other organisms (28–31). However, generating a time-resolved view of spatiotemporal gene expression dynamics from such data, although possible, has proven to be a daunting task (29, 32). As a result, we lack a detailed temporal view of the establishment of gene expression domains in PMG/HG patterning network or a clear description of the spatial relationship of these domains.Recent advances in live imaging, genome editing, and data science offer powerful tools for revisiting Drosophila gut determination with greatly increased temporal resolution. Nonetheless, one obstacle remains; it is not experimentally tractable to label the expression of more than two or three genes at a time (33–37) due to genetic constraints of expressing multiple live reporters in a single organism and from spectral overlap of fluorophores. Here, we present a systematic strategy that employs live imaging of transcriptional reporters of individual genes and algorithms that combine the partial observations into a multivariable trajectory. We demonstrate our strategy by reconstructing the wild-type dynamics of five genes involved in posterior gut patterning in the Drosophila embryo. We validate our findings using in situ hybridization and antibody staining in fixed embryos and provide a simplified view of the patterning dynamics through low-dimensional approximation. Finally, we use computational modeling to study the extent to which the simple patterning dynamics can be predicted by known gene interactions in the network.
Results
Live Imaging of PMG/HG Patterning.
In order to visualize transcriptional dynamics of the posterior patterning GRN, we created endogenous transcriptional reporters using the MS2–MS2 coat protein (MCP) system (33–35). This system utilizes stem loops derived from the MS2 bacteriophage that are inserted into a noncoding region of a gene and the expression of a fluorescently labeled MCP. As the gene is transcribed, the labeled MCP binds to the stem loops, creating fluorescent puncta (also referred to here as “dots”) and enabling observation of active transcription in individual nuclei in the posterior of the embryo. For five genes in the network, MS2 stem loops were inserted into the 5 untranslated region (UTR) of each gene with CRISPR-Cas9–mediated genome editing (Fig. 1) (38, 39). Transgenic flies for all genes were homozygous viable and were crossed to flies expressing MCP-mCherry and a Histone marker. To maximize spatiotemporal resolution of the posterior of the embryo in three dimensions, we used the Luxendo MuVi SPIM light-sheet microscope (Fig. 1) (40). We imaged nuclei and transcription starting from the pole cells, taking cross-sectional slices inward and capturing about one-third of the embryo. One stack was taken every minute from the beginning of nuclear cycle (NC) 11 to the start of gastrulation ( ∼ 90 min). Nuclei and MS2 dots were segmented using ilastik (41), and three-dimensional (3D) objects were reconstructed in arivis Vision4D (Fig. 1 and Materials and Methods). Four developmental movies were collected for each of the five genetic reporters (42). The output of segmented and reconstructed imaging datasets was a list of all nuclei paired with information about their positions and whether they contain a transcriptional dot, which indicates that the nucleus is actively transcribing a gene.Processed data paint a picture of how patterns arise and change over time. As expected for direct targets of ERK signaling, transcription of tll and hkb was observed at the onset of the movie during NC 11. Expression of tll and hkb continued through NC 12 and NC 13 and ceased about halfway through NC 14. Expression of byn, which is activated by Tll and repressed by Hkb, began in NC 12 and persisted through NC 14. About halfway through NC 14, transcription of byn subsided in the most posterior cells, resulting in a ring of expression. Similarly, wg expression was initiated in NC 13, eventually forming a ring of expression in NC 14. Lastly, fkh expression was initiated in NC 13 and continued in NC 14.Our set of transcriptional movies was a starting point for understanding multivariable profiles of the posterior patterning network. The next step was to combine these descriptions of individual patterns into a single embryo.
Fusion of Individual Expression Profiles.
In order to computationally fuse transcriptional data, imaging datasets first needed to be aligned in time and space due to variability of imaging initiation time and embryonic placement on the microscope. During registration steps, transcriptional information was ignored, and only embryonic morphology was employed to preserve spatiotemporal transcriptional variability between samples. NC lengths during Drosophila embryogenesis are consistent across samples (NC 12: 9.6 ±0.6 min, NC 13: min), as measured from all collected movies () (43, 44). Therefore, developmental time alignment between samples was performed by identifying the times of nuclear divisions in each movie (Materials and Methods and ) and specifying t = 0 as the 11th nuclear division in embryogenesis. renders the result of this alignment in terms of the total number of nuclei across movies and bulk transcriptional dynamics between replicates of the same gene, respectively.After temporal alignment, we obtained spatial registration of all samples to a template to account for the random rotation and translation of the embryo during mounting on the microscope as well as to correct for small fluctuations in the overall size of the embryo. This was achieved using a version of iterative closest point (ICP) algorithm (45) applied to the point clouds of nuclear centroids, with the formation of the ventral furrow at the end of the movie serving as a morphological cue for initial rotation (Materials and Methods and ).With all data aligned to the same axes, we developed a model to make predictions about spatiotemporal domains of gene expression on the template embryo. For every gene g, we used binary indicators of MS2 dots as a proxy for expression. We fit a random forest classification model (46, 47) to the observed expression profiles in the corresponding movies using the available spatiotemporal coordinates as features (Fig. 2). Indeed, every nucleus corresponds to an observation of the form , where are its spatiotemporal coordinates and is a binary variable indicating whether the nucleus is actively transcribing or not. We used a random forest classifier to solve the binary classification problem using D as predictors, with being the target variable. The performance of this classification approach was compared with gradient boosting and kNN classification through a cross-validation procedure (Materials and Methods). Cross-validation demonstrated similar performance between random forest and gradient boosting classifiers, with a small advantage over the kNN approach (Materials and Methods and ). Results of the evaluation of the best-performing random forest models on independent test sets are given in . For every gene g, we trained the corresponding tuned model on all the biological replicates and used the trained model to predict expression probabilities of this gene for every observation D on the morphology of the chosen template sample (Materials and Methods).
Fig. 2.
Fusion of transcriptional profiles of individual genes. (A) A schematic representation of our approach to data fusion. After the registration step (based on morphology) (), for every gene g, random forest classification was performed to predict transcription probabilities on spatiotemporal morphology of a shared model embryo. The output is a common view of multivariable transcriptional dynamics on the posterior of the embryo. For clarity, only two genes and one time point are shown. (B) Mean of the predicted expression probabilities on the model embryo as a function of (synchronized) time shown for individual genes and on the same axes (Bottom Right). The shaded regions show the 95% CIs.
Fusion of transcriptional profiles of individual genes. (A) A schematic representation of our approach to data fusion. After the registration step (based on morphology) (), for every gene g, random forest classification was performed to predict transcription probabilities on spatiotemporal morphology of a shared model embryo. The output is a common view of multivariable transcriptional dynamics on the posterior of the embryo. For clarity, only two genes and one time point are shown. (B) Mean of the predicted expression probabilities on the model embryo as a function of (synchronized) time shown for individual genes and on the same axes (Bottom Right). The shaded regions show the 95% CIs.Fig. 2 summarizes the dynamics of the expression probabilities provided by our trained models for individual genes, illustrating how the overall level of gene expression changes over time. For example, we see that levels of tll and hkb fall concurrently with a rise in wg expression and a reduction in byn levels during NC 14. To demonstrate spatial organization of the underlying process, Fig. 3 shows snapshots of individual and fused outputs of our classification on the coordinates of the template embryo. To validate some of our predictions, we have performed fluorescent in situ hybridization using intronic probes for byn and wg in embryos throughout NC 14 (Materials and Methods and , five time points shown). These data provided us with snapshots of the simultaneous dynamics of these two genes at the primary transcript level. In accordance with our results, these data demonstrated the emergence of byn pre-mRNA at the beginning of NC 14, with a gradual evolution from a posterior cap into a ring-like structure that was predicted by our random forest model. At the same time, we confirmed emergence of a thinner ring for wg concurrently with the recession of byn pre-mRNA from the posterior region.
Fig. 3.
Snapshots of fused expression patterns. Predicted transcriptional probabilities for representative time points are shown spatially on the template embryo (posterior view). For every individual plot, the color is normalized so that the highest intensity corresponds to probability of one. In the fused row, for every gene, the radius of its corresponding visible annulus is proportional to the predicted probability of transcription.
Snapshots of fused expression patterns. Predicted transcriptional probabilities for representative time points are shown spatially on the template embryo (posterior view). For every individual plot, the color is normalized so that the highest intensity corresponds to probability of one. In the fused row, for every gene, the radius of its corresponding visible annulus is proportional to the predicted probability of transcription.Fused data are valuable for examining the underlying properties of the genetic network, which we address next.
Posterior Patterning Network Dynamics Has Low Intrinsic Dimensionality.
The spatial patterns in Fig. 3 suggest a high level of coordination between the expression of individual genes within specific spatial components of the tissue. To construct the expression trajectories corresponding to emerging cell fates, the predictions between adjacent time frames must be connected. In lieu of a tracking algorithm, we created a grid of 1,000 points on the surface of the model embryo (Materials and Methods) and utilized our classification model again to make predictions for individual genes on this grid at every time point. As a result, each element of the grid displays a five-dimensional trajectory of predicted expression probabilities (Fig. 4). In order to dissect the structure of the multidimensional dynamics and extract interpretable features, we applied nonnegative matrix factorization (NMF) to the trajectories (Fig. 4). Dimensionality reduction showed that our data are accurately explained by two primary components (23% relative residual error) (Materials and Methods), revealing two developmental trajectories that are linearly combined at every locus with spatial mixing coefficients (Fig. 4). The approximation provided by the two NMF components recapitulates the spatiotemporal patterns of the fused expression probabilities (Fig. 5). The mixing coefficients of NMF components on the embryo are concentrated in a “cap” and a “ring” of nuclei around the pole, reminiscent of the tissue that will become ectoderm and endoderm of the future gut. Feature importance scores () offer a closer look into divergence between the two trajectories for every time point at the level of individual genes. A major distinction between the trajectories can be observed already in NC 13 through differential expression of fkh and hkb and persists throughout NC 14, culminating in its second half with input from wg and byn. This observation might provide insight into timing and origins of gut cell–type specification.
Fig. 4.
NMF reveals two major transcriptional trajectories. (A) The transcriptional fusion (Fig. 2) produces a five-dimensional trajectory at every element n of the grid E on the model embryo. (B) NMF uncovers low-dimensional structure in the set of N = 1000 five-dimensional trajectories predicted on the surface of the model embryo. For every rank where M = 96 is the number of time points in the fused trajectories, NMF provides an approximate decomposition of as a linear combination of k nonnegative basis components that are independent of the location on the embryo. The components are combined at every location n with nonnegative mixing coefficients that are, thus, shared between all genes. (C) Two NMF components (for k = 2; Left) and their corresponding mixing coefficients (Right) are shown. (Left) For each component, its elements for individual genes are shown as a function of time. Mixing coefficients (shown by color intensity and dot radius in Right) display clear symmetric cap and ring structures.
Fig. 5.
Snapshots of the expression patterns provided by low-dimensional NMF approximation. Snapshots of the approximate probabilities of transcription provided by two NMF components shown on the embryonic grid for the same time points as in Fig. 3. For individual gene predictions, the color in every figure is normalized so that the highest intensity corresponds to one. In the fused row, for every gene, the radius of its corresponding visible annulus is proportional to the approximate probability of transcription.
NMF reveals two major transcriptional trajectories. (A) The transcriptional fusion (Fig. 2) produces a five-dimensional trajectory at every element n of the grid E on the model embryo. (B) NMF uncovers low-dimensional structure in the set of N = 1000 five-dimensional trajectories predicted on the surface of the model embryo. For every rank where M = 96 is the number of time points in the fused trajectories, NMF provides an approximate decomposition of as a linear combination of k nonnegative basis components that are independent of the location on the embryo. The components are combined at every location n with nonnegative mixing coefficients that are, thus, shared between all genes. (C) Two NMF components (for k = 2; Left) and their corresponding mixing coefficients (Right) are shown. (Left) For each component, its elements for individual genes are shown as a function of time. Mixing coefficients (shown by color intensity and dot radius in Right) display clear symmetric cap and ring structures.Snapshots of the expression patterns provided by low-dimensional NMF approximation. Snapshots of the approximate probabilities of transcription provided by two NMF components shown on the embryonic grid for the same time points as in Fig. 3. For individual gene predictions, the color in every figure is normalized so that the highest intensity corresponds to one. In the fused row, for every gene, the radius of its corresponding visible annulus is proportional to the approximate probability of transcription.The mixing weights produce an embedding of our data into a plane (), with most of the variance in the weights explained by nonlinear functions of the anterior–posterior (AP) axis coordinate only (). Nonetheless, residual error analysis suggests that an additional fine-scaled structure might be present in the data. depicts the norm of the residual error of our NMF approximation over time plotted on the template embryo for each gene. Most of the error is concentrated at the dorsal side of the embryo for byn and away from the dorsal side for hkb and wg, suggesting that dorsal–ventral (DV) signaling may play a role in expression of the network. In fact, interactions between dorsal and Cic have been shown to lead to DV gradients in Cic concentration, which may propagate through the GRN (48). Additionally, note that the quality of our NMF approximation varies between individual genes, with wg displaying the highest residual to expression ratio on average (). This suggests that wg may be expressed in a cell-type subset within the posterior, reflecting a more complex relationship between the patterning dynamics and the ultimate cell fates.To test the sensitivity of our predictions to each individual gene in silico, we performed experiments removing from consideration the data corresponding to one gene at a time () and observed remarkable robustness of the low-dimensional decomposition. The patterns remained stable after removal of any gene except byn, which when removed, produced a shift in the distribution of the ring mixing coefficients (). This suggests that while byn is the main driver of the ring component, other genes collectively contribute to the formation of the two components.
Modeling Posterior Patterning Network.
To validate some of the presumed relationships in the terminal patterning GRN (Fig. 1), we collected quadruple antibody staining for embryos in late NC 14 in tll and hkb mutant backgrounds (Fig. 6 and Materials and Methods). In embryos, both Byn and Wg become expanded toward the posterior pole, suggesting that the posterior boundary for both genes is set by their relationship with Hkb serving as a repressor. At the same time, in background, no substantial Byn or Wg protein can be detected in the posterior of the embryo, confirming that Tll expression might be necessary for their activation.
Fig. 6.
Bayesian ODE model for the posterior patterning network. (A) A scheme of the GRN model that was applied to posterior patterning data. Squares, N(t): observed instantaneous rate of transcription; circles, P(t): unobserved protein concentration. Arrows indicate activation, and flat lines indicate repression. Dashed arrows correspond to the ODE part of the model (Materials and Methods). For every downstream gene in the network (byn, fkh, wg), its rate of transcription is modeled to result from a combined input of concentrations of upstream proteins [passed through nonlinear filters, denoted by ]. Other parameters include protein degradation rates (d), interaction strength (a), and thresholding and slope parameters (c, s). (B) Protein staining of NC 14 syncytial blastoderm into early gastrulation (the lateral view of the posterior portion is shown). The first four columns show wild-type terminal patterning. Columns 5 and 6 show mutant terminal patterning. NC 14A: 0 to 20 min; NC 14B: 20 to 40 min; NC 14C: 40 to 60 min of NC 14. Gast indicates early gastrulation. (Scale bars: 50 µm.) (C) Predicted transcription and protein dynamics obtained using Bayesian inference for the ODE-based regulatory network model shown in A. (Upper) Dynamics of the instantaneous rate of nascent RNA production for three downstream genes of the network. Dashed lines indicate posterior predictive means. (Lower) Mean predicted dynamics of protein concentration for three upstream genes. (D) Spatial distribution of protein concentration and transcription rates in NC 14C. (Upper) Predicted (minimum/maximum-normalized) expected concentrations of Tll and Hkb are shown as dotted lines. Mean protein concentration derived from immunostaining of 20 NC 14C embryos is shown in gray. (Lower Left) Transcriptional profiles of downstream genes of the GRN. (Lower Right) Immunostaining data (including Byn; solid lines) and model predictions (dotted lines) for protein concentration are overlaid with the transcriptional profile of byn. The x axis is the embryonic length (EL; 100% corresponds to posterior pole). has details on normalization and inference of common embryonic coordinates. In all plots of C and D, shaded regions denote twice the SD from the mean.
Bayesian ODE model for the posterior patterning network. (A) A scheme of the GRN model that was applied to posterior patterning data. Squares, N(t): observed instantaneous rate of transcription; circles, P(t): unobserved protein concentration. Arrows indicate activation, and flat lines indicate repression. Dashed arrows correspond to the ODE part of the model (Materials and Methods). For every downstream gene in the network (byn, fkh, wg), its rate of transcription is modeled to result from a combined input of concentrations of upstream proteins [passed through nonlinear filters, denoted by ]. Other parameters include protein degradation rates (d), interaction strength (a), and thresholding and slope parameters (c, s). (B) Protein staining of NC 14 syncytial blastoderm into early gastrulation (the lateral view of the posterior portion is shown). The first four columns show wild-type terminal patterning. Columns 5 and 6 show mutant terminal patterning. NC 14A: 0 to 20 min; NC 14B: 20 to 40 min; NC 14C: 40 to 60 min of NC 14. Gast indicates early gastrulation. (Scale bars: 50 µm.) (C) Predicted transcription and protein dynamics obtained using Bayesian inference for the ODE-based regulatory network model shown in A. (Upper) Dynamics of the instantaneous rate of nascent RNA production for three downstream genes of the network. Dashed lines indicate posterior predictive means. (Lower) Mean predicted dynamics of protein concentration for three upstream genes. (D) Spatial distribution of protein concentration and transcription rates in NC 14C. (Upper) Predicted (minimum/maximum-normalized) expected concentrations of Tll and Hkb are shown as dotted lines. Mean protein concentration derived from immunostaining of 20 NC 14C embryos is shown in gray. (Lower Left) Transcriptional profiles of downstream genes of the GRN. (Lower Right) Immunostaining data (including Byn; solid lines) and model predictions (dotted lines) for protein concentration are overlaid with the transcriptional profile of byn. The x axis is the embryonic length (EL; 100% corresponds to posterior pole). has details on normalization and inference of common embryonic coordinates. In all plots of C and D, shaded regions denote twice the SD from the mean.The small size of the network paired with the simple description of the network dynamics provided by the low-dimensional approximation allowed us to fit an ordinary differential equation (ODE) model to our data (Materials and Methods). The model (Fig. 6) incorporates activating and repressing relationships between the genes in the network in Fig. 1, assuming nonlinear relationships between the protein level of an upstream gene and the instantaneous rate of transcription of a target gene. These interactions are modeled through the Hill function, , with additional parameters for the strength of the interaction and protein degradation rates. We fit our system to two NMF components for the transcription intensity levels (derived from probabilities predicted by our random forest model) (Materials and Methods). Absence of RNA synthesis during mitosis allows us to fit the model parameters using only transcription intensities in the interphase. We employed a Bayesian framework for parameter inference; prior distributions for 14 free model parameters and fixed values chosen for other parameters can be found in .We applied the GRN model to our data by fitting it to both NMF components simultaneously and simulating samples from the posterior distribution of the parameters using a Monte Carlo approach (Materials and Methods). This allowed us to study the posterior predictive distribution of transcription trajectories for the downstream genes in the network: byn, fkh, and wg. Interactions within the model were sufficient to capture the overall dynamic trends of these genes in both spatial components (cap posterior mean RMSE = 0.2414, ring posterior mean RMSE = 0.1963) (Fig. 6 , Upper). We observed that, despite the flexibility of our chosen model, most of the marginal posterior distributions were unimodal and were concentrated in small regions of the parameter space (). As an example, the instantaneous rate of transcription of byn in both components is described within the model framework by an equation of the formwhere and are the upstream protein concentrations; a denotes the strength of interactions; and c1, c2 and s are thresholding and slope parameters, respectively. As a result, dynamics of byn are explained by a trade-off between the upstream Tll and Hkb protein profiles, which provides tight constraints on the degradation rates of these proteins. Our model provides an estimate of 40 ±2 min for the half-life of Tll, with a much longer half-life (on the order of hours) predicted for Hkb and Fkh. Indeed, the posterior distribution allowed us to simulate Tll, Hkb, and Fkh protein dynamics (Fig. 6 , Lower), which reflect their corresponding degradation rates. In particular, our results for the cap component (NMF component 1) suggest that Tll, Hkb, and Fkh persist in the posterior of the embryo throughout NC 14.
Testing GRN Model Predictions.
To test model predictions and obtain a view of the network dynamics at protein level, we performed simultaneous antibody staining for Tll, Hkb, Byn, and Wg proteins in wild-type embryos across NC 14 (Fig. 6). Overall, these data reveal that the spatial structure that we identified from RNA is preserved at the protein level, with Hkb, expressed in a small posterior cap region, getting combined with Tll in a broader cap region to define the two spatial components setting up a more refined structure of Byn and Wg downstream. In line with our GRN model predictions, we observed Hkb and Tll proteins in their respective spatial compartments of the embryo throughout NC 14 until gastrulation.Furthermore, by combining concentration predictions of Fig. 6 using spatial coefficients provided by NMF, one can predict how a protein is distributed along the AP axis at any point of our time course ( has details). Thus, to test our model further, we quantified Tll, Hkb, and Byn concentrations in embryos between 40 and 60 min into NC 14 (NC 14C) and contrasted the result against our predicted spatial protein profiles for this stage of development (Fig. 6). We observed striking agreement between model predictions for Tll and Hkb and these orthogonal measurements (with divergence around the posterior pole, which can be explained by not excluding pole cells from our analysis). We stress that the resulting shape of the protein profiles could not be immediately predicted without GRN modeling (since the shape of the corresponding transcription profiles dynamically changes throughout the whole time course of our observation).Unlike Tll and Hkb, however, our GRN model does not incorporate Byn and cannot provide a prediction for its distribution. Previous studies (49) suggested that Byn protein follows similar dynamics to its RNA, localizing, during cellularization, to the posterior ring of cells mostly comprising the proctodeal primordium. However, even though our data confirm that transcription of byn ceases in the cap spatial component by NC 14C, quantification of immunostaining data for Byn for this stage suggests its persistence in both spatial compartments of the posterior (Fig. 6 , Lower). Furthermore, Byn staining in Fig. 6 demonstrates that the Byn domain stays extended toward the posterior pole of the embryo until the beginning of gastrulation. We have followed our static experiments with live imaging Byn protein (Movie S1 and ) using a LlamaTag construct, which allowed us to visualize endogenous concentration dynamics of Byn by fusing it to a nanobody (50). Live imaging corroborated the static observations, with Byn clearly identifiable in the nuclei in both spatial components until gastrulation.
Discussion
Our approach to reconstructing dynamics of posterior gut specification is based on several experimental and computational features: generation of transcriptional reporters and homozygous viable transgenic flies, light-sheet imaging, image segmentation and reconstruction, temporal and spatial registration of samples, fusion of datasets through random forest classification, dimensionality reduction, and GRN modeling. These steps led to a compact description of multivariable transcriptional dynamics of the Drosophila PMG/HG regulatory network, which we found to be accurately approximated by two different gene expression trajectories. Our data fusion approach can readily incorporate additional replicates for the five genes considered here and imaging data for other genes involved in the PMG/HG network. While we provided accuracy estimates for our fusion and low-dimensional decomposition, these estimates as a function of the number of replicates or state variables remain to be investigated. Additional data may refine the boundaries of the two NMF components we described or reveal additional components specifying finer cell subtypes within the tissue.Possessing a comprehensible description of network dynamics opens up further questions on what drives differentiation. For instance, our data revealed that transcription of tll and hkb stops about halfway through NC 14. The underlying mechanisms of this are not understood and may involve loss of transcriptional activators, appearance of repressors, or the transient nature of the terminal ERK signal. A similar question is derived from our data involving the extent to which the dynamics of tll, hkb, and fkh constrain those of byn and wg (e.g., leading to their posterior disappearance in late NC 14). While our study suggests that a relatively simple GRN model might explain this latter feature of transcriptional dynamics of byn and wg, additional variables or refinements in the model might be required to explain finer spatiotemporal features of posterior patterning (such as the differences between transcriptional domains of wg and byn in late NC 14). Pairing modeling with imaging experiments in fixed and live embryos, we demonstrated how transcriptional dynamics constrain protein concentrations for upstream transcription factors (TFs) Tll, Hkb, and Fkh. At the same time, Byn appears to be initially broadly activated, with its boundaries gradually refined and the functional protein still present in both spatial components prior to gastrulation. Functional consequences of dynamics of Byn in PMG primordium and additional inputs or features in protein stability that play a role in such spatiotemporal induction remain to be elucidated.The presented strategy relies on live imaging data, and we expect that it can be extended in several directions. First of all, similar experimental techniques can be harnessed for observation of processes other than transcription, such as live imaging of protein as it is translated (50, 51), or for probing direct interactions within the network (e.g., by simultaneous live imaging of transcriptional activity of developmental genes and their cis-regulatory elements) (52). Another direction is incorporating morphological information (e.g., by live imaging cell shape changes with the use of nuclear and membrane markers) (53–55). Furthermore, multiplexity could be significantly increased with the use of modern transcriptomic technologies. Single-cell RNA sequencing captures information about the entire transcriptome, and profiles of single cells collected prior to gastrulation have been successfully mapped to embryonic coordinates (56–58). The result, however, is still only a snapshot of transcriptional dynamics in the embryo. Integrating whole-transcriptome sequencing and live imaging data in this context could further inform us about the dynamics of cell fate specification (59, 60). A similar strategy was successfully implemented in the chordate Phallusia mammillata embryo, where gene expression trajectories were reconstructed from single-cell RNA-sequencing (scRNA-seq) and light-sheet imaging data (61). Developing further computational approaches to integrating different data modalities is essential for advancing our understanding of development and cell fate decisions.There are advantages to working with Drosophila embryos that make our method of data fusion feasible. These include low morphogenetic variability, easy genetic manipulation, and amenability to live imaging. Such features should be present for use of our method in other model organisms. Consider the ascidian, Ciona intestinalis. Due to their size (140 µm in diameter) and transparency, whole Ciona embryos can be imaged on a light-sheet microscope with high resolution (62). They have a manageable number of cells with invariant cleavage patterns. The MS2-MCP system has not been implemented in Ciona, but previous studies have successfully utilized fluorescently tagged proteins for similar aims (63, 64). Zebrafish embryos can also be imaged with light-sheet microscopy (65, 66), and MS2 reporters have previously been used within this developmental system (37). Nonetheless, difficulties may arise when considering higher-level vertebrates. Xenopus laevis, utilized for their large size, may be more difficult to image due to their opaque yolk. While preimplantation mouse embryos have been imaged with high resolution, there is much higher variability between samples, and cell cycles are not synchronized, making temporal and spatial registration more difficult (67). Despite these challenges, we hope our method can be adapted to support the goal of understanding network dynamics with data fusion in other systems.
Materials and Methods
Fly Stocks and Genetic Crosses.
MS2 reporter flies were generated for five genes using CRISPR-Cas9–based insertion of 24x-MS2 RNA stem loops into the 5 UTR of each endogenous gene. Virgin females with His2Av-GFP were crossed to males with MCP-mCherry. Virgins from this cross were placed in a cage with males expressing homozygous gene-MS2. Embryos from this cross, which were all heterozygous for the MS2 allele, were imaged.
Generation of Transcriptional Reporters Using CRISPR.
For insertion of MS2 stem loops into the 5 UTR of our genes of interest, the pU6-BbsI-chiribonucleic acid expression plasmid (38) and the pHD-dsRed-24xMS2 donor plasmid (gifted by the Levine Lab) were coinjected to yw;nos-Cas9(II-attP40) or yw;;nos-Cas9(attP2) embryos. Microinjection was performed by BestGene Inc., and dsRed was used for subsequent screening. More details are in .
Light-Sheet Microscopy.
Embryos were manually dechorionated with tape and suspended in 1% low-melt agarose gel in a capillary tube. The embryo was oriented such that the AP axis runs perpendicular to the capillary tube. Once the gel solidified, the embryo was placed in the chamber of the Luxendo MuVi SPIM light-sheet microscope. The microscope has two Nikon 10×/0.3 water objectives and two Olympus 20×/1.0-W detection objectives. The embryo was rotated in the chamber such that the AP axis of the embryo was parallel to the detection cameras. In this way, the posterior end of the embryo was closer to one of the detection cameras. Data from the camera that was closest to the posterior of the embryo were used. A stack of 200 cross-sectional slices was taken from the posterior pole of the embryo up to 200 µm inward, with a z-step size of 1 µm. One stack was taken every minute for ∼90 min. The 488-nm laser was used to image nuclei (His-GFP), and the 561-nm laser was used to image MS2 dots (MCP-mCherry), both at 5% laser power. Exposure time for the green channel was 55 ms, and exposure for the red channel was 70 ms. The line illumination tool was used to improve background levels and was set to 40 pixels.
Protein Immunostaining and Quantification.
Embryos were dechorionated by treating with 50% bleach for 1 min and fixed in 1:1 heptane/4% formaldehyde in phosphate buffered saline (PBS) for 25 min, after which they were vortexed in methanol to remove vitelline membranes. To stain, embryos were washed in PBT (0.1% Triton X-100, 1% bovine serum albumin, 0.01% sodium azide in PBS), blocked in Image-iT FX signal enhancer (ThermoFisher; I36933), and incubated with a mixture of all primary antibodies excluding rat anti-Byn in staining buffer (5% bovine serum albumin, 0.1% Triton X-100, 0.01% sodium azide in PBS) at 4 ∘C overnight. Embryos were washed in PBT and incubated with secondary antibodies in staining buffer for 2 h at room temperature. A second fixation as before for 15 min at room temperature was followed by washing in PBT and overnight incubation in staining buffer with rat anti-Byn at 4 ∘C. Embryos were washed in PBT and incubated with goat anti-rat secondary antibody for 2 h at room temperature. To image, embryos were mounted on glass coverslips in Aqua-Poly/Mount (Polysciences; catalog no. 18606). More details, including embryo staging and protein quantification, are in .
Fluorescent In Situ Hybridization Using Intronic Probes.
Sheep anti-digitonin (DIG) (1:125; Roche; catalog no. 11333089001) and mouse antibiotin (1:125; Jackson ImmunoResearch; catalog no. 200-002-211) were used as primary antibodies. Alexa Fluor conjugates were used as secondary antibodies. DAPI (1:10,000; Molecular Probes; catalog no. D1306) was used to stain nuclei. DIG- or biotin-labeled antisense intronic probes were synthesized from the sequences within the introns of byn and wg using the primers listed in ; 0- to 1-h-old embryos from Oregon R flies were collected and aged for 2.5 h. The embryos were dechorionated in 50% bleach and fixed in 4% formaldehyde in PBS for 20 min. The fixed embryos were incubated in 90% xylene for 1 h and in 80% acetone for 10 min at –20 ∘C. Then, embryos were hybridized overnight with intronic byn probes labeled with DIG and intronic wg probes labeled with biotin. Next, the embryos were washed for 4 h and then, incubated with secondary antibodies and DAPI for 1 h. Embryos were then imaged on the Leica SP5 confocal microscopy with a 20× objective.
Image Segmentation and 3D Object Creation.
Image segmentation was performed for each two-dimensional (2D) slice of a movie using ilastik (41). This tool uses semantic image segmentation to train a random forest classifier on features of the image, including intensity, texture, and edges. At least five time points that captured representative images throughout the movie (different NCs, during division, during elongation, etc.) were loaded into the software. Images were annotated (nuclei or dots and background) for at least five slices per time point. These five slices were chosen based on distance from the camera to capture any loss of intensity with depth. We recursively added annotation until satisfied with performance on test slices. Then, segmentation was batch performed for all time points (). We used arivis Vision4D to reconstruct 3D nuclear and dot objects from segmented 2D binary images and to extract information about those objects, including positions, volumes, and intensities in both channels. Details are in .
Temporal and Spatial Registration.
Because each movie begins at a slightly different time, movies must be aligned temporally. We approached this problem by aligning the movies by the 11th nuclear division in embryogenesis. Details are in .Each movie can be viewed as a set of point clouds of centroids of the segmented nuclei at every time point. The ICP algorithm (45) was used to align point clouds between samples. We started with choosing the template sample Q, which represents the largest portion of the geometry of the embryo (i.e., exhibits the maximal average number of nuclei throughout the movie). For a certain choice of metric, for every movie M, ICP identified a rigid transformation T, minimizing the total distance between point cloud sets TM and Q at their last common time point (). We utilized a variation of the algorithm by introducing a scaling factor into the optimization problem. ICP converges to a local minimum and is sensitive to initialization; the formation of the ventral furrow at the end of our observation time frame served as a morphological cue for the initialization. After alignment, we applied principal component analysis to the nuclear centroids in all the samples pooled together. The first principal component aligned with the AP direction, while the DV axis was roughly given by the third component.
Fusion of Gene Expression Profiles.
Let be the binary indicator of transcription in every nucleus n of a movie collected for gene g [i.e., is set to zero if no MS2 dot is observed in n, and is set to one otherwise]. All data collected for gene g can, therefore, be viewed as a set , where N(g) denotes the set of all the nuclei observed in the samples corresponding to gene g, with denoting spatiotemporal coordinates of nucleus n after registration. Fusion of gene expression profiles in individual movies is achieved through random forest classification with serving as training data. Details on model selection and hyperparameter tuning are given below. The trained model allows us to predict expression probabilities for any new observation by taking the mean of the predicted positive class probabilities of the individual trees in the ensemble. For an individual tree, the probabilities are calculated as the proportion of positive votes in the corresponding leaf. In Fig. 2, the CIs for the mean predictions as a function of time were calculated using 1,000 bootstrap samples.We used the sklearn python implementation for the random forest classification and gradient boosting and k-nearest neighbors (kNN) classifiers from the same package to perform the following cross-validation comparison. For each gene, four samples were collected, of which one random sample was set aside for final testing. On the other three samples, threefold cross-validation was performed, with individual replicates consecutively chosen for validation sets, to achieve independence between training and validation sets. Cross-validation allowed for model selection and hyperparameter tuning. Hyperparameter tuning was performed through a randomized grid search (with 200 iterations per fold) over the sets of hyperparameter values shown in . Due to the imbalanced nature of the task, the area under the precision–recall curve was chosen as the optimization metric, and the final score was calculated as the mean score over the folds. The best scoring parameters for all three methods are given in . Note that an alternative approach, which evaluates overall performance by aggregating predictions over the folds and comparing them against ground truth (instead of averaging the scores over folds), produced similar results ().
Nuclear Trajectories.
In order to construct gene expression trajectories over time, predictions at adjacent time frames need to be connected. Coarse-grained tracking of the nuclei was done by applying k-means clustering to all the nuclear centroids in Q, fixing k = 1000 to produce a grid on the template driven by the embryonic morphology. Denoting the resulting cluster centroids by n, , the desired trajectories at every grid element n were then obtained from probabilistic predictions of the random forest classifier ( Fusion of Gene Expression Profiles) at these centroids over the aligned time course: that is,
Dimensionality Reduction.
NMF (68) is a popular method of dimensionality reduction and topic modeling. For a given data matrix X, NMF seeks to decompose it into a product of lower-rank nonnegative matrices: dictionary matrix W, which encodes hidden topics in the data, and code matrix H containing the mixture weights with which each hidden topic appears in each observation. More precisely, for and rank , we solve the following optimization problem:where is the Frobenius norm.We applied NMF from the sklearn python package to the matrix F with the flattened multidimensional trajectories in the rows (Nuclear Trajectories and ). The elements of matrix F are in one-to-one correspondence with a (j, g, t) triple of grid element index j, gene g, and time point t. NMF was initialized with nonnegative double singular value decomposition and used a coordinate descent solver. As the result of factorization, we obtained matrices C, with columns containing spatial mixing weights, and G, with rows containing gene expression trajectories. Columns of C and rows of G were simultaneously renormalized so that the maxima of the spatial mixing weights found in every column of C (after ordering by the AP coordinate and smoothing with the window of width 50) equal one. Information on the residual analysis, calculation of feature importance scores, and robustness tests for NMF decomposition are in .
Instantaneous Rate of Transcription.
Observed MS2 fluorescence intensity is roughly proportional to the number of actively transcribing RNA polymerase (RNAP) molecules (35). With a Poisson distribution assumption on the RNAP molecule count for gene g in a nucleus n at time point t, transcription probabilities predicted by our classification model ( Fusion of Gene Expression Profiles) can be approximated byThus, we can estimate the underlying mean transcription intensities from the random forest–predicted probabilities by employing the relationship We repeated NMF dimensionality reduction for these approximated intensity profiles, which revealed a 2D structure similar to the one described for the case of probability trajectories (). We employed this low-dimensional approximation in our GRN model. In what follows, we will denote the two resulting trajectories by and their corresponding spatial coefficients by .
ODE Model for a GRN.
We propose a simplified model for the GRN in Fig. 1, in which an activating or a repressive relationship between a protein and its downstream gene is modeled through a multiplicative factor employing the Hill function: for some positive threshold and slope parameters c and s. Note that our model ignores the rates at which protein is bound and unbound to DNA and does not account for diffusion.More precisely, for gene g and N, M, and P corresponding to instantaneous nascent transcription intensity, total messenger RNA (mRNA) product, and total protein product, respectively, the dynamics of the network for a nucleus n is chosen to be independent of other nuclei and can be described by the following system of equations:Here, e and d denote the decay coefficients of the mRNA and protein, respectively, and c is the rate of translation. The relationships between a gene g and its upstream factors within the network are encoded through the sets A(g) and R(g) of genes that serve as activating and repressive inputs to g, respectively. The strength of the interaction is modeled through the corresponding parameters and .Under quasisteady-state approximation, M can be assumed to be proportional to N, which after proper rescaling of P, reduces the system toHere, we allowed a slight abuse of notation by absorbing the rescaling constants of P into in the second equation above.We apply this framework to the network in Fig. 1, with the activating and repressive relationships in the model encoded through More precisely, protein levels , and serve as the input in the network, while , and serve as the (noiseless) output. From Eq. , protein level can be obtained from through discrete approximation:As a result, with Tll and Hkb upstream of fkh, can be derived from Eq. , and Tll and Hkb protein profiles can be obtained from Eq. . This subsequently allows us to obtain from Eq. .We fit the proposed model to two intensity components and only ( Instantaneous Rate of Transcription). The expressions for the three input proteins fully specify the model, and the output can be fit to the observations with another parameter allowing for residual noise: .Uninformative, uniform prior distributions were chosen for every parameter. Protein degradation rates d were constrained to be bounded by 0.0001 and 0.5; simultaneously, and a were chosen to be bounded by 0.01 and 50 to correspond to the possible range of protein concentration values (). We utilized the PyMC3 package for Bayesian modeling (69) and ran three chains of the No-U-Turn sampling. We allowed for 2,000 tuning steps, with 5,000 steps used for subsequent sampling. There were no divergences identified during sampling, with the diagnostic statistics close to one and the number of effective samples statistics >25%.
Authors: Carlos Castro-González; Miguel A Luengo-Oroz; Louise Duloquin; Thierry Savy; Barbara Rizzi; Sophie Desnoulez; René Doursat; Yannick L Kergosien; María J Ledesma-Carbayo; Paul Bourgine; Nadine Peyriéras; Andrés Santos Journal: PLoS Comput Biol Date: 2014-06-19 Impact factor: 4.475