Literature DB >> 31800624

Feature identification in time-indexed model output.

Abstract

We present a method for identifying features (time periods of interest) in data sets consisting of time-indexed model output. The method is used as a diagnostic to quickly focus the attention on a subset of the data before further analysis methods are applied. Mathematically, the infinity norm errors of empirical orthogonal function (EOF) reconstructions are calculated for each time output. The result is an EOF reconstruction error map which clearly identifies features as changes in the error structure over time. The ubiquity of EOF-type methods in a wide range of disciplines reduces barriers to comprehension and implementation of the method. We apply the error map method to three different Computational Fluid Dynamics (CFD) data sets as examples: the development of a spontaneous instability in a large amplitude internal solitary wave, an internal wave interacting with a density profile change, and the collision of two waves of different vertical mode. In all cases the EOF error map method identifies relevant features which are worthy of further study.

Entities: Chemical Disease Gene Species

Year: 2019 PMID： 31800624 PMCID： PMC6892550 DOI： 10.1371/journal.pone.0225439

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

1 Introduction

We present a data-centric diagnostic for identifying time subsets of model output which are worthy of further study. To minimize the cost of uptake and maximize the clarity of the presentation we have built this diagnostic on Empirical Orthogonal Functions (EOFs), which are used in an enormous variety of contexts (e.g. [1], [2], [3], [4], [5], etc.) and have implementations in every commonly used software toolbox (e.g. Matlab, R, Scipy). The method presented here can be applied to any data set for which an EOF analysis would be appropriate. However, we will focus on the application to CFD data sets. The method is data driven, using a novel construction: a map of the EOF reconstruction errors as a function of time and the number of modes in the reconstruction. The interpretation of this EOF error map yields the identification of interesting times in each field in the data set for the cost of one Singular Value Decomposition (SVD) and one norm calculation per time output and choice of reconstruction. The mathematical ideas behind EOFs have a long history, originating with [6], and go by many names, including Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Principal Orthogonal Decomposition (POD), depending on the community. These methods produce an orthogonal basis for the state space of a data set, where the basis vectors (EOFs) are rank-ordered by the amount of variance of the data they capture, as recorded in the eigenvalue for each basis vector. In particular, when the data has units of velocity, the variance has units of energy, so the basis is rank ordered by energy captured. Following the common parlance, we will use “energy” and “variance” interchangeably. Since the use of all basis vectors fully reconstructs the data, and the basis is rank-ordered by energy content, this representation can then be truncated to provide a reduced order reconstruction of the data. This reconstruction captures the most energy contained in the original data set per basis vector added, on average [7]. Efficient reconstructions of data are often the goal in statistical analysis, where EOF methods are referred to as PCA. For a review from this perspective see [8]. EOF methods are common in the atmospheric science, oceanography, and climate science communities where there has been an attempt to relate individual EOFs either to physical processes or to normal modes of the system being sampled. Such efforts have had some success, for example in the study of the El Niño Southern Oscillation [9], North Atlantic Oscillation [10], and the Arctic Oscillation [11]. The focus on the first, or “leading”, EOF can be viewed as the study of a an EOF reconstruction (heavily) truncated to include only the first mode. As mentioned, some large scale motions have been captured this way, and correspondences have been drawn between physical processes and the leading EOF. However EOFs form an orthogonal set, and thus adding subsequent EOFs to the reconstruction, while simultaneously expecting those additional modes to correspond to physical processes, is to assume that the physical processes or normal modes in question are orthogonal. This is not true in general. Instead, a kind of contamination occurs: [12] applied an EOF analysis to a constructed flow with multiple dominant structures. They found that EOFs roughly corresponding to specific fluid structures were contaminated by components of other structures (their Figures 3 and 6). Several modifications to EOF methods have been developed to produce modes which may have a more direct physical interpretation, but these methods often require a choice to be made, and it is not often clear which choice is correct. We refer the reader to the review by [5] of EOFs and their extensions for a history of these difficulties. In the error map method we simply use the standard EOF, as it is the most widely used. Moreover, we focus on the reconstruction perspective in order to build the EOF error map. This avoids the difficulties of focusing on individual EOFs outlined above. In addition, the construction of the error map includes errors from every truncated reconstruction, so there is no need to consider the problem of choosing a particular EOF to focus on. Because it avoids focusing on either individual EOFs, or individual EOF reconstructions, the EOF error map method is different from every previous EOF-based method. There are, of course, a wide variety of existing data analysis methods for CFD data sets which are not EOF-based, but none of them serve the same function as the EOF error map method presented here. There are local, Eulerian (i.e., measurements at fixed locations) methods to identify vortices based on the decomposition or invariants of the velocity-gradient tensor: the Q-, Δ-, and λ2-criterions for example [13]. There are Lagrangian methods (i.e., based on moving particles) to identify coherent structures (e.g. transport barriers), such as those based on Finite Time Lyapunov Exponents [14], [15], or graph theoretic methods [16], [17], [18]. For a comparison of multiple Lagrangian methods applied to the same benchmark see [19]. There are a host of methods based on the spectral properties of the Koopman operator [20], and its finite dimensional approximation the Dynamic Mode Decomposition [21], which allow identification of structures in fluid flows based on the frequency of the structure’s motion, such as the flapping frequency of a jet [22]. There are many reduced order methods besides EOF, including the related POD and Galerkin projection [23], [7]. For a review see [24]. In fact, there are many more analysis methods available which can be used to study CFD data sets. All of them make an a priori judgement on the field of interest (e.g. gradient of the velocity field, inter-particle separation, etc) and proceed with an analysis on that particular field in the data set. In contrast, the purpose of the EOF error map method is to identify interesting time periods within every field in the model output without an assumption on which variable is the most important. These features, in each field, then become targets for further study using any method appropriate, including those just mentioned. Put another way, the EOF error map method is a diagnostic tool which is applied earlier in the analysis pipeline than the standard methods just discussed. As such it is not a competitor with those methods, but a way to facilitate their intelligent application. This is particularly relevant to large, coupled models in fields such environmental fluid mechanics involving biogeochemistry and climate modeling for which the CFD component is only a small portion of the model. Even sophisticated mathematical tools based exclusively on the fluid mechanics may miss an important event in one of the other components of the model (e.g. an algal bloom in the coupled model of a bay). Thus for large coupled models, we envision our method being applied as part of the model execution, so that every field in the model output would be accompanied by identified features. Only the subsequent analysis would be discipline specific. Error maps also carry a very low overhead. They are constructed directly from model output immediately after the completion of a numerical experiment and the only extra computational burden is the SVD and error map construction: there is no need to take derivatives of fields, it is not necessary to have particle data, there is no necessity to tune parameters in a graph theoretic clustering algorithm, etc. Error maps are used as a diagnostic to quickly identify features which should be investigated further, by whatever method is deemed useful for the particular application. This allows error maps to inform decisions on where higher overhead methods should be applied. In summary, the EOF error map is a low overhead method applied directly to model output as a way of focusing the application of other methods. The remainder of the paper is organized as follows. Section 2 gives a brief background on EOFs (2.1) and discusses truncated EOF reconstructions (2.2) before introducing EOF error maps (2.3). Section 3 applies the method to the three data sets: the development of a spontaneous instability (3.1), an internal solitary-like wave encountering a sharp change in the background density profile (3.2), and the collision of two waves (3.3). The results show that the EOF error map method is able to identify time periods of interest in CFD data sets. Section 4 is a discussion of the main conclusions to be drawn from this work and of possible extensions of the EOF error map method. Section 5 gives a brief summary and concludes the work.

2 Methods

2.1 Empirical orthogonal functions

We briefly review EOFs in order to set notation, and point out those facts that will be needed when introducing the error map method. Suppose the data set has M grid points and N time outputs at times t, j = 1, …, N. This is a sequence of snapshots {x(t1), x(t2), …, x(t)} where each . Centre by the time mean, and make the resulting snapshots columns of a single matrix X. Then the jth column of X is where the angle brackets indicates the time mean. The matrix X is where i indexes the grid points, j indexes the time outputs, and 〈x〉 = 〈x〉. Then X is an M by N matrix whose entries are time mean-centred time series of measurements at the grid points. The standard derivation of EOF is often motivated by diagonalizing the covariance matrix of to obtain the EOF eigenmodes and eigenvalues λ. However we will instead present the SVD derivation, as the SVD method is generally more robust than the covariance matrix diagonalization method [25]. When M ≥ N, as is common in CFD data, applying the SVD to X we obtain [26] Where U and V are orthogonal matrices and Σ = diag(σ1, …, σ). The columns of U, , are the orthonormal spatial EOF basis vectors (modes), where the ith entry u in the column vector u corresponds to the ith grid point of mode k. This basis corresponds to the singular values from Σ with Carrying out the multiplication in Eq 3, we obtain [26] where r = rank(X), and the second equation is the columnwise version of the first with the time output indexed by j. By multiplying both sides of Eq 3 by U we find that so that the projection of the centred data onto the EOF basis yields time-dependent coefficients defined as Therefore the columns of V, , are the unscaled coefficients corresponding to each mode. The jth entry v in the column of v corresponds to the coefficient at time j for mode k. The rank ordering of the singular values (Eq 4) becomes a rank ordering of the scaling of the a. The data can then be written as Note that there are methods of producing EOFs which are dependent on time as well as space (see section 3.2 of [7]). The SVD method produces spatial EOFs and time dependent coefficients, which makes the interpretation of the error maps presented in section 2.3 and 3 completely straightforward. As mentioned this derivation was for the case M ≥ N. In general the number of singular values is min{M, N}, which is N in the cases presented here. There is an analogous decomposition for M < N. The submatrix of zeros in Eq 3 as well as the rank limited sum in Eq 7 both make it clear that at most the first N modes u1, …, u are needed. This leads to the reduced SVD [27], where U consists of only these columns and there is no submatrix of zeros with Σ. We obtained this decomposition using MATLAB’s built in svds command with N modes recovered to avoid the memory constraints of svd (see the accompanying code and the MATLAB documentation for details). As mentioned, the SVD is a more stable algorithm for calculating the modes and eigenvalues of the covariance matrix of X. The connection between the two is that the columns of U (up to sign) are the modes [7], and the nonzero eigenvalues and singular values satisfy [27]. To reduce notational clutter we have left out the scaling factor on X of in this presentation, but in practice it is included for equality with the covariance matrix. See section 15.4 of [25] for details. See [26] for more details on the SVD, and section 3.4.2 of [7] for more on the connection between SVD and the EOF.

2.2 Truncated EOF reconstructions

Eq 7 makes clear that the data can be thought of as a time mean vector signal with layers of corrections provided by the EOFs. This representation recovers the data completely, so that the error in the representation of the data set is at or near machine precision. However, the rank ordering of the singular values (Eq 4) implies that each successive mode added to the sum contributes less variance over time than the previous mode. To make this concrete, project the data at every time onto mode k and sum: where we’ve used the fact that U and V are orthogonal, along with Eqs 5 and 6. We see that the sum over time of the contributions of u is exactly the variance . Note that this equation shows that the contribution λ from u may be large either because of moderate contributions over most of the simulation, or large contributions over a short time, or some combination. The EOFs have been rank ordered by their total contribution to the reconstruction summed over time, but not by their contribution at any given time t. This time information has been summed out. This is related to the rank ordering of the singular values (Eq 4) providing a rank ordering in the scaling, but not a rank ordering of the values a1(t), …, a(t) at any specific time t. If σ are small for some i > D we can write so that truncated EOF reconstructions can be thought of as an energy or variance filter, because the D EOF reconstruction captures of the energy. We consider only these rank-ordered reconstructions of all modes up to and including D, for a total of N reconstruction for a given data set.

2.3 EOF error maps

With the background material clearly stated, we present the following novel construction. We are interested in finding features within model output fields which are worthy of further study. We will employ the SVD reconstructions just outlined to do so. As discussed in the introduction, individual EOFs do not generally relate to individual physical processes. However, every process contributes some amount to the total variance of the model output. Consider the following thought experiment: rank order the (unknown) processes in the dataset by variance contributed. Just as Eq 8 shows that the contribution of an EOF to the reconstruction may be large either as a result of moderate contributions over a long duration or large contributions over shorter durations, so too the rank ordering of processes is the result of some combination of the size and duration of each process. We expect large variance processes to include those with large scales and long duration. We expect small variance processes to include those with short scales and short duration. In between are medium variance processes with large scales and short duration, small scales and long duration, or medium scales and duration. See Fig 1 for examples.

Fig 1

Large, medium, and small variance processes.

Large, medium, and small variance processes.

Examples of large, medium, and small variance processes over time. The upper plot shows a large variance process which has a large scale and long duration, along with a medium variance process with less variance, but equal duration. The bottom plot shows a small variance process with a small scale and short duration, along with a medium variance process with larger scale and duration. We wish to identify time periods of interest. This means short or medium duration, and for the phenomenon to be of interest, probably medium to large scale. This means we are looking for medium variance processes in the data set. However, as discussed, the EOF does not generally find processes individually. Instead, the contamination phenomenon described in [12] implies that as D increases the approximations of multiple processes are simultaneously improved, and the higher the variance of a process, the greater its priority. Every mode added increases the variance represented rather than adding a process, but as variance represented increases more processes are approximated well. By convergence, some D approximates all processes of interest. At the extreme end, if everything is of interest, D = N. Moreover the speed of convergence, as indicated in the scree (a plot of the eigenvalues), shows that higher modes essentially represent “noise” (here the quotations are included to indicate that we do not mean noise in the sense of stochastic processes). This means that some low choice of D will tend to capture the large scales (as in the “elbow test”, see [8]), while different choices of D near N are basically the same because the last modes in the decomposition have very small coefficients. Intermediate choices of D will include those that poorly approximate a variety of medium variance processes. These are exactly the processes we seek, so the error of the reconstructions can be used to find them. In particular, changes in the structure of the error over time serves as an indicator of their presence. To better understand why reconstruction error can be used to find features, consider Fig 2, which reconstructions for several choices of D during the breakdown of the leading wave in the dual pycnocline data set. As D increases it is clear that large variance processes are approximated first, followed by smaller and smaller processes. As expected the EOF reconstruction effects multiple processes simultaneously. A choice of D near 1 corresponds to capturing processes with large variance such as the wave guide. Intermediate choices for D capture the large variance structures and some, but not all of the medium variance structures. Short to medium duration processes of interest such as the breakdown of the leading wave are poorly approximated for some intermediate D values, but as D increases the breakdown is also well approximated. Finally, a choice of D near N corresponds to an approximation which misses only noise.

Fig 2

Error of reconstructions.

Error of reconstructions.

Continually increasing choices of D at time output 80 in the density field (first 3 choices are the obvious elbow test choices). This time was chosen to look at the breakdown of the wave, which is a medium variance event with a variety of scales of structures. The top panel is the data, while reconstruction and reconstruction error are in pairs below it for comparison, with D = 1, 4, 6, 25, 50 increasing downward. As D increases the wave guide is approximated first, followed by lower variance structures like the breakdown, and finally the fine details of of the breakdown. By D = 25 the large variance wave guide is well approximated, but more modes are required to capture the fine details of the breakdown. By D = 85 (not shown) there is almost no error anywhere. While Fig 2 shows multiple choices for D at a single time, Fig 3 gives an example for two different times and two different choices for D, in order to give some sense of the change in error over time for a fixed D value. We see that for a low number of modes the error increases during the medium variance breakdown event. This is because the larger variance background state and propagation processes have taken precedence in the reconstruction. We also see that for a high number of modes the error goes down at the time of the breakdown event. This is because there are so many modes in the reconstruction that medium variance events like the breakdown have been well approximated, and the processes that are left are virtually noise.

Fig 3

Error of reconstructions over time.

Error of reconstructions over time.

Two examples of changes in error of reconstructions over time: the left panel is at time 20 and the right is at time 80. Similar to Fig 2, top panels are the data, while in pairs underneath we have D = 25, 85 reconstructions and reconstruction errors. For a 25 mode reconstruction the infinity norm error is greater during the shedding (right) than at time 20 (left). This is because for this lower number of modes, the medium variance shedding event has not yet been fully captured. For an 85 mode reconstruction the reverse is true: the error is higher during the early time. This is because for this higher number of modes, the medium variance event has been almost fully captured, and now the very small variance structures in the early times are left (note the change in error scale between the 25 mode and 85 mode reconstructions). Together, Figs 2 and 3 show that the medium variance processes of interest are poorly approximated for some intermediate values of D. Since these are the processes we are interested in, we can look at the error of the reconstructions to identify when they occur. When error is high for a short time, it can indicate the presence of dynamics worthy of further study. Rather than attempt to determine a single intermediate choice for D which will help identify times of interest, we simply calculate the error of the reconstruction for every choice of D, and for all times. In order to collapse the error information to a more manageable and interpretable size, we use a norm of the time slice error, rather than a full error plot like those in Figs 2 and 3. Moreover if we use the L2 norm at every time slice the error’s distribution is unknown, and may be spread thin over the whole domain or concentrated in some way. To avoid this ambiguity we use the infinity norm to make interpretation more straightforward. The error map ϵ(t) of an EOF reconstruction with D modes at time t is given by for each t. This is simply the infinity norm of the modes excluded from a reconstruction with D modes at every time step. By construction ϵ(t) is a function of both time and the number of modes used in the reconstruction D. We call this function the error map for the EOF reconstructions of the data set, or simply “the error map.” The number of modes produced by an EOF analysis is min{M,N}. The error map is therefore of size min{M,N} × N. In the case of CFD data sets M > N, and so the error map has size N × N. In practice, forming the error map is computationally inexpensive, as N tends to be small. The computations are simply an SVD decomposition, and one norm calculation for every time output and for every choice of D. In many contexts it is standard practice to perform an EOF analysis anyway, in which case the EOF error map is easily derived from the existing reconstructions.

3 Results

Although the method developed in this manuscript may be applied to any time-indexed model output for which an EOF analysis would be appropriate, we will consider concrete examples from three qualitatively different simulations in stratified flow dynamics. It is not necessary that the reader have training in fluid dynamics to understand the method presented, but we provide background for each of the data sets for those who are interested. In order to keep a consistent focus, and because the varying density is the essential component of stratified flows, we will focus on the dynamics of density. As discussed in the introduction, in practice the error map method would be used to identify features in all variables within the data set. For expository purposes, we have elected to present our method on one variable in multiple flows, rather than on multiple variables in one flow. All three data sets are simulated using a spectral collocation method (SPINS [28]). Grid doubling/halving experiments were performed to ensure that the numerical results were robust. The details of the physics of the dual pycnocline and collision cases will be discussed in future publications, while the details of the spontaneous instability case may be found in [29]. As the focus here is on the data analysis method presented, all data sets throughout this manuscript are presented in terms of grid points, time output number, and numeric field values. All MATLAB codes for production of these figures, along with all data sets, are included in the supplementary information. For reference, the normalized scree of the first thirty modes for all three data sets are plotted in Fig 4. Note that these three scree are plotted together, but that the total number of modes differs by case. The spontaneous instability data set has 131 total modes, the dual pycnocline data set has 100, and the collision data set has has 150. The fast convergence of the eigenvalues is clear in each case. Clearly the spontaneous instability has the most variance in the first few modes, while the dual pycnocline and collision cases have more variance in higher modes.

Fig 4

Scree plots.

Each scree is a plot of the normalized eigenvalues as a function of mode k = 1, …, 30, the k being the mode index from Eq 7. The sum in the normalization is over all eigenvalues of the given dataset. See text for details.

Scree plots.

3.1 Spontaneous instability

The first data set is the spontaneous shear instability of a very large amplitude internal solitary wave, studied in detail in [29], following previous related work [30], [31]. Here the flow is initialized from a solution to the Dubreil–Jacotin–Long (DJL) equation, which is formally equivalent to the stratified Euler equations. The initial wave develops a spontaneous instability at the rear of the wave. The instability grows and eventually exits the wave. Detailed discussion, including the effects of three-dimensionalization can be found in [29]. See the top four panels of Fig 5 for a visual representation of the density field’s evolution in this case. The internal solitary wave serves as a “base” flow with the spontaneous shear instability playing the part of a temporary perturbation. This data set is thus close to classical hydrodynamic instability theory, for which a base flow and a perturbation are specified analytically, but still requires a full integration of the stratified Navier-Stokes equations for a full description since a purely analytical treatment is not possible in this case. In what follows this case will be referred to as the “spontaneous instability” case.

Fig 5

Spontaneous instability error map.

Spontaneous instability error map.

A spontaneous shear instability forms and evolves, with time increasing from the top to the bottom of the first four panels. The bottom panel is the error map with time increasing left to right, and vertical axis of increasing D, with pairs of vertical green lines indicating the times of the upper panels as time increases from left to right. See text for details. The bottom panel of Fig 5 shows the results of applying the error map method to the spontaneous instability data set. For times less than t = 50 there is very little error due to the stable background profile’s large variance. This means even a reconstruction with D = 1 has small error over this time period. This is consistent with the large first eigenvalue (Fig 4). As the instability develops, we see error in the reconstructions for small to intermediate values of D. This is due to the instability’s low variance (and therefore priority) relative to the background profile, as discussed in section 2.3. This error continues to the end of the simulation as the instability evolves. The error map clearly indicates the presence of the instability as a time period of interest in the data set, as indicated in the obvious change in the structure of the error over time.

3.2 Dual pycnocline

The second data set we examine is a simulation of an internal wave train in a spatially varying wave guide, generated by what experimentalists refer to as a lock release: fluid of a set density is suddenly released from behind a barrier and is allowed to freely form waves in the stratified tank. The particular situation is set up so that a wave train of internal solitary waves with a trapped core forms, propagates some distance and then encounters a sharp change in the background density (a pycnocline). This change removes the near bottom stratification, while the main wave guide remains unchanged. To the best of our knowledge, there is no a priori theory for the wave evolution in this cases and we find that the change in the near bottom wave guide leads to the destruction of the trapped core in the leading wave. This in turn leads to a significant increase in short length scale activity and a loss of material from the leading wave, and a significant perturbation to the second wave in the wave train. Unlike the spontaneous instability data set, in this case there is no readily apparent way to define a “base” flow in this case since even prior to the collapse of the core, the disappearance of the near boundary wave guide implies a core cannot persist [32]. See the top four panels of Fig 6 for a visual representation of the density field’s evolution in this case. The dynamics are considerably more complex than the spontaneous instability dataset, and there is no obvious tie in with classical stability theory. This case thus acts as a stress test for our analysis method. In what follows this case will be referred to as the “dual pycnocline” case.

Fig 6

Dual pycnocline error map.

Dual pycnocline error map.

An internal wave train propagates from left to right and encounters a sharp change in the background density profile. The bottom panel is the error map with time increasing left to right, and vertical axis of increasing D, with pairs of vertical green lines indicating the times of the upper panels as time increases from left to right. See text for details. The bottom panel of Fig 6 shows the results of applying the error map method to the dual pycnocline data set. The clearest error structure is during the shedding event of the leading wave beginning around t = 65, up until the leading wave leaves the domain around t = 90. The change in structure of the error map with increasing D during this time period corresponds to the rank ordering of processes by variance illustrated in Fig 2 at t = 80. Once again the error map clearly indicates a time period of interest through the changes in the structure of the error over time. The observant reader may have noticed the persistent error for low values of D in Fig 6 which was not present in Fig 5. The EOF modes are functions of space but not time, so propagating structures require multiple modes. This is analogous to the way a sequence of hand drawn stills can be used to create an animation, despite each picture being a functions of space only. The propagation of the basic internal waves/gravity current structure is an example of a medium scale process that lasts the duration of the simulation, requiring a minimum amount of modes to even roughly approximate. This is consistent with the scree in Fig 4, which shows that more variance is found in higher modes than in the spontaneous instability case. As a result there is persistent error for low choices of D even before the wave train encounters the density change around t = 35. This is in sharp contrast to the spontaneous instability case there was almost no propagation of the steady background state, and so even a one mode reconstruction had low error. Similarly, the slight increase in error from t = 40 to t = 65 is due to the instability in the lead wave induced by interaction with the density change. There are more small scale processes present during this time, requiring more modes to approximate those processes well.

3.3 Collision

The third data set we examine involves the repeated collision of mode-1 (i.e. all lines of constant density rise and fall together) and mode-2 (i.e. lines of constant density above a given height rise, while those below fall, forming a lump-like wave) internal solitary waves in a two pycnocline stratification. This simulation is constructed based on the observations in [33] that suggest mode-mode collisions can irreversibly deform the higher mode. By choosing a double pycnocline we ensure that the interaction takes place without significant instability and three-dimensionalization. This allows us to confirm that our analysis method is capable of capturing nonlinear phenomena loosely linked to the concept of solitons, as opposed to turbulent transition. See the top four panels of Fig 7 for a visual representation of the density field’s evolution in this case. The dynamics are complex, but compared to the spontaneous instability and dual pycnocline cases, there are no instances of short scale instabilities, and no turbulence develops. In fact, the complex pattern of constructive and destructive interference between the waves would make an analysis method based on kinetic energy or vorticity very difficult to interpret. This case thus acts as a different test for our analysis method, since the nonlinear effects in this case involve soliton–like behaviour that becomes evident during collisions (both wave–wave and wave–wall). In what follows this case will be referred to as the “collision” case.

Fig 7

Collision error map.

Collision error map.

The repeated collision of a mode-1 wave with a mode-2 wave. Initially (top panel), the mode-2 wave propagates slowly from left to right, and the mode-1 wave propagates quickly from right to left. At t = 55 the mode-1 reflects from the left wall, as the mode-2 continues propagation to the right. At t = 75 the mode-1 wave has almost overtaken the mode-2 wave as both propagate to the right. At t = 93 the two waves nearly coincide. The bottom panel is the error map with time increasing left to right, and vertical axis of increasing D, with pairs of vertical green lines indicating the times of the upper panels as time increases from left to right. See text for details. The bottom panel of Fig 7 shows the results of applying the error map method to the collision data set. The waves are initialized so that the mode-2 wave is travelling rightward and the mode-1 wave is travelling leftward. As discussed for the dual pycnocline case, multiple modes are required for propagation, but in this case there is propagation of two different waves at two different speeds. This double propagation requires many modes, and again Fig 4 shows the variance in higher modes. The smaller error anomalies correspond to reflections from the boundary: the mode-1 wave at t = 55 and t = 111, and the mode-2 wave at t = 141. The large error anomaly from t = 60 to t = 100 corresponds to the overtaking of the mode one wave by the mode two wave. The clear error structure around t = 90 to t = 95 corresponds to the superposition of the two waves. As in the other two cases, we again see that the error map clearly indicates features in the data set.

4 Discussion

The EOF error map identified time periods of interest in each of the three cases presented in section 3. The method was successful even though only one of the three data sets had a classical “background–perturbation” split. And while the collision data set featured a complex patterns of constructive and destructive interference, making the kinetic energy and vorticity evolution very difficult to interpret, the error map method was still successful. Note that these two dimensional data sets were chosen so that the error map could be easily visualized alongside time outputs for expository purposes. The error map method still identifies features even if the data set is so large that it is otherwise difficult to visualize. Moreover, because the error map method collapses all non-time dimensions for a given reconstruction and time output, the method can be applied to any time-indexed model output, provided an EOF decomposition is appropriate and computationally feasible. For very large data sets, there are alternatives to reduce the computational burden. In particular it is clear that in many cases the full error map is unnecessary. For completeness we included reconstruction of all possible D values in the Figures of section 3. However the error structures would have been clear with fewer modes than the maximum. In particular, for half as many modes as the maximum we could have drawn all of the same conclusions. This is unsurprising given the convergence of the eigenvalues in all cases (Fig 4). Of course, given the steady increase in computational power, some data sets will be too large to fit into memory. However even here, a rapidly developing literature offers a way to compute the error map, albeit with an added burden of increased computational time [34], [35]. In the examples given here, error maps were calculated only for model output of consistent physical units. Our code [28] outputs multiple physical fields, and we chose to focus on only the density fields. As a result the EOF was carried out on a physical field with only one type of physical unit. Care must be taken if the model output includes data with different units. While multiple data types may be included together in an EOF reconstruction, the non-uniform units cause differing weights of importance on the different data types. Scalings may be chosen to attempt to correct this, but the more types of units in a data set, the more relative scalings must be considered. Moreover these scalings can have a profound effect on the resulting EOF reconstructions. All of this is a general principle when carrying out an EOF analysis. In particular, for the error map method, the relative scalings effect the reconstructions, and therefore the error maps as well. This scaling problem is most easily solved by avoiding it altogether: simply carry out a separate EOF analysis on each data type in the model output, as was done here. The error map method has several possible extensions. For example, reconstructions from using one of the many modifications of EOF (see [5]) could be employed or a different norm chosen to measure the error. Although the focus here was on time-indexed model output, a spatial dimension could also be used as the index. In that case the method identifies spatial extents of interest, and the error map would be a function of the spatial dimension and D. In general, any dimension of a data set may be used as an index for the method, provided continuous subsets of that dimension have a useful interpretation. Such extensions are possibilities for future work. The error map method also serves as a replacement for rough heuristics such as “the elbow test” [8] for deciding how many modes to keep. Modes with low energy, which may easily be removed by a standard elbow test, may still represent important dynamics [24]. In particular unstable modes start small but grow to be very important to the dynamics. In order to avoid missing dynamically relevant modes, simply pick a value of D large enough to avoid significant error structures in the map. This corresponds to picking the lowest row in the error map which has no significant error at any time.

5 Conclusion

EOF error maps identify time periods of interest in time-indexed model output which are worthy of further study. The method is easily implemented and computationally inexpensive. EOF error maps are appropriate for any data set for which an EOF analysis is appropriate. In the case of CFD data sets, this typically means many domain snapshots of a single physical field. We also recently published the γ method [36], which was designed primarily to find features in data sets consisting of time series sampling multiple physical fields. Together these two methods allow the quick identification of interesting features in a wide variety of data sets.

MATLAB code for Fig 4.

(M) Click here for additional data file. 10 Sep 2019 PONE-D-19-21139 Feature identification in time-indexed model output PLOS ONE Dear Mr. Shaw, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. ============================== Please address reviewer comments. In particular, focus on improved discussion of the methods, interpretability, and performance of the techniques to show improvement over the comparison discussions. Additionally, great detail should be given to the manuscript formatting, grammar, and ordering. ============================== We would appreciate receiving your revised manuscript by Oct 25 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Bobak Mortazavi Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for stating the following in the Acknowledgments Section of your manuscript: 'This work was funded by a Natural Sciences and Engineering Research Council of 358 Canada grant (RGPIN-311844-37157) and a PGS-D. Thanks to Andrew Grace for 359 assistance in editing.' We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: 'The author(s) received no specific funding for this work.' 3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. Additional Editor Comments (if provided): [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Partly Reviewer #3: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: No Reviewer #3: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: No ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This paper describes a method to identify time points of interest in time series data via the reconstruction error of EOF. The method is sound and sufficient details are provided. The manuscript can be improved by the following points: - The numbering of the sections throughout the manuscript is off. Please check to make sure they reflect the actual numbering. e.g. last paragraph of section 1, all numbers are off. - The structure of the manuscript can be reordered. The introduction of the datasets prior to the methods can be distracting since the focus of the manuscript is on the method. Also, the content of EOF can be reduced as this is not the innovation proposed by the authors. The authors should mention clearly what the contributions are compared to the existing literature. - The results of this proposed method should be compared with some baseline methods in the literature. They should be applied to the same datasets and metrics of performance evaluation and comparison should included. Reviewer #2: This paper proposes a method for identifying time periods of interest in time-series data by using empirical orthogonal functions (EOF) or principal component analysis. Leveraging PCA to determine important modes in the specific application of fluid dynamics may be new but this reviewer is concerned with the novelty in the proposed method. PCA have been extensively used to determine important components of different signals. However, the current paper does not provide any comparison with the state-of-the-art algorithms in this area. Moreover, there are a few parameters such as D in this method that needs configuration for which the authors are not providing a systematic solution; therefore, it is not clear in the paper how others can use this method to find important modes/components within a time-series signal. These two concerns makes the paper inappropriate to being published in the current form; Below are the detailed comments that need to be addressed: 1- The authors need to provide comparison with the state-of-the-art. 2- Setting of parameter D should be clearly explained. 3- The authors indicated that the proposed method "can minimize the cost of uptake and maximize the clarity of the presentation"; however, there is no results to support these claims. Please provide additional results/discussion about this. 4- The authors provided a great deal of discussion in the introduction section about inability of the current methods regarding providing interpretable representations that carry physical meanings. However, there is no sign of interpretability in the proposed method. It seems that the proposed method does not explain physical phenomena either. 5- Most of the figures do not have axis labels and physical units. 6- Most of the formulation provided in section 3 are the typical equations used in PCA/EOF; the authors need to clarify which parts are novel, where are the gaps that they are addressing, and where and how they have improved the current formulation. 7- In the text figure 7 has referred before figure 5. Fix the ordering. 8- For the results section show how the "time period of interest" is detected with the proposed method. 9- Provide a few examples about the application of determining "time period of interest", especially, in applications other than fluid dynamics. Reviewer #3: In this study, the authors gave an analysis on Feature identification in time-indexed model output. The introduction provides a good, generalized background of the topic. The following points should be clarified: 1- The motivations for this study need to be made clearer. 2- However, to make the motivation clearer and to differentiate the paper some more from other applied papers, the author may wish to provide examples of some of the applications. 3-The authors must be explained how their method is novel and on what basis (justifications). 4- The authors did not mention the importance of Feature identification in time-indexed. 5- Also the authors are advised to update the manuscript before final submission as it contains some typos and grammatical errors. 6- The authors claim that their method is easily implemented and computationally inexpensive. It should be explained in detail for the interest of reader make it more interesting. My recommendation is a Major revision. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 10 Oct 2019 Please see the accompanying response to reviewers pdf. Submitted filename: Response to Reviewers.pdf Click here for additional data file. 6 Nov 2019 Feature identification in time-indexed model output PONE-D-19-21139R1 Dear Dr. Shaw, We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements. Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication. Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. With kind regards, Bobak Mortazavi Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed Reviewer #3: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: (No Response) Reviewer #2: Yes Reviewer #3: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: (No Response) Reviewer #2: Yes Reviewer #3: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: (No Response) Reviewer #2: Yes Reviewer #3: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: (No Response) Reviewer #2: Yes Reviewer #3: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) Reviewer #2: Thanks for fully addressing all the comments. The only minor point is to add the axis titles to all the figures. That is true that the authors described this in the paper, but the figures should always stand alone and the reader should not read all the paper to understand them. Reviewer #3: The authors studied on the Feature identification in time-indexed model output. All comments have been addressed so paper is accepted. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No 18 Nov 2019 PONE-D-19-21139R1 Feature identification in time-indexed model output Dear Dr. Shaw: I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. For any other questions or concerns, please email plosone@plos.org. Thank you for submitting your work to PLOS ONE. With kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Bobak Mortazavi Academic Editor PLOS ONE

7 in total

1. A rough-and-ready cluster-based approach for extracting finite-time coherent sets from sparse and incomplete trajectory data.

Authors: Gary Froyland; Kathrin Padberg-Gehle
Journal: Chaos Date: 2015-08 Impact factor: 3.642