Literature DB >> 33266967

Informed Weighted Non-Negative Matrix Factorization Using αβ-Divergence Applied to Source Apportionment.

Gilles Delmaire¹, Mahmoud Omidvar¹, Matthieu Puigt¹, Frédéric Ledoux², Abdelhakim Limem¹, Gilles Roussel¹, Dominique Courcot².

Abstract

In this paper, we propose informed weighted non-negative matrix factorization (NMF) methods using an α β -divergence cost function. The available information comes from the exact knowledge/boundedness of some components of the factorization-which are used to structure the NMF parameterization-together with the row sum-to-one property of one matrix factor. In this contribution, we extend our previous work which partly involved some of these aspects to α β -divergence cost functions. We derive new update rules which are extendthe previous ones and take into account the available information. Experiments conducted for several operating conditions on realistic simulated mixtures of particulate matter sources show the relevance of these approaches. Results from a real dataset campaign are also presented and validated with expert knowledge.

Entities: Chemical Disease Gene

Keywords: air pollution; informed NMF; non-negative matrix factorization; robust cost function; source apportionment

Year: 2019 PMID： 33266967 PMCID： PMC7514734 DOI： 10.3390/e21030253

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Source apportionment consists of estimating the particulate matter (PM) sources present in the ambient air together with their relative concentrations. A source is fully characterized by a profile which gathers the m chemical species’ proportions (expressed in ng/μg) that constitute it. Usually, several, say n, PM samples are collected using an automated sampler, then characterized to asses the chemical composition. Each of them can be written as a mixture of p profiles, with different concentrations (expressed in ng/m). Mathematically, if we respectively denote by X, G, and F as the non-negative data matrix, contribution matrix, and profile matrix, the collected data reads While being known under the name of (blind) source separation in the signal/image processing community, Equation (1) is called the receptor model in the chemistry community. In practice, the latter should satisfy the following properties [1]: The entries of G and F are non-negative (one cannot assume a negative mass in G nor a negative proportion of chemical species in F). The product must fit the data matrix X. When one entry of the product does not fit the entry , we should then check i.e., the estimated mass of a chemical species in a sample should not be above the corresponding measured one. As a consequence, estimating the unknown matrices G and F is mainly performed using positive matrix factorization (PMF) [2] and, in particular, using its popular version from the US Environmental Protection Agency. Independently from the PMF investigations done by the chemistry community, Equation (1) has been massively considered by the signal/image processing and the machine learning communities which processed it with non-negative matrix factorization (NMF) techniques [3]. The general idea behind NMF is to minimize a discrepancy measure between X and the estimated product . Such a problem has been extensively studied in the past years. Historically—apart from pioneering work [4]—most methods are based on an alternating optimization of the factor matrices. NMF has been massively investigated because of the more interpretable results it provides when compared with methods without sign constraints. NMF was successfully applied to many fields, e.g., hyperspectral unmixing [5,6], astrophysics [7,8], fluorescence spectroscopy for agro-food analysis [9], audio signals [10], or environmental data processing [11]. It should be noticed that NMF is flexible and can take into consideration additional assumptions to provide a better estimation of the NMF factors. In the literature, assumptions such as sparseness [12,13], fixed row and/or column sums [13,14], structure in the matrix factors [15,16], or orthogonality constraints [17] were investigated. Solving Equation (1) can be performed by appropriately choosing a discrepancy measure between X and . When this measure is the Frobenius norm of their difference, the possible presence of a few outliers may corrupt the NMF enhancement. As a consequence, robust NMF methods were proposed to deal with a predefined number of outliers. While some of them decompose the data matrix into the sum of a low-rank and a sparse matrix—where the latter contains the outlying component [18]—most ones consider some modified cost functions as dissimilarity measures which gave rise to flexible and robust algorithms, e.g., Bregman-NMF [19], -NMF [20], -NMF [21,22], -NMF [23], Correntropy-NMF [24], Huber-NMF [25] (it should be noticed that the Huber cost function has also been considered for robust PMF [26]). Lastly, it should be noticed that in receptor models, each data point is provided with an uncertainty measure and PMF actually solves a weighted optimization problem [4,26]. Weighted extensions of NMF have been also considered, e.g., to enhance the factorization [27] or to deal with missing entries [28,29]. However, it is known than both the PMF [30] and the standard NMF techniques face some convergence issues (however, the convergence of NMF is guaranteed under some separability assumptions [3] which are not satisfied in practice in the considered application and which are thus out of the scope of this paper) [3]. As a consequence, we investigated the enhancement provided by informed NMF. In Ref. [31], the use of a Gaussian plume model enables us to assess the presence or absence of some punctual sources, depending on wind measures, and source and sensor locations which allowed us to fix some entries of G to zero. In the absence of a punctual source, such an information should be dropped. In Ref. [32], an informed NMF-based weighted criterion takes into consideration the known values of some terms of F (Informed NMF has also been proposed in [33] where the known entries are seen as a penalization term in the NMF optimization problem) in order to improve the separation. For that purpose, we introduced a specific parameterization for NMF methods using a Frobenius norm. This approach should be considered as a flexible NMF counterpart of [34] in between blind source separation—where no information on F is provided—and regression, where F is fully known. While it was shown in practice to be less sensitive than blind NMF to convergence issues, this method can still be affected by outliers which are present in many receptor modeling problems. In this paper, we thus extend our previous work [32] by (i) investigating and discussing several -divergence expressions, (ii) exploring different data normalization procedures combined with set values (as profiles are chemical species proportions, the rows of F are normalized), and (iii) adding minimum and maximum bounds to some of the unknown values of F. The methods we propose in this paper have been partially introduced in [35,36], in the framework of the -divergence only. We generalize here [35,36] to the -divergence and we provide a detailed study of their performance, shown on both realistic simulations and real data campaign. The remainder of the paper is structured as follows. We recall some properties of the -divergence in Section 2. Section 3 introduces our proposed NMF parameterization—which puts on light the special structure of the profile matrix in the NMF algorithm—while Section 4 is dedicated to the problem formulation. We introduce our proposed methods in Section 5 that we test in Section 6. Lastly, we conclude about the proposed work in Section 7. Appendix A introduces update rules for an alternative informed -NMF method.

2. Robust Cost Functions

2.1. Introduction to Modified Cost Functions

Chemical data often face some particular measures whose characteristics substantially differ from those which are commonly observed. From a signal processing point of view, such data may be considered as outliers which may degrade the performance of classical algorithms using the Frobenius norm in their cost function. Such an issue is often addressed in the field of robustness where the challenge is to design new algorithms which take into account the above corrupted data. Apart from the low-rank plus sparse decomposition [18], robust NMF algorithms using modified cost functions were investigated. Indeed, these robust functions provide less penalization to large entries of the residual matrix, which is defined as Among them, the Huber cost function accounts for the differentiable connection between the and norms, according to the residual value with respect to an adaptive cutoff parameter. Another popular modified cost function stands for the correntropy measure [24] which accounts for a bounded and non-convex discrepancy measure. In contrast with the above measures, the -divergence is not a norm as it is not symmetrical. Figure 1 shows an example of the behavior of such functions which are penalizing the values of the residual in different ways. As mentioned earlier, the -divergence is the only cost function to present a possible asymmetric behavior around the null residual value. Hopke [1] highlighted the need for methods dedicated to chemical source apportionment which enforce a positive residual value. This situation fits well with the configuration described in Figure 1.

Figure 1

Behavior of several dissimilarity measures with respect to the residual value.

2.2. -Divergence

The -divergence (For special values of , the reader is invited to consult [23]) is a parametric discrepancy measure which may be used to evaluate the gap between two scalar quantities p and q, i.e., , Special values of the parameters lead to very famous divergence measures [23], such as -divergences or -divergences [21]. These divergences are different from classical norms in the sense that they check some common properties—e.g., non-negativity—while others such as symmetry, scalability and triangular inequality are not satisfied. Cichocki et al. [23] study the influence of the parameters and on the robustness of the estimated data (they also establish general connections between the general -divergence and the scaled -order -divergence with an zoom of its arguments). To this aim, they express the sensitivity to outliers by computing the differentiation with respect to an unknown parameter here replaced for simplicity with an entry of F, namely , i.e., where and Considering Equation (6), the expression may be replaced in Equation (5), leading to For the sake of comparison, sensitivity equations of M-estimators [37] are usually designed for the weighted Frobenius cost function (corresponding to and in a -divergence), i.e., where accounts for the general entry of the weight matrix. This weight is usually viewed as a confidence index into the corresponding data. As a consequence, a large residual together with a large weight leads to large modifications in the estimates. In the frame of Equation (8), the weight entry reads Figure 2 describes the -zoom weight as a function of the ratio for different values of . It turns out that provides small weight to large values of the ratio . In other words, this situation does not induce big changes in the estimates. Outliers such as will be allowed in this context.

Figure 2

The -zoom weight.

Equation (8) combines two effects, namely an -zoom and an weight effect. When , the emphasis of the -zoom is put on larger values of the ratio while the emphasis is put on smaller values of this ratio when . These properties are recalled in Table 1. The weight effect in is expressed as a function of in Table 2.

Table 1

Properties of -zoom.

α	0<XijX^ij<1	XijX^ij>1
α>1	small zoom	large zoom
α<1	large zoom	small zoom

Table 2

Weighting effect on the -divergence.

α+β	0<X^ij<1	X^ij>1
α+β<1	large weighting	small weighting
α+β>1	small weighting	large weighting

To summarize, can be used to control the influence of large or small ratios in the estimator through the -zoom, while provides some control on the weighting of the ratios depending on the demand to better fit to larger or smaller values of the model [23]. Gathering these properties, the space of values may be partitioned in several areas as described in Figure 3.

Figure 3

Different areas as a function of and .

Each zone allows a certain kind of outliers. Areas 1 and 2 allow outliers of the form for large and small amplitudes of , respectively. Areas 3 and 4 accept outliers such as for large and small amplitudes of , respectively. Areas 1 and 3 favor a better fit to small values of X while areas 2 and 4 favor a better fit to large ones. As a consequence, for our considered application, we propose to favor a best fit for major species with respect to minor species. This leads to considering the case . Secondly, if the estimation does not fit the data, we prefer keeping situations where holds, as explained in Section 1 and in [1]. This fact results in the choice . These two conditions give rise to an area of interest which is area 2 and which is kept along the article (for convexity reasons in NMF [23], area 2 should be delimited to ).

2.3. Existing NMF Methods with Parametric Divergences

NMF methods are formulated as the global minimization of a cost function under the non-negativity of both factors G and F. Aside from pioneering work [4], NMF is classically performed through an iterative procedure which alternatively minimizes—for a fixed F (respectively G)—a discrepancy between X and . Multiplicative update rules were firstly proposed in [38] for the Frobenius norm and the Kullback–Leibler divergence. While being easy to implement, multiplicative algorithms only ensure that the cost function does not increase within iterations, which is not sufficient for getting a limit point. The study of NMF convergence through the Karush–Kuhn Tucker (KKT) conditions was explored by Lin [39]; stationarity is only a necessary condition of a local minimum. Moreover, some limit points which are not stationary may exist, especially if some components of F and G are initialized to zero. Moreover, most algorithms are sensitive to the initialization and to the presence of outliers. Parametric divergences may reduce the influence of this last drawback by an appropriate choice of the hyperparameters. Cichocki et al. [20] proposed multiplicative update rules with -divergence. The developed rules were based on the majorization-minimization (MM) strategy [40] but they may also be obtained in a heuristic way by using the KKT conditions or partial derivatives of the cost function as well. Févotte and Idier [21] proposed to use the -divergence as a cost function and derived different kinds of rules according to three different strategies involving the heuristic approach, the majorization-minimization strategy [40] and a new one called majorization-equalization. This last strategy provides a larger step size and a faster convergence. Hennequin et al. stated that the -divergence could be viewed as a special case of Bregman divergence [41], thus leading us to apply Bregman divergence theorems to -divergence. Cichocki et al. [23] proposed NMF based on generalized -divergences in the framework of majorization-minimization (MM). Extending the work in [27,42] from the one hand and in [23] from the other hand, we introduced in [35] a weighted -NMF (-WNMF) defined for . It is straightforward to extend it to a Weighted -NMF which amounts to minimizing a weighted -divergence, and yields where and and respectively denote the componentwise product and division between two matrices. W is a weight matrix used to model the uncertainties associated to the data samples , and whose general element is set to . This approach encompasses several other methods, especially -NMF [23] if , i.e., for any i and j, , and -NMF [21] if additionally . Apart from multiplicative updates, NMF based on alternating direction method of multipliers (ADMM) were recently proposed [43] for their ability to perform distributed computations for large scale data and in particular, Sun and Févotte introduced an approach based on the -divergence [22] while Zhu and Honeine [24] proposed a correntropy-based approach for large deviations. Such fast approaches are not required for the considered chemical application where the global computation time is not an issue.

3. Constraint Parameterization

In this paper, we assume the values of some components of the profile matrix F to be provided or bounded by experts. We thus propose a formalism which takes into account this knowledge. It extends our previous parameterization [32] which only considered equality constraints. Let and be two binary matrices which inform the presence/absence of equality and inequality constraints on each element of the matrix F, respectively, i.e., We then define the binary matrices and as and , where is the matrix of ones. By construction, we obtain We denote by the sparse matrix of set values, i.e., Please note that —the -th element of —is equal to zero when . We can easily prove that Similarly, we define and the sparse matrices of upper and lower bounds (equality constraints could be considered as inequalities, with the same upper and lower bounds. However, in some preliminary tests, we found our proposed approaches to outperform those using bound constraints only), respectively, i.e., Let and be the i-th column of F and , respectively. A column may be expressed as where and are respectively the vector of free parameters and the orthonormal basis of free parameters [32]. From Equation (20), we define as and as the matrix gathering each column , i.e., Following the stages in [32]—which combine Equations (17), (18) and (22)—we obtain the matrix form of Equation (20): This expression of F puts on light its specific structure, as F is expressed as the sum of its set and free parts. Moreover, combining Equations (16) and (23) leads to which shows that the free part of F may be decomposed as a bounded part and an unconstrained one.

4. General Problem Formulation

The proposed informed NMF methods consist of estimating the matrices G and F in order to get an approximate factorization (1) under the above constraints, i.e., where the weighted divergence is defined in Equation (11). The first constraint ensures that some predefined components of F are set while the second one forces the selected components to be bound-constrained. The last condition enforces each row of F to be normalized, i.e., , (Please note that the normalization met in remote sensing [44]—where the sum of each row of F is equal to one—is not similar, except in a noiseless case in the framework of exact factorization. Moreover, the normalization also differs from the one met in mobile sensor calibration [13] as the normalization is approximately satisfied in the latter). The main challenge in the Equation (25) consists of finding solutions which are satisfying all the above constraints. The first constraint leads to consider the parametrization (23) that we used in [32]. By substituting the parametrization (23), Equation (25) becomes a constrained NMF with respect to G and , i.e., The last condition is derived from the last one in Equation (25) combined with Equation (22). In the case of bound constraints only, no dedicated parameterization exists, but projective methods have been developed [45]. The row sum-to-one constraint has been taken under account by using a special parameterization in [14]. However, dealing with all the constraints together at the same time is a difficult task. We thus propose a less elegant, yet efficient strategy which consists of considering them sequentially. By dropping the bound constraint, we obtain the following reduced problem: As an alternative to the above problem, please note that by combining Equations (1) and (23), we obtain We can thus derive a slightly different problem, i.e., which yieldsslightly different update rules. We proposed in [35] some multiplicative update rules to solve Equation (29) in the case of -divergence only. The extension to the -divergence is derived in Appendix A. As explained above, instead of looking for the solution of Equation (26) directly, we sequentially consider each additional set of information, i.e., we first estimate and F that we then normalize and project onto the set of admissible solutions (or that we project and then normalize, respectively) within iterations.

5. Proposed Informed -NMF Methods

5.1. Weighted -NMF with Set Constraints

In this section, we firstly aim to solve Equation (27) without the sum-to-one constraint. The whole strategy follows the majoration-minimization technique [40] and consists of (i) finding a majoring function which is convex with respect to the unknown parameters, and (ii) minimizing this auxiliary function instead of the original one. Update rules for the free part of the profile matrix are where (denoting We consider a column of the data since the divergence may be split into independent partial divergences. Using the notations defined in Section 3, we hereafter drop the index i for the vectors , , , and for the matrix . Let k be the current iteration index and let us define Expression (32) together with Equation (21) provide The weighted -divergence between two corresponding column vectors reads where , Provided that and noticing that is convex for and [23], Jensen’s inequality may be applied twice, i.e., and where the superscript k is the current iteration number and is the j-th element of the free parameter vector introduced in Equation (20). Equation (34) together with expressions (36) and (37) yield the following auxiliary function: Canceling its gradient leads to the optimum, i.e., which reads in its vector form By combining Equation (21) with the above relationship, we derive the expression of one column of the matrix : By replacing U according to Equation (32), and by noticing that , it results in the new update rule: where Similarly to [35], we derive the update rules by writing the matrix form of Equation (43), which completes the proof. □ Appendix A proposes the update rules for the problem (29). These rules are almost similar to those introduced above as they present some differences in the multiplicative mask. We show in Appendix A that the update rules proposed in the main part of this paper extend the ones proposed in Appendix A by iteratively updating the weights. Update rules for G correspondto an unconstrained -WNMF driven by Equation (12) since no information is available on G. Their validity is only guaranteed within the convex domain, i.e., for . Outside this domain, some additional assumptions on the reconstructed data are needed to ensure the local convexity property [23]. As we chose to set and so that they belong to area 2 in Figure 3, the convexity domain reduces the possible area to the intersection between area 2 and the half-plane .

5.2. Normalization Procedures

In the considered application, the rows of the profile matrix are summed to one. This case is different from the one encountered in hyperspectral unmixing [44]—since our constraint cannot be split into independent vectorial subproblems—and in mobile sensor calibration [13] as the sum-constraint is only approximately satisfied in the latter. As a consequence, in our previous work [32,35], we used to normalize the matrices G and F in each iteration, after estimating them. We reformulate these steps below (see Section 5.2.1) while we investigate an alternative normalization procedure in Section 5.2.2. They are introduced in the framework of the above approach but the rules may be applied to our previous work [32,35] as well.

5.2.1. Classical Normalization

Let us define as the normalized profile matrix and the corresponding scaled contribution matrix. In order to hold the sum-to-one property, Lantéri et al. [14] proposed a change of variables under the form (please note that the normalization constraint can also be solved as a penalization term in the NMF problem formulation [13]. This setting is interesting when the sum constraint is approximately satisfied, which is not the case for the considered application). which may be written under matrix form as This equation enables to normalize the rows of F whereas the symmetric version enables to scale the columns of G correspondingly, i.e., The product then reads which results in the expression of its general entry: This means that the matrix product is maintained throughout the normalization process. Since the cost function to minimize only depends on this product, this property ensures the same decrease as in the unconstrained case within iterations. The normalized expression of F—denoted —at iteration then reads where and where stands for the free part of the normalized matrix defined by . Noticing that we express Equation (49) with respect to : Similarly, we derive the update rules for , i.e., where —defined in Equation (14)—is computed with the unnormalized matrix which reads Equations (52) and (53) thus provide the update rules for our first normalized and constrained WNMF method denoted -N-constrained and weighted NMF (CWNMF) below. Although the set profiles are lost within iterations due to the normalization process, we noticed in preliminary tests that they were recovered asymptotically.

5.2.2. Alternative Normalization

As an alternative, we now propose a second normalization which keeps the set constraints on F within iterations. Starting with Equation (30) that we normalize, it turns out that where accounts for the matrix involving the sum of the free components for each source, and the other part of the expression represents the different proportions within the free profiles. Using the property (51), alternative update rules may be derived This normalization keeps the constraints verified within iterations but may move along directions different from the steepest descent direction. During this process, the contribution matrix does not require a scale factor as in the first method since the scale factor is only applied to the free parameters of F. We then estimate using the unconstrained rules defined in Equations (12) and (14). The update rules (12) and (56) are associated with our second normalized and constrained WNMF method denoted -N-CWNMF below.

5.2.3. Description of Algorithm Acronyms

We proposed above some update rules for two methods for normalized and constrained WNMF. However, we also proposed different update rules in Appendix A for which the above normalizations can be applied. As these methods minimize the divergence between and the Residual , we add a “-R” to their acronym. Table 3 outlines the necessary information for each method. The pseudo code for -N-CWNMF(-R) methods is shown in Algorithm 1.

Table 3

Our different non-negative matrix factorization (NMF) methods with normalization.

Acronym	F	G	Mask on F	Mask on G
αβ-N1-CWNMF-R	Equation (52)	Equation (53)	MFα,β(G˜k,F˜k)	NGα,β(G˜k,Fk+1)
αβ-N1-CWNMF	Equation (52)	Equation (53)	NFα,β(G˜k,F˜k)	NGα,β(G˜k,Fk+1)
αβ-N2-CWNMF-R	Equation (56)	Equation (12)	MFα,β(Gk,F˜k)	NGα,β(Gk,F˜k+1)
αβ-N2-CWNMF	Equation (56)	Equation (12)	NFα,β(Gk,F˜k)	NGα,β(Gk,F˜k+1)

5.3. Bound-Constrained Normalized and Weighted -NMF

We now focus on problem (25) which involves several kinds of constraints which should coexist simultaneously. To our knowledge, only Lin [45] deals with bound constraints and proposes to adapt the stepsize of projected gradient techniques in order to both decrease the cost function while holding the constraints. However, the work was devoted to bound constraints only, and his solution does not suit our problem with normalization. As explained above, we propose to tackle them by applying a projection onto the admissible domain. Bound constraints act as safety barriers which prevent unrealistic solutions. However, the combination of normalization and projection should be applied in a predefined order. We thus propose below two structures: the bound constraint projection followed by a normalization stage, or the normalization followed by the projection.

5.3.1. Informed NMF with Bound Constraints and Normalization

In this subsection, we consider update rules for N-CWNMF methods. The same kind of procedure should be done for N-CWNMF approaches proposed above. We assume that we get at iteration k a normalized matrix and an unscaled (indeed, no scaling is applied on G in N-CWNMF, as explained in Section 5.2.2) contribution matrix . Combining Equations (24) and (54) provide which may be simplified by using Equation (51), i.e., Applying the bound constraint then consists of where is the projection operator defined by The second normalization proposed in Section 5.2.2 consists of scaling the free part without changing the set components, which reads This rule keeps the sum-to-one constraint and the set values. The bound constraints may be lost within because of the normalization but were found to be asymptotically recovered in our tests. The associated updates for G follows the unconstrained ones and it has not to be corrected by a scale factor, i.e., where has been introduced in Equation (14). The rules (61) and (62)—associated to our informed NMF approach named -BN-CWNMF—do not ensure the cancellation of the gradient of Equation (38) along iterations but they preserve two set of constraints among the three ones. Let us recall that the approach using the first proposed normalization—denoted -BN-CWNMF—can be derived with the same strategy. The pseudo code for -BN-CWNMF method is shown in Algorithm 2.

5.3.2. Informed NMF with Normalization and Bound Constraints

The same procedure as above should be applied in the reverse order so that bound projection is applied as the last step of an iteration. When applied to Equation (58), the second normalization provides The projection stage then leads to the unnormalized profile Equations (62) and (64) account for the update rules in this last method, denoted as -NB-CWNMF. The pseudo code for NB-CWNMF method is shown in Algorithm 3. Please notice that only set and bound constraints are checked within iterations. Convergence towards a limit point ensures that limit matrices keep all the desired properties. As explained above, the same procedure with our first considered normalization may be applied, thus yielding an approach named -NB-CWNMF.

6. Experimental Results

In this section, the enhancement provided by our methods are investigated in both simulations and a real data campaign. In these tests, we aim to identify the sources (by their chemical profile) contributing to the total atmospheric suspended PM as well as to quantify their contribution. In both the simulations and the real dataset, we consider atmospheric particles with diameter equal to or below 10 μm (PM). In practice, these particles are trapped in a filter which is changed every 24 h. Each filter is then analyzed by chemists who derive the masses of several chemical species of interest for the considered application, i.e., for evaluating the impact of marine traffic on air pollution in a port city. Species under study are divided into 16 metal tracers—i.e., , , , , P, , , , V, , , , , , , and —8 water soluble ionic species—i.e., , , , , , , , and —carbon compounds—either organic (OC) or elementary (EC)—levoglucosan, and polyols. In all these experiments, except when we tested the influence of these parameters, we set the values of and to 0.6 and 0.9, respectively. Indeed, such a couple of value lies in the recommended area 2 defined in Figure 3. Moreover, we found in preliminary tests that these values of and provided a better performance. As a consequence, we do not make them vary in the remainder of this section. Moreover, the signal-to-noise ratio (SNR) enabled us to evaluate the data set and is defined as: where and stand for the -th non-noisy data and the individual noise. This index is widely used in the literature [46].

6.1. Realistic Simulations

From the validated profile and contribution matrices obtained during the real campaign [47], simulation data were built by taking into account the individual uncertainty provided by the real campaign. In these simulations, the data matrix X thus consisted of a matrix—which correspond to the chemical composition (28 species) of 278 PM samples—associated with individual uncertainties, which are those provided by the chemical analysis. In addition, we also considered several cases with outliers. It is assumed that outliers come from an additional positive individual contamination. The mathematical model of the outliers was driven by a random vector including the locations of the outliers in the data matrix. For these locations, a multiplicative model was used depending of the trial number i (between 1 and 400). where (resp. X) accounts for the with outliers noiseless data (resp. the without outliers noiseless data). The variable is a parameter which may be tuned in order to get a SNR after outliers ranging from 15 dB to 70 dB. In our tests, the outlier deviation increased with the trial number i. In other words, for low trial number, the multiplicative factor remained close to 1 in order to keep large. The effect of such outliers essentially depends on the location of the outliers. Indeed, if an outlier acts on a large entry of the data matrix, its impact on the SNR will be greater. Then, a noise has to mimic the chemical measurement process. The chemical measurement process only gives a concentration value together with an uncertainty. So, every value within this interval is equally possible. A uniform noise which is designed on a limited support was proposed. This support may be truncated on the left side if the uncertainty is greater than the corresponding data. Among the 278 samples, 10 and 20 outliers were considered. Practically, we noticed that the signal-to-noise ratio (SNR) index then dropped in the worst case by 4 dB if the set of 20 outliers is taken into consideration with respect to the no outlier case.

6.1.1. Source Profiles

In this study, 10 sources are highlighted. Among them, some of them are purely natural or purely anthropogenic but some of them became anthropised. Table 4 describes major species present in each source profile. Other species than those listed in the corresponding source profile may be considered as negligible. Please note that—as we here consider simulations—the real profile matrix is perfectly known and is provided in Table A1. Also, one should notice that each source profile is presented under a per mil notation, i.e., it sums to a thousand instead of 1 and the only difference is a scale factor equal to 1000.

Table 4

Features of the different source profiles.

Profiles	Type	Major Species	References
Sea salts	Natural	Cl−, Na+, SO42−, Mg2+, K+, Ca2+, Sr	[48]
Crustal dust	Natural	Al, Ca2+, Fe, K+, OC, Ti, NO3−, Na+	[49]
Primary biogenic emission	Natural	OC, EC, Polyols, P	[50]
Aged sea salts	Anthropised	NO3−, Na+, SO42−, Mg2+, K+, OC, Ca2+, Sr,Cl−	[50]
Secondary nitrates	Anthropised	NO3−, OC, NH4+, EC, Ca2+, Fe, Zn, Cu	[50]
Secondary sulfates	Anthropised	SO42−, NH4+, OC, Ca2+, K+, Fe, Pb, Zn	[49]
Biomass combustion	Anthropogenic	OC, EC, Levoglucosan, NO3−, K+, Zn	[50]
Road traffic	Anthropogenic	EC, OC, NO3−, Cu, Sb, Zn, Fe	[50]
Sea traffic	Anthropogenic	OC, EC, V, Ni, Co, SO42−, NH4+, NO3−	[50,51]
Rich metal source	Anthropogenic	Fe, Al, Cr, Pb, Zn, Mn	[50]

Table A1

Theoretical source profile used in the simulations.

Profiles	Al	Cr	Fe	Mn	P	Sr	Ti	Zn	V	Ni	Co	Cu	Cd	Sb
Sea	0.0019	0	0	0	2.5 ×10−4	0.2034	0	0	0	0	0	0	0	0
Aged sea	0	7.2351 ×10−5	0	0	0.5	0.4	1.877 ×10−4	0	0	0	1.785 ×10−4	1.7941 ×10−5	0	0
Crustal	119.13	8.589 ×10−5	77.35	1.782	3.0680	0.7846	8.9121	1.868	0.3503	0	0.0276	0.0081	0	0
nitrates	4.00 ×10−3	2 ×10−5	3.5	0.11	0.0749	0	0	0.7742	0	0	7.0408 ×10−4	0.1	6.486 ×10−3	0.01975
sulfate	0	5 ×10−5	0	0.02825	0.05313	0	0	0.1334	0	0	0.003287	8.00 ×10−6	0	0
Biomass	0.001	0	2.554	0.05527	0	1.016 ×10−5	0	0.1415	0	0	0	0	0	0.0385
Road traffic	0	0	39.0414	0.1404	2.659	0	0	10.908	0	0	1.00 ×10−8	2.7712	0	0.8964
Sea traffic	0.001147	1.2012 ×10−4	0.1002	0	0	0.0217	9.42 ×10−5	0	7.4920	5.5348	0.1829	1.752 ×10−4	1.315 ×10−6	0
Biogenic	0	0	0	0	14.528	0.04308	8.941 ×10−4	0	0	0	0	0	0	5.2 ×10−4
Metal	64.430	33.332	780.16	33	0.7	2	0	0	0	10	0.15	1.5	1.55	0
Bis	La	Pb	Na+	NH4+	K+	Mg2+	Ca2+	Cl−	NO3−	SO42−	OC	EC	Levo.	Polyols
Sea	0	0	297.03	0	10.71	32.75	9.183	581.02	0	69.08	0	0	0	0
Aged sea	0	0.1	280	0	4	30	10	1.00 ×102	395	150	30	0	0	0
Crustal	0.0594	0	1.8333 ×10−4	4.36 ×10−5	5	5	301.81	0	49.95	39.96	384.92	0	0	0
nitrates	7.178 ×10−4	0.2075	0	216.26	3.2	0	0	1.21 ×10−5	730.73	0	45	0	0	9.027 ×10−11
sulfate	0	0.0729	0	260.83	4.43	0	0	8.66 ×10−8	0	680.59	53.84	0	0	0
Biomass	0	0.1007	2.650	2.85 ×10−12	12.26	0.001	11.67	25.48	35.16	56.84	692.10	91.14	69.78	1.477 ×10−7
Road traffic	0.0121	3.353	0	5.14 ×10−10	39.84	0	3.00 ×10−8	3.40 ×10−8	50.19	60.22	301.13	488.81	0	0
Sea traffic	0.0941	0	0	0.0626	0	0	0	0	75.17	300.69	500.76	109.87	0	0
Biogenic	0	0	5.023	0.0968	29.056	0	0	0.2975	0	20.094	854.02	0	0	76.83
Metal	0.2215	22.95	0	0	0	0	0	0	0	50.00	0	0	0	0

6.1.2. Equality Constraints

Equality constraints or set values enable to inform the algorithm about some entries of the profile matrix. This knowledge is taken into account by specifying matrices and . These matrices are available in Appendix B. It is to be stressed that the only used knowledge here is the absence of some compounds in some source profiles. As a result, matrix reduced to . Then, it follows that our informed methods with residuals were identical to those without residual. As a consequence, we do not test the latter in the simulations below.

6.1.3. Initialization

An approximate prior knowledge of F was used as a starting point for each informed NMF algorithm. Table A3 gathers the different entries used. Then, a weighted quadratic estimation of the initial contribution matrix G [31] was performed so that each method has the same initial factors.

Table A3

Matrix used in the simulations.

Finit	Al	Cr	Fe	Mn	P	Sr	Ti	Zn	V	Ni	Co	Cu	Cd	Sb
Sea	0.2	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	0.01	0.8	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11
Aged sea	1.00 ×10−11	0.001	1.00 ×10−11	1.00 ×10−11	1	1	0.01	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	0.01	0.01	1.00 ×10−11	1.00 ×10−11
Crustal	200	0.001	150	2	2	2	20	2	2	1.00 ×10−11	0.001	0.0001	1.00 ×10−11	1.00 ×10−11
nitrates	1.00 ×10−5	2.00 ×10−6	8	1	0.4	1.00 ×10−11	1.00 ×10−11	4	1.00 ×10−11	1.00 ×10−11	0.001	0.5	0.01	0.2
sulfate	1.00 ×10−11	1.00 ×10−4	1.00 ×10−11	1.00 ×10−4	0.5	1.00 ×10−11	1.00 ×10−11	0.4	1.00 ×10−11	1.00 ×10−11	0.01	1.00 ×10−4	1.00 ×10−11	1.00 ×10−11
Biomass	5	1.00 ×10−11	10	2	9.43 ×10−11	0.001	1.00 ×10−11	1	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	1.006 ×10−10
Road traffic	1.00 ×10−11	1.00 ×10−11	50	1	1.00 ×10+0	1.00 ×10−11	1.00 ×10−11	24	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	4	1.00 ×10−11	2
Sea traffic	0.01	1.00 ×10−4	0.4	1.00 ×10−11	1.00 ×10−11	0.1	1.00 ×10−4	1.00 ×10−11	18	10	1	1.00 ×10−3	1.00 ×10−4	1.00 ×10−11
Biogenic	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	5	7.96 ×10−10	7.96 ×10−10	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	7.96 ×10−10
Metal	73	70	650	50	3	5	1.00 ×10−11	1.00 ×10−11	1.00 ×10−11	30	1	3	4	1.00 ×10−11
	La	Pb	Na +	NH 4+	K +	Mg 2+	Ca 2+	Cl −	NO 3−	SO 42−	OC	EC	Levo.	Polyols
Sea	1.00 ×10−11	1 ×10−9	320	5 ×10−5	10	38	11	550	1 ×10−5	70	1 ×10−5	1 ×10−5	9.98 ×10−11	9.98 ×10−11
Aged sea	1.00 ×10−11	0.01	250	1 ×10−8	1	40	15	150	320DA80.eps	210	12	1.00 ×10−11	9.99 ×10−11	9.99 ×10−11
Crustal	0.0001	1 ×10−7	0.0001	0.0001	10	10	250	1.00 ×10−11	30	30	290	1.00 ×10−11	1.00 ×10−10	1.00 ×10−10
nitrates	0.2	0.5	1 ×10−10	300	5	1.00 ×10−11	1.00 ×10−11	0.2	600	1.00 ×10−11	80	1.00 ×10−11	1.00 ×10−10	1.00 ×10−10
sulfate	1.00 ×10−11	0.1	1.00 ×10−8	305	10	1.00 ×10−11	1.00 ×10−11	1.00 ×10−3	1.00 ×10−11	584	100	1.00 ×10−11	1.00 ×10−10	1 ×10−11
Biomass	1.00 ×10−11	1	3	28	72	5	38	66	66	66	510	70	57	9.43 ×10−11
Road traffic	1	9.99	1.00 ×10−10	1.00 ×10−8	57	0.00049	1.00 ×10−6	1.00 ×10−11	79.99	80	260	430	9.99 ×10−11	9.99 ×10−11
Sea traffic	0.5	1.00 ×10−8	1 ×10−11	1.00 ×10−2	1 ×10−11	1 ×10−11	1.00 ×10−11	1 ×10−11	110	250	450	160	8.37 ×10−11	8.37 ×10−11
Biogenic	1.00 ×10−11	7.96 ×10−10	1	1	9	4	1.00 ×10−11	7.96 ×10−10	5	5	800	1 ×10−11	7.96 ×10−10	170
Metal	1	40	1.00 ×10−11	1.00 ×10−2	1.00 ×10−11	1.00 ×10−2	1.00 ×10−11	0.005	0.001	70	0.00164	1.64 ×10−10	1.64 ×10−10	1.64 ×10−10

6.1.4. Performance Evaluation

Several performance indexes are available in the literature. However in this work, only the mixing-error ratio () index [52] is considered (please note that while specifically designed for measuring the estimation accuracy of a mixing matrix, the MER may also be used as a signal-to-inteference ratio (SIR) when applied to the profile matrix, and more specifically to ). It was computed over each column of G. For each source, a scalar quantity for source j expressed in may be obtained. For one exact vector and its estimate , it is possible to write under the form where and are respectively colinear and orthogonal to the exact vector . This decomposition allows to express the of source j, denoted as , defined as, Infinite values mean exact separation while 0 correspond to an angle equal to . These values may be summed up into a vector which gathers the performance of each source. Generally, a global indicator is obtained by averaging each index over all sources, i.e., In all the cases under study, the MER (the results and the Matlab interpretation codes are already available at http://www-lisic.univ-littoral.fr/~delmaire/recherche.html) index [52] was represented as a function of the input SNR. In this study, intensive computations were performed with ten thousand iterations for each method over 400 tests. In our comparison, we dropped the PMF method as it is only available as a user interface (see https://www.epa.gov/air-research/positive-matrix-factorization-model-environmental-data-analyses) which prevents to compute several tests in a single command. Moreover, even for a single test, our expertise shows that PMF requires the uncertainties to be increased in order to perform a computation, but it did not make sense in this case. As a consequence, nine methods were selected and tested: among them, three are uninformed, two account for our informed methods with set values while the four remaining ones are our informed methods with bounds. In order to get an idea, we chose to display the road traffic profile estimation in the case when input is equal to 24 dB (Figure 4). Species were represented in descending order of the real profile. We could notice that for this source, NCWNMF appears better than other methods.

Figure 4

Estimation of the road traffic profile.

In our tests, the input SNR ranges from 15 to 70 dB. We decided to display only the performance of the methods for 20 outliers as shown in Figure 5 since the other tests provide similar results. The statistical performance is provided in Figure 5 by specifying the standard deviation in each slice of SNR and for each method.

Figure 5

MER vs. input signal-to-noise ratio (SNR). The case with 20 outliers.

Let us first analyze the enhancement provided by the non-informed NMF methods. One notice that the robust -WNMF [23] performs very poorly in all cases. Its standard deviation appears very large for a wide range of SNRs. Besides, RNMF—which stands for a robust NMF method [18]—behaves correctly for low SNRs while its performance decreases surprisingly for large SNRs. Moreover, we experimented a sparse NMF (SNMF) method [53] including a -divergence cost function together with L sparsity of one factor. We select one trial and test the performance for the parameter ranging from 0 to 2. The optimal value has been selected over 400 trials for the case of 20 outliers. SNMF provides inconsistent solutions in every slice of SNR. We analyzed the performance of our proposed informed methods. Let us firstly focus on both informed methods with set values which were experimented, i.e., the -N1CWNMF and -N2CWNMF methods. Their performance appeared to be very similar in all the simulations. In practice, their MER was approximately equal to the SNR in every input SNR slice, which was expected according to our experience in preliminary tests. The four informed NMF methods with bound constraints behaved similarly, except in a few slices where the SNR is large. Indeed, in low SNR, they are slightly better than -N1CWNMF (the gap is not visible due to the scale), but they outperform all the other tested methods as soon as the SNR becomes greater than 40 dB. The low gap in low SNR is essentially due to the fact that we inform F while the performance index is measured on G. In noisy tests—i.e., for a low SNR—the estimated matrix G does not benefit from the additional information on F, because of the important noise in X. However, we noticed an improvement on F for these tests, even if we cannot safely measure it, as the profiles might be correlated. On the contrary, for medium and large SNR, the MER enhancement was significant for every bound-constrained informed NMF method. More precisely, -N1BCWNMF and -N2BCWNMF outperformed all the other methods with a significant gap as soon as the SNR increased. We also explored in the synthetic example the use of a large range of and parameters within area 2 such that and . We noticed that the MER index for WNMF method was very sensitive to the choice of the parameters and also to the trial number. A successful tuning of these parameters was somehow difficult. On the other hand, we experiment the same operating conditions for NCWNMF. We observe in Figure 6 that results are more stable than for the uninformed one. In this case, the choice of appears quite insensitive but the method remains satisfactory.

Figure 6

Mixing-error ratio (MER) index for N constrained and weighted NMF mixing-error ratio (CWNMF) vs. and .

In addition, we could wonder how constraints affect the results. First, we potentially may use 117 set values and roughly 60 bound constraints. We decided to inspect the influence of dropping set values only. For that purpose, we progressively turned on one set value at a time for each column and according to the increasing order of the row index, until the 117 constraints were reached. We plot the MER performance according to the number of constraints in Figure 7.

Figure 7

MER versus constraint number. The case with 20 outliers.

Contrary to what should be expected, adding constraints may sometimes degrade the performance suddenly or conversely. There seems to be set of constraints which fit better to the situation. This conclusion is quite surprising and the design of appropriate constraints seems an open question. To conclude, these methods provide a good performance in every situation and are thus better-suited for the considered application.

6.2. Real Data Case

The real data campaign was conducted by Dr C. Roche during her Ph.D. thesis [54], within the UCEIV laboratory (Université du Littoral Côte d’Opale). The first goal of this thesis was to study how much the shipping traffic in the English Channel, one of the most important in the world, can contribute to the atmospheric PM concentration in coastal area, such as the Hauts–de–France region. In her work, some characteristic species of maritime traffic emissions have been evidenced. Then, some flexible bound profiles and set profile entries were proposed. Using this knowledge, the challenge was to implement an informed NMF method—as those developed in [32]—in order to reconstruct the PM origin. Contrary to [54], we here would like to drop some of the bound information and to test whether or not the new methods that we propose in this paper are still competitive.

6.2.1. Context

A sampling campaign has been conducted using a Digitel DA80 sampler over a long period—i.e., 16 months—in Cape Gris–Nez and over a shorter period—i.e., three months—in the port of Calais, which enabled us to get 278 sample measurements. Cape Gris–Nez and Calais are two coastal sites in the eastern part of the English channel. The first one is a rural site whereas the port of Calais is the second busiest in passenger traffic in Europe with 10.8 million of passengers and over 80 arrivals and departures of ferries per day in average in 2014 [55]. The DA80 device (see Figure 8) is an equipment which is able to trap PM on filters, which are stored and a posteriori analyzed for chemical composition. A special sieve enabled us to select only PM, i.e., PM whose diameter was lower than m. The machine is also able to save wind conditions and time. The sampling period was chosen equal to 24 h. Along this period, meteorological conditions concentration levels were highly varying. Thus, after analyzing the filters, several data files were available to address the pollution source apportionment problem.

Figure 8

Digitel DA80 high volume sampler used for data acquisition (source of the right plot: Digitel).

6.2.2. Input Data

Appendix C provides operating conditions for the run which are performed. Based on the expert knowledge provided by chemists and on the information described in Table 4, a matrix —which defines 55 set value locations (among 278 profile entries)—is provided in Table A5. In the same way as in Section 6.1, the matrix is equal to . In addition, the initial profile matrix is chosen by an expert and is provided in Appendix C.

Table A5

Matrix used in the real data case.

ΩE	Al	Cr	Fe	Mn	P	Sr	Ti	Zn	V	Ni	Co	Cu	Cd	Sb
Sea	0	1	0	1	0	0	1	1	1	1	1	1	1	1
Aged sea	0	0	0	0	0	0	0	0	1	1	0	0	0	0
Crustal	0	0	0	0	0	0	0	0	0	0	0	0	0	0
nitrates	0	0	0	0	0	0	0	0	1	0	0	0	0	0
sulfate	0	0	0	0	0	0	0	0	1	0	0	0	0	0
Biomass	0	1	0	0	0	0	0	0	1	1	0	0	0	0
Road traffic	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Sea traffic	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Biogenic	0	0	0	1	0	0	0	0	1	1	0	0	1	0
Metal	0	0	0	0	0	0	0	0	0	0	0	0	0	0
	La	Pb	Na +	NH 4+	K +	Mg 2+	Ca 2+	Cl −	NO 3−	SO 42−	OC	EC	Levo.	Polyols
Sea	1	1	0	1	0	0	0	0	1	0	0	0	1	1
Aged sea	0	0	0	1	0	0	0	0	0	0	0	0	1	1
Crustal	0	0	0	0	0	0	0	0	0	0	0	0	1	1
nitrates	0	0	1	0	0	0	0	0	0	1	0	0	1	0
sulfate	0	0	1	0	0	0	0	0	1	0	0	0	1	0
Biomass	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Road traffic	0	0	1	0	0	1	0	0	0	0	0	0	1	1
Sea Traffic	0	0	1	0	0	0	0	0	0	0	0	0	1	1
Biogenic	1	1	0	0	0	0	0	0	0	0	0	0	1	0
Metal	0	0	1	0	0	0	0	1	1	0	1	1	1	1

6.2.3. Results Evaluation

The results were obtained in the case of 10 identified sources and iterations for each method. The profiles under study are specified in Table 4. However, their estimation remains a difficult task for several reasons which listed hereafter: Data are corrupted by an unknown number of outliers. Their origin may be of various kinds, e.g., the presence of a new source which affects the data at some sparse moments. Data are very noisy. In particular, an additional overall pollution—whose level highly varies over time—can not be assigned to a particular source and can significantly decrease the overall SNR. Some source profiles may be geometrically close, only a few tracer species are able to distinguish them. Even if a database with source profiles is available at http://source-apportionment.jrc.ec.europa.eu/Specieurope/sources.aspx, a universal profile for a given source does not exist. When comparing our result, they all appear highly consistent with the published one in the Specieurope database. We display the source profiles in a descending order of expected species (MPMPthis task was designed by the chemist co-authors of the paper). A correct source profile was then displayed as decreasing proportions from the left to the right of each figure. On the contrary, a large proportion on the right part of a profile plot implies that the estimation has partly failed. Among the 10 source profiles, some of them are well recovered. We only show in Figure 9 the estimated sea traffic source profile as it is difficult to recover. As mentioned above, it is expected that proportions are decreasing from the left to the right side of the figure. The order has been built based on ship profiles from the European database and from the literature [50,51,54]. To process these data, we compare the enhancement provided by two non-informed methods, i.e., the -WNMF and the -RNMF [18] and three informed methods, i.e., the method used in [54] and our methods -N1CWNMF, and its bound-constrained extension -N1BCWNMF. Other bound methods were dropped since they turn out in Section 6.1 to behave roughly similar to -N1BCWNMF. Note that Roche [54] used 67 constraints while we only use 63 and 65 bound constraints in the tested bound-constrained informed -NMF method, respectively.

Figure 9

Estimation of the sea traffic source profile.

It may be noticed that blind NMF methods, i.e., the -WNMF and the -RNMF, and our -N1CWNMF method are overestimating and species while underestimating OC and EC compounds. The estimated sea traffic profile thus appears not to be very realistic with these methods. Besides, bound-constrained WNMF methods behave similarly and report good estimations for major species. However, these estimations reach the proposed bounds for , , and species among the 28 species under study. For example, is limited by the maximum value provided in Table A4. Finally, these bound-constrained NMF methods outperform all the other methods for the sea traffic re-construction.

Table A4

Matrix used in the real data case.

Finit	Al	Cr	Fe	Mn	P	Sr	Ti	Zn	V	Ni	Co	Cu	Cd	Sb
Sea	0.19	ϵ	ϵ	ϵ	1.00	8.00	ϵ	ϵ	ϵ	ϵ	ϵ	ϵ	ϵ	ϵ
Aged sea	0.10	0.01	0.50	0.01	1.00	8.00	0.02	0.02	ϵ	ϵ	1.00	0.01	0.01	0.01
Crustal	266.67	0.14	150	2.00	ϵ	2.00	20	0.50	0.50	0.07	0.07	0.07	ϵ	0.01
nitrates	0.98	0.98	30	0.98	ϵ	0.98	0.98	20	ϵ	0.98	0.98	10	0.98	0.98
sulfate	1.00	1.00	30	1.00	ϵ	1.00	15.00	20	ϵ	1.00	1.00	1.00	1.00	1.00
Biomass	4.00	ϵ	9.00	1.00	ϵ	1.00	1.00	10	ϵ	ϵ	ϵ	1.00	ϵ	ϵ
Road traffic	20	1.00	50	5.00	ϵ	1.00	ϵ	50	5.00	10	5.00	50	5.00	50
Sea traffic	10	ϵ	10	ϵ	ϵ	ϵ	ϵ	5.00	55.00	55.00	30	ϵ	ϵ	ϵ
Biogenic	0.01	ϵ	0.01	ϵ	20	ϵ	ϵ	ϵ	ϵ	ϵ	ϵ	1.00	ϵ	ϵ
Metal	80	80	358	40	8	18.00	40	40	30	30	1.00	40	50	30
	La	Pb	Na +	NH 4+	K +	Mg 2+	Ca 2+	Cl −	NO 3−	SO 42−	OC	EC	Levo.	Polyols
Sea	ϵ	ϵ	320.00	ϵ	10.00	40.00	10.00	540.08	ϵ	70.00	0.64	0.09	ϵ	ϵ
Aged Sea	ϵ	0.01	250.00	ϵ	10.00	25.00	10.00	200.00	275.30	210.00	8.00	1.00	ϵ	ϵ
Crustal	ϵ	0.14	10.00	3.00	100.00	70.14	210.00	7.00	20.00	35.07	90.00	12.62	ϵ	ϵ
nitrates	ϵ	0.98	ϵ	200.00	0.98	0.98	40.00	0.98	547.30	ϵ	100.00	40.00	ϵ	ϵ
sulfate	ϵ	20.00	ϵ	200.00	34.00	1.00	40.00	1.00	ϵ	554.00	60.00	16.00	ϵ	ϵ
Biomass	0.00	0.94	2.83	28.31	70.00	4.72	37.74	66.05	70.00	66.05	500.61	69.29	56.46	ϵ
Road traffic	ϵ	10.00	ϵ	10.00	10.00	ϵ	21.00	2.00	80.00	40.00	271.73	303.27	ϵ	ϵ
Sea traffic	15.00	10.00	ϵ	10.00	ϵ	ϵ	20.00	ϵ	10.00	30.00	580.00	160.00	ϵ	ϵ
Biogenic	ϵ	ϵ	1.00	1.00	5.00	4.00	1.00	ϵ	5.00	5.00	760.00	50.00	ϵ	146.98
Metal	1.00	80.00	ϵ	1.00	48.00	10.00	5.00	ϵ	ϵ	10.00	ϵ	ϵ	ϵ	ϵ

Using the estimation provided by each method, it is possible to reconstruct each species’ concentration, and especially the V and compounds since they are tracers of the sea traffic activity [54]. In other words, the V and species can only be found in the sea traffic source. Moreover, the ratio V over is often assimilated to a value between 2 and 3 [51], and is found to be between 1.2 to 1.5 for the three bound-constrained WNMF methods, which is close to the expected ratio. To confirm this fact, we plot the reconstruction of the V species in Figure 10. This shows that this compounds is mainly due to Sea Traffic. More than 98 per cent of the V species originates from the sea traffic source which is consistent with the chemist’s expectations.

Figure 10

V species reconstruction over Cape Gris–Nez.

7. Conclusions

In this paper, we tackled an informed non-negative matrix factorization problem where the profile matrix lives in a specific subspace. We proposed several informed NMF methods combining -divergence and a specific structure of one matrix factor provided by the considered problem. This work extends our previous informed NMF [32]—assuming some values of one of the factor matrices to be known—by considering generalized divergences, and by leading to alternative update rules and normalization. The update rules may be viewed as projective multiplicative updates applied to a special structure of the profile matrix. The relevance of these extensions were shown on realistic simulations of natural and industrial PM source apportionment—with various input SNR conditions and various numbers of outliers—and on a real data case. In practice, these informed methods are more robust than blind NMF, and provide results which are consistent with the chemical expert, even in the presence several outliers. In future work, we will investigate new soft constraints to inform NMF and alternatives to multiplicative updates.

Table A2