Literature DB >> 31983896

Determining the Number of Latent Factors in Statistical Multi-Relational Learning.

Abstract

Statistical relational learning is primarily concerned with learning and inferring relationships between entities in large-scale knowledge graphs. Nickel et al. (2011) proposed a RESCAL tensor factorization model for statistical relational learning, which achieves better or at least comparable results on common benchmark data sets when compared to other state-of-the-art methods. Given a positive integer s, RESCAL computes an s-dimensional latent vector for each entity. The latent factors can be further used for solving relational learning tasks, such as collective classification, collective entity resolution and link-based clustering. The focus of this paper is to determine the number of latent factors in the RESCAL model. Due to the structure of the RESCAL model, its log-likelihood function is not concave. As a result, the corresponding maximum likelihood estimators (MLEs) may not be consistent. Nonetheless, we design a specific pseudometric, prove the consistency of the MLEs under this pseudometric and establish its rate of convergence. Based on these results, we propose a general class of information criteria and prove their model selection consistencies when the number of relations is either bounded or diverges at a proper rate of the number of entities. Simulations and real data examples show that our proposed information criteria have good finite sample properties.

Entities: Chemical

Keywords: Information criteria; Knowledge graph; Model selection consistency; RESCAL model; Statistical relational learning; Tensor factorization

Year: 2019 PMID： 31983896 PMCID： PMC6980192

Source DB: PubMed Journal: J Mach Learn Res ISSN： 1532-4435 Impact factor: 5.177

Introduction

Relational data is becoming ubiquitous in artificial intelligence and social network analysis. These data sets are in the form of graphs, with nodes and edges representing entities and relationships, respectively. Recently, a number of companies have developed and released their knowledge graphs, including the Google Knowledge Graph, Microsoft Bing’s Satori Knowledge Base, Yandex’s Object Answer, the Linkedln Knowledge Graph, etc. These knowledge graphs are graph-structured knowledge bases that store factual information as relationships between entities. They are created via the automatic extraction of semantic relationships from semi-structured or unstructured text (see Section II.C in Nickel et al., 2016). The data may be incomplete, noisy and contain false information. It is therefore of great importance to infer the existence of a particular relationship to improve the quality of these extracted information. Statistical relational learning is primarily concerned with learning from relational data sets, and solving tasks such as predicting whether two entities are related (link prediction), identifying equivalent entities (entity resolution), and grouping similar entities based on their relationships (link-based clustering). Statistical relational models can be roughly divided into three categories: the relational graphical models, the latent class models and the tensor factorization models. Relational graphical models include probabilistic relational models (Getoor and Mihalkova, 2011) and Markov logic networks (MLN, Richardson and Domingos, 2006). These models are constructed via Bayesian or Markov networks. In latent class models, each entity is assigned to one of the latent classes and the probability of a relationship between entities depends on their corresponding classes. Two important examples include the stochastic block model (SBM, Nowicki and Snijders, 2001) and the infinite relational model (IRM, Kemp et al., 2006). IRM can be viewed as a nonparametric extension of SBM where the total number of clusters is not prespecified. Both models have received considerable attentions in the statistics and machine learning literature for community detection in networks. Tensors are multidimensional arrays. Tensor factorization methods such as CANDE-COMP/PARAFAC (CP, Harshman and Lundy, 1994), Tucker (Tucker, 1966) and their extensions have found applications in a variety of fields. Kolda and Bader (2009) presented a thorough overview of tensor decompositions and their applications. Recently, tensor factorizations are being actively studied in the statistics literature and have becoming an emerging field of statistics. To name a few, Chi and Kolda (2012) developed a Poisson tensor factorization model for sparse count data. Yang and Dunson (2016) proposed a conditional tensor factorization model for high-dimensional classification with categorical predictors. Sun et al. (2017) proposed a sparse tensor decomposition method by incorporating a truncation step into the tensor power iteration step. Relational data sets are typically expressed as (subject, predicate, object) triples and can be grouped as a third-order tensor. As a result, tensor factorization methods can be naturally applied to these data sets. Nickel (2013) proposed a RESCAL factorization model for statistical relational learning. Compared to other tensor factorization approaches such as CP and Tucker methods, RESCAL is more capable of detecting the correlations produced between multiple interconnected nodes. For relational data consisting of n entities, K types of relations, and a positive integer s, RESCAL computes an n × s factor matrix and an s × s × K core tensor. The factor matrix and the core tensor can be further used for link prediction, entity resolution and link-based clustering. Nickel et al. (2011) showed that a linear RESCAL model achieved better or comparable results on common benchmark data sets when compared to other existing methods such as MLN, DEDICOM (Harshman, 1978), IRM, CP, MRC (Kok and Domingos, 2007), etc. It was shown in Nickel and Tresp (2013) that a logistic RESCAL model could further improve the link prediction results. Central to the empirical validity of RESCAL is the correct specification of the number of latent factors. Nickel et al. (2011) proposed to select this parameter via cross-validation. As commonly known for cross-validation methods, there’s no theoretical guarantee against overestimation. Besides, cross-validation can be computationally expensive, especially for large n and K. In the literature, model selection is less studied for tensor factorization methods. Allen (2012) and Sun et al. (2017) proposed to use Bayesian information criteria (BIC, Schwarz, 1978) for sparse CP decomposition. However, no theoretical results were provided for BIC. Indeed, we show in this paper that a BIC-type criterion may fail for the RESCAL model. The contribution of this paper is twofold. First, we propose a general class of information criteria for the RESCAL model and prove their model selection consistency. Although we focus on the RESCAL model, our information criteria can be extended to select models for general tensor factorization methods with slight modification. The problem is nonstandard and challenging since both the factor matrix and the core tensor are not observed and need to be estimated. Besides, the model parameters are non-identifiable. Moreover, the derivation of model/tuning parameter selection consistency of information criteria usually relies on the (uniform) consistency of estimated parameters. For example, Fan and Tang (2013) derived the uniform consistency of the maximum likelihood estimators (MLEs) to prove the consistency of GIC (see Proposition 2 in that paper). Zhang et al. (2016) established the uniform consistency of the support vector machine solutions to prove the consistency of SVMIC (see Lemma 2 in that paper). The consistency of these estimators are due to the concavity (convexity) of the likelihood (or the empirical loss) functions. In contrast, for most tensor decomposition models including RESCAL, the likelihood (or the empirical loss) function is usually non-concave (non-convex) and may have multiple local solutions. As a result, the corresponding global maximizer (minimizer) may not be consistent even with the identifiability constraints. It remains unknown how to establish the consistency of the information criterion without consistency of the estimator. A key innovation in our analysis is to design a “proper” pseudometric and show that the global optimum is consistent under this specific pseudometric. We further establish the rate of convergence of the global optimum under this pseudometric as a function of n and K. Based on these results, we establish the consistency of our information criteria when K is either bounded or diverges at a proper rate of n. No parametric assumptions are imposed on the latent factors. Second, we introduce a scalable algorithm for estimating the parameters in the logistic RESCAL model. Despite the fact that a linear RESCAL model can be conveniently solved by an alternating least square algorithm (Nickel et al., 2011), there are lack of optimization algorithms for solving general RESCAL models. The proposed algorithm is based on the alternating direction method of multipliers (ADMM, Boyd et al., 2011) and can be implemented in a parallelized fashion. The rest of the paper is organized as follows. We formally introduce the RESCAL model and study the parameter identifiability in Section 2. Our information criteria are presented in Section 3 and their model selection properties are investigated. Numerical examples are presented in Section 4 to examine the finite sample performance of the proposed information criteria. Section 5 concludes with a summary and discussion of future extensions. All the proofs are given in the Appendix.

The RESCAL Model

This section is structured as follows. We introduce the RESCAL model in Section 2.1. In Section 2.2, we study the identifiability of parameters in the model.

Model Setup

In knowledge graphs, facts can be expressed in the form of (subject, predicate, object) triples, where subject and object are entities and predicate is the relation between entities. For example, consider the following sentence from Wikipedia: Jon Snow is a fictional character in the A Song of Ice and Fire series of fantasy novels by American author George R. R. Martin, and its television adaptation Game of Thrones. The information contained in this message can be summarized into the following set of (subject, predicate, object) triples: In this example, we have a total of 7 entities, 4 types of relations and 6 triples. More generally, let denote the set of all entities and denote the set of all relation types. The number of relations K is either bounded or diverges with n. Assuming non-existing triples indicate false relationships, we can construct a third-order binary tensor such that The RESCAL model is defined as follows. For each entity e, a latent vector is generated. The Y’s are assumed to be conditionally independent given all latent factors . Besides, it is assumed that for some strictly monotone link function g and s0 × s0 matrices . In the above model, corresponds to the latent representation of the ith entity and specifies how these ’s interact for the k-th relation. To account for asymmetric relations, we do not restrict ’s to symmetric matrices. When the relations are symmetric, i.e., one can impose the symmetry constraints and obtain a similar derivation. For continuous Y, a related tensor factorization model is the TUCKER-2 decomposition, which decomposes the tensor into for some and some (random) errors . By Equation 1, RESCAL can be interpreted as a “nonlinear” TUCKER-2 model with the additional constraints that s1 = s2 = s0 and . CP decomposition is another important tensor factorization method that decomposes a tensor into a sum of rank-1 tensors. It assumes that for some and . Define and . In view of Equation 2, CP is a special TUCKER-2 model with the constraints that s1 = s2 = s0 and where is a diagonal matrix with the sth diagonal elements being r. In this paper, the proposed information criteria are designed in particular for the RESCAL model. However, they can be extended to estimate s0 in a more general tensor factorization framework including CP and TUCKER-2 models. We discuss this further in Section 5.

Identifiability

The parameterization in Equation 1 is not identifiable. To see this, for any nonsingular matrix , we define . Observe that and hence we have Let . We impose the following condition. (A0) (i) Assume 0 has full column rank. (ii) Assume the matrix has full row rank. (A0)(i) requires the latent factors to be linearly independent. (A0)(ii) holds when at least one of the ’s has full rank. Under Condition (A0), the following lemma states that the RESCAL model is identifiable up to a nonsingular linear transformation. In Section B.1 of the Appendix, we show (A0) is also necessary to guarantee such identifiability when are symmetric.

Lemma 1 (Identifiability).

Assume (A0) holds. Assume there exist some such that , and Then, there exists some invertible matrix such that To fix the nonsingular transformation indeterminacy, we adopt a specific constrained parameterization and focus on estimating and where where . Observe that where stands for an s0 × s0 identity matrix. Therefore, the first ’s are fixed as long as is nonsingular. By Lemma 1, the parameters and are estimable. From now on, we only consider the logistic link function for simplicity, i.e, g(x) = 1/{1 + exp(−x)}. Results for other link functions can be similarly discussed.

Model Selection

Parameters and can be estimated by maximizing the (conditional) log-likelihood function. Since we use the logistic link function, the log-likelihood is equal to where the first equality is due to the conditional independence assumption. We assume the number of latent factors s0 is fixed. For any here smax is allowed to diverge with n and satisfies smax ≥ s0, we define the following constrained maximum likelihood estimator for some , where the vec(·) operator stacks the entries of a matrix into a column vector. To estimate the number of latent factors, we define the following likelihood-based information criteria for some penalty functions κ(·,·). The estimated number of latent factors is given by In addition to the constraint in Equation 4, there exist many other constraints that would make the estimators identifiable. The choice of the identifiability constraints might affect the value of IC. However, it wouldn’t affect the value of . Detailed discussions can be found in Section A of the Appendix. A major technical difficulty in establishing the consistency of IC is due to the nonconcavity of the objective function given in Equation 3. For any , let be the set of parameters. For any , we define With some calculations, we can show that where . Here, I1 is nonnegative. However, I2 can be negative for some and . Therefore, the negative Hessian matrix is not positive semidefinite and the likelihood function is not concave. As a result, and may not be consistent to and , even with the identifiability constraints in Equation 4. Here, the presence of I2 is due to the bilinear formulation of the RESCAL model. Let . Notice that is concave in θ, ∀i, j, k. This motivates us to consider the following pseudometric: for any integers s1, s2 > 0 and , , , . Apparently, d(·,·) is nonnegative, symmetric and satisfies the triangle inequality. Below, we establish the convergence rate of We first introduce some notation. For any s > s0, we define and where 0 denotes a q-dimensional zero vector and is an p × q zero matrix. With a slight abuse of notation, we write and . Clearly, for any s ≥ s0, we have and hence Let where . When is invertible, ’s are invertible for all s > s0. The defined ’s satisfy the identifiability constraints in Equation 4 for all s ≥ s0. We make the following assumption. (A1) Assume and , and s0 ≤ s ≤ smax. In addition, assume , for some .

Lemma 2.

Assume (A1) holds, . Then there exists some constant C0 > 0 such that the following event occurs with probability tending to 1, Under the condition , we have that When ω and ω are bounded, it follows that Hence, and are consistent under the pseudometric d for all overfitted models. On the contrary, for underfitted models, we require the following conditions. (A2) Assume there exists some constant such that . (A3) Let . Assume .

Lemma 3.

Assume (A2) and (A3) hold. The for any , we have where and are defined in (A2) and (A3), respectively. Assumption (A3) holds if there exists some such that When for some constant c′ > 0, it follows from Lemma 3 that Based on these results, we establish the consistency of defined in Equation 5 below. For any sequences {a} and {b}, we write a ~ b if there exist some universal constants c1, c2 > 0 such that c1a ≤ b ≤ c2a.

Theorem 1.

Assume (A1)-(A3) hold, or some . Assume κ(n, K) satisfies Then, we have where . Let . When are bounded, it follows from Theorem 1 that IC is consistent provided that and . Define for some α ≥ 0. Note that Consider the following criteria: Note that the term satisfies that and . It follows from Equation 7 and Theorem 1 that IC is consistent for all α ≥ 0. When α > 0, the term adjust the model complexity penalty upwards. We notice that Bai and Ng (2002) used a similar finite sample correction term in their proposed information criteria for approximate factor models. Our simulation studies show that such adjustment is essential to achieve selection consistency for large K. Conditions (A1) and (A2) are directly imposed on the realizations of . In Section B.2 and B.3, we consider an asymptotic framework where are i.i.d according to some distribution function and show (A1) and (A2) hold with probability tending to 1. Therefore, under this framework, we still have . The consistency of our information criterion remains unchanged. Observe that we have a total of n × n × K = n2K observations. Consider the following BIC-type criterion: The model complexity penalty in BIC satisfies Hence, it does not meet Condition (6) in Theorem 1. As a result, BIC may fail to identity the true model. As shown in our simulation studies, BIC will choose overfitted models and is not selection consistent.

Numerical Experiments

This section is organized as follows. In Section 4.1, we introduce our algorithm for computing the maximum likelihood estimators of a logistic RESCAL model. Simulation studies are presented in Section 4.2. In Section 4.3, we apply the proposed information criteria to a real dataset.

Implementation

In this section, we propose an algorithm for computing and . The algorithm is based upon a 3-block alternating direction method of multipliers (ADMM). Set , , , ’s and ’s are defined by where For any , define Fix , the optimization problem in Equation 10 is equivalent to We then derive its augmented Lagrangian, which gives us where ρ > 0 is a penalty parameter and, . Applying dual descent method yields the following steps, with l denotes the iteration number: Let us examine Equation 11–13 in more details. In Equation 11, we rewrite the objective function as Note that L can be represented as a separable sum of functions. As a result, ‘s can be solved in parallel. More specifically, we have Hence, each can be computed by solving a ridge type logistic regression with responses and covariates . In Equation 12, each can be independently updated by solving a logistic regression with responses and covariates , i.e, where ⊗ denotes the Kronecker product. Similar to Equation 11, each in Equation 13 can be independently computed by solving a ridge type regression with responses and covariates . Using similar arguments in Theorem 2 in Wang et al. (2017), we can show that the proposed 3-block ADMM algorithm converges for any sufficiently large ρ. In our implementation, we set ρ = nK/2. To guarantee global convergence, we randomly generate multiple initial estimators and solve the optimization problem multiple times based on these initial values.

Simulations

We simulate from the following model: where N(0, 1) stands for a standard normal random variable and denotes a q × q diagonal matrix with the jth element equal to v. We consider six simulation settings. In the first three settings, we fix K = 3 and set n = 100,150 and 200, respectively. In the last three settings, we increase K to 10, 20, 50, and set n = 50. In each setting, we further consider three scenarios, by setting s0 = 2, 4 and 8. Let smax = 12. The ADMM algorithm proposed in Section 4.1 is implemented in R. Some subroutines of the algorithm are written in C with the GNU Scientific Library (GSL, Galassi et al., 2015) to facilitate the computation. We compare the proposed IC (see Equation 8) with the BIC-type criterion (see Equation 9). In IC, we set α = 0, 0.5 and 1. Note that when α = 0, we have Reported in Table 1 and 2 are the percentage of selecting the true models (TP) and the average of selected by IC0, IC0.5, IC1 and BIC over 100 replications.

Table 1:

Simulation results for Setting I, II and III (standard errors in parenthesis)

	s₀ = 2		s₀ = 4		s₀ = 8
n = 100, K = 3	TP	s^	TP	s^	TP	s^
IC₀	0.97 (0.02)	2.03 (0.02)	0.97 (0.02)	4.03 (0.02)	0.90(0.03)	7.90 (0.03)
IC_0.5	0.97 (0.02)	2.03 (0.02)	0.98 (0.01)	4.02 (0.01)	0.90(0.03)	7.90 (0.03)
IC₁	0.97 (0.02)	2.03 (0.02)	0.98 (0.01)	4.02 (0.01)	0.89(0.03)	7.89 (0.03)
BIC	0.00 (0.00)	11.99 (0.01)	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	11.99 (0.01)

n = 150, K = 3	TP	s^	TP	s^	TP	s^
IC₀	0.99 (0.01)	2.01 (0.01)	0.97 (0.02)	4.03 (0.02)	0.96(0.02)	8.04 (0.02)
IC_0.5	0.99 (0.01)	2.01 (0.01)	0.97 (0.02)	4.03 (0.02)	0.96(0.02)	8.04 (0.02)
IC₁	0.99 (0.01)	2.01 (0.01)	0.97 (0.02)	4.03 (0.02)	0.96(0.02)	8.04 (0.02)
BIC	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	11.98 (0.01)

n = 200, K = 3	TP	s^	TP	s^	TP	s^
IC₀	0.99 (0.01)	2.01 (0.01)	0.95 (0.02)	4.05 (0.02)	0.95(0.02)	8.05 (0.02)
IC_0.5	0.99 (0.01)	2.01 (0.01)	0.95 (0.02)	4.05 (0.02)	0.95(0.02)	8.05 (0.02)
IC₁	0.99 (0.01)	2.01 (0.01)	0.95 (0.02)	4.05 (0.02)	0.95(0.02)	8.05 (0.02)
BIC	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	11.99 (0.01)	0.00 (0.00)	11.98 (0.01)

Table 2:

Simulation results for Setting IV, V and VI (standard errors in parenthesis)

	s₀ = 2		s₀ = 4		s₀ = 8
n = 50, K = 10	TP	s^	TP	s^	TP	s^
IC₀	1.00 (0.00)	2.00 (0.00)	0.97 (0.02)	4.03 (0.02)	0.69(0.05)	7.91 (0.06)
IC_0.5	1.00 (0.00)	2.00 (0.00)	0.97 (0.02)	4.03 (0.02)	0.66(0.05)	7.75 (0.06)
IC₁	1.00 (0.00)	2.00 (0.00)	0.98 (0.01)	4.02 (0.01)	0.60(0.05)	7.62 (0.06)
BIC	0.00 (0.00)	11.81 (0.06)	0.00 (0.00)	11.60 (0.06)	0.01 (0.01)	11.67 (0.07)

n = 50, K = 20	TP	s^	TP	s^	TP	s^
IC₀	0.97 (0.02)	2.03 (0.02)	0.95 (0.02)	4.05 (0.02)	0.73(0.04)	8.46 (0.10)
IC_0.5	0.97 (0.02)	2.03 (0.02)	0.98 (0.01)	4.02 (0.01)	0.87(0.03)	8.09 (0.03)
IC₁	0.98 (0.01)	2.02 (0.02)	1.00 (0.00)	4.00 (0.00)	0.79(0.04)	7.99 (0.05)
BIC	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	11.92 (0.03)	0.00 (0.00)	11.99 (0.01)

n = 50, K = 50	TP	s^	TP	s^	TP	s^
IC₀	0.98 (0.01)	2.02 (0.01)	0.93 (0.03)	4.07 (0.03)	0.17(0.04)	11.24 (0.15)
IC_0.5	0.99 (0.01)	2.01 (0.01)	0.97 (0.02)	4.03 (0.02)	0.76(0.04)	8.24 (0.05)
IC₁	1.00 (0.00)	2.00 (0.00)	0.98 (0.01)	4.02 (0.01)	0.79(0.04)	7.99 (0.05)
BIC	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	12.00 (0.00)	0.00 (0.00)	11.99 (0.01)

It can be seen from Table 1 and 2 that BIC fails in all settings. It always selects overfitted models. On the contrary, the proposed information criteria are consistent for most of the settings. For example, under settings where s0 = 2 or 4, TPs of IC0, IC0.5 and IC1 are larger than or equal to 93%. When s0 = 8, expect for the last setting, TPs of the proposed information criteria are no less than 60% for all cases. IC0, IC0.5 and IC1 perform very similarly for small K. In the first three settings, TPs of these three information criteria are nearly the same for all cases. However, IC0.5 and IC1 are more robust than IC0 for large K. This can be seen in the last scenario of Setting 6, where the TP of IC0 is no more than 20%. Besides, in the last two settings, TP of IC0 is smaller than IC0.5 and IC1 for all cases. These differences are due to the finite sample correction term . As commented before, and increase the model complexity penalty term in IC0.5 and IC1 to avoid overfitting for large K. In Section D of the Appendix, we examine the performance of our proposed information criteria under the scenario where . Results are similar to those presented at Table 1 and 2.

Real Data Experiments

In this section, we apply the proposed information criteria to the “Social Evolution” dataset (Madan et al., 2012). This dataset comes from MIT’s Human Dynamics Laboratory. It tracks everyday life of a whole undergraduate MIT dormitory from October 2008 to May 2009. We use the survey data, resulting in n = 84 participants and K = 5 binary relations. The five relations are: close relationship, political discussion, social interaction and two social media interaction. We compute for and select the number of latent factors using the proposed information criteria and BIC. It turns out that IC0, IC0.5 and IC1 all suggest the presence of 9 factors. In contrast, BIC selects 12 factors. To further evaluate the number of latent factors selected by the proposed information criteria, we consider the following cross-validation procedure. For any , we randomly select 80% of the observations and estimate and by maximizing the observed likelihood function based on these training samples. Then we compute Based on these predicted probabilities, we calculate the area under the precision-recall curve (AUC) on the remaining 20% testing samples. Reported in Table 3 are the AUC scores averaged over 100 replications. For any , we denoted by AUC the corresponding AUC score. It can be seen from Table 3 that AUC first increases and then decreases as s increases. The maximum AUC score is achieved at s = 10. Observe that AUC9 is very close to AUC10, and it is larger than the remaining AUC scores. This demonstrates that the proposed information criteria select less latent factors while achieve better or similar link prediction results when compared to BIC.

Table 3:

AUC scores

s	1	2	3	4	5	6	7	8	9	10	11	12
AUC	0.7201	0.8341	0.8952	0.9095	0.9257	0.9364	0.9444	0.9486	0.9513	0.9518	0.9485	0.9467

Discussion

In this paper, we propose information criteria for selecting the number of latent factors in the RESCAL tensor factorization model and prove their model selection consistency. Although we focus on the logistic RESCAL model, the proposed information criteria can be applied to general tensor factorization models. More specifically, consider the following class of models: with any of (or without) the following constraints: (C1) is diagonal; (C2) for , for some strictly increasing function g, and some mean zero random errors . As commented in Section 2.1, such representation includes the RESCAL, CP and TUCKER-2 models. Specifically, it reduces to the TUCKER-2 model by setting g to be the identity function. If further (C1) holds, then the model in Equation 14 reduces to the CP model. When (C2) holds, it corresponds to the RESCAL model. Consider the following information criteria, where stands for the likelihood function and are the corresponding (constrained) MLEs. Similar to Theorem 1, we can show that with some properly chosen , IC is consistent under this general setting. Currently, we assume the tensor is completely observed. When some of the Y’s are missing, we can calculate ’s and ’s by maximizing the following observed likelihood function where N denotes the set of the observed responses. The above optimization problem can also be solved by a 3-block ADMM algorithm. Define the following class of information criteria, where denotes the percentage of observed responses. Consistency of IC can be similarly studied.

Subject	Predicate	Object
Jon Snow	character in	A Song of Ice and Fire
Jon Snow	character in	Game of Thrones
A Song of Ice and Fire	genre	novel
Game of Thrones	genre	television series
George R.R. Martin	author of	A Song of Ice and Fire
George R.R. Martin	profession	novelist

Table 4: