Fitting coupled matrix factorization models

Objective function

To fit a coupled matrix factorization, we solve the following optimization problem

\[\min_{\mathbf{A}, \{\mathbf{B}^{(i)}\}_{i=1}^I, \mathbf{C}} \frac{1}{2} \sum_{i=1}^I \frac{\| \mathbf{B}^{(i)} \mathbf{D}^{(i)} \mathbf{C}^\mathsf{T} - \mathbf{X}^{(i)}\|^2}{\|\mathbf{X}^{(i)}\|^2},\]

where \(\mathbf{A}\) is the matrix obtained by stacking the diagonal entries of all \(\mathbf{D}^{(i)}\)-matrices. However, as discussed in What are coupled matrix factorizations?, this problem does not have a unique solution, and each time we fit a coupled matrix factorization, we may obtain different factor matrices. As a consequence, we cannot interpret the factor matrices. To circumvent this problem, it is common to add regularization, forming the following optimisation problem

\[\min_{\mathbf{A}, \{\mathbf{B}^{(i)}\}_{i=1}^I, \mathbf{C}} \frac{1}{2} \sum_{i=1}^I \frac{\| \mathbf{B}^{(i)} \mathbf{D}^{(i)} \mathbf{C}^\mathsf{T} - \mathbf{X}^{(i)}\|^2}{\|\mathbf{X}^{(i)}\|^2} + \sum_{n=1}^{N_\mathbf{A}} g^{(A)}_n(\mathbf{A}) + \sum_{n=1}^{N_\mathbf{B}} g^{(B)}_n(\{ \mathbf{B}^{(i)} \}_{i=1}^I) + \sum_{n=1}^{N_\mathbf{C}} g^{(C)}_n(\mathbf{C}),\]

where the \(g\)-functions are regularization penalties, and \(N_\mathbf{A}, N_\mathbf{B}\) and \(N_\mathbf{C}\) are the number of regularization penalties for \(\mathbf{A}, \{\mathbf{B}^{(i)}\}_{i=1}^I\) and \(\mathbf{C}\), respectively.

The formulation above also encompasses hard constraints, such as \(a_{ir} \geq 0\) for any index \((i, r)\). To obtain such a constraint, we set

\[\begin{split}g^{(\mathbf{A})} = \begin{cases} 0 & \text{if } a_{ir} \geq 0 \text{ for all } a_{ir} \\ \infty & \text{otherwise}. \end{cases}\end{split}\]

Note

The data fidelity term (sum of squared error) differs by a factor \(1/2\) compared to that in [RSC+22, RSCA21]

Optimization

To solve the regularized least squares problem, we use alternating optimisation (AO) with the alternating direction method of multipliers (ADMM). AO-ADMM is a block coordinate descent scheme, where the factor matrices for each mode is updated in an alternating fashion. This means that the regularized loss function above is split into the following three optimization subproblems:

\[\min_{\mathbf{A}} \frac{1}{2} \sum_{i=1}^I \| \mathbf{B}^{(i)} \mathbf{D}^{(i)} \mathbf{C}^\mathsf{T} - \mathbf{X}^{(i)}\|^2 + \sum_{n=1}^{N_\mathbf{A}} g^{(A)}_n(\mathbf{A}),\]

\[\min_{\{\mathbf{B}^{(i)}\}_{i=1}^I} \frac{1}{2} \sum_{i=1}^I \| \mathbf{B}^{(i)} \mathbf{D}^{(i)} \mathbf{C}^\mathsf{T} - \mathbf{X}^{(i)}\|^2 + \sum_{n=1}^{N_\mathbf{B}} g^{(B)}_n(\{ \mathbf{B}^{(i)} \}_{i=1}^I),\]

\[\min_{\mathbf{C}} \frac{1}{2} \sum_{i=1}^I \| \mathbf{B}^{(i)} \mathbf{D}^{(i)} \mathbf{C}^\mathsf{T} - \mathbf{X}^{(i)}\|^2 + \sum_{n=1}^{N_\mathbf{C}} g^{(C)}_n(\mathbf{C}),\]

which we solve approximately, one at a time, using a few iterations of ADMM. We repeat this process, updating \(\mathbf{A}, \{\mathbf{B}^{(i)}\}_{i=1}^I\) and \(\mathbf{C}\) untill some convergence criteria are satisfied.

The benefit of AO-ADMM is its flexibility in terms of regularization and constraints. We can impose any regularization penalty or hard constraint so long as we have a way to evaluate the scaled proximal operator of the penalty function or projection onto the feasible set of the hard constraint [HSL16]. That is, any regularization function, \(g(\mathbf{v})\), where we can solve the problem

\[\min_{\mathbf{w}} g(\mathbf{w}) + \frac{\rho}{2}\|\mathbf{w} - \mathbf{v}\|^2,\]

where \(\mathbf{v}\) and \(\mathbf{w}\) are the vectorized input to the regularized least squares subproblem we solve with ADMM. \(\rho\) is a parameter that penalises infeasible solutions (more about that later), we use the name feasibility penalty parameter for \(\rho\).

The AO-ADMM algorithm is described in detail (in the context of PARAFAC2, a special case of coupled matrix factorization) in [RSC+22].

Note

The role of \(\mathbf{A}\) and \(\mathbf{C}\) are switched between this software and [RSC+22], as this change makes for more straightforward usage.

ADMM and the feasibility gap

There is one property of ADMM that is important to be aware of when fitting models with AO-ADMM: The feasibility gap. If the feasibility gap is high (what a high value depends on the application, but any value above 0.0001 is suspicious and any value above 0.01 is likely high), then the constraints and regularization we impose may not be satisfied. To explain why, we briefly describe ADMM (for a thorough introduction, see [BPC11], and for an introduction specifically for coupled matrix factorization, see [RSC+22]).

ADMM can solve problems on the form

\[\begin{split}\min_{\mathbf{x}, \mathbf{z}} f(\mathbf{x}) + g(\mathbf{z}) \\ \text{s.t. } \mathbf{Mx} - \mathbf{Ny} = \mathbf{c},\end{split}\]

which is useful for solving regularized least squares problems. If we have only one regularization penalty, then we can rewrite a regularized least squares problem to the standard form of ADMM:

\[\begin{split}\min_{\mathbf{x}, \mathbf{z}} \frac{1}{2}\|\mathbf{Tx} - \mathbf{b}\|^2 + g(\mathbf{z}) \\ \text{s.t. } \mathbf{x} = \mathbf{z}.\end{split}\]

ADMM then works by forming the augmented Lagrange dual problem:

\[\max_{\boldsymbol{\lambda}} \min_{\mathbf{x}, \mathbf{z}} \frac{1}{2}\|\mathbf{Tx} - \mathbf{b}\|^2 + g(\mathbf{z}) + \frac{\rho}{2}(\mathbf{x} - \mathbf{z})^2 + \boldsymbol{\lambda}^\mathsf{T} (\mathbf{x} - \mathbf{z}),\]

where \(\rho\) is a penalty parameter for infeasible solutions (i.e. solutions where \(\mathbf{x} \neq \mathbf{z}\)).

An important property for assessing validity of models fitted with AO-ADMM is therefore the feasibility gap, given by

\[\frac{\|\mathbf{x} - \mathbf{z}\|}{\|\mathbf{x}\|}\]

If this is high, then the solution is infeasible, and the model is likely not valid.

Note

A sufficiently small feasibility gap is part of the stopping criteria, so if the AO-ADMM procedure stopped before the maximum number of iterations were reached, then the feasibility gaps are sufficiently small.

Penalty-functions

We separate the penalty functions into three categories: row-wise penalties, matrix-wise penalties and multi-matrix penalties:

Multi-matrix penalties are penalties that penalise behaviour across multiple \(\mathbf{B}^{(i)}\)-matrices at once (e.g. the PARAFAC2 constraint: matcouply.penalties.Parafac2()).
Matrix-wise penalties are penalties full matrices (or columns of full matrices) at once (e.g. total variation regularization: matcouply.penalties.TotalVariationPenalty()) and can be applied either to the \(\mathbf{B}^{(i)}\)-matrices, or the \(\mathbf{C}\)-matrix with no.
Finally, row-wise penalties are penalties that single rows (or elements) of a matrix at a time (e.g. non-negativity: matcouply.penalties.NonNegativity(). These penalties can be applied to any factor matrix.

Note

We can also apply matrix-wise penalties on \(\mathbf{A}\) and special multi-matrix penalties that require a constant feasibility penalty for all \(\mathbf{B}^{(i)}\)-matrices by using the constant_feasibility_penalty=True argument. There are currently no multi-matrix penalties that require a constant feasibility penalty in MatCoupLy. An example of such a penalty could be a similarity-based penalty across the different \(\mathbf{B}^{(i)}\)-matrices.

Stopping conditions

The AO-ADMM procedure has two kinds of stopping conditions. The ADMM stopping conditions (inner loop), used to determine if the regularized least squares subproblems have converged and the AO-ADMM stopping conditions (outer loop) used to determine whether the the full fitting procedure should end.

Inner loop (ADMM):

The ADMM stopping conditions is by default disabled, and all inner iterations are ran without checking for convergence. The reason is that for a large portion of the iterations, the ADMM iterations will not converge, and checking the stopping conditions may be a bottleneck. If they are set, then the following conditions must be satisfied

\[\frac{\|\mathbf{x}^{(t, q)} - \mathbf{z}^{(t, q)}\|}{\|\mathbf{x}^{(t, q)}\|} < \epsilon_{\text{inner}},\]

\[\frac{\|\mathbf{x}^{(t, q)} - \mathbf{z}^{(t, q-1)}\|}{\|\mathbf{z}^{(t, q)}\|} < \epsilon_{\text{inner}},\]

where \(\mathbf{x}^{(t, q)}\) is the variable whose linear system we solve (i.e. \(\mathbf{A}, \{\mathbf{B}^{(i)}\}_{i=1}^I\) or \(\mathbf{C}\)) and \(t\) and \(q\) represent the outer and inner iteration number, respectively.

Outer loop (AO-ADMM):

For the outer, AO-ADMM, loop, the stopping conditions are enabled by default and consist of two parts that must be satisfied. The loss condition and the feasibility conditions.

The loss condition states that either an absolute loss value condition or a relative loss decrease condition should be satisfied. These conditions are given by:

\[f(\mathbf{M}^{(t)}) + g(\mathbf{M}^{(t)}) < \epsilon_{\text{abs}},\]

and

\[\frac{|f(\mathbf{M}^{(t-1)}) - f(\mathbf{M}^{(t)}) + g(\mathbf{M}^{(t-1)}) - g(\mathbf{M}^{(1)})|} {f(\mathbf{M}^{(t-1)}) + g(\mathbf{M}^{(t-1)})} < \epsilon_{\text{rel}},\]

where \(f\) is the relative sum of squared error and \(g\) is the sum of all regularization functions. \(\mathbf{M}^{(t)}\) represents the full decomposition after \(t\) outer iterations.

The feasibility conditions must also be satisfied for stopping the AO-ADMM algorithm, and they are on the form

\[\frac{\|\mathbf{x}^{(t)} - \mathbf{z}^{(t)}\|}{\|\mathbf{x}^{(t)}\|} \leq \epsilon_{\text{feasibility}},\]

where \(\mathbf{x}^{(t)}\) represents \(\mathbf{A}, \{\mathbf{B}^{(i)}\}_{i=1}^I\) or \(\mathbf{C}\) after \(t\) outer iterations and \(\mathbf{z}^{(t)}\) represents a corresponding auxiliary variable after after \(t\) outer iterations. The feasibility conditions must be satisfied for all auxiliary variables for all modes for stopping the outer loop.