Graphical Inference in Non-Markovian Linear-Gaussian State-Space Models

State-space models (SSMs) are common tools in time-series analysis for inference and prediction. SSMs are versatile probabilistic models that allow for Bayesian inference by describing a (generally Markovian) latent process. However, the parameters of that latent process are often unknown and must be estimated. In this paper, we consider the parameter estimation in a SSM with a non-Markovian linear-Gaussian latent process. This process is described as a vector auto-regressive with p unknown matrices. Our algorithm LaGrangEM estimates these matrices through an expectation-maximization algorithm that exploits a graphical interpretation of the latent process in order to define prior knowledge about the unknown parameters. We connect the new algorithm with existing approaches such as Granger causality and graphical inference in SSMs. We discuss the strong potential of the algorithm to bring interpretability, e.g., in estimating causal relationships and their delays. The numerical experiments also show a superiority in performance.


INTRODUCTION
Dealing with multi-variate time series is central in countless discrete-time signal processing applications [1].State-space models (SSMs) are probabilistic models that have shown a superior predictive performance while incorporating uncertainty quantification.Unlike other popular time-series approaches, such as vector auto-regressive models, SSMs consider a latent state that evolves over time and that explains the observed time series [2].The linear-Gaussian SSM (LG-SSM) is arguably the most celebrated model, since it allows for exact inference when all model parameters are known [3,4].In practice, these parameters are rarely available and must be estimated [5].Existing works based on graphical models time-series include [6][7][8][9], with applications in biology [10,11], networks [12], and neuroscience [13].Recent works have dealt with parameter estimation in Markovian SSMs through a graphical perspective, either point-wise [14][15][16] or fully probabilistic [17,18].
In this paper, we consider a LG-SSM where the latent process follows an order-p vector auto-regressive process, VAR(p), generalizing standard LG-SSMs where p = 1.The p unknown state matrices, which govern the state dynamics, are estimated by our novel expectation-maximization algorithm, LaGrangEM.The E-step relies on a RTS algorithm in an extended space, while the M-step implements a modern nonconvex optimization algorithm.This allows us to promote interesting properties in the sought matrices [19], e.g., sparsity.LaGrangEM can incorporate additional structure into the problem, e.g., that each dimension in the latent process is affected by at maximum another dimension, or that each dimension can affect to each other through at maximum one specific (unknown) lag.These two examples highlight that LaGrangEM provides an enhanced interpretability with close connections with graphical approaches and Granger causality methods [11,[20][21][22].Moreover, the non-Markovianity in the latent process increases the dimension of the RTS algorithm by a factor of p + 1, which is a known limitation of these models.Thus, another advantage is that LaGrangEM can decrease in many cases the computational complexity in the RTS algorithm due to the structure and sparsity.
The rest of the paper is structured as follows.In Section 2, we provide background material.The novel LaGrangEM algorithm is introduced in Section 3. We provide numerical experiments in Section 4 and a conclusion in Section 5.

Linear non-Markovian state-space model
We consider an SSM with non-Markovian auto-regressive latent process of order p ≥ 1, denoted as AR(p).For every time k = 1, . . ., K, the evolving hidden state x k ∈ R Nx , N x ≥ 1, is linked to the observation y k ∈ R Ny , N y ≥ 1: Matrices A i ∈ R Nx×Nx , for i ∈ {1, . . ., p} describe the state transition for each lag i, H ∈ R Ny×Nx is the observation matrix, {q k } K k=1 ∼ N (0, Q) is the i.i.d.state noise process with Q ∈ R Nx×Nx symmetric definite positive (SDP), and {r k } K k=1 ∼ N (0, R) is the i.i.d.observation noise process with SDP R ∈ R Ny×Ny .The state process is initialized as x i ∼ N (x i ; µ 0 , P 0 ) for i ∈ {1−p, . . ., 0}, with known µ 0 ∈ R Nx and SDP P 0 ∈ R Nx×Nx .In the remaining of the paper, we use the short notation A = [A 1 , . . ., A p ] ∈ R Nx×pNx , i.e., the rowwise concatenation of the state transition matrices.

Inference in non-Markovian LG-SSMs with known parameters
A standard approach for filtering and smoothing in (1) consists on stacking (columnwise) the p consecutive states into and performing Kalman filtering (KF) and Rauch-Tung-Striebel (RTS) smoother on the equivalent SSM where we define qk ∼ N (0, Q), and r k ∼ N (0, R) [23].

EM approach for parameter estimation
Running the the KF and RTS on model ( 2) requires knowing all parameters.The parameter estimation in LG-SSMs is often done via (a) gradient-based algorithms, using sensitivity equations or Fisher's identity to efficiently compute derivatives of the likelihood function, allowing iterative optimizers like quasi-Newton or Newton-Raphson to obtain the ML estimate [24,25], or (b) EM-based algorithms, which optimize an iteratively improved bound on the marginal likelihood.See [4,26] for more details.
In [15], we introduced GraphEM, an EM method for estimating matrix A in the case of a Markovian LG-SSM (i.e., p = 1).GraphEM incorporated a suitable prior on A following an original perspective relating the transition matrix to the adjacency matrix of a directed graph.In this work, we share a similar goal as in GraphEM, which unfortunately cannot be used since matrix Q in the extended space is singular, and the derivations of the E-step do not hold (we re-derive it in the next section).Also, we take a different graphical interpretation on the non-Markovian model which requires different priors and post-processing.

Graphical non-Markovian modeling
In this paper, we take a graphical modeling approach where the matrices A i encode i-lagged dependencies among state dimensions.In particular, for any lag i ∈ {1, . . ., p}, a non- This can be interpreted as the discovery of conditional Granger causality [21] in the state space up to a p auto-regressive order.Thus, the temporal relations among the dimensions of the multivariate unobserved state process can be represented as a graphical model of p (possibly reflexive) directed graphs.Each graph is composed of N x vertices, and its weights are the values of matrix A ⊤ i , for each lag i ∈ {1, . . ., p}.Toy example.Figure 1 displays the latent state relations in the first equation of model (1), with N x = 3, p = 2, and i.e., with sparse matrices, which is key for interpretability.
The left-hand-side in Figure 1 represents the probabilistic model of the non-Markovian SSM, with the 1-lag-ahead and 2-lag-ahead interactions depicted in blue and magenta, respectively.The right-hand-side plot shows the superposition of the p = 2 directed graphs.This example satisfies having (at maximum) one lag interaction per pair of dimensions, thus the model can be represented by a (coloured) directed graph that captures the relations in all p lags among dimensions.
x k−2 (1) x k−2 (3) x k−1 (2) x k (2) Fig. 1: Summary representation of the proposed graphical model, for the toy example.The edges in blue (resp.magenta) are defined as non-zero entries of A ⊤ 1 (resp.A ⊤ 2 ).Edge thickness is proportional to the absolute entries of these matrices.Left: Dynamic state evolution.Right: Graphs associated to matrices A1 and A2.
Connection with existing algorithms.LaGrangEM can be seen as an SSM with a Granger causality model in its latent process, which enhances interpretability due to the promoted structure and sparsity.Another advantage of LaGrangEM is that it keeps its probabilistic nature in both state and observation models, and could be easily extended to a fully Bayesian approach (e.g., as in SpaRJ [18]).Unlike in Granger-based fitting methods, LaGrangEM allows to promote structure in the latent process by incorporating prior knowledge.Moreover, LaGrangEM can be seen as a generalization of GraphEM [14,15], where the latent process follows a generic AR(p), instead of p = 1 in GraphEM.This requires re-deriving the E-step.

LaGrangEM algorithm
We now introduce LaGrangEM, an EM-based algorithm to estimate the maximum a posteriori (MAP) of A in model 1.This is equivalent to minimizing L(A) ≜ L 0 (A) + L 1:K (A) with L 0 (A) ≜ − log p(A) and L 1:K (A) ≜ − log p(y 1:K |A).Due to the intricate structure of the model in (1), the function L 1:K takes a recursive form that makes its minimization not straightforward (see discussion in [2,15,27] for the Markovian case).
The final algorithm is summarized in Alg. 1.Given an initial estimate A (0) ∈ R Nx×pNx , the EM algorithm alternates at each iteration i ∈ N between (a) the majorization step which builds Q(A; A (i−1) ) as in Eq. ( 4), i.e., an upper bound on the neg-log-likelihood function (E-step), and (b) the minimization of this upper bound (M-step), which finds the new estimate as in (5).By construction, the sequence (A (i) ) i∈N decreases monotonically the MAP loss L, and convergence guarantees can be obtained under suitable assumptions [28].
Post-processing.Sophisticated constraints, such as the single lag assumption, described in Sec.3.1, or the DAG constraint [29], can also be desirable.As these constraints are non-convex and non-continuous [30], they are not straightforward to be integrated inside the EM framework.Instead, we propose a post-processing step after the EM reached convergence, to project the estimated A on the desired constrained set.The projection A = [A ′ 1 , . . ., A ′ p ] ∈ R Nx×pNx over the single lag constraint of an A ∈ R Nx×pNx reads as with o(n, ℓ) = argmax i∈1,...,p |A i (n, ℓ)| (with the convention o(n, ℓ) = 0 if A i (n, ℓ) = 0 for all i).The matrix O = (o(n, ℓ)) 1≤n,ℓ≤Nx defines the lag of the AR(p) model for the interaction (if any) of each pair of dimensions, improving the interpretability compared to Granger [20] or even conditional Granger models [21].
Prior on A. We are interested in priors on A that favor graphical interpretation (e.g.,, sparse A).Moreover, our algorithm restricts to log-concave prior p(A).As an example, we propose L 0 (A) = γ p i=1 ∥A i ∥ 1 , where γ ≥ 0 is the weight for the ℓ 1 term.Since L 0 is a convex, lower-semicontinuous, and proper function ( [31]), the existence of an explicit proximity operator ( [32]) leads to a simple recursion in our M-step.
Run KF and RTS smoother in extended space using Ǎ(i−1) .Deduce (m s k:k−p , P s k:k−p ) K k=1 using [2, Chap.8], and define with . M-step: Use DR solver to solve Stopping condition: ∥ F , stop the recursion.

Derivations of LaGrangEM
E-step.For every A ∈ R Nx×pNx and i ∈ N, the EM bound is defined as (see for instance [2,28,33]) where ct /A does not depend on A, such that equality holds at A = A (i−1) .Plugging Eq. ( 1), for every A ∈ R Nx×pNx , log p(x 1−p:K , y 1: We introduce the compact notations Using ( 6)- (7), and marginalizing part of the variables, For every k ∈ {1, . . ., K}, p(x k:k−p |y 1:K , A (i−1) ) follows a multivariate Gaussian distribution with mean m s k:k−p and covariance P s k:k−p , obtained by running KF/RTS algorithm with the extended transition matrix Hence, we can re-express (8) as  The results are summarized in Tab. 1, averaged over 20 noise realizations in Eq. ( 1).LaGrangEM obtains a significant improvement in all cases, with CGC being a good competitor in two datasets.The sparsity promoting term in La-GrangEM especially shows its benefits for the edge detection task.Thus, LaGrangEM obtains excellent performance in all metrics.Fig. 2 illustrates this behavior, on the retrieval of WeathN5 A graph, showing the good performance of the proposed approach, in terms of edge retrieval (few spurious edges), edge weight estimation (accurate edge weights), and lag estimation (correct edge colors).

CONCLUSION
In this paper, we have addressed the problem of estimating transition matrices in LG-SSMs where the latent process follows a VAR(p) model.Moreover, we have adopted a graphical approach that interprets those parameters as p directed graphs, connecting to (conditional) Granger causality.Our derived EM-based algorithm allows us to introduce prior knowledge that promotes interesting properties such as sparsity, enhancing interpretability.Moreover, the sparsity can allow for computationally efficient RTS smoothing in the extended space that will be studied in the future.The numerical examples evidence an excellent performance on three different tasks, showing also a great potential of our approach.

Fig. 2 :
Fig. 2: From left to right: True graph, MLEM, CGC, and La-GrangEM estimates for WeathN5 A dataset.Edge color represents the lag for each edge.Self-loops are omitted for readibility.

Table 1 :
1)k Results for proposed method and competitors.Mean (standard deviation) computed over 20 realizations.