A Water-filling Algorithm Maximizing the Volume of Submatrices Above the Rank

In this paper, we propose an algorithm to extract, from a given rectangular matrix, a submatrix with maximum volume, whose number of extracted columns is greater than the initial matrix rank. This problem arises in compression and summarization of databases, recommender systems, learning, numerical analysis or applied linear algebra. We use a continuous relaxation of the maximum volume matrix extraction problem, which admits a simple and closed form solution: the nonzero singular values of the extracted matrix must be equal. The proposed algorithm extracts matrices with singular values, which are close to be equal. It is inspired by a water-filling technique, traditionally dedicated to equalization strategies in communication channels. Simulations show that the proposed algorithm performs better than sampling methods based on determinantal point processes (DPPs) and achieves similar performance as the best known algorithm, but with a lower complexity.


I. INTRODUCTION
In this paper, we address the problem of column subset selection (CSSP) [1], [13], [17] with the following framework: the elements of a dataset of cardinal n are characterized by a real vector of d coordinates, so that the dataset can be modelized by a matrix A ∈ R d×n . Column vectors represent the elements, called items, rows represent features describing the items via a latent space isomorphic to R d . To allow the features to be comparable, we suppose that each column of A is normalized. The problem is to select m columns (d ≤ m ≤ n) of the initial data matrix A (assumed to be of rank d, much lower than n and m), thus forming a rectangular submatrix B ∈ R d×m . Moreover, the criterion used to select the submatrix is the maximization of the volume of B. This framework can be seen as a compression method which samples the population of items, while keeping all the features characteristics.
Maximizing the volume occurs in many applications: some search engines data summarization algorithms rely on extracting a small representative subset from a given dataset [3]. Sampling data requires to select items by maximizing a cost function related to a quantity of information [6]. Compression of databases can be done by selecting a subset of the initial base, keeping as much information as possible [1], [13]. In each of these applications, the choice of the subset is performed by maximizing the volume of an extracted submatrix B; the extraction can be performed on rows or columns, reflecting either a feature selection, either a sampling process, and the submatrices can be transformed into a square Gram matrix B ′ B or a covariance matrix BB ′ , in which case the volume is given by the determinant of the underlying symmetric positive semidefinite (SPSD) matrices. In numerical linear algebra, to optimize the condition number or to build a good low rank approximation of a square matrix, a popular method is to extract square submatrices by selecting same rows and columns, leading to pseudo-skeleton, cross or CUR approximations [4], [8], [9], [12]; there, the volume maximization relies on QR, LU or singular value decomposition. All methods based on determinantal point processes (DPPs) form a Gram matrix A ′ A, from which a square submatrix of maximum determinant is extracted [10]. All methods introduced above assume that the number of samples is below the rank. The problem and intuitions are different when the number of samples is above the rank. A method exists [14] but is greedy and does not really draw good intuitions on what is the best way to sample the matrix.
In this work, we propose a new algorithm relying on a water-filling technique [5], to extract maximum volume rectangular submatrices, under the assumption that the number of extracted columns is greater than the rank. Our method is one-shot, and is inspired by a continuous relaxation of the volume maximization problem, whose solution is a matrix with equal singular values. Therefore, we propose a water-filling approach [5], that tends to equalize the singular values of the submatrix. Interestingly, the achieved complexity is lower than the existing greedy algorithm. The overall complexity of the DPP algorithm is in O(n 3 ) operations [16], that of the greedy algorithm in O(nm 2 ) [14], while the proposed algorithm is in O(nd 2 ) operations, which is significantly lower, as d is assumed to be lower than n and m.
The paper is organized as follows. In section II, we formalize the several optimization problems. We present some mathematical tools to precise the links between them and deduce an objective function related to the spread of the covariance matrix spectrum. Section III presents how to solve the new problem, describes the algorithm implementation and its complexity, while section IV is dedicated to simulations and comparisons with other methods.

II. PROBLEM FORMULATION AND ANALYSIS
In the following, we fix n, d, m (d ≤ m ≤ n). We denote b i the columns of A ∈ R d×n and B ∈ R d×m , B k = (b 1 , ..., b k ) for d ≤ k ≤ m. Σ = BB ′ is the covariance matrix and G = B ′ B the Gram matrix of B. The notation B ⊂ A means that the columns of B form a subset of the columns of A. Tr is the trace operator and Sp denotes the spectrum. ||.|| p represents the p-norm for p = 1, 2.
In the rest of the paper, we further assume that the matrix B is of rank d. For such a rectangular matrix B of rank d, we define the volume, as Ben-Israël in [2], [17]: ( We can now formulate our problem, where we look for a submatrix of A that maximizes the volume (see Pb. 1).
Problem 1 (Discrete maximization of the volume).
where the equality follows from (2).
Since Pb. 1 is NP-hard [17], we consider a continuous relaxation of this problem, where B is an arbitrary matrix with normalized columns, and is not necessarily a submatrix of A.
Proof. To prove this equality, we first solve the problem defined in the RHS of (4). The Lagrangian is The differentiation of L with respect to each λ i gives Therefore, all λ i are equal, and the common value is Finally, the maximum square volume is vol(B) 2 = ( m d ) d , and is reached for the scalar matrix BB ′ = m d I d . We now show the equality in (4): • We first show max This follows from the fact that, • Note that the converse is not true: Tr(BB ′ ) = m ̸ =⇒ ∀i, ||b i || = 1. Therefore, to show that both maximum are equal, we show that the maximum value achieved in the RHS of (4) can be achieved by a matrix B with normalized columns. More precisely, we show that the optimal scalar matrix m d I d can be decomposed into the product of BB ′ , where B has normalized columns. This is illustrated in Fig. 1. To show this, we introduce the 2m reals:  (10)). We have constructed a B matrix that maximizes the volume (BB ′ = m d I d ) and whose columns are normalized.
Interpretation: the continuous relaxation Pb. 2 gives insights on how to solve the original discrete Pb. 1. More precisely, we look for a submatrix B of A such that the covariance matrix BB ′ is close to a scalar matrix. The latter constraint is however difficult to implement. Instead, we rely on the following equivalent statement which states that all eigenvalues should be equal. This leads to the minimization Pb. 3 proposed in the next section, that tends to equalize the eigenvalues of the spectrum.

III. PROPOSED ALGORITHM
The solution of Pb. 2 is a diagonal matrix with a unique eigenvalue equal to λ = m/d. This result gives the key principle of the proposed algorithm. Indeed, we seek to equalize the eigenvalues of the extracted matrix B, constraining the spectrum of BB ′ to be as narrow as possible: Problem 3 (Minimization of the spectrum's spread). Find: where A has normalized columns, λ i (BB ′ ), 1 ≤ i ≤ d are the d eigenvalues of BB ′ , and λ is the common value to reach. We now describe an algorithm able to promote equalization of the eigenvalues of BB ′ during the selection of columns, and solve Pb. 3.

A. Initialization
The initialization step consists in producing a square matrix B d of size d × d using any existing algorithm of volume maximization, for example an exhaustive search if the dimension d is low enough, or the MAXVOL algorithm [9].
Concatenating m − d columns to B d is thus equivalent to adding m − d rank one operators to Σ d . From [7], [15], if u i is a normalized eigenvector related to λ i , then, for any column b, we have that where equality occurs in the second inequality if and only if b is colinear to u i . In other words, the eigenvalues of Σ m are obtained from those of Σ d by summing the m−d contributions from all the rank one perturbations: where Interestingly, a continuous relaxation of Pb. 3 admits a closed form solution. Indeed, consider the problem min where the equality constraint follows (8) and and the first order conditions are obtained by differentiating L with respect to ϵ i , α and β i : The two first equations of (16) give (18) Therefore, in the continuous relaxation problem the common achieved singular value is λ − α.

C. A discrete water-filling solution
The solution (18) to the continuous relaxed problem (15) is similar to the waterfilling solution used in power allocation for multiple channel communication [5]. For the latter problem, the optimum is achieved, when the total power is evenly distributed among each channel. Similarly, in our problem, the goal is to add m columns to the d × d B d matrix such that the eigenvalues of the obtained matrix are all equal to a common value λ, except for the eigenvalues λ i that are already above the threshold λ.
For the sake of computational efficiency, we propose an algorithm that allows to select m columns all at ones. To do so, we order the columns in A according to their distance to the eigenvectors u i of Σ d . More formally, Second, to equalize the spectrum, we first determine the continuous increase ϵ i for each eigenvalue with (18), where the threshold is approximated by λ − α = m d . Indeed, there are d eigenvalues, and due to the norm constraint on the columns of A and from (8), the sum of the eigenvalues is d i=1 λ i (Σ m ) = m. Then, we round ϵ i to determine how many columns should contribute to each eigenvalue: Note that by doing so, we consider that adding a vector b ∈ A i to B d modifies the eigenvalues of the new matrices into: In other words, we neglect the angle between b and u i . Then, to meet the constraint d i=1 c i = m − d, the update of c i is processed iteratively: c d is first evaluated, then c d−1 , etc. until the (m − d) columns are attributed, where we assume that the eigenvalues are ordered by decreasing order λ 1 ≥ λ 2 .... This allows to favor the smallest eigenvalues, that need to be increased. Finally, when c i (the number of vectors in A i to be selected) is set, the columns chosen are the closest ones to u i in A i . To favor the smallest eigenvalues, we first select vectors in A d , and remove the selected columns from all other sets A 1 to A d−1 . Then we proceed with u d−1 .

D. Implementation and complexity
We now detail the complexity of each step of the proposed algorithm: . The overall complexity of the algorithm is thereby O(nd 2 ) operations, with d assumed to be much lower than n. By using efficient matrix product algorithms, it is possible to lower each cubic power complexity step into 2 + δ, with δ ∈]0, 1[.
In comparison, the traditional DPP sampling method is based on a three step processes where steps 1 and 3 have the highest complexity: the initialization step is the eigendecomposition of the L-ensemble matrix L = A ′ A and its complexity is in O(n 3 ) operations. The third step is a Gram-Schmidt decomposition whose cost is O(nm 3 ) for the classical method and O(nm 2 ) for the best known algorithms [16].
The last comparison is made with the RECT MAXVOL algorithm dedicated to rectangular matrices [14]. The initialization step has complexity O(nd 2 ), then the algorithm run in O(nm 2 ) operations. For both steps, as d is assumed to be much lower than n, it is more complex than our method.

A. Achieved volume
In the following figures, we compare the performance of the proposed WaterMaxVol algorithm with known algorithms of volume maximization, the first comparison being made with a uniform random choice of columns.
A classical and efficient way to perform CSS with the objective of maximizing volume is to use DPPs. For the sake of brevity, we refer the reader to [1], [10] for a complete outline of DPPs and kernel methods.
We consider a subclass of DPPs, called L-ensembles, and well suited to our framework. Indeed it is characterized by a symmetric semi-definite matrix indexed by the elements of the dataset. In our model, this kernel matrix is the Gram matrix G = B ′ B. The DPP is the probability distribution over all subsets of the items such that: where G = B ′ B and B is the extracted matrix from A whose m column indices belong to S: B = (b i ) i∈S . But G is a n × n matrix of rank d < n, so that det(B ′ B) = 0. For (22) to define a probability distribution, the kernel function G must be modified in such a way that it becomes semi definite positive.
Here we consider two kernel functions. First, G is defined by the kernel function of a modified Gram matrix G = B ′ B + δI, where δ > 0 is a real positive number and G is then definite positive.
The second kernel is defined through an embedding into a nonlinear feature function ϕ : R d −→ R n such that the family (ϕ(b i ); i = 1, ..., n) is linearly independent. Then, the kernel matrix is defined by the Euclidean dot product for each pair of columns b i , b j . In addition to ensure linear independency, the function ϕ has to reflect the way initial column vectors contribute to the volume of the extracted matrix. So instead of the traditional basis radial function kernel, we choose a cosine kernel which is maximum when the variables are orthogonal and zero when they are colinear. Thanks to the kernel trick, We do not need to explicit the function ϕ, but just to define G ij = cos(b i , b j ).
The implementation of DPPs is made with MATLAB using the code of Alex Kulesza. Our implementation uses also the MAXVOL algorithm [9] during the initialization step. MAXVOL algorithm is available in MATLAB as part of the TT-Toolbox of Ivan Oseledets.  [14]. Red: our algorithm.  [14]. Red: our algorithm.
The algorithm performs well for the first half values of m. When m increases and comes closer to n, the possible choices of columns decrease and all methods give approximatively the same results.

B. Sprectrum equalization of the proposed algorithm
The effect of the algorithm can be visualized in Fig. 4, in the case of random Gaussian i.i.d. matrices. 100 independent samples of an d×n standard Gaussian matrix A are generated. For each dimension m varying in d, n (corresponding to one vertical line), we extract a d × m random submatrix C and draw its sprectrum in grey. We also extract a d × m submatrix B by using the WaterMaxVol algorithm and draw its spectrum in red. The narrowing of the red tube around the theoretical mean curve m/d illustrates the performances of the optimization for the first half values of m. This results remain licit for any random vectors whose distribution is either spherical symmetric, either coordinates independent.

V. CONCLUSION
In this work, we proposed an algorithm to extract a maximum volume rectangular submatrix of size greater than the dataset matrix rank. Thanks to a continuous relaxation which admits a closed form solution, we propose to equalize the singular values from the extracted matrix. The algorithm can therefore be interpreted as a water-filling technique used in telecommunications to equalize the allocated powers.
Simulations showed that the proposed algorithm outperforms DPPs methods. Moreover, the proposed algorithm, the exhaustive search, and the best known algorithm (RECT MAXVOL) have all similar performances. However, our approach achieves the maximum with a much lower complexity.