Multivariate expectile-based distribution: properties, Bayesian inference, and applications

Expectiles form a family of risk measures that have recently gained interest over the more common value-at-risk or return levels, primarily due to their capability to be determined by the probabilities of tail values and magnitudes of realisations at once. However, a prevalent and ongoing challenge of expectile inference is the problem of uncertainty quantification, which is especially critical in sensitive applications, such as in medical, environmental, or engineering tasks. We address this issue by developing a novel distribution, termed the multivariate expectile-based distribution (MED), that possesses an expectile as a closed-form parameter. Desirable properties of the distribution, such as log-concavity, make it an excellent fitting distribution in multivariate applications. Maximum likelihood estimation and Bayesian inference algorithms are described. Simulated examples and applications to expec-tile and mode estimation illustrate the usefulness of the MED for uncertainty quantification.


Introduction
Risk assessment mainly relies on drawing inference from quantiles, often referred to as valueat-risk in finance or return levels in environmental science.A shortcoming of quantiles is that they only use the information on the frequency of events and not on the actual magnitudes.This 1 is an issue since taking account of the magnitudes of extreme events may be crucial in many fields of application.One way to tackle this problem is to work with expectiles, introduced in Newey and Powell (1987) since, unlike quantiles, expectiles depend on both the probability of tail values and the magnitude of realisations (Kuan et al., 2009).
In univariate settings, expectiles are defined as the minimizers of an asymmetric quadratic loss, and as such, they benefit from a number of desirable properties.They are well-defined and unique whenever the underlying distribution has a finite first moment; see for instance Abdous and Remillard (1995).Furthermore, they induce the only coherent, law-invariant, and elicitable risk measure (see Ziegel, 2016), and therefore benefit from a natural backtesting methodology.Expectiles are therefore a sensible risk management tool, as a complement or an alternative to quantiles.
In the last five years, there has been an increasing interest in estimating expectiles, both at fixed probability levels (see e.g.Holzmann and Klar, 2016;Krätschmer and Zähle, 2017) as well as at extreme levels (Daouia et al., 2018;Girard et al., 2021).Another recent line of work garnering increasing interest in mathematical finance consists of extending the expectile risk measure to the multivariate setting: see, for example, Maume-Deschamps et al. (2017, 2018); Herrmann et al. (2018); Daouia and Paindaveine (2019); Beck et al. (2021), among others.
Here, we focus on the notion of multivariate geometric expectiles, as introduced by Herrmann et al. (2018), which is the analog of the multivariate geometric notion of quantiles proposed by Chaudhuri (1996).

Our contribution
We construct a multivariate distribution (Section 2), which is based on the loss function used for defining multivariate geometric expectiles (Herrmann et al., 2018).This distribution, termed the multivariate expectile-based distribution (MED), is parameterized by a location vector, scale matrix, asymmetry level, and direction.We argue that inference employing the MED offers a practical way to handle uncertainty quantification in expectile estimation.Our specific contributions are summarized below.
We first prove a number of appealing properties of the MED: it is log-concave, unimodal, and the unique mode is an expectile whose parameters (asymmetry level and direction) are specified as closed-form parameters of the distribution.In doing so, we propose an elementary proof of the strict convexity of the above-mentioned loss function.We establish that the MED family is closed under linear transformation, leading to a simple standardization of the distribution.
We provide bounds on the MED density (Section 3), showing that it is sub-Gaussian.Limiting cases of the asymmetry level are studied, proving that the MED comprises the multivariate Gaussian and some asymmetric Gaussian distributions as special cases.Various closed-form moments are provided, along with marginal and conditional distributions (Section 4).
The main benefit of our proposal lies in inference.We propose a maximum likelihood approach (Section 5.1) which relies on optimization algorithms in Riemannian geometries.Consistency of the maximum likelihood estimator is established under compactness assumptions on the parameter space.We devise a Bayesian algorithm based on Hamiltonian Monte Carlo (Section 5.2).We also describe our accompanying R package (R Core Team, 2019) containing statis-tical functions related to the multivariate expectile-based distribution1 , and provide a simulation study (Section 6).
We finally showcase three multivariate applications of the MED (Section 7): Bayesian estimation of expectiles, of a distribution mode, and illustration of discriminant analysis.We show that a Bayesian approach relying on the MED yields well-calibrated credible intervals, thus addressing the uncertainty quantification shortcoming in expectile estimation.
All proofs and some additional implementation and application results are deferred to the Appendix.

Notations
We denote asymptotic equivalence between positive real functions f and g by f (y) ≈ y→a g(y), meaning that f (y)/g(y) → 1 when y → a for a ∈ [−∞, ∞].Vectors and matrices are denoted with boldface, for instance Z = (Z 1 , . . ., Z d ) denotes a vector in R d , 0 denotes the vector of zeros, and (u 1 , . . ., u d ) denotes the canonical basis of R d .We write the last d − 1 entries of Z as Z −1 , in the sense that Z = (Z 1 , Z −1 ).The sphere in dimension d is denoted by S d−1 .The set of d × d symmetric positive definite matrices is denoted by SPD(d), the identity matrix is denoted by I d .Let ⟨•, •⟩ denote the usual dot product and let ∥ • ∥ be the associated Euclidean norm.For any matrix M ∈ SPD(d), we set ⟨•, •⟩ M := ⟨•, M •⟩ and denote by ∥ • ∥ M the associated norm.The determinant of some matrix M is denoted by |M |.We denote by Γ the gamma function and by 2 F 1 the hypergeometric function.The d-dimensional Gaussian density with mean µ and covariance Σ is denoted by ϕ d ( • ; µ, Σ).The Gaussian, uniform, Wishart, and inverse-Wishart distributions are respectively denoted by N , U, W, and IW.
In the particular case where Σ = I d , this loss function has been used to define the multivariate geometric expectile (Herrmann et al., 2018) of a random vector X as arg min It has been shown that this optimisation problem is well-defined and benefits from a unique solution provided that the margins of X have finite second order moments (Proposition 4.6 of Herrmann et al., 2018).We show in Proposition A.1 in Appendix that this condition can be relaxed to assuming only a finite first-order moment.This notion can be extended without difficulty to any Σ ∈ SPD(d) by setting: (1) Indeed, existence and unicity of the expectile defined in (1) stems from Proposition A.2 in Appendix on the strict convexity of Λ ρ,ν,Σ , which extends the result of Herrmann et al. (2018, Theorem 4.3) which deals with the particular case of Λ ρ,ν,I d .A proof relying on more elementary arguments than Herrmann et al. (2018)'s proof is provided in Appendix.
where θ is a shorthand notation for µ, Σ, ν, and ρ, is defined on R d by its density for any x ∈ R d , where . (2) See Lemma A.1 in the Appendix for a proof that f d (•; µ, Σ, ν, ρ) integrates to one.A straightforward application of Proposition A.2 shows that the MED is log-concave, which implies in turn that it is unimodal, that its level sets are closed compact sets, and its marginal distributions remain log-concave (Brascamp and Lieb, 2002).Let us now show that the MED family is closed under linear transformations.Moreover, parameter ρ is invariant with respect to such transformation.The proof is straightforward.
Proposition 1 (Linear transformation).Let A be a non-singular As an immediate consequence of Proposition 1, we have: In the sequel, MED(0, I d , u 1 , ρ) is referred to as the standardized MED and, for the sake of presentation, it is denoted by MED(ρ), and the associated density and objective function are denoted respectively by columns of that matrix with the last (d − 1) basis vectors u 2 , . . ., u d provides a d × d matrix that can be transformed into a valid orthogonal matrix Q ⊤ by using the Gram-Schmidt process.
Proposition 2 (Particular cases).For X ∼ MED(µ, Σ, ν, ρ): the standardized MED; then ∆ ρ Z converges in distribution to a half-standard Gaussian restricted to negative values of the first component.
(iii) If d = 1, then X is the one-dimensional asymmetric Gaussian random variable of Kato et al. (2002).
We can interpret MED as a multivariate extension of the one-dimensional asymmetric Gaussian distribution (AGD) introduced in Kato et al. (2002).Tail properties of the one-dimensional AGD are established in Chen and Huang (2014), while this distribution is used in Fu and Zhou (2016) to build a model-based classifier.An alternative multivariate extension is applied in He et al. (2018), to the problem of image denoising.We also refer to Majumdar and Paul (2016) for an AGD-based construction of a stationary spatial noise process model, whose marginal distributions have certain (one-dimensional) expectiles equal to zero.
In the following, ρ ∈ [0, 1) is referred to as the asymmetry parameter, while ν is called the asymmetry direction.See Figure 1 for an illustration of the possible shapes obtained by varying the parameter values of ρ, ν, and Σ. Parameter µ is simply a location parameter and its impact on the distribution is not illustrated (see Proposition 3).Scale matrix Σ is a scale parameter that influences the shape of the distribution.The asymmetry direction ν clearly rotates the distribution, while the asymmetry parameter ρ stretches it in the direction opposite to ν.This counter-intuitive behaviour may be explained as follows.For a fixed (univariate) distribution, the expectile tends to +∞ as ρ → 1 (similarly to quantiles, see for instance Daouia et al. (2018)).Thus, to get a fixed expectile µ, the distribution should shift towards the negative real axis as ρ increases.
Characteristics of the location µ of the MED are investigated in the next result.
The distribution of X is unimodal and satisfies: The vector µ is both the mode and an expectile of the MED.In contrast, the expectation of the MED also depends on the asymmetry direction: it is located in the two-dimensional subspace spanned by µ and ν.Specifying the above expectation in the one-dimensional case MED(µ, σ, 1, ρ), we recover the expectation of the one-dimensional asymmetric Gaussian of Kato et al. (2002): Second moment and skewness centered with respect to µ are given by (1,1) Top panel: asymmetry direction ν = u 1 fixed, varying asymmetry parameter ρ ∈ {0, 0.7, 0.9} (row-wise) and matrices Σ (columnwise).Bottom panel: fixed Σ = I 2 , ρ = 0.7, and varying direction ν.
respectively.It can be seen from Figure 7 in Appendix that asymmetry increases with ρ, corroborating the illustration of Figure 1.The role of the asymmetry parameter is further illustrated with the next proposition which studies the variations of expectiles with respect to variations of the asymmetry parameter for a given MED.
Proposition 4 (Expectile with a different asymmetry parameter).Let ρ 1 , ρ 2 ∈ [0, 1).The expectile of parameters (ρ 2 , u 1 , I d ) for the standardized Z ∼ MED(ρ 1 ) is supported on the first axis, and satisfies the following equivalences 3 Moments, sub-Gaussianity and bounds The next Proposition provides both upper and lower bounds on the MED density based on the Gaussian density.This result will reveal itself to be useful to sample from the MED, using an acceptance-rejection method; see Section 5.3.To this end, let us consider, for any ρ ∈ [0, 1), (3) Proposition 5 (Density bounds).For any x ∈ R d , one has As a consequence, one can easily derive bounds on moments of a projected MED random vector.
Corollary 2 (Bounds on projected moments).Let X ∼ MED(µ, Σ, ν, ρ).For any non-zero β ∈ R d and p > 0, one has In the terminology of Vladimirova et al. (2020), MED random variables are thus sub-Weibull with optimal tail parameter 1/2, that is to say, they are sub-Gaussian, but no lighter than sub-Gaussian.The sub-Weibull property in a multivariate setting entails projections ⟨β, Z⟩ Σ −1 for any non-zero vector β ∈ R d to satisfy the (univariate) sub-Weibull property.
The next proposition provides the quadratic moment of the MED.
We always have the mean of a χ 2 distribution with d degrees of freedom.The next proposition provides the limiting behaviour of the expectation and quadratic moment when ρ → 1.

Marginal & conditional distributions
In this section, we provide some marginal and conditional distributions of the MED that illustrate the distribution's shape in the asymmetry direction and in the hyperplane orthogonal to it.Let us first focus on the margins of the standard MED.As alluded to in the Introduction, the log-concavity of the MED implies log-concavity of any of its marginal distributions.Two distributions are considered: the univariate margin in the asymmetry direction u 1 and the (d − 1)dimensional margin in the orthogonal direction, denoted by u ⊥ 1 .In the second case, the marginal distribution is elliptically contoured (Cambanis et al., 1981) or, more precisely, spherical when Σ = I d .
(i) The marginal density of Z 1 (in the direction u 1 ), denoted by f Z 1 , satisfies for any z 1 ∈ R: and (ii) The marginal density of Z −1 , denoted by f Z −1 , is an elliptically contoured distribution in the hyperplane orthogonal to u 1 .That is, there exists a density f ⊥ on R + such that for any and where Let us note that the bounds and the asymptotic equivalents in Proposition 8 (i) all coincide in the particular case where ρ = 0, yielding back the standard Gaussian density.When ρ ∈ (0, 1), the gap between asymptotic equivalences (4) and ( 5) shows that the univariate marginal distribution of (i) has asymmetric tails.The asymptotic equivalence of ( 6) and ( 7) shows that the (d − 1)-dimensional marginal of (ii) reduces to a centered Gaussian density on R d−1 .
We conclude this section with some one-dimensional and (d − 1)-dimensional conditional distributions.
(i) Let w ∈ R d , such that ∥w∥ = 1, and denote by P w ⊥ (Z) the projection of Z on the hyperplane orthogonal to w.The one-dimensional conditional distribution of ⟨w, Z⟩ given that P w ⊥ (Z) = 0 coincides with the one-dimensional MED with µ = 0, Σ = 1, asymmetry parameter equal to ρ |⟨w, u 1 ⟩|, and asymmetry vector sign(w 1 ), where Note that in (i) above, the considered distribution is one-dimensional, thus the asymmetry vector is either 1 or −1, which can be written as the sign of a scalar product.According to Proposition 2 (iii), the conditional distribution given in (i) also corresponds to the one-dimensional asymmetric Gaussian density introduced in Kato et al. (2002).When projecting in this way, the asymmetry parameter is reduced by a factor of |⟨w, u 1 ⟩| ≤ 1.In particular, it remains unchanged if the projection direction w is identical to u 1 , while it vanishes when w ⊥ u 1 ; in the latter case, the distribution obtained coincides with the standard Gaussian distribution.

Statistical inference
In this section, we start by describing inference approaches for fitting the MED, either based on maximum likelihood or on a Bayesian model.We conclude by explaining how the statistical functions for MLE and Bayesian inference can be used in R via the MEDdist package.
We note that the elements of θ are constrained to four distinct types of Euclidean sets, namely, is restricted to the positive definite matrix manifold, and ν ∈ S d−1 is restricted to the spherical manifold.Due to the non-separable dependence of the elements of θ and the nonlinear manner in which ℓ n is dependent on θ making a closed form solution unavailable, we have found it most convenient to compute the MLE θn = arg min via a coordinate descent algorithm, in order to simplify the optimization process by only considering one of the four domain constraints at a time.Let ( θ(t) ) ∞ t=1 be a sequence of coordinate descent algorithm iterates, where θ(t) = µ (t) , Σ (t) , ρ (t) , ν(t) denotes the tth iteration.With rem (a, b) denoting the remainder of a ∈ N divided by b ∈ N, and initializing at some θ(0) , the cyclic coordinate descent algorithm proceeds via the following rules, for each t ∈ N: • If rem (t, 4) = 0, set θ(t) = µ * , Σ (t−1) , ρ (t−1) , ν(t−1) , where , where , where • If rem (t, 4) = 3, set θ(t) = µ (t−1) , Σ (t−1) , ρ (t−1) , ν * , where

Numerical considerations
In practice, the coordinate descent algorithm is run for a sufficiently large number of iterations T ∈ N, at which point the algorithm is considered to be convergent.We then use the value θ(T) as the MLE.
The computation of ( 8) and ( 10) can be performed via typical optimization routines due to their mundane domains.However, the computation of ( 9) and ( 11) require the enforcement of matrix manifold restrictions and thus specialized optimization routines.Here, we make use of the Riemannian optimization framework of Absil et al. (2009), which is implemented via the libraries of Huang et al. (2018) and Martin et al. (2020).The availability of such Riemannian manifold optimization tools is also a motivating factor for our use of a coordinate descent algorithm for MLE.
By definition, the coordinate descent algorithm yields a decreasing sequence of objective values (ℓ n ( θ(t) )) ∞ t=1 .However, without further assumptions, it is not possible to make stronger statements regarding the behavior of the sequences ( θ(t) ) ∞ t=1 and (ℓ n ( θ(t) )) ∞ t=1 .Furthermore, due to the spherical matrix manifold restriction on ν, the typical directional derivative-based arguments for convergence of Tseng ( 2001) cannot be applied and thus we cannot guarantee convergence via application of generic optimization routines.However, if each of the sub-problems ( 8)-( 11) are solved using trust region-based algorithms, then the recent theory of Boumal et al. (2018) and Tian et al. (2019) can be applied to obtain global convergence guarantees if a random sub-problem is solved at iteration t instead of cycling through the sub-problems ( 8)-( 11), as we have proposed.However, empirically, we have found convergence to be robust to the choice of algorithms used to compute ( 8)-( 11) and to the use of cyclic coordinate descent.
Even for simple choices of algorithms to use for solving the sub-problems ( 8)-( 11), the combinatorial calculation of the computational complexity is prohibitively difficult, as can be seen via the example calculations of Boumal et al. (2018) and Huang and Gallivan (2022).However, at the very least, when implementing the coordinate descent algorithm using trust region-based solvers, we have the result that each of the sub-problems converge superlinearly Huang et al. (2015) and that the norm of the tangent vector of the sub-problem iterate after O (1/ϵ 2 tan ) iterations can be bounded from above by a constant ϵ tan > 0.Here the tangent vector in Riemannian geometry is analogous to the usual gradient in Euclidean geometry, and a tangent vector value of size zero implies a stationary point, like the typical Fermat condition.
To make more quantitative statements, we conduct a simulation study where we implement coordinate descent with sub-problems ( 8) and (10) solved using the optim function in R, where we use the Broyden-Fletcher-Goldfarb-Shanno algorithm in the case of ( 8) and the Dekker-Brent algorithm in the case of (10) (see Quarteroni et al., 2010, Chs. 6 and 7).Sub-problems ( 9) and ( 11) are solved using the manifold.optimfunction from the ManifoldOptim package (Martin et al. 2020) in R, implementing the Riemannian symmetric rank one trust region method of Huang et al. (2015).
In our study, we simulate n ∈ {50, 100, 200, 500, 1000} observations X 1 , . . ., X n from a d ∈ {2, 4, 6, 8, 10} dimensional normal distribution with randomized mean vectors and covariance matrices from normal and Wishart distributions, respectively.Here, we do not simulate from the broader class of MED distributions due to the efficiency characteristics of our current sampling algorithm for larger values of d; see   replicate data sets, from which we then run T = 80 iterations of our specified coordinate descent algorithm (so that each of the sub-problems ( 8)-( 11) is solved 20 times).The average times per iteration depending on n and d on a PC (Intel i7-9700 3.00 GHz CPU, 32 GB of RAM, running Windows 10 and R 4.1.0)are plotted in Figure 2. The computational cost of the coordinate descent algorithm appears to be linear in n and quadratic in d.Although the linear scaling in n is expected, we note that the quadratic scaling in d is surprising since the computation of ℓ n requires matrix multiplications and inversions, which are typically operations of complexity order O(d 3 ).
As such, we advise that our empirical results regarding the complexity with respect to d should be used cautiously and should not be extrapolated beyond the bounds of the simulation experiment.

Consistency of the MLE
Given that MLE θn can be obtained, we require some additional assumptions to guarantee its consistency.A minimal assumption is that the set of minimizers is contained in a compact set Θ. Given our domain specifications of θ, and following the approach Ritter (2014, Sec.B.6.2), we can always choose such a compact set to be of the form where γ ∈ (0, 1) and v 1 > 0 are sufficiently small and m, v d > 0 are sufficiently large.Here, λ (1) (Σ) and λ (d) (Σ) are the smallest and largest eigenvalues of Σ, respectively.The constraint does not need to be enforced explicitly, but for numerical stability, we found it useful to solve subproblem (10) using the domain [0, 1 − γ], with γ = 10 −6 .The next result follows by an application of Theorem 5.3 in Shapiro et al. (2021).See Appendix A.1 regarding the proof.

Bayesian inference
An interesting feature of Bayesian inference is that some expert knowledge can be incorporated in the prior distribution and that the posterior distribution can be used to provide uncertainty estimates with credible bands.This is complementing the previous section as well as the work of Herrmann et al. (2018).
We adopt the same parameterization as that of the previous section, which replaces ν in θ, by ν = Σ −1/2 ν.Instead of the covariance matrix, it is more efficient from an implementation viewpoint to work with the precision matrix Ω = Σ −1 ∈ SPD (d).A simple strategy consists in assigning independent priors on µ, Ω, ρ and ν.First, a normal-inverse-Wishart prior on (µ, K) parameters, where K is the covariance matrix parameter of the normal prior for µ: , and S K ∈ SPD (d) can be fixed or be given hyperpriors.Second, a Wishart prior is placed on Ω: where ν Ω ∈ (d − 1, ∞) and S Ω is symmetric and positive definite, and can be fixed or can be given hyperpriors.And finally, uniform priors are placed on ρ and ν: ρ ∼ U(0, 1), ν ∼ U(S d−1 ).
Note that in the presence of prior information, more informative priors than the ones presented here could be considered in order to encode such prior knowledge.
Finally, the data generating process for observations X 1 , . . ., X n is the MED: where φ represents potential hyperparameters described above.We have implemented the model in the Stan probabilistic programming language (Carpenter et al., 2017), which uses Hamiltonian Monte Carlo.See Appendix B.2 for implementation details.

The MEDdist R package
We describe an R package termed MEDdist 2 containing the MLE and Bayesian procedures introduced above.Density, distribution function, quantile function and random generation for the MED are also documented.
• dmed: returns the value of the p.d.f. of the MED with parameters µ, Σ, ρ and ν (inputs) at point x ∈ R d (input) using Equation (2).
• pmed: returns the value of the c.d.f.(only in dimension d = 1) of the MED with parameters µ, Σ, ρ and ν (inputs) at point x ∈ R (input).
• rmed: returns a sample of size n (input) from the MED with parameters µ, Σ, ρ and ν (inputs), see below.
• bayexpectile and baymode: return posterior mean estimators for multivariate expectile and mode, respectively, under a sample from the MED.These functions, containing Stan codes including prior specification, are provided in Appendix B.2, and detailed in Section 7.
Sampling from the distribution is done by using the bounds of Section 3. Indeed, the upper bound of Equation3 and Proposition 5 leads to a simple rejection algorithm, where the proposal distribution is multivariate Gaussian.The acceptance probability, M d (ρ) −1 , deteriorates with ρ and d as follows (see also Figure 3): • polynomially in ρ as ρ → 1 (for fixed d): 2 , for some constant a; • exponentially in d as d → ∞ (for fixed ρ): , for some constants (in ρ) A(ρ) and B(ρ), where B(ρ) < 1.Some potential refinements for the proposal distribution taking the form of mixtures of truncated multivariate Gaussian distributions are not investigated here.

MLE simulation study
To assess the performance of our procedure for computing the MLE, as well as the quality of the MLEs themselves, we have conducted a set of simulation studies in R. Within these studies, we have implemented the coordinate descent algorithm that is described in Section 5.1 using the same choices of sub-problem solvers as in Section 5.1.1.
From Tables 1 and 2, we observe that in every case, other than when ρ = 0.2 and n ∈ {50, 100, 200}, the biases of the MLEs are all within a standard deviation of the samples.Furthermore, we observe that in all cases of ρ and Σ, the sample standard deviations of the estimate samples are decreasing in n.These two initial observations provide empirical verification of the statistical consistency and numerical accuracy of the MLE estimator and the coordinate descent algorithm for its computation.
However, the estimates of ρ appear to exhibit a noticeable bias when ρ = 0.2.By inspecting the results in the ρ = 0.5 and ρ = 0.8 cases, it appears that the biases and standard deviations of the estimates of ρ are decreasing as ρ increases.Furthermore, we see that although ρ does not seem to affect the bias of other parameters, it does appear that the the standard deviations of the ϕ estimates are larger when ρ is small, and decreasing as ρ increases.Notably, there does not appear to be any other effects due to the generative value of ρ with respect to the biases and standard deviations of the samples of estimates of µ and Σ.In this section, we estimate geometric expectiles in the Bayesian framework described in Section 5.2.We adopt the bivariate simulation experiments of Herrmann et al. (2018).More specifically, we consider four bivariate random variables X (1) , . . ., X (4) (See Table 6 in the Appendix for a description of the data distributions).
We sampled n = 1, 000 observations for each of the four distribution settings.For each value of ϕ in the grid {ϕ ℓ = 2π(ℓ − 1)/100, ℓ = 1, . . ., 100}, we used Stan (code described in Appendix B.2) with 2, 000 HMC iterations after a 1, 000-iteration burn-in.Stan provides a R value of 1 for all of the chains, thus not indicating convergence issues.Expectiles were estimated by the posterior mean.For each value of ϕ, credible bands could be computed as those regions containing a prespecified proportion of iterations from the HMC chain (see, for instance, Section 7.2).We have followed a different approach in order to better visualize the credible regions as bands around the estimated expectiles.More specifically, we used multivariate quantiles (in the sense of Chaudhuri, 1996) of orders 2.5% and 97.5%, to get pointwise credible bounds for each value of ϕ ℓ , and then connected them with a curve visible as blue dashed lines on Figure 4. Expectiles were also estimated using the maximum likelihood approach of Herrmann et al. (2018).For each of the four distribution settings, Figure 4 displays contour plots of the true distribution, and the Bayesian and MLE estimates in solid blue and red lines, respectively.Eight values for ϕ defined as φ j = jπ/8, j = 0, . . ., 7, are marked along the expectile curves.
We can see from Figure 4 that Bayesian estimates of the expectiles are almost indistinguishable from MLE estimates.The main asset of our Bayesian approach is the availability of uncertainty quantification of these estimates.

Bayesian estimation of the mode of a distribution
We take advantage of the property that the MED mode coincides with its location parameter µ (Proposition 3 (i)) to propose a procedure for estimating the mode of a distribution.
Consider the four distribution settings of the previous section.For each of them, we generated n observations, n ∈ {100; 500; 1, 000}.We used the Bayesian model of Section 5.2 implemented in Stan (code described in Appendix B.2) with 2, 000 HMC iterations after a 1, 000-iteration burnin.Stan provides a R value of 1 for all of the chains, thus non indicating any convergence issue.As a competitor to the MED, we used a multivariate Gaussian distribution as a fitting distribution.The Gaussian distribution can be simply obtained from the MED by fixing the ρ parameter to 0, see Proposition 2 (i) (this automatically makes the ν parameter deprecated).The Gaussian distribution also satisfies the property that the mode is available in closed-form (as its mean).For both cases, expectiles were estimated by the posterior mean.In contrast to Section 7.1, 95%  credible regions were provided by sets containing 95% of the 2, 000 HMC iterations.These were constructed by using the dataEllipse R function from package car, which provides the smallest ellipse enveloping the desired proportion of points (95%).The whole procedure was repeated 100 times, leading to Table 3 which displays the mean squared error of the estimates over the 100 repetitions (MED / Gaussian distribution).
Figure 5 represents the results for one of the 100 repetitions obtained with n = 1, 000 data points.Each panel corresponds to one of the four distributions, the red cross indicates the true mode, and the blue (resp.green) crosses the estimate from the MED (resp.Gaussian distribution).The ellipses correspond to 95% credible intervals.The results are in accordance with those of Table 3, illustrating that the approach based on the MED provides better calibrated credible intervals than those based on the Gaussian distribution.

Discriminant analysis
We propose a discriminant analysis on real data, by considering the Banknote authentication data set, available at this url3 .The data of size n = 1, 372 contain four covariates (X 1 = variance, X 2 = skewness, X 3 = kurtosis and X 4 = entropy of wavelet transformed banknote images), and a binary variable indicating whether the banknote is genuine (0) or forged (1).As illustrated in Figure 6, the genuine (blue) vs forged (red) balance in the data set is 56% vs 44%.
We randomly split the data set into a training sample of size n = 1, 000 and a testing sample of size 372, and repeat 100 times the experiment.For each experiment, we fit the genuine and forged classes of banknotes with a Gaussian, Skew normal (SN ) distribution (Azzalini and Cap- Covariates (X 1 , . . ., X 4 ) in blue (genuine) and in red (forged) as bivariate scatterplots.Expectile µ estimated with the MED is reported as "×" (genuine) and "+" (forged); direction ν is represented as an arrow originating from the estimated expectile.respectively).Estimated expectiles µ and asymmetry directions ν are also reported in Figure 6.We focus on testing the membership of a banknote to the genuine class by using a likelihoodratio test (with significance level α = 0.01).More precisely, for each banknote x i in the test sample (i = 1, . . ., 372), we compute and the banknote is considered genuine if λ i > c α , and forged otherwise.Note that c α is the αquantile of λ, computed with the genuine banknotes of the training sample.The same approach is adopted for all three distributions.The results (summarized in Table 5) show that the MED has the lowest median error (only 0.269%), while the Skew normal distribution fails to improve the performance of the Gaussian distribution.The mean error, not reported, is also slightly lower for the MED, underlying its flexibility in this application (the MED has lower AIC and BIC values for the guenine class).Note that all the errors of the MED (and the Gaussian distribution) are type I errors, i.e. genuine banknotes wrongly considered as forged (some rare cases of type II errors are obtained with the Skew normal distribution).In comparison with the other distributions, the MED has the advantage of never considering a forged banknote as genuine (like the Skew normal distribution sometimes does), and rarely considers a genuine banknote as forged (less frequently than the Gaussian distribution).

A Appendix: Proofs
A.1 Proof of main results

Proof of results from Section 2
Remarking that the optimization problems are equivalent, the following Proposition provides a sufficient condition for the well-posedness of (1).
Proof of Proposition A.1.Let us consider the expansion and consider the three terms separately: by repeated uses of Cauchy-Schwarz and triangle inequalities.The conclusion follows.
Proposition A.2 (Strict convexity of Λ ρ,ν,Σ ).For any ρ ∈ [0, 1), for any matrix Σ ∈ SPD(d) Proof of Proposition A.2.Note that convexity is a property that is preserved by translation and rotation, so that it suffices to prove the result for the standardized MED.Denoting by x 1 = ⟨x, u 1 ⟩ the first component of x, and by x −1 the other d − 1 entries, we have where function Λρ : R 2 → R + , (x, y) → x 2 +y 2 +ρx x 2 + y 2 coincides with Λ ρ in the bivariate case (d = 2).Note that it is strictly convex as the Hessian determinant and trace are both positive, equal respectively to 3(ρx/∥x∥+1) 2 +1−ρ 2 and 3ρx/∥x∥+4, with x = (x, y).Also, the function of the second component y → Λρ (x, y) has partial derivative y → y(ρx/∥x∥ + 1), meaning that it is non-decreasing on R + for any x ∈ R. As a result, we have that for any λ ∈ (0, 1), and for any x, y ∈ R d , where both equalities (a) are by ( 13), (b) is by combining the above non-decreasing property of y ∈ R + → Λρ (x 1 , y) with the convexity of the Euclidean norm ∥ • ∥, and (c) is by the strict convexity of Λ.This concludes the proof.
Proof of Proposition 2. Part (i) is obvious.For part (ii), the standard MED random variable Z = (Z 1 , Z −1 ) has density The density of the transformed random variable This corresponds to a half-standard Gaussian restricted to negative values of the first component.From Proposition A.2, Λ ρ,ν,Σ is strictly convex on R d and therefore the mode is unique.Moreover, Λ ρ,ν,Σ (0) = 0 and Λ ρ,ν,Σ (y) ≥ 0, for any y ∈ R d , in view of Cauchy-Schwarz inequality.It follows that arg min and the first result is proved.
(iii) The proof is similar to that of Lemma A.1 in Appendix A.2. Recalling that Q is an orthogonal matrix such that QΣ −1/2 ν = u 1 , one can write: Inspired by the proof of Lemma A.1, changes of variables lead to: (1 + ρ cos(ψ)) dψ.
Proof of Proposition 4. Let X ∼ MED(ρ 1 ).By definition, e ρ 2 (X) is the solution of a slightly modified Equation (1) involving two asymmetry parameters ρ 1 and ρ 2 : where expectation is taken with respect to the MED(ρ 1 ).In other words, the expectation is taken with respect to X ∼ MED(ρ 1 ), but the order of the expectile is ρ 2 .Since the objective function in the above optimization task is strictly convex by Proposition A.2, it has a unique solution.By cylindrical symmetry of the distribution of X along the first axis (the direction of u 1 through the origin), the expectile e ρ 2 (X) is also supported on this axis, so there exists some m ∈ R such that e ρ 2 (X) = mu 1 .Let ψ be the function with arguments (m, ρ 2 ) ∈ R × [0, 1) defined from the objective function above as: Differentiating with respect to m yields The value in m = 0 is equal to which is an affine function with negative slope.Additionally, this function vanishes when ρ 2 equals ρ 1 , by definition of the expectile of order ρ 1 being the minimizer of (1).Hence, the equivalence of the statement: if ρ 2 > ρ 1 , then the derivative in ( 15) is negative, therefore going along the first axis in the positive direction decreases the objective function, which is necessarily the direction of e ρ 2 (X) by convexity; so m > 0. The other equivalences are derived similarly.
Proof of results from Section 3 Proof of Proposition 5.One can write the density of Z ∼ MED(ρ) as: and thus, noting that the dot product ⟨ • , • ⟩ lies in [−1, 1] provides the simple bounds . By change of variable, the density of X is given by By replacement in ( 16), the result is proved.
Proof of Proposition 6.A similar calculation as that of Proposition 3 leads to The result is deduced via the identity Proof of Proposition 7. (i) is a direct application of the asymptotic equivalent for the 2 F 1 function from Lemma A.2 in Appendix A.2, while (ii) comes from the fact that the second term in the squared norm expectation in Proposition 6 dominates the first.

Proof of results from Section 4
Proof of Proposition 8. Let H d−1 be the hyperplane orthogonal to u 1 .(i) Consider the onedimensional margin Z 1 of Z: The first result is obtained by integrating the bounds provided by Proposition 5 on H d−1 .Secondly, let us assume z 1 > 0, the case z 1 < 0 being obtained by replacing ρ by −ρ.Consider the change of variable z −1 = z 1 x: Switching to polar coordinates yields: ) is the surface of the unit sphere S d−2 .Introducing , and applying Lemma A.3 with φ(0) = 1 + ρ and φ ′′ (0) = 2 + ρ, it follows that, while, in view of the above remarks, which proves that the marginal distribution of Z −1 is an elliptically contoured distribution.Let z = ∥z −1 ∥.A simple linear change of variables decouples z and z 1 in the integrand above Let function h be defined on R by h(z 1 ) = z 2 1 + ρz 1 z 2 1 + 1.We know from Proposition A.2 that h is strictly convex, thus it admits a single minimum on R. Simple algebra yields that the arg min and minimum are t As a result, Laplace's method provides the following large z equivalent to the integral Together with the above identities, we get the large ∥z −1 ∥ distribution where Proof of Proposition 9. (i) For any z ∈ R, the density f ⟨w,Z⟩|P w ⊥ (Z)=0 (z) is equal to: which is the one-dimensional MED of the statement.
(ii) Conditioning on Z 1 = 0 cancels the non-quadratic term in the exponential term of the density.As a consequence, this conditional density is proportional to exp − 1 2 ∥z −1 ∥ 2 , which identifies the asserted multivariate Gaussian distribution.

Proofs of results from Section 5
Proof of Proposition 10.The proof is a direct application of Theorem 5.3 of Shapiro et al. (2021).The only non-trivial criterion to be checked is that as n → ∞, for compact set Θ. We can apply Theorem 6 of Andrews (1992) to verify the condition.To do so, we verify that n −1 ℓ n ( θ) a.s.
−→ E log f d (X; θ), pointwise.This is possible via a trivial application of the strong law of large numbers combined with Proposition 5 and Corollary 2. To complete the proof, we must demonstrate the existence of a function α(x), such that α(x) ≥ sup θ∈ Θ | log f d (x; θ)| and E(α(X)) < ∞.Such a function can be obtained as follows.Use Proposition 5 to obtain the inequalities: and thus where γ above is defined in Θ in (12).The function α is well-defined (and non-infinite) by construction of the compact set Θ. We apply Proposition 5, again, to obtain where the last integral is finite since the normal distribution has finite first and second moments.
The analogous result holds for E sup θ∈ Θ log ϕ d X; µ, Σ 1+ρ , and thus we have verified that α (x) satisfies the necessary properties.Proof.Let us calculate the integral

B.2 Bayesian implementation in Stan
In this section, we describe an implementation of the Bayesian model of Section 5.2 in Stan probabilistic programming language (Carpenter et al., 2017).Stan starts by describing the data: Figure 3.For each pair (n, d) we simulate 20 Average times per iteration as a function of the number of observations n.
Average times per iteration as a function of the dimension d.

Figure 2 :
Figure 2: Computational times per iteration of the coordinate descent algorithm.Points represent results from individual replicates of each case pair (n, d).

Figure 5 :Figure 6 :
Figure 5: Red, blue and green crosses: theoretical mode, and mode estimates based on the MED and Gaussian distributions, respectively.Blue and green areas: credible ellipses of the mode based on the MED and Gaussian distributions, respectively.Grey: level curves of the densities.

Figure 7 :
Figure 7: Absolute value of first three moments centered with respect to parameter µ for the 1-dimension MED.

Finally, classical resultsFigure 7
Figure7represents the first three moments centered with respect to the µ parameter for the 1-dimension MED.Expressions are provided in Section 2.

Table 1 :
Absolute biases (standard font) and standard deviations (italic) of the MLE estimates across 100 simulations for different values of n and ρ, and Σ = I 2 .

Table 2 :
Absolute biases (standard font) and standard deviations (italic) of the MLE estimates across 100 simulations for different values of n and ρ, and Σ = (I 2 + J 2 ) /2.

Table 3 :
Mean squared errors computed on 100 repetitions of the Bayesian mode estimator (based on the MED / Gaussian distribution) for different sample sizes n.

Table 4 :
Empirical coverage probabilities (i.e. the posterior probability that true mode belongs to the credible 95% ellipse) measured on 100 repetitions of the Bayesian mode estimator (based on the MED / Gaussian distribution) for different sample sizes n.

Table 5 :
Median classification errors computed on the test sample, their standard deviations (SD), their median absolute deviations (MAD), and average AIC and BIC for genuine / forged classes ) using the R function msn.mle from package sn, and MED distributions.MLE of the MED parameters (µ, Σ, ρ, ν) are computed for both classes and reported in Table7in Appendix B.3 (let us denote by θ g and θ f the obtained parameters for genuine and forged classes,

Table 7 :
Maximum likelihood estimators computed with the genuine and forged records averaged over 100 train samples.