Deep metric learning for visual servoing: when pose and image meet in latent space

We propose a new visual servoing method that controls a robot's motion in a latent space. We aim to extract the best properties of two previously proposed servoing methods: we seek to obtain the accuracy of photometric methods such as Direct Visual Servoing (DVS), as well as the behavior and convergence of pose-based visual servoing (PBVS). Photometric methods suffer from limited convergence area due to a highly non-linear cost function, while PBVS requires estimating the pose of the camera which may introduce some noise and incurs a loss of accuracy. Our approach relies on shaping (with metric learning) a latent space, in which the representations of camera poses and the embeddings of their respective images are tied together. By leveraging the multimodal aspect of this shared space, our control law minimizes the difference between latent image representations thanks to information obtained from a set of pose embeddings. Experiments in simulation and on a robot validate the strength of our approach, showing that the sought out benefits are effectively found.


I. INTRODUCTION
Visual servoing (VS) is the task of controlling the motion of a robot in order to reach a desired goal or a desired pose using only visual information extracted from an image stream [7]. The camera can be mounted on the robot's end effector or directly observing the robot. Visual servoing usually requires the extraction and the tracking of visual information (usually geometric features) from the image in order to design the control law. The choice of features is a crucial aspect of VS, as it impacts the servoing behaviour (namely the convergence to the target pose and the trajectory in 3D space). Geometric features are usually split into two categories. The first one, Image-Based VS (IBVS), uses 2D primitives in the image space, such as points [28], lines [1] or moments [6], in order to create the control law. The second category, named Pose-Based VS (PBVS) [29], [6], uses image information to estimate the camera pose, which can be used to directly control the robot's motion. The PBVS control law is often seen as the most practical and optimal one: if the camera pose is perfectly estimated, then the scheme is globally convergent and the trajectory is the shortest path to the goal pose, both in translation and rotation.
In all cases, geometric features must be extracted, as well as matched between current and desired images. As this is an error-prone process, another way of performing VS has developed. By using photometric features, such as raw pixel intensities, feature extraction is avoided [9]. In Direct VS (DVS), the difference between pixel intensities is minimized, which leads to very accurate positioning, but with an unpredictable trajectory and a small convergence domain, since the cost function to minimize is highly non-linear. To alleviate the latter problem, a solution is to represent images by lower dimensional features that better correlate to the pose. Multiple representations have been studied: [2] expresses an image with photometric moments. In [22], servoing is performed in the frequency domain, where only the smoothly varying low-frequency information is preserved. A similar, learning-based approach was proposed in [21], where principal component analysis is used to project on a subspace, which maximally preserves the information of an image set. Inspired by this work, we previously proposed AEVS [11], projecting images in the latent space of an autoencoder.
Between all these VS methods, a trade-off becomes apparent: the easier feature extraction is, the harder servoing becomes. On one end of the spectrum lies DVS, with no extraction but a highly non-linear cost function. On the other end, PBVS requires estimating the pose (from a 3D model of the scene/object or directly from data) but the cost function is smooth. Pose estimation also has the drawback of introducing some noise into the robot's trajectory. However, as camera relocalization is a fundamental task of many computer vision applications, it has become a topic of interest for the deep learning community, that seeks to replace or support classical geometric approaches with a neural network. One of the first works was PoseNet [17], that uses a Convolutional Neural Network (CNN) to regress the camera pose in a scene (position + orientation) from a single RGB image. This process was repurposed for VS, with multiple works such as [4], [24], [31] which employ a CNN to estimate the pose difference between current and desired images. This difference is then fed to the PBVS control law in order to move the robot closer to the target pose. Servoing is then fully dependent on this estimation, and [33] notes that in the case of orientation regression, trained networks may still yield a large error, as the rotation space is not smooth and continuous. Furthermore, In some cases, the difference between translational and rotational motions may be nonobservable.
Our work proposes to replace the standard pose regression approach with metric learning. In metric learning, we seek to learn a similarity function between samples that is dependent on the task to be accomplished. These approaches can be used to group images of the same concept (high similarity), while pushing away images of different concepts (low similarity). This principle can be applied to many tasks, including classification [13], object tracking [10] or camera relocalization [3].
We propose to use metric learning to shape the latent space in which we perform VS. We aim to learn a space in which the similarity between two learned representations correlates with the distance between their respective camera poses. This representation space can be seen as an intermediate manifold between poses and images on which we project both modalities. Doing so, we propose a new servoing control law that combines the optimal behavior of PBVS with the accuracy and simplicity of photometric methods. In Section II, we give an overview of VS, detailing the generic minimization framework, as well as presenting how PBVS and visual servoing in an autoencoder latent space are achieved. With this knowledge, we introduce our method in Section III, that constrains the latent space to have good properties for VS via metric learning. By leveraging the multimodal aspect of this shared space, our control law minimizes image error thanks to information obtained from a set of pose projections. Finally, In Section IV, we present simulated and real-world experiments that validate our approach and illustrate its properties.

A. Visual servoing framework
Visual servoing aims to reach a desired pose r * , from its arbitrary, current pose r. Many robotics tasks, such as navigation, tracking, object picking can be viewed through this prism, e.g. navigation is a succession of positioning tasks. VS thus seeks to solve an optimization problem, finding the pose closest to r * that minimizes an error function e: r = arg min r e(r).
If e is well designed and e = 0, then r = r * . Since the pose r may be unknown during servoing, e is defined as e = s(r) − s * , which is the difference between what the camera sees at the current pose s(r) = s, and what it should see at the desired pose s * . As stated before, the choice of features s is important as it conditions the 3D trajectory, the convergence domain -how far can r be from r * before VS diverges -and the final accuracy -how close is r to r * . Features may lie in the image space (IBVS), in the 3D world (PBVS) or in photometric space (e.g. considering pixel intensities [9]). In all cases, the relationship between the variation (in time) of the features s and the camera velocity v must be established:ṡ where L s = ∂s ∂r is called the interaction matrix. By inverting this equation, we can then define the control law that best minimizes the error e: where λ is a gain parameter L + s the pseudo-inverse of L s . The computed velocity v can be used to move the robot's end-effector closer to the desired pose r * . VS operates in a closed loop, with the minimization of e being performed in an iterative manner.

B. Deep learning for pose-based VS
A pose r = t θu ⊤ is an element of SE(3) = R 3 × SO(3) the group that represents rigid transformations, combining translation t and rotation θu, where u is the axis around which to rotate and θ is the angle of the rotation. It may also be represented as a homogeneous matrix o T c , that expresses the pose of camera F c (or any other frame) in a reference frame F o (such as one given by the origin of a scene object o). The displacement between two poses ∆r = c * T c can be computed as c * T c = c * T o o T c . Deep Learning (DL) has previously been used to estimate the camera pose r, given a single image I. The first major work to accomplish this was PoseNet [17]. Given a dataset of images and their associated poses (expressed in a common frame), a CNN is trained to minimize the following loss function: wheret,q are the predicted position and orientation of the camera and t, q is the ground truth. The orientation q can be expressed as a unit quaternion, or as an axis-angle θu. β is an important hyperparameter, that balances the learning of both translation and orientation. This weighting is required as the two quantities are on different scales, and it must carefully be tweaked in order to get sensible results. In [16], the authors introduced a multi-task loss that automatically learns the weighting, improving upon the manual tuning of β. Given two images, it is also possible to regress the pose difference ∆r [4], [24], [31]. ∆r can then be plugged into the PBVS control law [29], [7] to compute v. The interaction matrix of a pose expressed in a fixed frame F c * (also valid for F o ) is defined as [7]: with c * R c a rotation matrix and L θu the interaction matrix defined in [8]. If ∆r is perfectly estimated, then the control law v = −λL −1 r ∆r ensures that the 3D trajectory is a geodesic both in translation and rotation.

C. Servoing in latent space
In [11], we introduced AEVS, a method to perform VS in the latent space of an autoencoder (AE). This AE is a neural network that learns a projection from an image to a lower dimensional representation (encoder), as well as the inverse mapping (decoder). This approach is similar to PCA-based VS [21], except that AEs learn non-linear projections, which PCA cannot. The autoencoding objective is the minimization of the reconstruction error. Considering two embeddings z I , z I * , the control law is of the form: where L z I is computed analytically by applying the chain rule, finding L z I is the composition of the encoder Jacobian and the interaction matrix of the input image, detailed in [9], [20]: ∂r . Although this process is applied to images, the same reasoning can be applied to an encoder for any type of inputs, as long as the interaction matrix of the input can be defined. While this approach improves upon other photometric dimensionality reduction schemes, it has some drawbacks. First, the training objective is weakly correlated to pose estimation. This makes it hard to know whether the latent space of a trained AE will give good results when applied to VS. Second, the interaction matrix L z I depends on the interaction matrix of DVS, which is known to lead to unpredictable trajectories due to a highly non-linear cost function [9]. Finally, AEVS requires the camera's intrinsic calibration, as well as an estimate of the depth: a very coarse estimation works, but degrades the trajectory.

D. Metric learning
Metric learning aims to learn a similarity function between two compared inputs. The similarity measure is defined not in the input space (e.g. comparing pixel intensities) but rather from the factors that underlie the variations of the data. As an example, metric learning may learn to group images of the same subject (e.g. class, landmark or person) while enforcing a large margin between dissimilar concepts. Metric learning is often coupled to nearest neighbor search. A majority of the metric learning algorithms focuses on binary supervision to learn a meaningful metric: either two samples are similar or they are not. This is best seen in the common triplet loss [13] that compares an anchor representation x a with similar and dissimilar samples x p , x n , where ϵ is a margin parameter that gives the separation threshold between positive and negative data points. It is also possible to learn smoothly varying similarity functions with continuous metric learning, as studied in [18], [3]. An application of continuous metric learning that is closely tied to our work can be found in [3]. This approach learns a feature space where the Euclidean distance between two representations is directly correlated with the overlap between two images, i.e. how much of the scene is visible in both cameras. An additional pose difference regressor is then trained to obtain a better estimate for the camera relocalization task. This network compares the query with its nearest neighbors. Another work closely related to ours is [15] which learns a feature space that is equivariant to camera motion. This approach uses discrete metric learning and a discrete number of motion patterns (e.g. move forward, rotate left/right) is learned.
Using latent representations also allows projecting multiple modalities of the same data to compare them with a single metric, in a common space. In [27], authors transform text and images and map them to a shared space. Metric learning then allows for the retrieval of images that match a given textual query. Similarly, [25] embeds point clouds, textual tasks, and trajectories in the same space, so that the best fitting trajectory may be selected given a task. Multiple modalities can also be used in audio processing, by either creating a music sample-tag association [30], or an acousticlinguistic relationship [26].

III. METRIC LEARNING FOR VISUAL SERVOING
This section presents our novel approach to Visual Servoing, that leverages deep metric learning in order to frame VS in a learned space. We first detail the reasoning behind our method, then present our training procedure, that shapes the latent space. Finally, we describe how VS is performed in this new latent space. In this paper, our goal is to propose a latent space servoing scheme with a behavior similar to PBVS, overcoming the limitations of other deep learning-based methods. To do so, we propose to create a multimodal latent space Z, in which both pose and image representations are mapped. A pose r ∈ SE(3) maps to an image I ∈ I via a camera. A pose r j maps to an embedding z rj , while the image acquired at r j is noted z Ij . The relationship between latent space and images/poses is illustrated in Figure (1) We argue that for the best VS behavior, the distance between two embeddings should be equal to the distance between their underlying poses: The distance constraint holds true, whether z j , z k are image or pose projections, i.e. z j = z Ij or z j = z rj . If this property is perfectly met, it follows that: • for a given image I j , acquired at pose r j , d Z (z Ij , z rj ) = 0. This is similar to an absolute pose regression objective; • for two images I j , I k acquired at r j , r k , the constraint d Z (z Ij , z I k ) = d SE(3) (r j , r k ) is akin to estimating the relative pose difference between r j and r k from the images; • finally, d Z (z Ij , z I k ) = d Z (z rj , z r k ). As the two cost functions are the same, using the interaction matrix linked to a pose representation in a servoing context is valid for the minimization of d Z (z Ij , z I k ).
We thus seek to learn a space that is equivariant to 3D motion. This is ideal for VS, as we wish for features that have a strong and straightforward relation to pose. Moreover, we seek to explicitly learn a space that is invariant to perturbations P ∈ P, such as lighting changes or occlusions (i.e. z I+P = z I ).

B. Learning an SE(3)-equivariant space
To learn the space Z, we propose to use two distinct, parallel neural networks. The first is ϕ : SE(3) → Z, that maps a pose r to an embedding z r = ϕ(r). The second model ψ : I → Z, maps an image I to its latent representation z I = ψ(I).
In order to shape Z, we devise our loss function that is based on the distances between latent representations. While most metric learning approaches focus on clustering problems, we require our distances to be continuously meaningful as in [3], [18]. Instead of comparing the embeddings of specific tuples (r j , I j , r k , I k ), we adopt a full-batch approach, comparing a representation with every other in the batch. To do so, we leverage the distance matrices in SE (3) and in the latent space, viewing an embedding batch as a fully connected graph. By using this dense approach, we encourage the latent representations to position themselves with respect to every neighbor, providing a more stable training signal for the models ϕ and ψ.
To train the pose encoder ϕ, Our first loss seeks to enforce the equivariance with SE(3) when considering pose projections. As SE(3) has no true distance metric, a weighting between translation t and rotation θu distances must be introduced to create a pseudo-metric d SE(3) , as in [16]. We propose to define the weightings from the data and normalize the translational and rotational distances by their average value in the dataset. To compute the translation/rotation distances between two poses r j , r k , we first compute the pose difference ∆r = (∆t, ∆θu) we then define a single dataset-aware metric on SE(3), as where w t , w θu are averages of translational and rotational distances, computed on a subset of the dataset. With d SE (3) defined, we introduce our first loss. Considering a batch of B samples, we seek to minimize This loss is fairly straightforward to minimize, as ϕ has access to the full pose information, and its main task is to transform the combination of translational and rotational metrics into a single euclidean distance. Of course, since poses are not available during VS, we require ψ to project an image to the same embedding as its associated pose, as well as match the distances with other poses. This is modeled by: By transitivity, L ψ,ϕ also minimizes (d SE (3) To learn invariance to perturbations, we incorporate perturbed samples in the image batch. These noisy samples are exploited in L ψ,ϕ . Moreover, the image representations associated to a single pose r j (the original image and its P perturbed versions) are compared, and their distance to each other minimized: To obtain our final training objective, we sum up the losses The impact of each loss is visualized in Figure (2). It can be seen that the objectives designed above act as pushpull forces, moving the embeddings to respect the distance constraints, as well as minimize the influence of perturbations. By comparing a representation with every neighbor, we ensure that a single iteration forces an embedding towards a more stable location.

C. Training policy
To constitute our batches of data, we randomly sample N = 64 poses from the dataset, and add their M = 3 closest neighbors found in a subset of the data to ensure that the constraints are met both globally and locally. For each image, we also generate P = 2 perturbations. Comparing to AEVS [11], not using a decoder (discarded for VS) network allows for larger batches. The perturbations consist in adding Gaussian noise, changing the brightness of the image and performing random erasing [32]. The image/pose generation process is the same as the one presented in [11], leveraging simulation to create data cheaply and efficiently.
We set the dimensionality of the latent space so that Z = R 32 . The image encoder ψ is a ResNet-34 [12], with the modifications of [11], i.e. replacing batch normalization [14] with weight normalization [23] and the end average pooling with a group convolution. The pose-embedding network ϕ is a 5-layer perceptron of dimensions 6 → 32 → 64 → 128 → 64 → 32, with ReLU activations after each hidden layer. The networks are jointly trained for 50 epochs, on a dataset of 100K samples. Training takes around 8h with a Quadro RTX 6000. Gradient descent is performed with the Adam optimizer [19] and a learning rate of 10 −4 . With our networks trained, we can finally perform VS in the latent space.

D. Visual servoing in latent space
Our approach to VS is two-pronged, based on the multimodal nature of Z. We first embed the current and desired images z I = ψ(I), z I * = ψ(I * ) in Z to obtain e = z I − z I * .
To minimize e, we require an interaction matrix, that gives us the control directions. Before VS starts, we project a set of N poses {z r1 , ...z r N } = {ϕ(r 1 ), ..., ϕ(r N )} in the latent space, along with their interaction matrix, thanks to the method described in Section II-C. The latent interaction matrix is then defined as L zr j = ∂zr j ∂rj L rj , with ∂zr j ∂rj the Jacobian of ϕ with respect to r j (computed via forward propagation) and L rj given by Equation (5). Because we minimize L ψ,ϕ (Equation (10)), We can approximate the interaction matrix at z I as an interpolation of its neighbors' interaction matrices. We thus have a K-Nearest Neighbors (KNN) regression problem, defined as: To obtain the pose embeddings set, we generate poses on a 6D grid (displacements in translation and rotation), and oversample near z I * . In our experiments, we thus create a set of 1M pose representations that can be used for KNN. To be computationally tractable, we store them in a KD-tree [5]. Other, smarter, less memory demanding sampling strategies, based on the values of both z I and z I * , are possible. We however found that this straightforward approach works well in practice. Unlike AEVS, the interaction matrices of the pose representations (and thus the network Jacobian) are computed offline. Fitting the pieces together, the final control law is simply In the next section, we explore the behavior of this VS scheme, both in simulation and on a 6DOF robot.

A. Simulation validation
We start our experimental validation with a large scale experiment: we run 500 VS tasks with multiple methods.
The initial positions are chosen with a "look-at/look-from" scenario: we sample camera positions in a volume of dimensions 1.2m × 1.2m × 0.3m, centered on the desired position. we then sample the focal (look-at) points of the cameras on the planar scene, with a distance to the desired focal points between 8cm and 32cm. The scene is a poster of dimensions 80cm × 60cm, and we set the desired camera elevation to 60cm. From the camera position and focal point, we build the camera orientation, to which we add a rotation around the optical axis ∈ [−120 • , 120 • ]. The average initial pose error is 47cm ± 16cm, 74 • ± 28 • .
We experiment with multiple methods, the first one being a pure photometric scheme (DVS) [9]. We also compare with AEVS [11], as well as a PBVS visual servoing approach which uses a CNN-based pose regression approach, as developed in [24], [4] (referred as PBVS-CNN). To ease training and improve results, we adopt the automatic weighting loss of [16] that balances translation and rotation errors in Equation (4). For both AEVS and PBVS-CNN, we use the network and data described in Section III-C. For our method, we explore different numbers of neighbors K. We first report the percentage of samples that convergence for each method. A sample is defined as having converged if the VS method significantly reduces (by at least 90%) the initial positioning error and the final velocities are close to 0. For those samples, we measure the end positioning error, as well as the Absolute Pose Error (APE), averaged over all iterations of a trajectory, which describes how far the method strays from the geodesic of PBVS. The other included metrics are the mean length ratios: the length of trajectory of the observed method divided by the length of the PBVS trajectory. As can be seen in Table I, our method is able to converge reliably and accurately, with a positioning error that is comparable to photometric methods in the case of clean target images I * (noted ✓in the table), while having a far larger domain of convergence. We can observe that a larger value for K leads to a more stable and accurate positioning. Looking at the results of the PBVS-CNN, it can be seen that the end positioning error is subpar (3.3cm, 1.71 • on average). The trajectory statistics (APE and length ratio) show that our method is far closer to the behavior of PBVS, compared to AEVS or DVS. This is made explicit in Figure (3b). The overall statistics when considering cases where I * is perturbed (noted ✗) highlight that our method better handles variations in lighting and occlusions. While both convergence and accuracy degrade, they remain above that of the pose estimator, even on clean images.
Next, we explore the servoing behavior in the latent space. We perform 8 trajectories, where the initial errors are displacements on the x, y axes, with an initial error of 20cm. We then visualize the trajectories in the latent space by projecting in a 2D-subspace with PCA (explained variance ≈ 96%). As can be seen in Figure (3a), the minimization of e in the latent space leads to nearly straight lines in the latent space. The error between pose embeddings also correlates well with the error from image representations.

B. Robot experiment
We also deploy our method on a 6DoF gantry robot and study its behavior on a large motion. For the experiment, shown in Figure ( (Figure (4c)) and run our method for 1.2k iterations. Note that this motion (and thus the image) is far from what is seen during training. As servoing progresses, the error in the latent space (Figure (4f)) is quickly minimized. The final positioning error is ∆r f inal = (0.09cm, 0.08cm, −0.04cm, 0.08 • , −0.12 • , −0.01 • ), and the resulting error in image space is low (Figure (4d)). As shown in Figures (4e, 4g, 4h), the trajectory starts with some unwanted motion on the x, y axes (compensated by y/x rotations), probably due to the fact that I lies outside the training domain. However, our method recovers and exhibits a smooth decrease in positioning error.

V. CONCLUSION
In this paper, we proposed a method that allows visual servoing from both image and pose representations in a common latent space. This new visual servoing scheme combines the accuracy of photometric methods with the behavior of pose-based approaches. Our experiments show strong results, with a large convergence domain and accurate positioning. In future works, we plan to extend our method to deal with more than two modalities (i.e. adding depth or segmentation information), so that the visual features at the current and desired poses may be drawn from different domains. In addition, we believe that it is possible to reduce data requirements by including some form of self-supervised learning. For instance, small motions could be used to warp an image and generate new weakly labeled samples. Supervision would then be used to learn representations tied to large motions, while smaller displacements would be taken into account with self-supervision.