Learning Generalizable Light Field Networks from Few Images

We explore a new strategy for few-shot novel view synthesis based on a neural light field representation. Given a target camera pose, an implicit neural network maps each ray to its target pixel’s color directly. The network is conditioned on local ray features generated by coarse volumetric rendering from an explicit 3D feature volume. This volume is built from the input images using a 3D ConvNet. Our method achieves competitive performances on real MVS data with respect to state-of-the-art neural radiance field based competition, while offering a roughly 50 times faster rendering.


ABSTRACT
We explore a new strategy for few-shot novel view synthesis based on a neural light field representation.Given a target camera pose, an implicit neural network maps each ray to its target pixel's color directly.The network is conditioned on local ray features generated by coarse volumetric rendering from an explicit 3D feature volume.This volume is built from the input images using a 3D ConvNet.Our method achieves competitive performances on real MVS data with respect to state-of-the-art neural radiance field based competition, while offering a roughly 50 times faster rendering.
Index Terms-Novel view synthesis, neural light field, volumetric rendering

INTRODUCTION
The ongoing research in computer vision and artificial intelligence has long sought to enable machines to understand 3D given limited observations [1][2][3][4][5][6].This ability is in fact crucial for many downstream 3D based machine learning, vision and graphics tasks.Among these, novel view synthesis is a particularly prominent problem with numerous applications in free viewpoint and virtual reality, as well as image editing and manipulation.
While most traditional approaches require depth information, coarse geometric proxies or dense samplings of the input views, deep learning based approaches rely on deep neural network's generalization abilities across view points and 3D scenes to achieve novel view synthesis from minimal visual input.In this context, the recently popularized implicit neural representations offer numerous advantages in modelling 3D shape [1] and appearance [2,4,7] in comparison to their traditional alternatives.In particular, Neural Radiance Fields [2] (NeRF), notably their generalizable versions (e.g.[5,7]), provide impressive novel view synthesis performances.However, the rendering of these methods requires sampling hundreds of points along each target pixel ray, and evaluating densities and view-dependent colors for all these points through a multi-layer perceptron (MLP), which increases the time and memory requirements.
To reduce this complexity, we propose to use an implicit neural network operating in ray space rather than the 5D Euclidean × direction space, thus alleviating the need for per ray multi-point evaluation and physical rendering.For a given target pixel, an MLP (i.e.light field network) maps its ray coordinate and ray features to the color directly.Key to efficient generalization, and differently from [4], we build the ray features by computing and merging 3D convolutional feature volumes from the input images.These features are then rendered volumetrically into a coarse ray feature image, as illustrated in figure 2.
Our method is trained end-to-end and evaluated using real multi-view stereo data (DTU [8]).We achieve competitive results in comparison to generalizable encoder-decoder NeRF models, while providing orders of magnitude faster rendering (see table 3).

RELATED WORK
We discuss existing work that is most relevant to few-shot novel view synthesis in this section.
Early deep learning based approaches used 2D convolutional encoder-decoder architectures mapping the sparse inputs to the target images [9][10][11].These methods were outperformed by 3D aware convolutional approaches [12][13][14].Although many of these could learn to generate 360-degree views from very sparse inputs especially for synthetic central object data, most of them could not scale to high resolution images, complex scenes, and real data such as MVS datasets (DTU [8]).
Implicit neural radiance fields (NeRF) [2] emerged later on as a powerful representation for novel view synthesis.It presented initially however a few limitations such as compu-Fig.2: Overview: Given an input image, a 3D feature volume is built with a ConvNet (first black cube) and re-sampled into a volume representing the target view frustum (red cube).Target feature volumes originating from different input views are aggregated using learnable weights and rendered with αcompositing.Finally the light field network maps a ray stemming from a target camera origin T to the corresponding pixel color of the target image.
In particular, recent methods proposed to augment NeRFs with 2D [7,19,20] and 3D [5] convolutional features collected from the input images, allowing extra-scene generalization and feed-forward prediction.However, they still need to evaluate hundreds of query points per ray during inference, which makes them slow to render.Methods such as [17,21] try to alleviate NeRFs' rendering complexity by learning view independent radiance features.[21] combines it with a single ray-dependant specular component, while Yu et al. [17] predict radiance spherical harmonic coefficients instead.Furthermore, Sitzmann et al. [4] introduced a neural light field representation that maps rays i.e. target pixels directly to their colors without any need for physical rendering.The method was implemented in the auto-decoding setup, which means it requires test time optimization.It also uses a hypernetwork for conditioning, which is expensive to scale to bigger images in compute.
Following [4], we explore here a tangent strategy to NeRFs, consisting in bypassing 3D implicit radiance modelling all together.Differently from [4] however, we propose a more efficient local conditioning mechanism for the light field network, which allows real scene generalization, and offers optimization-free inference.

METHOD
Given one or few images {I i } of a scene or an object with their known camera parameters, i.e. camera poses {R i , T i }, R i ∈ SO(3), T i ∈ R 3 , and intrinsics K ∈ R 3×3 , our goal is to generate images {I t } for novel target views , i.e. new camera poses {R t , T t }.A summary of our method is illustrated in figure 2. We present in the remaining of this section the components of the two stages of our method, namely the convolutional stage, and the neural light field network.

Feature volume re-sampling
Following seminal work (e.g.[5,13]), we build an explicit volume of features from an input image I i using a fully convolutional neural network E consisting of a succession of a 2D convolutional U-Net and several 3D convolutional blocks: where I i ∈ R H×W ×3 , H and W being the height and width of the input RGB image, and and C being respectively the height, width, depth, and the number of channels of the 3D feature volume.
Using the the input feature volume F i aligned with the input image, we would like to create a feature volume F t/i aligned to the target image, that could be used subsequently to render a target feature image given the target camera pose {R t , T t }.Following the principles of volumetric rendering [2], in order to recreate a target image of dimensions H V × W V , we need to evaluate N points {p z u,v } N z=1 along each ray r u,v with direction d u,v , where u ∈ 1, H V and v ∈ 1, W V : where [2], z n and z f being the depth near and far bounds of the visual frustum.K is the intrinsic camera matrix.The target volume F t/i is obtained then as the resampling of input volume F i with trilinear interpolation, using points {p z u,v } aligned rigidly to the input camera coordinate frame: where F t/i ∈ R H V ×W V ×N ×C and {R i , T i } is the input camera pose.In practice, we normalize the aligned points' coordinates prior to sampling as F i is assumed to represent features in the input view normalized device coordinate (NDC) space.

Feature Aggregation and rendering
As different input views provide different information about the observed scene, we merge subsequently the 3D features obtained from the various inputs.We note that all target feature volumes {F k t/i } k provided by input images {I k i } k are represented in the same target view camera coordinate frame.Inspired by attention mechanisms, we propose to learn a 3D confidence measure per input view in the form of a weight volume W i ∈ R H V ×W V ×D .This volume is obtained as one of the channels of the input volume features W i = F i (1) (i.e.W t/i = F t/i (1)).After resampling the input features {F k i } k into the target ones {F k t/i } k , we use the resampled weights {W k t/i } k normalized with Softmax across the input views to compute a weighted average of the target volumes: where index k is over the number of input views, and F t ∈ R H V ×W V ×N ×C−1 .This aggregation allows our method to use an arbitrary number of input views at both training and testing.Following volumetric rendering [2], we generate a target feature image F for a given target view differentiably using α-compositing of the target feature volume F t along the depth dimension.We assume one of the target feature channels to represent volume density σ = F t (1) ∈ R H V ×W V ×D .We recall that the dimensions of tensor F t span the pixels of the target feature resolution H v × W v in the first two dimensions, and N points sampled along each ray for the third dimension.The rendered target feature image then writes: where T represents transmittance, δ z = t z+1 − t z and F ∈ R H V ×W V ×C−2 .In order to reduce the memory cost and increase the rendering speed of our method, the size of the rendered feature image is chosen to be lower than the size of the target image resolution, i.e.H V = H/4 and W V = W/4.

Neural Light Field
The convotulional rendered features produce a low resolution feature image representative of all rays making up the target view.We propose to learn a light field function f to upsample and refine these first stage results.Given a ray r u,v with direction d u,v corresponding to the target image pixel coordinates (u, v), with (u, v) ∈ 1, H × 1, W , we encode rays using Plücker coordinates similarly to Sitzmann et al. [4]: where r u,v ∈ R 6 .This representation ensures a unique ray encoding when the origin T t moves along direction d u,v .We recall that the expression of d u,v as a function of the target camera pose {R t , T t } can be found in equation 2.
The feature F u,v of a ray r u,v at the final image resolution H × W is obtained from the lower resolution rendered feature image F ∈ R H V ×W V ×C−2 through a learned upsampling.Specifically, the rendered feature image undergoes two successive 2D convolutions and up upsamplings to produce a feature image at the desired resolution F ∈ R W ×H×C−2 .The final target RGB image I t = {c u,v } u∈ 1,H ,v∈ 1,W is predicted from the concatenation of the ray coordinate and its feature with an MLP accordingly: Notice that while convolution equipped NeRF [2] methods (e.g.[5,7]) require querying H × W × N 3D points through their implicit neural radiance fields, our light field network only needs to evaluate H × W rays, which enables our method to train potentially faster, and render orders of magnitude faster compared to [5,7] (see Table 3).

Training Objective
Our model is fully differentiable and trained end-to-end.We optimize the parameters of the convolutional network E and the light field network f jointly, by back-propagating a combination of a fine loss L r and two coarse losses Lr and Ld : L r and Lr are the L2 reconstruction losses of the final light field predicted image I t and the first stage prediction Ĩt respectively: We additionally regularize the gradient of the low resolution depth image dt rendered from the density volume σ of the first stage thusly: where T and α are detailed in equation 6.

Implementation details
We implemented our method with the PyTorch framework on a Quadro RTX 5000 gpu.We optimize with the Adam solver using learning rate 10 −4 in training and 10 −5 in fine-tuning.The depth of the convolutional feature volume is set to D = 32, and the number of channels C = 32.

Comparison on DTU dataset
We demonstrate the capability of our method to generate novel views from sparse input views using the DTU benchmark [8].Following the PixelNeRF [7] experimental settings, the data is split into 88 training scenes and 16 testing scenes.Each scene contains 49 views, including 4 views for testing as suggested by MVSNeRF [5] and GeoNeRF [22].Our training does not require mask supervision, thus all evaluation are performed on full resolution image(400 × 300) rather than only foreground.
For quantitative comparison, we report the peak signal-tonoise ratio (PSNR), structural similarity (SSIM) and learned perceptual image patch similarity (LPIPS) reconstruction metrics in Table 1 for 3 and 6 view inputs averaged across all testing scenes.We report numbers of PixelNeRF(PN) and MVSNeRF(MN) from RegNeRF [23].We also show qualitative comparisons for 6 view inputs in figure 3.While our method is robust and competitive with NeRF based counterparts, it seems to lack some high frequency details.We defer this limitation to future work.

Per-scene fine-tune results
Table 2 shows a quantitative comparison of our method with the recent few-shot novel view synthesis state-of-the-art with test time optimization.We outperform all methods in the PSNR and SSIM metrics, including conditional baseline PixelNeRF(PN) [7] and MVSNeRF(MN) [5], and unconditional baselines DietNeRF(DN) [15] and RegNeRF(RN) [23].Figure 4 shows a qualitative comparison to MVSNeRF and PixelNeRF with 6 input views after finetuning.We obtain overall comparable performances with generalizable methods [5,7].We recall again that competition methods here require renderings that are orders of magnitude slower than ours.

Rendering time comparison
As shown in table 3, compared with PixelNeRF [7] and MVS-NeRF [5], our method requires less inference time on DTU dataset with 3 input views.

Ablation
We propose an ablative analysis showing the importance of the light field stage in our method.Specifically, we disable the latter (ours w/o lf), and we render the final image directly from the target view aligned convolutional feature volume.Table 4 shows numerical comparisons for 3 and 6 input views on DTU [8], and figure 5 shows qualitative comparisons for 6 input views.

CONCLUSIONS
We proposed a method for generating novel views from few input calibrated images with a single forward pass prediction deep neural network.We learn an implicit neural light field function that models ray colors directly.In comparison to [4], we proposed a more efficient local ray conditioning, and an optimization free inference.Our method outperforms the baselines and provides competitive performances compared to locally conditioned radiance fields (e.g.[5,7]), while being roughly 50 times faster at rendering.

Fig. 1 :
Fig. 1: Our method enables fast generation of novel views from sparse input images without 3D supervision in training.We generate above novel views for objects (ShapeNet dataset) and a scene (DTU dataset) never seen at training.

4 :
Qualitative comparison with test time optimization from 6 input views on the DTU dataset[8].

Table 1 :
[8]ntitative comparison of reconstructed images in the DTU[8]dataset without test time optimization.Qualitative comparison without test time optimization from 6 input views on the DTU dataset[8].

Table 2 :
[8]ntitative comparison of reconstructed images in the DTU[8]dataset with test time optimization.

Table 3 :
Comparison of rendering complexity.
w/o lf Ours w/o lf Ours w/o lf Ours