A Diffusion Approach to Radiance Field Relighting using Multi-Illumination Synthesis

Relighting radiance fields is severely underconstrained for multi-view data, which is most often captured under a single illumination condition; It is especially hard for full scenes containing multiple objects. We introduce a method to create relightable radiance fields using such single-illumination data by exploiting priors extracted from 2D image diffusion models. We first fine-tune a 2D diffusion model on a multi-illumination dataset conditioned by light direction, allowing us to augment a single-illumination capture into a realistic – but possibly inconsistent – multi-illumination dataset from directly defined light directions. We use this augmented data to create a relightable radiance field represented by 3D Gaussian splats. To allow direct control of light direction for low-frequency lighting, we represent appearance with a multi-layer perceptron parameterized on light direction. To enforce multi-view consistency and overcome inaccuracies we optimize a per-image auxiliary feature vector. We show results on synthetic and real multi-view data under single illumination, demonstrating that our method successfully exploits 2D diffusion model priors to allow realistic 3D relighting for complete scenes.


Introduction
Radiance fields have recently revolutionized 3D scene capture from images [MST * 20].Such captures typically involve a multi-view set of photographs taken under the same lighting conditions.Relighting such radiance fields is hard since lighting and material properties are entangled (e.g., is this a shadow or simply a darker color?) and the inverse problem ill-posed.
One approach to overcome this difficulty is to capture a multiillumination dataset which better conditions the inverse problem but comes at the cost of a heavy capture setup [DHT * 00].Another option is to use priors, which is typically done by training a neural network on synthetic data to predict intrinsic properties or relit images.However, creating sufficiently large, varied and photorealistic 3D scenes is both challenging and time-consuming.As such, methods relying on these-or simpler-priors often demonstrate Other methods have handled more complex illumination models, including full scenes [PMGD21, PGZ * 19], but can be limited in the complexity of the geometry and materials that must reconstruct well.Finally, methods that depend on accurate estimates of surface normals [JLX * 23, GGL * 23] often produce limited levels of realism when relighting.
At the other end of the spectrum, diffusion models (DMs, e.g., [RBL  *  22]), trained on billions of natural images, have shown exceptional abilities to capture real image distribution priors and can synthesize complex lighting effects.While recent progress shows they can be controlled in various ways [ZRA23], extracting lightingspecific priors from these models, especially for full 3D scenes, has not yet been demonstrated.
In this paper, we build on these observations and present a new method that demonstrates that it is possible to create relightable radiance fields for complete scenes from single low-frequency lighting condition captures by exploiting 2D diffusion model priors.We first propose to fine-tune a pre-trained DM conditioned on the dominant light source direction.For this, we leverage a dataset of images with many lighting conditions of the same scene [MGAD19], which enables the DM to produce relit versions of an image with explicit control over the dominant lighting direction.We use this 2D relighting network to augment any standard multi-view dataset taken under single lighting by generating multiple relit versions of each image, effectively transforming it into a multi-illumination dataset.Given this augmented dataset, we train a new relightable radiance field with direct control on lighting direction, which in turn enables realistic interactive relighting of full scenes with lighting and camera view control in real time for low-frequency lighting.We build on 3D Gaussian Splatting [KKLD23], enhancing the radiance field with a small Multi-Layer Perceptron and an auxiliary feature vector to account for the approximate nature of the generated lightings and to handle lighting inconsistencies between views.
In summary, our contributions are: • A new 2D relighting neural network with direct control on lighting direction, created by fine-tuning a DM with multi-lighting data.• A method to augment single-lighting multi-view capture to an approximate multi-lighting dataset, by exploiting the 2D relighting network.• An interactive relightable radiance field that provides direct control on lighting direction, and corrects for inconsistencies in the neural relighting.
We demonstrate our solution on synthetic and real indoor scenes, showing that it provides realistic relighting of multi-view datasets captured under a single lighting condition in real time.

Related Work
Our method proposes a relightable radiance field.We review work on radiance fields and their relightable variants, and discuss diffusion models and fine-tuning methods we build on.

Radiance Fields
Radiance field methods have revolutionized 3D scene capture using multi-view datasets (photos or video) as input.In particular, Neural Radiance Fields (NeRFs) [MST * 20] learn to synthesize novel views of a given scene by regressing its radiance from a set of input images (multiple photos or videos of a 3D scene).Structure from motion [Ull79,SF16] is used to estimate the camera poses for all images and rays are cast through the center of all pixels.A multilayer perceptron (MLP) c θ parameterized by 3D position and view direction is used to represent the radiance and opacity of the scene.The optimization objective is simply the mean squared error: where o is a ray's origin, d its direction, and c * the target RGB color value of its corresponding pixel.could be used instead.Radiance fields are most commonly used in the context of single-light condition captures, i.e., the images are all captured under the same lighting.As a result, there is no direct way to change the lighting of captured scenes, severely restricting the utility of radiance fields compared to traditional 3D graphics assets.
Our method uses diffusion models to simulate multi-light conditions from a single-light capture thus allowing the relighting of radiance fields.

Single Image Relighting
Single image relighting approaches have mostly been restricted to human faces, with the most recent methods using generative priors OutCast [GRP22] produces realistic, user-controllable, hard, cast shadows from the sun.In contrast, we focus on indoor scenes which often exhibit soft shadows and more complex lighting effects.Finally, the concurrent work of Zeng et al. [ZDP * 24] uses diffusion models to relight isolated objects using environment maps.In contrast to these solutions, we focus on cluttered indoor scenes which often exhibit soft shadows and more complex lighting effects.] use the more recent 3D Gaussian representation along with ray-tracing to estimate properties of objects.GS-IR [LZF * 24], GaussianShader [JTL * 24] and GIR [SWW * 23] also build on 3D Gaussian splatting, proposing different approaches to estimate more reliable normals while approximating visibility and indirect illumination; these work well for isolated objects under distant lighting.However, these methods struggle with more complex scene-scale input and near-field illumination but can work or be adapted to both single and multi-illumination input data.] focused on scene scale, single illumination relighting scenes using both implicit and explicit representations.While they can achieve reasonable results, they often lack overall realism, exhibiting bumpy or overly smooth shading during relighting.In contrast, our use of diffusion priors provides realistic-looking output.

Diffusion Models
Diffusion Models (DMs) [SDWMG15, HJA20] made it possible to train generative models on diverse, high-resolution datasets of billions of images.These models learn to invert a forward diffusion process that gradually transforms images into isotropic Gaussian noise, by adding random Gaussian noise ϵt ∼ N (0, I) to an image in T steps.DMs train a neural network g φ with parameters φ to learn to denoise with the objective: in which target y t is often set to ϵ.After training, sampling can be performed step-by-step, by predicting x t−1 from xt for each timestep t which is expensive since T can be high (e.g., 1000); faster alternatives include deterministic DDIM [SME20] sampling, that can perform sampling of comparable quality with fewer steps (i.e., 10-50× larger steps).Stable Diffusion [RBL * 22] performs denoising in a lower-dimensional latent space, by first training a variational encoder to compress images; for instance, in Stable Diffusion XL [PEL * 23], images are mapped to a latent space of size R 128×128×4 .In a pre-pass, the dataset is compressed using this autoencoder, and a text-conditioned diffusion model is then trained directly in this latent space.
Diffusion models have an impressive capacity to synthesize highly realistic images, typically conditioned on text prompts.The power of DMs lies in the fact that the billions of images used for training contain an extremely rich representation of the visual world.However, extracting the required information for specific tasks, without incurring the (unrealistic) cost of retraining DMs is not straightforward.A set of recent methods show that it is possible to fine-tune DMs with a typically much shorter training process to perform specific tasks (e.g., [GAA * 23, RLJ * 23]).A notable example is Con-trolNet [ZRA23] which proposed an efficient method for fine-tuning Stable Diffusion with added conditioning.In particular, they demonstrated conditional generation from depth, Canny edges, etc., with and without text prompts; We will build on this solution for our 2D relighting method.
In a similar spirit, there has been significant evidence in recent years that latent spaces of generative models encode material information [BMHF23,BF24].Recent work shows the potential to fine-tune DMs to allow direct material editing [SJL * 23].Nonetheless, we are unaware of published methods that use DM fine-tuning to perform realistic relighting of full and cluttered scenes.

Method
Our method is composed of three main parts.First, we create a 2D relighting neural network with direct control of lighting direction (Sec.3.1).Second, we use this network to augment a multi-view capture with single lighting into a multi-lighting dataset, by using our relighting network.The resulting dataset can be used to create a radiance field representation of the 3D scene (Sec.3.2).Finally, we create a relightable radiance field that accounts for inaccuracies in the synthesized relit input images and provides a multi-view consistent lighting solution (Sec.3.3).

Single-View Relighting with 2D Diffusion Priors
Relighting a scene captured under a single lighting condition is severely underconstrained, given the lighting/material ambiguity, and thus requires priors about how appearance changes with illumination.Arguably, large DMs must internally encode such priors since they can generate realistic complex lighting effects, but existing architectures do not allow for explicit control over lighting.
We propose to provide explicit control over lighting by finetuning a pre-trained Stable Diffusion (SD) [RBL * 22] model using ControlNet [ZRA23] on a multi-illumination dataset.As illustrated in Fig. 2, the ControlNet accepts as input an image as well as a target light direction, and produces a relit version of the same scene under the desired lighting.To train the ControlNet, we leverage the dataset of Murmann et al. [MGAD19], which contains N = 1015 real indoor scenes captured from a single viewpoint, each lit under M = 25 different, controlled lighting directions.We only keep the 18 non-front facing light directions.

Lighting Direction
To capture the scenes using similar light directions, Murmann et al. relied on a camera-mounted directional flash controlled by a servo motor.A pair of diffuse and metallic spheres are also visible in each scene; we leverage the former to obtain the effective lighting directions.Using as target the average of all diffuse spheres produced by the same flash direction, we find the lighting direction l ∈ R 3 which best reproduces this target when rendering a gray ball with a simplistic Phong shading model.More specifically, we minimize the L 1 error when jointly optimizing for an ambient light term and shading parameters (albedo, specular intensity and hardness, as well as a Fresnel coefficient).Fig. 3 illustrates this process.

Controlling Relighting Diffusion
We train ControlNet to predict relit versions of the input image by conditioning it on a target lighting direction.Let us denote a set X of images of a given scene in the multi-light dataset of Murmann et al. [MGAD19], where each image X k ∈ X has associated light direction l k .Our approach, illustrated in Fig. 2, trains on pairs of lighting directions of the same scene (including the identity pair).The denoising objective becomes with the approach of Ke et al. [KOH * 24])-both are given as input to the ControlNet subnetwork.In short, the network is trained to denoise input image X i given its light direction l i while conditioned on the image X j corresponding to another lighting direction l j of the same scene.Here, we do not use text conditioning: the empty text string is provided to the network.
Specifically, the light direction l i is encoded using the first 4 bands of spherical harmonics, following the method of Müller et al. [MESK22].The resulting vector is added to the timestep embedding prior to feeding it to the layers of ControlNet's trainable copy.

Improving the Diffusion Quality
Since ControlNet was not specifically designed for relighting, adapting it naively as described above leads to inaccurate colors and a loss in contrast (see Fig. 4), as well as distorted edges (see Fig. 5).These errors also degrade multi-view consistency.We adopt two strategies to improve coloration and contrast.First, we follow the recommendations of [LLLY23] to improve image brightness-we found them to also help for color.In particular, using the "v-parameterized" objective yt = √ ᾱt • ϵ − √ 1 − ᾱt • x, instead of the more usual yt = ϵ, proved critical; in this equation, 1 − ᾱt gives the variance of the noise at timestep t.Second, after sampling, we color-match predictions to the input image to compensate for the difference between the color distribution of the training data and that of the scene.This is done by subtracting the per-channel mean and dividing by the standard deviation for the prediction, then adding the mean and standard deviation of the input, in the LAB colorspace.This is computed over all 18 lighting conditions together (i.e., the mean over all lighting directions) to conserve relative brightness across all conditions.Fig. 4 shows the effect of these changes; without them, the bottle is blue instead of green and overall contrast is poor.
To correct the distorted edges, we adapt the asymmetric autoencoder approach of Zhu et al. [ZFC * 23], which consists in conditioning the latent space decoder with the (masked) input image for the inpainting task.In our case, we ignore the masking and fine-tune the decoder on the multi-illumination dataset [MGAD19].At each fine-tuning step, we encode an image and condition the decoder on an image from the same scene with another random lighting direction.The decoder is fine-tuned with the Adam optimizer at Example relighting results obtained using our 2D relighting network on images outside of the dataset are shown in Fig. 6.Observe how the relit images produced by our method are highly realistic and light directions are consistently reproduced across scenes.A naive solution for radiance field relighting would be to apply this 2D network to each synthesized novel view.However, the ControlNet is not multi-view consistent, and such a naive solution results in significant flickering.Please see the accompanying video for a clear illustration.

Augmenting Multi-View/Single-Lighting Datasets
Given a multi-view set I of images of a scene captured under the same lighting (suitable for training a radiance field model), we now leverage our light-conditioned ControlNet model to synthetically relight each image in I.We assume the 3D pose of each image I i ∈ I is known a priori, for example via Colmap [SF16,SZPF16].We then simply relight each I i ∈ I to the corresponding 18 known light directions in the dataset from Murmann et al. [MGAD19] (excluding the directions where the flash points forward), (see Sec. 3.1).We now have a full multi-lighting, multi-view dataset.This process is illustrated in Fig. 7.

Training a Lighting-Consistent Radiance Field
Given the generated multi-light, multi-view dataset, we now describe our solution to provide a relightable radiance field.In particular, we build on the 3DGS framework of Kerbl et al. [KKLD23].Our requirements are twofold: first, define an augmented radiance field that can represent lighting conditions from different lighting directions; second, allow direct control of the lighting direction used for relighting.
The original 3DGS [KKLD23] radiance field uses spherical harmonics (SH) to represent view-dependent illumination.To encode varying illumination, we replace the SH coefficients with a 3-layer MLP c θ of width 128 which takes as input the light direction along  with the viewing direction.Both vectors have a size of 16 after encoding.
Since light directions are computed with respect to a local camera reference frame (c.f.Sec.3.1), we subsequently register them to the world coordinate system (obtained from Colmap) by rotating them according to their (known) camera rotation parameters: where R i is the 3 × 3 camera-to-world rotation matrix of image I i from its known pose.
We condition the MLP with the spherical harmonics encoding of the globally consistent lighting direction l ′ , which enables training a 3DGS representation on our multi-lighting dataset.While this strategy works well for static images, it results in inconsistent lighting across views despite accounting for camera rotation in Eq. 2. Radi-ance fields like 3DGS rely on multi-view consistency, and breaking it introduces additional floaters and holes in surfaces.
To allow the neural network to account for this inconsistency and correct accordingly, we optimize a per-image auxiliary latent vector a of size 128.Similar approaches for variable appearance have been used for NeRFs [MBRS * 21].Therefore, in addition to the lighting direction l ′ , we condition the MLP with per-view auxiliary parameters a: where g ∈ [1, G] sums over the G gaussians (see [KKLD23]), xg/wg are their features/weights, d is the view direction, o the ray origin, and c is the predicted pixel color.Note that for novel views at We first train 3DGS with the unlit images as a "warmup" stage for 5K iterations, then train the full multi-illumination solution for another 25K iterations, using all 18 back-facing light directions (see Sec. 3.1).The multi-illumination nature of the training results in an increase in "floaters".As observed by Philip and Deschaintre [PD23], floaters are often present close to the input cameras; the explicit nature of 3DGS allows us to reduce these effectively.In particular, we calculate a znear value for all cameras by taking the z value of the 1st percentile of nearest SfM points and scaling this value down by 0.9.During training, at each step, all gaussian primitives that project within the view frustum of a camera but are located in front of its znear plane are culled.Finally, given the complexity of modeling variable lighting, we observed that the optimization sometimes converges to blurry results.To counter this, we overweight three front-facing views (left, right, and center), by optimizing for one of these views every three iterations.This provides marginal improvement in results; all images shown are computed with this method, but it is optional.
The full method for relightable radiance fields is shown in Fig. 8.At inference, we can directly choose a lighting direction, and use efficient 3DGS rendering for interactive updates with modified lighting.Our latent vectors and floater removal remove most, but not all, artifacts introduced by the multi-view inconsistencies; this can be seen in the ablations at the end of the supplemental video.

Results and Evaluation
Our method was implemented by leveraging publicly available implementations of ControlNet [ZRA23] and 3DGS [KKLD23].We use Stable Diffusion [RBL * 22] v2.1 as a backbone.Our source code and datasets will be released upon publication.
We first present the results of our 3D relightable radiance field, both for synthetic and real-world scenes.We then present a quantitative and qualitative evaluation of our method by comparing it to previous work and finally present an ablation of the auxiliary vector a from Sec. 3.3.

Test Datasets
Since there are no real multi-view multi-illumination indoor datasets of full scenes available for our evaluation, we use synthetic scenes to allow quantitative evaluation.For this purpose, we designed 4 synthetic test scenes (KITCHEN, LIVINGROOM, OFFICE, BEDROOM).They were created in Blender by downloading artist-made 3D rooms from Evermotion and modifying them to increase clutter: in each room, we gathered objects and placed them on a table or a countertop.We also created simpler, diffuse-only versions to evaluate how scene clutter affects the relighting results.For each synthetic scene, we first built a standard multi-view (single-lighting) dataset consisting of 4 camera sweeps (left-to-right, at varying elevations) of 50 frames for training and one (at a different elevation) of 100 frames for testing.We simulated the light direction of the 2D training dataset with a spotlight with intensity of 2 kW and radius 0.1 locating on top of the camera and pointing away from it.We used the same set of camera flash directions as in the dataset of Murmann et al. [MGAD19].We then render all frames in 736 × 512 using the Cycles path tracer.Please note that the effective lighting direction will be dependent on the exact configuration of the room.This configuration is our best effort to produce a ground truth usable for comparison.
In addition, we also captured a set of real scenes (KETTLE, HOT PLATES, PAINT GUN, CHEST OF DRAWERS and GARAGE WALL), for which we performed a standard radiance-field multi-view capture, by taking between 90-150 images of the environment, in an approximate sphere (or hemisphere) around the scene center of interest.

3D Relighting Results
We begin by showing qualitative results on the set of real scenes that we captured.Here, we used a resolution of 1536 × 1024, training for 150K iterations.We show qualitative results for these scenes using our 3D relightable radiance field in Fig. 9.In addition, we also show results for two scenes from the MipNeRF360 dataset, namely COUNTER and ROOM.
As our method is lightweight and only adds a small MLP over the core 3DGS architecture, it runs interactively for both novel view synthesis and relighting at 30fps on an A6000 GPU.Memory usage is comparable to the original 3DGS.Please see the video for interactive relighting results on these scenes and additional synthetic scenes.We see that our method produces realistic and plausible relighting results.Also, note that our solution is temporally consistent.

Evaluation
Baselines.We compare our results to the method of [PMGD21] which is specifically designed for complete scenes, Ten-    KKLD23] on the input data and render a test path using novel view synthesis; We then use Out-Cast [GRP22] to relight each individual rendered frame using the target direction.We trained TensoIR [JLX * 23] using the default configuration but modified the "density_shift" parameter from −10 to −8 to achieve best results on our data.For Relightable 3D Gaussians [GGL * 23], we train their "Stage 1" for 30K iterations and "Stage 2" for an additional 10K to recover the BRDF parameters.
We then relight the scenes using 360°environment maps rendered in Blender using a generic empty room and a similar camera/flash setup for ground truth.Finally, to improve the baselines we normalize the predictions of all methods; we first subtract the channel-wise mean and divide out the channel-wise standard deviation, and then multiply and add the corresponding parameters of the ground truths.These operations are performed in LAB space for all methods.
Experimental methodology.We use our synthetic test scenes for providing quantitative results.To compare our method, we rendered 200 novel views with 18 different lighting directions to evaluate the relighting quality for each method by computing standard image quality metrics.Given the complexity of setup for [PMGD21], we only show qualitative results for 1 real scene in Fig. 10.Here, our method was trained at 768 × 512 resolution for 200k iterations, with a batch size of 8 and a learning rate of 10 −4 .
Results.We present quantitative results in Table 1.We present perscene results on the following image quality metrics: PSNR, SSIM, and LPIPS [ZIE * 18].The results demonstrate that our method outperforms all others in all but a few scenarios, where it still achieves competitive performance.
Qualitative comparisons are shown in Fig. 11; on the left we show the ground truth relit image rendered in Blender, and we then show our results, as well as those from Outcast [GRP22], Relightable 3D Gaussians [GGL * 23] and TensoIR [JLX * 23].Please refer to the supplementary HTML viewer for more results.We clearly see that our method is closer to the ground truth, visually confirming the quantitative results in Tab. 1. TensoIR has difficulty reconstructing the geometry, and Relightable 3D Gaussians tend to have a "splotchy" look due to inaccurate normals.Outcast has difficulty with the overall lighting condition and can add incorrect shadows, but in many cases produces convincing results since it operates in image space.Our results show that by using the diffusion prior we manage to achieve realistic relighting, surpassing the state of the art.
Our method was trained for indoor scenes; Fig. 13 gives additional ControlNet results on out-of-distribution samples, showing that it can generalize to some extent to unseen scenes and lighting conditions, although the realism is lower than for in-distribution samples.

Conclusion
We have presented the first method to effectively leverage the strong prior of large generative diffusion models in the context of radiance field relighting.Rather than relying on accurate geometry, material and/or lighting estimation, our approach models realistic illumination directly, by leveraging a general-purpose single-view, multi-illumination dataset and fine-tuning a large pretrained generative model.Our results show that we can synthesize realistic relighting of captured scenes, while allowing interactive novel-view synthesis by building on such priors.Our method shows levels of realism for relighting that surpass the state of the art for cluttered indoor scenes (as opposed to isolated objects).
Figure 12: Example limitations of our approach, with our prediction (top) vs ground truth (bottom).Our ControlNet mistakenly produces a shadow at the top of the image while there should not be any (red arrow), presumably assuming the presence of another top shelf.Additionally, the highlight position is somewhat incorrect (yellow arrow), ostensibly because we define light direction in a manner that is not fully physically accurate.
One limitation of the proposed method is that it does not enforce physical accuracy: the target light direction is noisy and the ControlNet relies mostly on its powerful Stable Diffusion prior to relight rather than performing physics-based reasoning.For example, Fig. 12 shows that ControlNet can hallucinate shadows due to unseen geometry, while there should not be any.Given that we define light direction in a manner that is not fully physically accurate, the positioning of highlight can be inaccurate, as is also shown in Fig. 12.In addition, the appearance embeddings can correct for global inconsistencies indirectly and do not explicitly rely on the learned 3D representation of the radiance field.Our method does not always remove or move shadows in a fully accurate physicallybased manner.While our method clearly demonstrates that 2D diffusion model priors can be used for realistic relighting, the ability to perform more complex relighting-rather than just changing light direction-requires significant future research, e.g., by using more general training data as well as ways to encode and decode complex lighting.
An interesting direction for future work would be trying to enforce multi-view consistency more explicitly in ControlNet, e.g. by leveraging single-illumination multi-view data.Another interesting direction is to develop solutions that would guide the predicted relighting making it more accurate, leveraging the 3D geometric information available in the radiance field more explicitly.

Figure 2 :
Figure 2: We use the single-view, multi-illumination dataset of Murmann et al. [MGAD19] to train ControlNet [ZRA23] on single view supervised relighting.The network accepts an image (along with its estimated depth map) and a target light direction as input and produces a relit version of the same scene under the desired target lighting.

Figure 3 :
Figure 3: Top row: five diffuse sphere rendered by our optimized lighting direction and shading parameters -the direction is indicated by a blue dot at the point of maximum specular intensity; Bottom row: the corresponding target gray spheres obtained by averaging the diffuse spheres captured in all spheres.We found the lighting directions by minimizing the L 1 distance between the top and bottom row.

Figure 4 :
Figure 4: Importance of post-relighting color and contrast adjustments.Left: input image.Middle: naive ControlNet relighting; the bottle has the wrong color and the contrast is poor.Right: our relighting after training with [LLLY23] and after color-matching the input.

Figure 5 :
Figure 5: Importance of conserving edge sharpness when relighting.Left: input image.Middle: naive ControlNet relighting; note how the edges do not match the input and how the text is illegible.Right: our final relighting after fine-tuning the conditonal decoder network from [ZFC * 23].

Figure 6 :
Figure 6: Relighting results with our light-conditioned ControlNet.From a single input image (left column), the ControlNet can generate realistic relit versions for different target light directions (other columns).Please notice realistic changes in highlights for different light directions (top row), as well as the synthesis of cast shadows (bottom row).

Figure 7 :
Figure 7: Given a multi-view, single-illumination dataset we use our relighting ControlNet to generate a multi-view, multi-illumination dataset.

Figure 8 :
Figure 8: Overview of our radiance training scheme.To alleviate potential inconsistencies in lighting directions, we condition our 3DGS-based radiance field both on the illumination direction encoding and on optimized auxiliary vectors (one per training image).These vectors model the differences between predictions and let us fit each view to convergence.

Figure 9 :
Figure 9: Qualitative relighting results for the real scenes, from left to right: CHEST OF DRAWERS, KETTLE, MIPNERF ROOM and GARAGE WALL, for a moving light source.The lighting direction is indicated in the gray ball in the lower right.Please see the supplemental video for more results.Please note how the highlights (left group) and shadows (right group) have changed.

Figure 10 :
Figure 10: Qualitative comparison on real scene KETTLE.From left to right, from the same viewpoint: input lighting condition (view reconstructed using 3D Gaussian Splatting), target lighting, our relighting, Philip et al. [PMGD21] relighting.Top and bottom rows are two different lighting conditions.Philip et al. [PMGD21] exhibits much more geometry and shading artifacts compared to our method; in particular imprecise MVS preprocessing results in missing geometry.

Figure 13 :
Figure13: We show the results of our 2D relighting network on out-of-distribution images (StyleGAN-generated woman and MipNeRF360 BICYCLE, GARDEN, and STUMP).On human faces, ControlNet may change the expression as well as the lighting, or create excessive shininess; on outdoor scenes, while the overall lighting direction is plausible, the network fails to generate sufficiently hard shadows.
The predicted color for that pixel is obtained by integrating a color field c θ weighted by a density field σ θ following the equation of volume rendering.The original NeRF was slow to train and to render; A vast number of methods [TTM [KKLD23]ve been proposed to improve the original technique, e.g., acceleration structures[MESK22], antialiasing [BMT * 21], handling larger scenes [BMV * 22] etc.Recently, 3D Gaussian Splatting (3DGS)[KKLD23]introduces an explicit, primitive-based representation of radiance fields.The anisotropic nature of the 3D Gaussians allows the efficient representation of fine detail, and the fast GPU-accelerated rasterization used allows realtime rendering.We use 3DGS to represent radiance fields mainly for performance, but any other radiance representation, e.g., [CXG * 22], where X t,i is the noisy image at timestep t ∈ [1, T ], where i, j ∈ [1, M], and where ψ are the ControlNet optimizable parameters only.X j is another image from the set and D j is its depth map (obtained Authors version Y. Poirier-Ginter et al. / A Diffusion Approach to Radiance Field Relighting 5 of 14

Table 1 :
[KKLD23]tive results of our 3D relighting on the synthetic datasets (where ground truth is available), compared to previous work, from left to right: OutCast[GRP22](run on individual images from 3DGS[KKLD23]), Relightable3DGaussians [GGL * 23], and TensoIR [JLX * 23].Arrows indicate higher/lower (↑ ↓) is better.Results are color coded by best , second-and third-best.We show comparative results of our method of synthetic scenes where the (approximate) ground truth is available (left), and compare to previous methods.Our approach is closer to the ground truth lighting, capturing the overall appearance in a realistic manner.soIR [JLX * 23] and RelightableGaussians [GGL * 23].Given that most other methods do not handle full scenes well, we also create a new baseline, by first training 3DGS [