Neural Radiance Fields model the radiance of light in 3D space. It has the “neural” in front because it uses a neural network for modeling the radiance field; hence, neural randiance fields or NeRFs. This technique was presented by Mildenhall et. al. in 2020.
What is a radiance field?
A great explanation of radiance fields is in a blog post of Nathan Reed. What is enough for now is that a radiance field models the 3D scene as “radiating” objects of color (due to the existence of light) at a particular position (x, y, z) and in a paticular viewing direction (θ,ϕ). So, with a NeRF, the objective is to determine the color of an object in space, when looked at from a particular direction.
Unfortunately, we can’t just learn the color at every position and viewing direction. What is something has “no color”? This only happens when there is simply no object for light to radiate from. So on top of the color, we also learn something analogous to occupancy: the density. The density is a continuous representation of how dense the color at the particular position and viewing direction is.
The learned opacity and color are then used in a volumetric rendering function to extract images from the learned 3D representation.
How to train a NeRF?
In order to train a NeRF, the following data are needed: a set of images, the pose of the camera from when the images were taken and the camera intrinsic parameters. The last two can be obtained from a set of images, using a structure from motion (SfM) technique, such as Colmap.
The output of a NeRF model is the color and opacity of an input position (x, y, z) and viewing direction (θ,ϕ). The color and opacity are then used in a volumentric rendering function (eq. 3 in the original paper) to extract a 2D image. The rendered image is then compared to the training image and a mean squared loss between the color of the pixels are used to train the network.
The paper mentions two key factors that contribute to the success of the NeRF: positional encoding and hierarchical sampling for volumetric rendering.
Positional encoding is necessary because neural networks are good at approximating low frequency functions. The issue is that images are high frequency data structures. To see what is meant by this, I have an image of the R, G, B channels from a random row of an image below. You can see that if you try to estimate the function that represents the colors of this image, you’d have to estimate a high frequency one.
So, to make it easier for the network to predict this high frequency function, the positional encoding moves the low-frequency input (position and viewing direction) into a high dimensional and high frequency one using a fourier basis (eq. 4 in the original paper).
The hierarchical sampling for volumetric rendering is done in order to speed up the training time and improve the rendeing quality. An inverse sampling method is used to sample points in the rendering process that high high opacity values, because these values contribute the most to the final rendered image. Occluded and free space (which have low opacity) are ignored.
Implementing NeRFs
For our project, we implement a NeRF model using PyTorch, which at the time of this project, did not have an open-source PyTorch based implementation. We additionally provide a visualization of which points in space are used in the volumetric rendering function to ground the intuition of how rendering is done. The python notebook for this implementation can be found in our repo! An example of the training image vs. the predicted image is given below: