Depth to Point Cloud
In this tutorial, we’ll learn how to capture RGB and depth images from AirGen’s simulation environment, generate point clouds from ground truth depth data, predict depth maps from RGB images using the Metric3D model, and visualize everything using the Rerun library.
Notebook for this example can be found: Here
Getting Started
First, let’s import the Metric3D model for depth prediction:
The Metric3D model is a computer vision model specifically designed for monocular depth estimation. This model can predict depth from a single RGB image, which is particularly useful when ground truth depth information isn’t available.
Capturing Images and Generating Point Clouds
Now we’ll create a workflow that captures and processes images from the simulation environment:
We also initialize an AirGen VehicleClient, which provides an interface to the simulation environment.
This code captures images from the “front_center” camera of the simulated vehicle. We request two types of images:
airgen.ImageType.Scene
: An RGB image showing what the camera seesairgen.ImageType.DepthPerspective
: A depth image containing distance information for each pixel
The getImages
method returns a list of image data, which we organize into a dictionary for easier access.
Here we iterate through our captured images. For the depth image, we create a mask to filter out points that are too far away (>1000 units). The np.where
function creates a binary mask: 1 for pixels where depth is less than 1000, and 0 for pixels where depth is greater than or equal to 1000. This helps us focus on relevant objects in the scene and ignore very distant points.
The depth2pointcloud
function converts the 2D depth map into a 3D point cloud. It takes three parameters:
depth
: The depth map (camera_data[0])camera_param
: Camera parameters like focal length and principal point (camera_data[1])mask
: Our binary mask to filter out distant points
The function uses the camera parameters to project each pixel into 3D space based on its depth value, resulting in a cloud of 3D points representing the scene.
These lines print the shapes of our data structures and visualize the ground truth point cloud and depth map using Rerun. The rr.log
function sends data to the Rerun visualizer, with the first parameter being a path in the visualization tree and the second parameter being the data to visualize.
For the RGB image, we first visualize it and then use the Metric3D model to predict a depth map. The run
method takes an RGB image as input and returns a predicted depth map. This is the key step where AI-based depth estimation occurs.
The Metric3D model outputs a depth map with shape (256, 256), but the depth2pointcloud
function expects a depth map with shape (256, 256, 1) where the last dimension is the channel dimension. This reshape operation adds that extra dimension.
Similar to what we did with the ground truth depth, we create a mask to filter out distant points in the predicted depth map. Note that we use a different threshold (150 instead of 1000) because the predicted depth values might be on a different scale than the ground truth.
We use the same depth2pointcloud
function to convert the predicted depth map to a 3D point cloud. Note that we use the camera parameters from the original RGB image (camera_data[1]), as these parameters are the same for both the RGB and depth cameras in the simulation.
Finally, we print the shape of the reshaped depth map and visualize the predicted point cloud and depth map using Rerun. The visualization paths (‘grid/point_cloud_predicted’ and ‘grid/depth_predicted’) are different from the ground truth visualization paths, allowing us to compare them side by side in the Rerun visualizer.