In this tutorial, we’ll learn how to capture RGB and depth images from AirGen’s simulation environment, generate point clouds from ground truth depth data, predict depth maps from RGB images using the Metric3D model, and visualize everything using the Rerun library.

Notebook for this example can be found: Here

Getting Started

First, let’s import the Metric3D model for depth prediction:

from grid.model.perception.depth.metric3d import Metric3D
depth_metric3d_0 = Metric3D()

The Metric3D model is a computer vision model specifically designed for monocular depth estimation. This model can predict depth from a single RGB image, which is particularly useful when ground truth depth information isn’t available.

Capturing Images and Generating Point Clouds

Now we’ll create a workflow that captures and processes images from the simulation environment:

import numpy as np
import airgen
from PIL import Image
import rerun as rr
from airgen.utils.vision import depth2pointcloud

# Initialize the AirGen client
client = airgen.VehicleClient()

We also initialize an AirGen VehicleClient, which provides an interface to the simulation environment.

# Capture RGB and depth images from the front center camera
images = client.getImages("front_center", 
        [airgen.ImageType.Scene, airgen.ImageType.DepthPerspective])
capture = {
    "front_rgb": images[0],
    "front_depth": images[1],
}

This code captures images from the “front_center” camera of the simulated vehicle. We request two types of images:

  • airgen.ImageType.Scene: An RGB image showing what the camera sees
  • airgen.ImageType.DepthPerspective: A depth image containing distance information for each pixel

The getImages method returns a list of image data, which we organize into a dictionary for easier access.

# Process each captured image
for camera_name, camera_data in capture.items():
    if camera_name == "front_depth":
        # Process ground truth depth data
        mask = np.where(camera_data[0] < 1000.0, 1, 0).astype(np.uint8)

Here we iterate through our captured images. For the depth image, we create a mask to filter out points that are too far away (>1000 units). The np.where function creates a binary mask: 1 for pixels where depth is less than 1000, and 0 for pixels where depth is greater than or equal to 1000. This helps us focus on relevant objects in the scene and ignore very distant points.

# Generate point cloud from depth data
point_cloud = depth2pointcloud(
    depth=camera_data[0], 
    camera_param=camera_data[1], 
    mask=mask
)

The depth2pointcloud function converts the 2D depth map into a 3D point cloud. It takes three parameters:

  • depth: The depth map (camera_data[0])
  • camera_param: Camera parameters like focal length and principal point (camera_data[1])
  • mask: Our binary mask to filter out distant points

The function uses the camera parameters to project each pixel into 3D space based on its depth value, resulting in a cloud of 3D points representing the scene.

# Display information about the data
print(f"Point cloud shape: {point_cloud.shape}")
print(f"Depth map shape: {camera_data[0].shape}")

# Visualize the ground truth point cloud and depth map
rr.log('grid/point_cloud', rr.Points3D(positions=point_cloud))
rr.log('grid/depth_ground_truth', rr.DepthImage(camera_data[0]))

These lines print the shapes of our data structures and visualize the ground truth point cloud and depth map using Rerun. The rr.log function sends data to the Rerun visualizer, with the first parameter being a path in the visualization tree and the second parameter being the data to visualize.

else:
    # Process RGB image
    print(f"RGB image shape: {camera_data[0].shape}")
    rr.log('grid/rgb', rr.Image(camera_data[0]))
    
    # Predict depth map from RGB image using Metric3D
    depth_map_pred = depth_metric3d_0.run(camera_data[0])
    print(f"Predicted depth map shape: {depth_map_pred.shape}")

For the RGB image, we first visualize it and then use the Metric3D model to predict a depth map. The run method takes an RGB image as input and returns a predicted depth map. This is the key step where AI-based depth estimation occurs.

# Reshape the predicted depth map to be compatible with depth2pointcloud
depth_map_3d_reshape = depth_map_pred.reshape((256, 256, 1))

The Metric3D model outputs a depth map with shape (256, 256), but the depth2pointcloud function expects a depth map with shape (256, 256, 1) where the last dimension is the channel dimension. This reshape operation adds that extra dimension.

# Create a mask to filter out points that are too far away
mask = np.where(depth_map_3d_reshape < 150.0, 1, 0).astype(np.uint8)

Similar to what we did with the ground truth depth, we create a mask to filter out distant points in the predicted depth map. Note that we use a different threshold (150 instead of 1000) because the predicted depth values might be on a different scale than the ground truth.

# Generate point cloud from predicted depth map
point_cloud = depth2pointcloud(
    depth=depth_map_3d_reshape, 
    camera_param=camera_data[1], 
    mask=mask
)

We use the same depth2pointcloud function to convert the predicted depth map to a 3D point cloud. Note that we use the camera parameters from the original RGB image (camera_data[1]), as these parameters are the same for both the RGB and depth cameras in the simulation.

print(f"Predicted depth map after reshape: {depth_map_3d_reshape.shape}")

# Visualize the predicted point cloud and depth map
rr.log('grid/point_cloud_predicted', rr.Points3D(positions=point_cloud))
rr.log('grid/depth_predicted', rr.DepthImage(depth_map_3d_reshape))

Finally, we print the shape of the reshaped depth map and visualize the predicted point cloud and depth map using Rerun. The visualization paths (‘grid/point_cloud_predicted’ and ‘grid/depth_predicted’) are different from the ground truth visualization paths, allowing us to compare them side by side in the Rerun visualizer.