Skip to main content
This page explains how to install the grid-cortex-client Python package, and access the AI models hosted by GRID-Cortex.

Installation

Install the GRID Cortex client package using pip:
pip install grid-cortex-client
It is recommended that you use Python 3.10+

Authentication & Endpoint

Set up your API key (and endpoint, if needed):
  1. During onboarding General Robotics will give you a personal CORTEX API key.
  2. Export it so the client can pick it up automatically:
    export GRID_CORTEX_API_KEY="<YOUR_KEY>"
    
  3. If you run Cortex on-prem or on a managed cloud deployment, point the client at your instance:
    export GRID_CORTEX_BASE_URL="https://<custom_IP>/cortex"
    
  4. You can also pass the key directly when constructing the client:
    from grid_cortex_client import CortexClient
    client = CortexClient(api_key="<YOUR_KEY>")
    

Quick Start (2 lines)

Get started with just two lines of code:
from grid_cortex_client import CortexClient
result = CortexClient().run(model_id="zoedepth", image_input="demo.jpg")
The result type depends on the model (see reference below).

Model Reference

The client exposes one unified function:
CortexClient.run(model_id: str, **kwargs) -> Any
  • Use the exact model_id shown in the tables below.
  • **kwargs are model-specific inputs such as image_input, prompt, left_image, etc.
All snippets assume:
from grid_cortex_client import CortexClient
client = CortexClient()  # uses env creds and GRID_CORTEX_BASE_URL (if set)

Depth & Stereo

Model IDWhat it doesKey inputsReturns
zoedepthMonocular depthimage_input (path/URL/PIL/np.ndarray)np.ndarray depth map (H, W) float32
foundationstereoStereo depth (FoundationStereo)left_image, right_image; optional aux_args = {K, baseline, hiera, valid_iters}np.ndarray depth map (H, W) float32
# Monocular depth
from PIL import Image
image = Image.open("path/to/scene.jpg")
depth = client.run(model_id="zoedepth", image_input=image)
print(depth.shape, depth.dtype)  # (H, W) float32
# Stereo depth (FoundationStereo)
import numpy as np
from PIL import Image

K = np.array([[525, 0, 320], [0, 525, 240], [0, 0, 1]], dtype=np.float32)
aux = {"K": K, "baseline": 0.1, "hiera": 0, "valid_iters": 32}
left = Image.open("left.jpg")
right = Image.open("right.jpg")
depth = client.run(
    model_id="foundationstereo",
    left_image=left,
    right_image=right,
    aux_args=aux,
)
print(depth.shape)  # (480, 640)

Object Detection

Model IDWhat it doesKey inputsReturns
owlv2Text-prompted object detectionimage_input, prompt; optional box_threshold, timeoutdict with boxes, scores, labels
from PIL import Image

image = Image.open("path/to/street.jpg")
dets = client.run(
    model_id="owlv2",
    image_input=image,
    prompt="car, person, traffic light",
    box_threshold=0.25,
)
print(dets["boxes"][0])  # [x1,y1,x2,y2]
print(dets["scores"][0])  # 0.91
print(dets["labels"][0])  # "car"

Image Segmentation

Model IDWhat it doesKey inputsReturns
gsam2Text-prompted segmentationimage_input, prompt; optional box_threshold, text_threshold, nms_thresholdnp.ndarray mask (H, W) uint8 (255 fg, 0 bg)
sam2Point/box prompted segmentationimage_input, prompts ([[x,y], ...]), labels, optional multimask_output, mode, timeoutbackend dict with masks/scores
sam3Single prompt-type segmentation (text OR points OR boxes)image_input; one of text, points, or boxes; labels required for points/boxesnp.ndarray mask (H, W) uint8
oneformerUniversal segmentationimage_input, mode (panoptic/semantic/instance)dict with output, label_map, latency_ms
GSAM2 (text prompt)
from PIL import Image

img = Image.open("cat.jpg")
mask = client.run(model_id="gsam2", image_input=img, prompt="a cat on the sofa")
print(mask.shape)  # (H, W)
SAM2 (point prompts)
from PIL import Image

img = Image.open("cat.jpg")
result = client.run(
    model_id="sam2",
    image_input=img,
    prompts=[[320, 240], [410, 260]],
    labels=[1, 0],  # 1 = foreground, 0 = background
    multimask_output=True,
)
print(result.keys())  # backend dict
SAM3 (three prompt styles; choose exactly one per call)
from PIL import Image

img = Image.open("cat.jpg")

# Text prompt
text_mask = client.run(
    model_id="sam3",
    image_input=img,
    text="cat",
)
print(text_mask.shape)

# Points prompt
points_mask = client.run(
    model_id="sam3",
    image_input=img,
    points=[[466, 125]],
    labels=[1],  # 1 for foreground, 0 for background
)
print(points_mask.shape)

# Boxes prompt
boxes_mask = client.run(
    model_id="sam3",
    image_input=img,
    boxes=[[340, 32, 609, 365]],
    labels=[1],
)
print(boxes_mask.shape)
OneFormer (semantic mode)
from PIL import Image

img = Image.open("cat.jpg")
result = client.run(
    model_id="oneformer",
    image_input=img,
    mode="semantic",
)
print(result.keys())  # dict_keys(['output', 'label_map', 'latency_ms'])
print(result["output"].shape)  # (H, W) segmentation mask

Grasp Prediction

Model IDWhat it doesKey inputsReturns
graspgen6-DoF grasp generationdepth_image, seg_image, camera_intrinsics; optional aux_args (num_grasps, gripper_config, camera_extrinsics), or provide point_cloud directlydict with grasps (N,4,4), confidence, optional latency_ms
import numpy as np
from PIL import Image

K = np.eye(3)
aux = {"num_grasps": 128, "gripper_config": "single_suction_cup_30mm", "camera_extrinsics": np.eye(4)}
depth_image = np.load("depth.npy")
seg_image = np.array(Image.open("seg.png"))
res = client.run(
    model_id="graspgen",
    depth_image=depth_image,
    seg_image=seg_image,
    camera_intrinsics=K,
    aux_args=aux,
)
print(res["grasps"].shape)

Vision-Language

Model IDWhat it doesKey inputsReturns
moondreamVQA, captioning, detection, pointingimage_input; task = vqa/caption/detect/point; prompt for vqa/detect/point; length (short/normal) for captiondict with "output" (text or structured data)
import numpy as np
from PIL import Image

image = np.array(Image.open("path/to/kitchen.jpg"))
vqa = client.run(
    model_id="moondream",
    image_input=image,
    task="vqa",
    prompt="How many cups are on the table?",
)
print(vqa["output"])  # Text answer

# Image Captioning
result = client.run(
    model_id="moondream",
    image_input=image,
    task="caption",
    length="short",  # or "normal"
)
print(result["output"])  # Text caption

# Object Detection
result = client.run(
    model_id="moondream",
    image_input=image,
    task="detect",
    prompt="cup, plate, bowl",
)
print(result["output"])  # Dict with boxes, scores, labels

# Pointing (clickable points)
result = client.run(
    model_id="moondream",
    image_input=image,
    task="point",
    prompt="the red cup",
)
print(result["output"])  # Numpy array of (x,y) points

Troubleshooting

401 Unauthorized – Check that your shell actually has GRID_CORTEX_API_KEY exported and that the key is correct.
Timeout / connection errors – If you are on-prem/managed cloud, confirm GRID_CORTEX_BASE_URL points to your instance. You can also adjust the default 30 s timeout:
client = CortexClient(timeout=60)