GRID Docs | General Robotics

This page explains how to install the grid-cortex-client Python package, and access the AI models hosted by GRID-Cortex.

Installation

Install the GRID Cortex client package using pip:

pip install grid-cortex-client

It is recommended that you use Python 3.10+

Authentication & Endpoint

Set up your API key (and endpoint, if needed):

During onboarding General Robotics will give you a personal CORTEX API key.
Export it so the client can pick it up automatically:
```
export GRID_CORTEX_API_KEY="<YOUR_KEY>"
```
If you run Cortex on-prem or on a managed cloud deployment, point the client at your instance:
```
export GRID_CORTEX_BASE_URL="https://<custom_IP>/cortex"
```

You can also pass the key directly when constructing the client:

from grid_cortex_client import CortexClient
client = CortexClient(api_key="<YOUR_KEY>")

Quick Start (2 lines)

Get started with just two lines of code:

from grid_cortex_client import CortexClient
result = CortexClient().run(model_id="zoedepth", image_input="demo.jpg")

The result type depends on the model (see reference below).

Model Reference

The client exposes one unified function:

CortexClient.run(model_id: str, **kwargs) -> Any

Use the exact model_id shown in the tables below.
**kwargs are model-specific inputs such as image_input, prompt, left_image, etc.

All snippets assume:

from grid_cortex_client import CortexClient
client = CortexClient()  # uses env creds and GRID_CORTEX_BASE_URL (if set)

Depth & Stereo

Model ID	What it does	Key inputs	Returns
`zoedepth`	Monocular depth	`image_input` (path/URL/PIL/np.ndarray)	`np.ndarray` depth map `(H, W)` float32
`foundationstereo`	Stereo depth (FoundationStereo)	`left_image`, `right_image`; optional `aux_args = {K, baseline, hiera, valid_iters}`	`np.ndarray` depth map `(H, W)` float32

# Monocular depth
from PIL import Image
image = Image.open("path/to/scene.jpg")
depth = client.run(model_id="zoedepth", image_input=image)
print(depth.shape, depth.dtype)  # (H, W) float32

# Stereo depth (FoundationStereo)
import numpy as np
from PIL import Image

K = np.array([[525, 0, 320], [0, 525, 240], [0, 0, 1]], dtype=np.float32)
aux = {"K": K, "baseline": 0.1, "hiera": 0, "valid_iters": 32}
left = Image.open("left.jpg")
right = Image.open("right.jpg")
depth = client.run(
    model_id="foundationstereo",
    left_image=left,
    right_image=right,
    aux_args=aux,
)
print(depth.shape)  # (480, 640)

Object Detection

Model ID	What it does	Key inputs	Returns
`owlv2`	Text-prompted object detection	`image_input`, `prompt`; optional `box_threshold`, `timeout`	dict with `boxes`, `scores`, `labels`

from PIL import Image

image = Image.open("path/to/street.jpg")
dets = client.run(
    model_id="owlv2",
    image_input=image,
    prompt="car, person, traffic light",
    box_threshold=0.25,
)
print(dets["boxes"][0])  # [x1,y1,x2,y2]
print(dets["scores"][0])  # 0.91
print(dets["labels"][0])  # "car"

Image Segmentation

Model ID	What it does	Key inputs	Returns
`gsam2`	Text-prompted segmentation	`image_input`, `prompt`; optional `box_threshold`, `text_threshold`, `nms_threshold`	`np.ndarray` mask `(H, W)` uint8 (255 fg, 0 bg)
`sam2`	Point/box prompted segmentation	`image_input`, `prompts` (`[[x,y], ...]`), `labels`, optional `multimask_output`, `mode`, `timeout`	backend dict with masks/scores
`sam3`	Single prompt-type segmentation (text OR points OR boxes)	`image_input`; one of `text`, `points`, or `boxes`; `labels` required for points/boxes	`np.ndarray` mask `(H, W)` uint8
`oneformer`	Universal segmentation	`image_input`, `mode` (`panoptic`/`semantic`/`instance`)	dict with `output`, `label_map`, `latency_ms`

GSAM2 (text prompt)

from PIL import Image

img = Image.open("cat.jpg")
mask = client.run(model_id="gsam2", image_input=img, prompt="a cat on the sofa")
print(mask.shape)  # (H, W)

SAM2 (point prompts)

from PIL import Image

img = Image.open("cat.jpg")
result = client.run(
    model_id="sam2",
    image_input=img,
    prompts=[[320, 240], [410, 260]],
    labels=[1, 0],  # 1 = foreground, 0 = background
    multimask_output=True,
)
print(result.keys())  # backend dict

SAM3 (three prompt styles; choose exactly one per call)

from PIL import Image

img = Image.open("cat.jpg")

# Text prompt
text_mask = client.run(
    model_id="sam3",
    image_input=img,
    text="cat",
)
print(text_mask.shape)

# Points prompt
points_mask = client.run(
    model_id="sam3",
    image_input=img,
    points=[[466, 125]],
    labels=[1],  # 1 for foreground, 0 for background
)
print(points_mask.shape)

# Boxes prompt
boxes_mask = client.run(
    model_id="sam3",
    image_input=img,
    boxes=[[340, 32, 609, 365]],
    labels=[1],
)
print(boxes_mask.shape)

OneFormer (semantic mode)

from PIL import Image

img = Image.open("cat.jpg")
result = client.run(
    model_id="oneformer",
    image_input=img,
    mode="semantic",
)
print(result.keys())  # dict_keys(['output', 'label_map', 'latency_ms'])
print(result["output"].shape)  # (H, W) segmentation mask

Grasp Prediction

Model ID	What it does	Key inputs	Returns
`graspgen`	6-DoF grasp generation	`depth_image`, `seg_image`, `camera_intrinsics`; optional `aux_args` (`num_grasps`, `gripper_config`, `camera_extrinsics`), or provide `point_cloud` directly	dict with `grasps` `(N,4,4)`, `confidence`, optional `latency_ms`

import numpy as np
from PIL import Image

K = np.eye(3)
aux = {"num_grasps": 128, "gripper_config": "single_suction_cup_30mm", "camera_extrinsics": np.eye(4)}
depth_image = np.load("depth.npy")
seg_image = np.array(Image.open("seg.png"))
res = client.run(
    model_id="graspgen",
    depth_image=depth_image,
    seg_image=seg_image,
    camera_intrinsics=K,
    aux_args=aux,
)
print(res["grasps"].shape)

Vision-Language

Model ID	What it does	Key inputs	Returns
`moondream`	VQA, captioning, detection, pointing	`image_input`; `task` = `vqa`/`caption`/`detect`/`point`; `prompt` for `vqa`/`detect`/`point`; `length` (`short`/`normal`) for caption	dict with `"output"` (text or structured data)

import numpy as np
from PIL import Image

image = np.array(Image.open("path/to/kitchen.jpg"))
vqa = client.run(
    model_id="moondream",
    image_input=image,
    task="vqa",
    prompt="How many cups are on the table?",
)
print(vqa["output"])  # Text answer

# Image Captioning
result = client.run(
    model_id="moondream",
    image_input=image,
    task="caption",
    length="short",  # or "normal"
)
print(result["output"])  # Text caption

# Object Detection
result = client.run(
    model_id="moondream",
    image_input=image,
    task="detect",
    prompt="cup, plate, bowl",
)
print(result["output"])  # Dict with boxes, scores, labels

# Pointing (clickable points)
result = client.run(
    model_id="moondream",
    image_input=image,
    task="point",
    prompt="the red cup",
)
print(result["output"])  # Numpy array of (x,y) points

Troubleshooting

401 Unauthorized – Check that your shell actually has GRID_CORTEX_API_KEY exported and that the key is correct.
Timeout / connection errors – If you are on-prem/managed cloud, confirm GRID_CORTEX_BASE_URL points to your instance. You can also adjust the default 30 s timeout:

client = CortexClient(timeout=60)

GET STARTED

OPEN GRID

GRID ENTERPRISE

ROBOT API

SIMULATION

AI LAYER

DEPLOYMENT

DATA GENERATION PIPELINES

FAQ

GRID Cortex

Installation

Authentication & Endpoint

Quick Start (2 lines)

Model Reference

Depth & Stereo

Object Detection

Image Segmentation

Grasp Prediction

Vision-Language

Troubleshooting

GET STARTED

OPEN GRID

GRID ENTERPRISE

ROBOT API

SIMULATION

AI LAYER

DEPLOYMENT

DATA GENERATION PIPELINES

FAQ

Installation

Authentication & Endpoint

Quick Start (2 lines)

​Model Reference

​Depth & Stereo

​Object Detection

​Image Segmentation

​Grasp Prediction

​Vision-Language

​Troubleshooting

Model Reference

Depth & Stereo

Object Detection

Image Segmentation

Grasp Prediction

Vision-Language

Troubleshooting