Skip to main content
Detect objects in an image using a text prompt with Google’s OWL-ViT v2.

Parameters

image_input
str | PIL.Image | np.ndarray
required
RGB image as file path, URL, PIL Image, or numpy array.
prompt
str
required
Comma-separated text description of objects to detect (e.g. "car, person, traffic light").
box_threshold
float
Confidence threshold (0.0–1.0) for filtering detections.
timeout
float | None
Optional HTTP timeout.

Returns

dict with keys:
  • boxes — List of bounding boxes as [x1, y1, x2, y2]
  • scores — List of confidence scores (0.0–1.0)
  • labels — List of detected label strings

Example Output

OWLv2 object detection output with bounding boxes

Example

from grid_cortex_client import CortexClient
from PIL import Image

client = CortexClient()
image = Image.open("scene.jpg")  # 640x480 RGB
dets = client.run(
    model_id="owlv2",
    image_input=image,
    prompt="building",
    box_threshold=0.1,
)

print(len(dets["boxes"]))
# 10

# Top detections:
print(dets["scores"][0], dets["boxes"][0])
# 0.152  [122.0, 5.7, 187.9, 227.1]

print(dets["scores"][1], dets["boxes"][1])
# 0.271  [167.7, 26.6, 334.8, 242.4]

print(dets["scores"][2], dets["boxes"][2])
# 0.165  [-0.5, -0.9, 179.5, 242.3]