Detect objects in an image using a text prompt with Google’s OWL-ViT v2.
Parameters
image_input
str | PIL.Image | np.ndarray
required
RGB image as file path, URL, PIL Image, or numpy array.
Comma-separated text description of objects to detect (e.g. "car, person, traffic light").
Confidence threshold (0.0–1.0) for filtering detections.
Returns
dict with keys:
boxes — List of bounding boxes as [x1, y1, x2, y2]
scores — List of confidence scores (0.0–1.0)
labels — List of detected label strings
Example Output
Example
from grid_cortex_client import CortexClient
from PIL import Image
client = CortexClient()
image = Image.open("scene.jpg") # 640x480 RGB
dets = client.run(
model_id="owlv2",
image_input=image,
prompt="building",
box_threshold=0.1,
)
print(len(dets["boxes"]))
# 10
# Top detections:
print(dets["scores"][0], dets["boxes"][0])
# 0.152 [122.0, 5.7, 187.9, 227.1]
print(dets["scores"][1], dets["boxes"][1])
# 0.271 [167.7, 26.6, 334.8, 242.4]
print(dets["scores"][2], dets["boxes"][2])
# 0.165 [-0.5, -0.9, 179.5, 242.3]