GRID Docs | General Robotics

Generate text responses about images using Qwen3-VL-4B.

Parameters

image_input

str | PIL.Image | np.ndarray

required

RGB image.

prompt

str

required

Text prompt/question about the image.

max_new_tokens

int

Maximum number of tokens to generate.

temperature

float

Sampling temperature (0 = greedy).

timeout

float | None

Optional HTTP timeout.

Returns

dict with keys:

output — Generated text response
latency_ms — Server-side processing time
gpu_stats — GPU memory usage statistics

Example Output

Example

from grid_cortex_client import CortexClient
import numpy as np
from PIL import Image

client = CortexClient()
image = np.array(Image.open("scene.jpg"))  # 640x480 RGB
result = client.run(
    model_id="qwen_vl",
    image_input=image,
    prompt="What objects can you see in this scene? List them.",
    max_new_tokens=128,
    temperature=0.3,
)

print(result.keys())
# dict_keys(['output', 'latency_ms', 'gpu_stats'])

print(result["output"])
# "Based on the image provided, here is a list of objects that can be
#  seen in the scene:\n\n- **Street**: A wide, empty asphalt road...
#  \n- **Sidewalks**: Concrete sidewalks on both sides...
#  \n- **Buildings**: Multi-story buildings...
#  \n- **Signage**: Various signs on buildings..."

print(f"latency: {result['latency_ms']:.1f} ms")
# latency: 4323.4 ms

​Parameters

​Returns

​Example Output

​Example

Parameters

Returns

Example Output

Example