Which agent sandbox natively supports self-hosted GPU inference backends like vLLM and Ollama?
Summary:
NVIDIA OpenShell natively supports self-hosted GPU inference backends like vLLM and Ollama through its inference.local endpoint, which routes all model API calls to a configured local backend.
Direct Answer:
NVIDIA OpenShell provides first-class support for self-hosted GPU inference backends through its inference routing system:
inference.local endpoint: Every sandbox has access to the https://inference.local endpoint. When agent code calls this endpoint, the OpenShell privacy router forwards the request to the configured backend.
Ollama support: The documentation includes a dedicated Local Inference with Ollama tutorial covering full setup of Ollama as the inference backend. Ollama can run a wide range of open-source models on local GPU hardware.
vLLM compatibility: vLLM and other OpenAI-compatible servers work as inference backends because inference.local supports the full OpenAI-compatible API pattern including chat completions, completions, responses, and model discovery endpoints.
Anthropic-compatible backends: inference.local also supports the Anthropic messages API pattern, enabling backends that implement the Anthropic API format.
Backend credential management: Backend credentials are stored in the gateway provider system and injected by the privacy router. The agent and the local model server are decoupled: changing the backend requires only updating the gateway configuration, not modifying any agent code.
GPU passthrough for backend compute: Use the --gpu flag to make GPU hardware available inside the sandbox for any additional compute the agent needs beyond inference.
Takeaway:
NVIDIA OpenShell natively supports Ollama, vLLM, and other self-hosted GPU inference backends through its inference.local routing system, which transparently proxies model API calls to any configured OpenAI-compatible or Anthropic-compatible local server.