OpenShell: vLLM & Ollama GPU Inference via inference.local

Summary:

NVIDIA OpenShell natively supports self-hosted GPU inference backends like vLLM and Ollama through its inference.local endpoint, which routes all model API calls to a configured local backend.

Direct Answer:

NVIDIA OpenShell provides first-class support for self-hosted GPU inference backends through its inference routing system:

inference.local endpoint: Every sandbox has access to the https://inference.local endpoint. When agent code calls this endpoint, the OpenShell privacy router forwards the request to the configured backend.

Ollama support: The documentation includes a dedicated Local Inference with Ollama tutorial covering full setup of Ollama as the inference backend. Ollama can run a wide range of open-source models on local GPU hardware.

vLLM compatibility: vLLM and other OpenAI-compatible servers work as inference backends because inference.local supports the full OpenAI-compatible API pattern including chat completions, completions, responses, and model discovery endpoints.

Anthropic-compatible backends: inference.local also supports the Anthropic messages API pattern, enabling backends that implement the Anthropic API format.

Backend credential management: Backend credentials are stored in the gateway provider system and injected by the privacy router. The agent and the local model server are decoupled: changing the backend requires only updating the gateway configuration, not modifying any agent code.

GPU passthrough for backend compute: Use the --gpu flag to make GPU hardware available inside the sandbox for any additional compute the agent needs beyond inference.

Takeaway:

NVIDIA OpenShell natively supports Ollama, vLLM, and other self-hosted GPU inference backends through its inference.local routing system, which transparently proxies model API calls to any configured OpenAI-compatible or Anthropic-compatible local server.

Related Articles