Provider matrix

Five providers ship first-class. Each runs as a podman container under a hal0-slot@<name>.service unit; container images and tuned flags come from slot profiles managed by hal0-api. Each provider is a class with a small contract (build_env() / start_cmd() / health() / infer()) that makes them stateless and swappable. The slot lifecycle is provider-agnostic; what changes between providers is the workload they serve and the hardware they target.

The matrix

Provider	TOML name	Hardware	What it serves
llama-server	`llama-server`	Vulkan (default) / ROCm (opt-in)	chat, embed, rerank, vision
FLM	`flm`	AMD XDNA NPU (opt-in)	chat / embed / ASR multiplex
Moonshine	`moonshine`	CPU only	STT (`/v1/audio/transcriptions`)
Kokoro	`kokoro`	CPU / Vulkan	TTS (`/v1/audio/speech`)
ComfyUI	`comfyui`	ROCm (Strix Halo iGPU class)	Image gen (`/v1/images/generations`)

All five are first-class. FLM is opt-in because XDNA NPU support depends on AMD’s driver stack being present (kernel >= 6.11 with the amdxdna driver on the host) and a local FLM toolbox image; the picker only advertises NPU when both are detected.

All six toolbox images are published to ghcr.io/hal0ai/ with pinned sha256 digests in manifest.json: hal0-toolbox-vulkan, hal0-toolbox-rocm, hal0-toolbox-flm, hal0-toolbox-moonshine, hal0-toolbox-kokoro, hal0-toolbox-comfyui. (Six images, five providers — llama-server ships both a Vulkan and a ROCm toolbox.)

llama-server

The default for primary and embed. Handles:

Chat completions (/v1/chat/completions).
Plain completions (/v1/completions).
Embeddings (/v1/embeddings).
Rerank (/v1/rerankings, same backend process).
Vision (multimodal models, where the GGUF supports them).

Backend modes:

Vulkan, the default. Runs on iGPUs (Strix Halo, RDNA3), discrete AMD, and discrete NVIDIA cards via Vulkan. Toolbox image: ghcr.io/hal0ai/hal0-toolbox-vulkan (pinned by sha256 in manifest.json).
ROCm, opt-in via ghcr.io/hal0ai/hal0-toolbox-rocm (also pinned by sha256). Faster on RDNA3 discrete cards and on Strix Halo’s iGPU where Vulkan leaves performance on the table.

The CUDA path on NVIDIA uses CUDA-backed llama.cpp through the same provider. Use provider = "llama-server" in slot TOML for all three.

FLM

For AMD XDNA NPUs (the second AI engine on Strix Halo and newer Ryzen AI parts). Multiplexes chat, embed, and ASR workloads on the NPU, keeping the iGPU free for other slots.

Toolbox image: ghcr.io/hal0ai/hal0-toolbox-flm:v1. The image bundles FLM at /opt/fastflowlm/, so no host bind-mount of the FLM tree is required; the container’s ENTRYPOINT runs the in-image flm via tini. Default port 8086. FLM has its own model namespace (you can’t run arbitrary GGUFs through it); available tags come from flm list -j against the bundled model_list.json.

Moonshine

The STT provider. Targets edge-real-time speech-to-text: small model, low latency, designed for streaming.

CPU-only — no GPU path. Upstream useful-moonshine-onnx ships an ONNX Runtime CPU EP only; there’s no Vulkan/ROCm/CUDA EP in the wheel, so the catalog pins the moonshine runtime fan-out to ("cpu",) and the picker never advertises a backend the slot can’t honour.

Toolbox image: ghcr.io/hal0ai/hal0-toolbox-moonshine (pinned by sha256). The provider initialises the ONNX model by passing both models_dir and model_name to MoonshineOnnxModel — passing only one raises a TypeError. See Audio for the endpoint shape.

Kokoro

The TTS provider. Defaults to Kokoro-82M v1.0 (8 languages, 54 voices), with support for swapping to F5-TTS for voice cloning.

Toolbox image: ghcr.io/hal0ai/hal0-toolbox-kokoro (pinned by sha256).

ComfyUI

The image-gen provider. Backs /v1/images/generations with curated SDXL Turbo, SD 1.5, and Flux Schnell weights. hal0 owns the OpenAI ↔ ComfyUI translation; the upstream is treated as a black box that speaks POST /prompt, GET /history/<id>, GET /view.

Toolbox image: ghcr.io/hal0ai/hal0-toolbox-comfyui:v1 (pinned by sha256). Targets ROCm: the Strix Halo iGPU class is the first-class target, the unified memory pool holds an SDXL-Turbo checkpoint alongside a primary chat model.

How a provider plugs in

Every provider implements:

Method	What it does
`build_env()`	Compute the env file the systemd unit and container will use.
`start_cmd()`	The argv passed to podman to launch the container.
`health()`	Cheap probe to decide `warming → ready`.
`infer()`	The request path the dispatcher proxies to.

The slot lifecycle (offline → pulling → starting → warming → ready → serving ↔ idle → unloading) is identical across providers. Adding a new provider means implementing this contract; no slot-manager changes required.