DGX Spark Local LLM Infrastructure Research
Evaluated local LLM inference infrastructure for Claude Code on NVIDIA DGX Spark hardware.
Hardware & Model Evaluation
Researched NVIDIA DGX Spark specifications: GB10 Grace Blackwell Superchip, 128 GB unified memory, 1 PFLOP FP4, $4,699 MSRP.
Assessed coding models for the Spark platform:
- Qwen3-Coder-Next (80B MoE, ~43 tok/s)
- Qwen3-Coder-30B-A3B (NVFP4, ~56 tok/s)
- Qwen 2.5 Coder 32B (Apache 2.0)
Evaluated Spark against alternatives: Mac Studio M4 Max (128 GB, better bandwidth, lower cost), Mac Studio M4 Ultra (up to 512 GB), RTX 5090 build (fast but 32 GB ceiling).
vLLM + Claude Code Integration
Identified vLLM (with NVIDIA’s DGX Spark Docker image) as the recommended inference server for native Anthropic Messages API support and tool-calling compatibility.
Claude Code redirection via ANTHROPIC_BASE_URL environment variable requires no patches or forks. Required vLLM flags:
--enable-auto-tool-choice--tool-call-parser--enable-prefix-caching
Stack is viable: vLLM + Qwen3-Coder-Next with ANTHROPIC_BASE_URL=http://localhost:8000.
Reflection
What went well:
ANTHROPIC_BASE_URLredirect is clean and well-documented- MoE architecture (Qwen3 family) identified as optimal for Spark’s bandwidth characteristics
- Parallel research covered hardware, software stack, models, and alternatives efficiently
What could be better:
- If implementing, benchmark Qwen3-Coder-30B-A3B (NVFP4) vs Qwen3-Coder-Next (FP8) on real Claude Code tasks
- Consider testing LM Studio as a lower-friction entry point before committing to vLLM + Docker
Shipped:
- Research infrastructure plan (not hardware-bound; no purchases made)