- Shell 100%
| PKGBUILD | ||
| README.md | ||
| rocmfp4-convert.sh | ||
| rocmfp4-run.sh | ||
llama-strix
Arch Linux packaging and helper scripts for rocmfp4-llama — experimental AMD-focused FP4 quantization and MTP inference for llama.cpp, targeting AMD Strix Halo (gfx1151).
This repository does not contain the source code. It is a thin packaging layer that:
- Builds
rocmfp4-llamafrom the upstream tree viaPKGBUILD - Provides convenience wrappers (
rocmfp4-run.sh,rocmfp4-convert.sh) - Keeps your custom settings, presets, and packaging logic version-controlled and separate from upstream
Directory Layout
~/prppl/
├── rocmfp4-llama/ # upstream source (clone this separately)
│ ├── README.md
│ ├── scripts/build-strix-rocmfp4-mtp.sh
│ └── ...
└── llama-strix/ # this repo (packaging + helpers)
├── PKGBUILD
├── rocmfp4-run.sh
└── rocmfp4-convert.sh
Prerequisites
Install ROCm, Vulkan, and build dependencies:
sudo pacman -S --needed git cmake ninja hip-runtime-amd hipblas \
rocm-device-libs vulkan-headers vulkan-icd-loader spirv-headers ccache
Optional:
sudo pacman -S --needed vulkan-tools rocm-smi-lib
Build & Install
cd llama-strix
makepkg -si
Override GPU architecture if you are not on Strix Halo gfx1151:
CMAKE_HIP_ARCHITECTURES=gfx1100 makepkg -si
After installation, binaries are prefixed with rocmfp4- to avoid clashing with upstream llama.cpp:
| Binary | Purpose |
|---|---|
rocmfp4-llama-cli |
Interactive inference |
rocmfp4-llama-server |
OpenAI-compatible server |
rocmfp4-llama-quantize |
Quantize to ROCmFP4 formats |
rocmfp4-llama-bench |
Benchmarking |
rocmfp4-llama-completion |
Completion tool |
Convert a Model
You need an F16 or BF16 GGUF as the source. Do not re-quantize an already-quantized model for real quality work.
# Default compact Strix preset (recommended)
./rocmfp4-convert.sh -i /path/to/qwen3.6-35b-a3b-bf16.gguf
# Quality-biased preset
./rocmfp4-convert.sh -i /path/to/qwen3.6-35b-a3b-bf16.gguf -p Q4_0_ROCMFP4_STRIX
# Pure fast format (smallest, may trade coherence)
./rocmfp4-convert.sh -i /path/to/model-f16.gguf -p Q4_0_ROCMFP4_FAST
# Custom output path
./rocmfp4-convert.sh -i model.gguf -o my-model.gguf -p Q4_0_ROCMFP4
Presets
| Preset | BPW | Description |
|---|---|---|
Q4_0_ROCMFP4_STRIX_LEAN |
~4.50 | Compact, fast, default |
Q4_0_ROCMFP4_STRIX |
~4.50 | Quality-biased, protects sensitive tensors |
Q4_0_ROCMFP4 |
4.50 | Pure dual-scale |
Q4_0_ROCMFP4_FAST |
4.25 | Pure single-scale, fastest, smallest |
Run a Model
Basic interactive run
./rocmfp4-run.sh -m /path/to/model-ROCmFP4-STRIX_LEAN.gguf
Full power: MTP + reasoning (Qwen3.6 35B A3B style)
./rocmfp4-run.sh -m /path/to/model-ROCmFP4-STRIX_LEAN.gguf --mtp --reasoning
Smaller context for testing
./rocmfp4-run.sh -m /path/to/model.gguf -c 8192
Pass through extra llama-cli args
./rocmfp4-run.sh -m model.gguf --mtp -p "Explain quantum computing"
Environment Variables
| Variable | Default | Description |
|---|---|---|
MODEL |
(none) | Path to model (set via -m) |
CTX |
262144 |
Context window size |
BATCH |
512 |
Batch size |
UBATCH |
512 |
Micro-batch size |
NGPU_LAYERS |
999 |
GPU offload layers |
DEVICE |
ROCm0 |
Backend device |
CTX_TYPE_K |
q8_0 |
KV cache key type |
CTX_TYPE_V |
q8_0 |
KV cache value type |
HSA_OVERRIDE_GFX_VERSION |
11.5.1 |
ROCm GPU version override |
GGML_HIP_ENABLE_UNIFIED_MEMORY |
1 |
Enable unified memory |
PRESET |
Q4_0_ROCMFP4_STRIX_LEAN |
Default quant preset |
THREADS |
$(nproc) |
Quantize thread count |
Flags Reference
--mtp
Enables Multi-Token Prediction (draft-MTP). Only use with models that support it (e.g., Qwen3.6 35B A3B). This sets:
--spec-type draft-mtp
--spec-draft-n-max 3
--spec-draft-n-min 0
--spec-draft-p-min 0.0
--spec-draft-p-split 0.10
--spec-draft-type-k q4_0
--spec-draft-type-v q4_0
--reasoning
Enables reasoning mode (step-by-step thinking before answering). Remove this flag for models that do not support it.
Expected Performance (Strix Halo 395+, 128 GB unified RAM)
| Model | Context | Profile | Decode |
|---|---|---|---|
| Qwen3.6 35B A3B MTP ROCmFP4 STRIX_LEAN | 262k | draft-MTP, reasoning, q8 KV | 104.4 tok/s short, 89.3 sustained |
| Qwen3.6 27B MTP ROCmFP4 STRIX_LEAN | 262k | draft-MTP | 33.6 tok/s short, 28.0 sustained |
Your mileage may vary. Results are hardware-, driver-, model-, and prompt-sensitive.
Rebuilding After Upstream Changes
cd llama-strix
rm -rf pkg/ src/ *.pkg.tar.zst
makepkg -si
License
This packaging layer is provided under the same MIT license as upstream llama.cpp. See upstream LICENSE for details.