No description
Find a file
2026-06-06 22:31:34 +02:00
PKGBUILD Add packaging, helper scripts, and README 2026-06-06 22:31:34 +02:00
README.md Add packaging, helper scripts, and README 2026-06-06 22:31:34 +02:00
rocmfp4-convert.sh Add packaging, helper scripts, and README 2026-06-06 22:31:34 +02:00
rocmfp4-run.sh Add packaging, helper scripts, and README 2026-06-06 22:31:34 +02:00

llama-strix

Arch Linux packaging and helper scripts for rocmfp4-llama — experimental AMD-focused FP4 quantization and MTP inference for llama.cpp, targeting AMD Strix Halo (gfx1151).

This repository does not contain the source code. It is a thin packaging layer that:

  • Builds rocmfp4-llama from the upstream tree via PKGBUILD
  • Provides convenience wrappers (rocmfp4-run.sh, rocmfp4-convert.sh)
  • Keeps your custom settings, presets, and packaging logic version-controlled and separate from upstream

Directory Layout

~/prppl/
├── rocmfp4-llama/      # upstream source (clone this separately)
│   ├── README.md
│   ├── scripts/build-strix-rocmfp4-mtp.sh
│   └── ...
└── llama-strix/        # this repo (packaging + helpers)
    ├── PKGBUILD
    ├── rocmfp4-run.sh
    └── rocmfp4-convert.sh

Prerequisites

Install ROCm, Vulkan, and build dependencies:

sudo pacman -S --needed git cmake ninja hip-runtime-amd hipblas \
  rocm-device-libs vulkan-headers vulkan-icd-loader spirv-headers ccache

Optional:

sudo pacman -S --needed vulkan-tools rocm-smi-lib

Build & Install

cd llama-strix
makepkg -si

Override GPU architecture if you are not on Strix Halo gfx1151:

CMAKE_HIP_ARCHITECTURES=gfx1100 makepkg -si

After installation, binaries are prefixed with rocmfp4- to avoid clashing with upstream llama.cpp:

Binary Purpose
rocmfp4-llama-cli Interactive inference
rocmfp4-llama-server OpenAI-compatible server
rocmfp4-llama-quantize Quantize to ROCmFP4 formats
rocmfp4-llama-bench Benchmarking
rocmfp4-llama-completion Completion tool

Convert a Model

You need an F16 or BF16 GGUF as the source. Do not re-quantize an already-quantized model for real quality work.

# Default compact Strix preset (recommended)
./rocmfp4-convert.sh -i /path/to/qwen3.6-35b-a3b-bf16.gguf

# Quality-biased preset
./rocmfp4-convert.sh -i /path/to/qwen3.6-35b-a3b-bf16.gguf -p Q4_0_ROCMFP4_STRIX

# Pure fast format (smallest, may trade coherence)
./rocmfp4-convert.sh -i /path/to/model-f16.gguf -p Q4_0_ROCMFP4_FAST

# Custom output path
./rocmfp4-convert.sh -i model.gguf -o my-model.gguf -p Q4_0_ROCMFP4

Presets

Preset BPW Description
Q4_0_ROCMFP4_STRIX_LEAN ~4.50 Compact, fast, default
Q4_0_ROCMFP4_STRIX ~4.50 Quality-biased, protects sensitive tensors
Q4_0_ROCMFP4 4.50 Pure dual-scale
Q4_0_ROCMFP4_FAST 4.25 Pure single-scale, fastest, smallest

Run a Model

Basic interactive run

./rocmfp4-run.sh -m /path/to/model-ROCmFP4-STRIX_LEAN.gguf

Full power: MTP + reasoning (Qwen3.6 35B A3B style)

./rocmfp4-run.sh -m /path/to/model-ROCmFP4-STRIX_LEAN.gguf --mtp --reasoning

Smaller context for testing

./rocmfp4-run.sh -m /path/to/model.gguf -c 8192

Pass through extra llama-cli args

./rocmfp4-run.sh -m model.gguf --mtp -p "Explain quantum computing"

Environment Variables

Variable Default Description
MODEL (none) Path to model (set via -m)
CTX 262144 Context window size
BATCH 512 Batch size
UBATCH 512 Micro-batch size
NGPU_LAYERS 999 GPU offload layers
DEVICE ROCm0 Backend device
CTX_TYPE_K q8_0 KV cache key type
CTX_TYPE_V q8_0 KV cache value type
HSA_OVERRIDE_GFX_VERSION 11.5.1 ROCm GPU version override
GGML_HIP_ENABLE_UNIFIED_MEMORY 1 Enable unified memory
PRESET Q4_0_ROCMFP4_STRIX_LEAN Default quant preset
THREADS $(nproc) Quantize thread count

Flags Reference

--mtp

Enables Multi-Token Prediction (draft-MTP). Only use with models that support it (e.g., Qwen3.6 35B A3B). This sets:

--spec-type draft-mtp
--spec-draft-n-max 3
--spec-draft-n-min 0
--spec-draft-p-min 0.0
--spec-draft-p-split 0.10
--spec-draft-type-k q4_0
--spec-draft-type-v q4_0

--reasoning

Enables reasoning mode (step-by-step thinking before answering). Remove this flag for models that do not support it.

Expected Performance (Strix Halo 395+, 128 GB unified RAM)

Model Context Profile Decode
Qwen3.6 35B A3B MTP ROCmFP4 STRIX_LEAN 262k draft-MTP, reasoning, q8 KV 104.4 tok/s short, 89.3 sustained
Qwen3.6 27B MTP ROCmFP4 STRIX_LEAN 262k draft-MTP 33.6 tok/s short, 28.0 sustained

Your mileage may vary. Results are hardware-, driver-, model-, and prompt-sensitive.

Rebuilding After Upstream Changes

cd llama-strix
rm -rf pkg/ src/ *.pkg.tar.zst
makepkg -si

License

This packaging layer is provided under the same MIT license as upstream llama.cpp. See upstream LICENSE for details.