No description

Shell 100%

Find a file

Jan De Landtsheer f256ed0211 Add packaging, helper scripts, and README		2026-06-06 22:31:34 +02:00
PKGBUILD	Add packaging, helper scripts, and README	2026-06-06 22:31:34 +02:00
README.md	Add packaging, helper scripts, and README	2026-06-06 22:31:34 +02:00
rocmfp4-convert.sh	Add packaging, helper scripts, and README	2026-06-06 22:31:34 +02:00
rocmfp4-run.sh	Add packaging, helper scripts, and README	2026-06-06 22:31:34 +02:00

README.md

llama-strix

Arch Linux packaging and helper scripts for rocmfp4-llama — experimental AMD-focused FP4 quantization and MTP inference for llama.cpp, targeting AMD Strix Halo (gfx1151).

This repository does not contain the source code. It is a thin packaging layer that:

Builds rocmfp4-llama from the upstream tree via PKGBUILD
Provides convenience wrappers (rocmfp4-run.sh, rocmfp4-convert.sh)
Keeps your custom settings, presets, and packaging logic version-controlled and separate from upstream

Directory Layout

~/prppl/
├── rocmfp4-llama/      # upstream source (clone this separately)
│   ├── README.md
│   ├── scripts/build-strix-rocmfp4-mtp.sh
│   └── ...
└── llama-strix/        # this repo (packaging + helpers)
    ├── PKGBUILD
    ├── rocmfp4-run.sh
    └── rocmfp4-convert.sh

Prerequisites

Install ROCm, Vulkan, and build dependencies:

sudo pacman -S --needed git cmake ninja hip-runtime-amd hipblas \
  rocm-device-libs vulkan-headers vulkan-icd-loader spirv-headers ccache

Optional:

sudo pacman -S --needed vulkan-tools rocm-smi-lib

Build & Install

cd llama-strix
makepkg -si

Override GPU architecture if you are not on Strix Halo gfx1151:

CMAKE_HIP_ARCHITECTURES=gfx1100 makepkg -si

After installation, binaries are prefixed with rocmfp4- to avoid clashing with upstream llama.cpp:

Binary	Purpose
`rocmfp4-llama-cli`	Interactive inference
`rocmfp4-llama-server`	OpenAI-compatible server
`rocmfp4-llama-quantize`	Quantize to ROCmFP4 formats
`rocmfp4-llama-bench`	Benchmarking
`rocmfp4-llama-completion`	Completion tool

Convert a Model

You need an F16 or BF16 GGUF as the source. Do not re-quantize an already-quantized model for real quality work.

# Default compact Strix preset (recommended)
./rocmfp4-convert.sh -i /path/to/qwen3.6-35b-a3b-bf16.gguf

# Quality-biased preset
./rocmfp4-convert.sh -i /path/to/qwen3.6-35b-a3b-bf16.gguf -p Q4_0_ROCMFP4_STRIX

# Pure fast format (smallest, may trade coherence)
./rocmfp4-convert.sh -i /path/to/model-f16.gguf -p Q4_0_ROCMFP4_FAST

# Custom output path
./rocmfp4-convert.sh -i model.gguf -o my-model.gguf -p Q4_0_ROCMFP4

Presets

Preset	BPW	Description
`Q4_0_ROCMFP4_STRIX_LEAN`	~4.50	Compact, fast, default
`Q4_0_ROCMFP4_STRIX`	~4.50	Quality-biased, protects sensitive tensors
`Q4_0_ROCMFP4`	4.50	Pure dual-scale
`Q4_0_ROCMFP4_FAST`	4.25	Pure single-scale, fastest, smallest

Run a Model

Basic interactive run

./rocmfp4-run.sh -m /path/to/model-ROCmFP4-STRIX_LEAN.gguf

Full power: MTP + reasoning (Qwen3.6 35B A3B style)

./rocmfp4-run.sh -m /path/to/model-ROCmFP4-STRIX_LEAN.gguf --mtp --reasoning

Smaller context for testing

./rocmfp4-run.sh -m /path/to/model.gguf -c 8192

Pass through extra llama-cli args

./rocmfp4-run.sh -m model.gguf --mtp -p "Explain quantum computing"

Environment Variables

Variable	Default	Description
`MODEL`	(none)	Path to model (set via `-m`)
`CTX`	`262144`	Context window size
`BATCH`	`512`	Batch size
`UBATCH`	`512`	Micro-batch size
`NGPU_LAYERS`	`999`	GPU offload layers
`DEVICE`	`ROCm0`	Backend device
`CTX_TYPE_K`	`q8_0`	KV cache key type
`CTX_TYPE_V`	`q8_0`	KV cache value type
`HSA_OVERRIDE_GFX_VERSION`	`11.5.1`	ROCm GPU version override
`GGML_HIP_ENABLE_UNIFIED_MEMORY`	`1`	Enable unified memory
`PRESET`	`Q4_0_ROCMFP4_STRIX_LEAN`	Default quant preset
`THREADS`	`$(nproc)`	Quantize thread count

Flags Reference

`--mtp`

Enables Multi-Token Prediction (draft-MTP). Only use with models that support it (e.g., Qwen3.6 35B A3B). This sets:

--spec-type draft-mtp
--spec-draft-n-max 3
--spec-draft-n-min 0
--spec-draft-p-min 0.0
--spec-draft-p-split 0.10
--spec-draft-type-k q4_0
--spec-draft-type-v q4_0

`--reasoning`

Enables reasoning mode (step-by-step thinking before answering). Remove this flag for models that do not support it.

Expected Performance (Strix Halo 395+, 128 GB unified RAM)

Model	Context	Profile	Decode
Qwen3.6 35B A3B MTP ROCmFP4 STRIX_LEAN	262k	draft-MTP, reasoning, q8 KV	104.4 tok/s short, 89.3 sustained
Qwen3.6 27B MTP ROCmFP4 STRIX_LEAN	262k	draft-MTP	33.6 tok/s short, 28.0 sustained

Your mileage may vary. Results are hardware-, driver-, model-, and prompt-sensitive.

Rebuilding After Upstream Changes

cd llama-strix
rm -rf pkg/ src/ *.pkg.tar.zst
makepkg -si

License

This packaging layer is provided under the same MIT license as upstream llama.cpp. See upstream LICENSE for details.