Stateless OpenAI-compatible embeddings and reranking daemon using BGE ONNX models.
  • Rust 66%
  • HTML 33.4%
  • CSS 0.4%
  • Makefile 0.2%
Find a file
mik-tf 7083767e67
Some checks failed
lab publish (gnu) / publish-gnu (push) Failing after 8m18s
ci(lab-publish-gnu): authenticate git for private forge deps
Configure the FORGE_TOKEN credential via git insteadOf and use the git CLI for
cargo fetches, so the gnu build can clone private transitive dependencies.
Matches the other Hero publish workflows.

See lhumina_code/home#268

Signed-by: mik-tf <mik-tf@noreply.invalid>
2026-06-07 19:39:39 -04:00
.forgejo/workflows ci(lab-publish-gnu): authenticate git for private forge deps 2026-06-07 19:39:39 -04:00
crates chore: rename herolib_derive → herolib_macros across workspace 2026-06-06 21:29:40 +02:00
schema chore: migrate to hero_lifecycle, herolib_derive/openrpc, and canonical service.toml keys 2026-05-31 23:28:54 +02:00
.gitignore chore: remove Cargo.lock and update gitignore 2026-06-06 08:05:31 +02:00
Cargo.toml chore: rename herolib_derive → herolib_macros across workspace 2026-06-06 21:29:40 +02:00
Cargo.toml.hero_builder_backup chore: switch to baso_info! macro, herolib_derive/openrpc, and patch hero_rpc_derive locally 2026-06-01 09:30:58 +02:00
LICENSE feat: initial scaffold — stateless OpenAI-compatible embedder daemon 2026-05-26 14:59:23 +02:00
README.md feat: add admin dashboard, reranker, status/mem_info/ort_info RPC methods, and install helper 2026-05-29 10:09:37 +02:00
rust-toolchain.toml chore: switch to baso_info! macro, herolib_derive/openrpc, and patch hero_rpc_derive locally 2026-06-01 09:30:58 +02:00

hero_embedder_provider

Stateless OpenAI-compatible embeddings + reranking daemon backed by BGE ONNX models. No auth, no sessions, no database — models are loaded once at startup and kept in memory.

What it does

  • Embeddings — four BGE quality levels (INT8 fast → FP32 best), 384d or 768d
  • Reranking — BGE cross-encoder reranker scores (query, doc) pairs
  • OpenAI-compatible REST on rest.sock + TCP 8092 for drop-in client use
  • Hero-native JSON-RPC 2.0 on rpc.sock with full status / mem_info introspection
  • Admin dashboard on admin.sock — live status, embed/rerank playground, performance benchmarks

Crates

Crate Role
hero_embedder_provider_lib ONNX embedder + reranker, download, install
hero_embedder_provider_server Axum server — REST, RPC, TCP 8092
hero_embedder_provider_admin Admin dashboard web UI
hero_embedder_provider_sdk Generated OpenRPC client for Rust consumers

Quality levels

Four quality levels are exposed on the embed call and as REST model IDs. Select by quality (14) in JSON-RPC, or by model in the REST API:

Level Model ID BGE model Precision Dimensions Max tokens Use case
Q1 bge-small-q1 bge-small-en-v1.5 INT8 384 128 Fast, low memory
Q2 bge-small-q2 bge-small-en-v1.5 FP32 384 256 Balanced
Q3 bge-base-q3 bge-base-en-v1.5 INT8 768 256 High quality
Q4 bge-base-q4 bge-base-en-v1.5 FP32 768 512 Best accuracy
bge-reranker-base bge-reranker-base FP32 n/a 512 Cross-encoder rerank

OpenAI alias IDs are also accepted on the REST endpoint:

  • text-embedding-3-smallbge-small-q2
  • text-embedding-3-largebge-base-q4

Bind address

Interface Address
TCP 127.0.0.1:8092 (Linux: mycelium overlay if up)
REST socket $PATH_SOCKETS/hero_embedder_provider/rest.sock
RPC socket $PATH_SOCKETS/hero_embedder_provider/rpc.sock
Admin socket $PATH_SOCKETS/hero_embedder_provider/admin.sock

On Linux the TCP listener probes the local mycelium daemon at startup; if a mycelium overlay address is found it binds there so other nodes on the overlay can reach it directly.

REST API (OpenAI-compatible)

Method Path Description
POST /v1/embeddings Compute dense vectors for one or more inputs
GET /v1/models List available local model IDs
GET /health `{"status": "ok"

Embeddings request

POST /v1/embeddings
Content-Type: application/json

{ "model": "bge-small-q1", "input": "Hello world" }

Batch input and the text-embedding-3-* aliases work the same way.

JSON-RPC 2.0 API (Hero-native)

All methods are served on rpc.sock at POST /rpc.

Method Params Description
health Service status + models_ready boolean
status Download progress, model load progress, models root path
models.list List locally available model IDs
mem_info Process RSS, system RAM, per-model disk sizes
ort_info ONNX Runtime version, path, min-version check
embed texts: string[], quality?: 14 Compute embeddings; preserves INT8 quantization scale
rerank query, docs: [{id,text}], top_k? BGE cross-encoder rerank

embed example

{ "jsonrpc": "2.0", "id": 1, "method": "embed",
  "params": [["Hello world", "Machine learning"], 1] }

Response includes embeddings, precision (int8/fp16), dimensions, quality, and model name. INT8 embeddings carry a scale field for dequantisation.

rerank example

{ "jsonrpc": "2.0", "id": 2, "method": "rerank",
  "params": ["what is ML?",
             [{"id":"d1","text":"ML is..."},{"id":"d2","text":"Weather..."}],
             5] }

Returns an array of {id, score} sorted by descending score.

Health during startup

Model loading takes seconds (CPU-only) to minutes (cold first-run download). Both sockets and TCP are bound immediately at startup and return {"status": "starting"} until models are ready — hero_proc probes never flap. The status RPC method exposes per-file download progress and per-model load progress during this window.

ONNX Runtime

ONNX Runtime 1.25.0+ is required. At startup the server:

  1. Checks for an existing installation (hero lib dir → Homebrew → system paths).
  2. If not found, downloads the official GitHub release archive automatically.

To override: set ORT_DYLIB_PATH to the full path of the dylib before starting. The ort_info RPC method reports the detected path and version.

Build

cargo build --workspace --release

Use from Rust

use hero_embedder_provider_sdk::EmbedderProviderClient;

let client = EmbedderProviderClient::connect().await?;

// Embed at Q1 (fast INT8)
let resp = client.embed(vec!["hello world".into()], Some(1)).await?;
println!("{}d {}", resp.dimensions, resp.precision);

// Rerank
let hits = client.rerank(
    "what is ML?".into(),
    vec![("d1".into(), "ML is...".into()), ("d2".into(), "Weather".into())],
    Some(5),
).await?;

Or point any OpenAI client at http://127.0.0.1:8092/v1 with no auth.