Speech-to-text fails on non-16 kHz audio: resample input in the transcription endpoint #1

New issue

Open

opened 2026-05-31 23:10:04 +00:00 by mik-tf · 0 comments

mik-tf commented

2026-05-31 23:10:04 +00:00

Owner

The /v1/audio/transcriptions endpoint returns HTTP 500 "Unsupported sample rate 24000 (expected 16000)" (or 48000) whenever the uploaded audio is a WAV that is not already 16 kHz mono. Parakeet only accepts 16 kHz mono, and the multipart file path passes the WAV straight to the recognizer without resampling. Only the Opus container path (audio/ogg, audio/webm) resamples, via decode_browser_audio_to_wav. So any caller that sends a WAV at the browser capture rate (44.1 or 48 kHz), or the text-to-speech round-trip rate (24 kHz), gets a 500, and speech-to-text silently falls through to the cloud fallback (which fails when no cloud key is configured). Text-to-speech is unaffected.

Fix: normalize the WAV to 16 kHz mono inside the transcription handler before running the recognizer, reusing the existing shared linear resampler and WAV encoder (resample_to_16k + encode_wav_f32). A WAV that is already 16 kHz mono passes through untouched. This makes the endpoint tolerant of any caller sample rate, which is the right layer to fix it (one place, every caller and path benefit). Kept as a small self-contained change so it can be reverted or adjusted easily.

Two related operational notes for the tester side that this surfaced:

The consumer voice server on a tester must be the gnu build (it links libopus, sherpa, onnx); the musl build is an older generation that does not implement the transcription methods.
The host needs libopus0 installed (the gnu binary fails to start with "libopus.so.0: cannot open shared object file" otherwise). This should be added to the tester install dependencies.

Signed-by: mik-tf mik-tf@noreply.invalid

The `/v1/audio/transcriptions` endpoint returns HTTP 500 "Unsupported sample rate 24000 (expected 16000)" (or 48000) whenever the uploaded audio is a WAV that is not already 16 kHz mono. Parakeet only accepts 16 kHz mono, and the multipart `file` path passes the WAV straight to the recognizer without resampling. Only the Opus container path (audio/ogg, audio/webm) resamples, via `decode_browser_audio_to_wav`. So any caller that sends a WAV at the browser capture rate (44.1 or 48 kHz), or the text-to-speech round-trip rate (24 kHz), gets a 500, and speech-to-text silently falls through to the cloud fallback (which fails when no cloud key is configured). Text-to-speech is unaffected. Fix: normalize the WAV to 16 kHz mono inside the transcription handler before running the recognizer, reusing the existing shared linear resampler and WAV encoder (`resample_to_16k` + `encode_wav_f32`). A WAV that is already 16 kHz mono passes through untouched. This makes the endpoint tolerant of any caller sample rate, which is the right layer to fix it (one place, every caller and path benefit). Kept as a small self-contained change so it can be reverted or adjusted easily. Two related operational notes for the tester side that this surfaced: - The consumer voice server on a tester must be the gnu build (it links libopus, sherpa, onnx); the musl build is an older generation that does not implement the transcription methods. - The host needs libopus0 installed (the gnu binary fails to start with "libopus.so.0: cannot open shared object file" otherwise). This should be added to the tester install dependencies. Signed-by: mik-tf <mik-tf@noreply.invalid>