Speech-to-text fails on non-16 kHz audio: resample input in the transcription endpoint #1

Open
opened 2026-05-31 23:10:04 +00:00 by mik-tf · 0 comments
Owner

The /v1/audio/transcriptions endpoint returns HTTP 500 "Unsupported sample rate 24000 (expected 16000)" (or 48000) whenever the uploaded audio is a WAV that is not already 16 kHz mono. Parakeet only accepts 16 kHz mono, and the multipart file path passes the WAV straight to the recognizer without resampling. Only the Opus container path (audio/ogg, audio/webm) resamples, via decode_browser_audio_to_wav. So any caller that sends a WAV at the browser capture rate (44.1 or 48 kHz), or the text-to-speech round-trip rate (24 kHz), gets a 500, and speech-to-text silently falls through to the cloud fallback (which fails when no cloud key is configured). Text-to-speech is unaffected.

Fix: normalize the WAV to 16 kHz mono inside the transcription handler before running the recognizer, reusing the existing shared linear resampler and WAV encoder (resample_to_16k + encode_wav_f32). A WAV that is already 16 kHz mono passes through untouched. This makes the endpoint tolerant of any caller sample rate, which is the right layer to fix it (one place, every caller and path benefit). Kept as a small self-contained change so it can be reverted or adjusted easily.

Two related operational notes for the tester side that this surfaced:

  • The consumer voice server on a tester must be the gnu build (it links libopus, sherpa, onnx); the musl build is an older generation that does not implement the transcription methods.
  • The host needs libopus0 installed (the gnu binary fails to start with "libopus.so.0: cannot open shared object file" otherwise). This should be added to the tester install dependencies.

Signed-by: mik-tf mik-tf@noreply.invalid

The `/v1/audio/transcriptions` endpoint returns HTTP 500 "Unsupported sample rate 24000 (expected 16000)" (or 48000) whenever the uploaded audio is a WAV that is not already 16 kHz mono. Parakeet only accepts 16 kHz mono, and the multipart `file` path passes the WAV straight to the recognizer without resampling. Only the Opus container path (audio/ogg, audio/webm) resamples, via `decode_browser_audio_to_wav`. So any caller that sends a WAV at the browser capture rate (44.1 or 48 kHz), or the text-to-speech round-trip rate (24 kHz), gets a 500, and speech-to-text silently falls through to the cloud fallback (which fails when no cloud key is configured). Text-to-speech is unaffected. Fix: normalize the WAV to 16 kHz mono inside the transcription handler before running the recognizer, reusing the existing shared linear resampler and WAV encoder (`resample_to_16k` + `encode_wav_f32`). A WAV that is already 16 kHz mono passes through untouched. This makes the endpoint tolerant of any caller sample rate, which is the right layer to fix it (one place, every caller and path benefit). Kept as a small self-contained change so it can be reverted or adjusted easily. Two related operational notes for the tester side that this surfaced: - The consumer voice server on a tester must be the gnu build (it links libopus, sherpa, onnx); the musl build is an older generation that does not implement the transcription methods. - The host needs libopus0 installed (the gnu binary fails to start with "libopus.so.0: cannot open shared object file" otherwise). This should be added to the tester install dependencies. Signed-by: mik-tf <mik-tf@noreply.invalid>
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_voice_provider#1
No description provided.