Speech-to-text fails on non-16 kHz audio: resample input in the transcription endpoint #1
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The
/v1/audio/transcriptionsendpoint returns HTTP 500 "Unsupported sample rate 24000 (expected 16000)" (or 48000) whenever the uploaded audio is a WAV that is not already 16 kHz mono. Parakeet only accepts 16 kHz mono, and the multipartfilepath passes the WAV straight to the recognizer without resampling. Only the Opus container path (audio/ogg, audio/webm) resamples, viadecode_browser_audio_to_wav. So any caller that sends a WAV at the browser capture rate (44.1 or 48 kHz), or the text-to-speech round-trip rate (24 kHz), gets a 500, and speech-to-text silently falls through to the cloud fallback (which fails when no cloud key is configured). Text-to-speech is unaffected.Fix: normalize the WAV to 16 kHz mono inside the transcription handler before running the recognizer, reusing the existing shared linear resampler and WAV encoder (
resample_to_16k+encode_wav_f32). A WAV that is already 16 kHz mono passes through untouched. This makes the endpoint tolerant of any caller sample rate, which is the right layer to fix it (one place, every caller and path benefit). Kept as a small self-contained change so it can be reverted or adjusted easily.Two related operational notes for the tester side that this surfaced:
Signed-by: mik-tf mik-tf@noreply.invalid