This content originally appeared on DEV Community and was authored by nareshipme
APIs that do speech-to-text — Groq Whisper, OpenAI Whisper, and friends — all have one thing in common: a file size limit. Groq's hard cap is 25MB. A typical one-hour interview at decent quality can easily be 80–150MB. If you just try to send that, you'll get a 413 or a rate-limit error before the transcription even starts.
The fix is chunking: split the audio into manageable pieces, transcribe each one, then stitch the results back together — with correct timestamps. That last part is where most implementations go wrong.
Here's the approach I landed on, built around ffmpeg and TypeScript.
The Strategy
if file < 24MB → send directly (fast path)
else → chunk into 20-min segments at 32kbps mono → transcribe each → stitch
The 20-minute / 32kbps combination keeps each chunk well under 5MB, which gives plenty of headroom below the 25MB limit regardless of source format.
This content originally appeared on DEV Community and was authored by nareshipme
nareshipme | Sciencx (2026-03-21T15:38:27+00:00) Audio Chunking for Long-Form Transcription: Splitting and Stitching with ffmpeg + TypeScript. Retrieved from https://www.scien.cx/2026/03/21/audio-chunking-for-long-form-transcription-splitting-and-stitching-with-ffmpeg-typescript/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.