How to build a sub‑200 automated caption pipeline using whisper, ffmpeg and scheduled uploads

I’m going to walk you through a practical, repeatable pipeline I use to generate captions for live-to-VOD content for less than $200 in ongoing costs. This is a lean, automated workflow that leans on ffmpeg for audio prep, a Whisper-based local transcriber for the heavy lifting, and simple scheduling / upload steps so captions appear alongside your videos without manual effort.

Why this approach

I build streaming tools so creators can focus on content, not fiddly post-production. Captions improve discoverability, accessibility and watch-time, but manual captioning is a time sink. The stack below is deliberately pragmatic: it uses reliable open-source tools, keeps operating costs near-zero, and can be run on modest hardware (even a second-hand GPU). You’ll get accurate machine captions fast, editable SRT files, and a scheduled upload step to push captions to your CMS or cloud storage.

Overview of the pipeline

Record / export video (OBS, your streaming platform VOD export, or an automated recording job).

Extract and normalize audio with ffmpeg (consistent sample rate, mono, noise reduction if needed).

Transcribe with Whisper (I recommend using whisper.cpp or OpenAI Whisper locally with a small/medium model depending on hardware).

Post-process timestamps and punctuation to SRT format (Whisper does this for you in many builds).

Embed captions or upload the SRT to your platform via a scheduled job (cron, Github Actions, or a simple task scheduler and rclone or API client).

What “sub‑200” means here

My target is total recurring cost under $200 / year (or one-time hardware spend under $200 if you already have a capable desktop). The components that generate cost are:

Optional GPU or faster CPU for local transcribe speed (one-off).

Cloud compute if you prefer not to run locally (pay-as-you-go).

Storage or transfer costs for uploads (minimal).

Below I’ll show a cost breakdown and a few options so you can pick what fits your environment.

Tools and recommended installs

ffmpeg — audio/video processing (mandatory).

whisper.cpp or local Whisper via Python — transcription engine (I use whisper.cpp for low-spec machines; the Python OpenAI whisper is fine if you have a GPU).

rclone — flexible uploader to cloud providers, or a small script using the YouTube / Vimeo API if you push captions directly.

cron (Linux) / Task Scheduler (Windows) or GitHub Actions — scheduling.

optional: sox for advanced audio cleanup.

Step-by-step commands (practical)

These are the exact commands I use as a core building block. Adjust paths and filenames to match your setup.

1) Extract and normalize audio with ffmpeg (convert to 16k mono WAV — Whisper likes consistent sample rates):

<code>ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ac 1 -ar 16000 -af "highpass=f=200, lowpass=f=3600" output.wav</code>

Notes: I apply a gentle highpass/lowpass to reduce rumble and high-frequency noise. If your recording already looks good, you can skip filters.

2) Transcribe with whisper.cpp (fast, runs on CPU or takes advantage of low-end GPUs):

<code># build and run (one-time)git clone https://github.com/ggerganov/whisper.cppcd whisper.cppmake# transcribe with a small English model./main -m models/ggml-small.en.bin -f output.wav --task transcribe --language en --output_srt# this produces output.srt in the current folder</code>

If you prefer the Python/OpenAI Whisper (requires more RAM or a GPU), a sample command is:

<code>whisper output.wav --model small.en --task transcribe --language en --output_format srt</code>

3) (Optional) Run a lightweight punctuation and profanity pass or time-shift correction. Many Whisper builds give usable SRTs out of the box; I add a short script that ensures no overlapping timestamps and normalizes encoding (UTF-8). I keep this snippet tiny:

<code>python fix_srt.py output.srt -o fixed_output.srt</code>

4) Upload or attach captions automatically. I usually push SRT files to cloud storage (S3 or Google Drive) with rclone, then hit my CMS or platform API to register the caption file with the VOD.

<code>rclone copy fixed_output.srt remote:captions/$(basename input.mp4 .mp4).srt# then trigger an API call to your CMS or YouTube to attach captionspython register_caption.py --video-id VIDEO_ID --caption-path remote:captions/VIDEO.srt</code>

rclone supports scheduled transfers and many destinations without diving into cloud SDKs. For direct platform uploads (YouTube), you’ll need to use the Google API with OAuth credentials or an existing uploader library — I keep that step modular so you can plug in your platform.

Scheduling and automation

My preference: a post-processing folder watch script that runs on a small VM or a local workstation. When OBS writes a finished recording (or my VOD exporter drops a file), the script enqueues it.

Linux example with inotifywait: watch the recordings directory and launch the pipeline when a new .mp4 finishes.

Or use a cron job every 5–10 minutes if you prefer polling.

For teams that want serverless checks, GitHub Actions or a tiny AWS Lambda (invoked by S3 events) works well — but note serverless can introduce small costs if you transcribe in the cloud.

Quality and model choices

Whisper models trade off speed vs accuracy: tiny/ base are fast but rough, small/medium give much better punctuation and speaker handling. In practice I use small.en for English-only content on a laptop and medium if I have GPU or when accuracy matters. If your content includes multiple languages, use a multilingual model and let the pipeline detect language.

Cost breakdown

Item	Example	Est. cost
One-off second-hand GPU (optional)	Used GTX 1660 / RTX 2060	~$100–200
Storage / bandwidth	S3 / Drive	~$0–$50 / year (small volumes)
Software	ffmpeg, whisper.cpp, rclone	Free / open-source
Cloud transcribe (alternative)	OpenAI or GPU rental	Varies; can exceed $200 if used a lot

For a small creator doing dozens of hours of content per year, a one-time hardware purchase + open-source stack will comfortably sit under $200 in total annualized cost. If you use cloud GPUs heavily, budget scales with minutes transcribed.

Maintenance and edge cases

Expect to tune the pipeline for particular audio artifacts: background music, overlapping speakers, or heavy accents can trip the models. My usual playbook:

Run a short noise gate or use manual markers to tell the transcriber where dialog is dense.

Use speaker separation tools if you need speaker-labeled captions — Whisper doesn’t do robust diarization by default.

Keep an editable SRT output so your editor or captioner can quick-correct high-impact videos.

What I test and measure

I treat captioning like a growth experiment. Measure: time-to-availability (how long after recording captions are attached), word accuracy on a sample set, and viewer metrics (clicks, retention). For one client, automating caption uploads cut time-to-caption from 2–3 days to under 30 minutes, and the videos with captions saw a measurable lift in average view time.

If you want, I can share a starter repo with the scripts I use (ffmpeg extract, whisper invocation, rclone upload and a simple watcher). Tell me where you host your videos (YouTube, Vimeo, S3, a CMS), and I’ll tailor the uploader snippet — or walk you through adding YouTube caption uploads with the Google API.

How to build a sub‑200 automated caption pipeline using whisper, ffmpeg and scheduled uploads

Why this approach

Overview of the pipeline

What “sub‑200” means here

Tools and recommended installs

Step-by-step commands (practical)

Scheduling and automation

Quality and model choices

Cost breakdown

Maintenance and edge cases

What I test and measure

You should also check the following news:

How to set up a failover internet bonding solution with a 4g dongle and load‑balancing router that keeps streams live during isp outages

How to build a sub‑200 automated caption pipeline using whisper, ffmpeg and scheduled uploads

How to calibrate color profiles across two webcams and a capture card so face tones match on multi‑camera streams

How to set up a failover internet bonding solution with a 4g dongle and load‑balancing router that keeps streams live during isp outages

Which two‑tier paywall strategy for clips on youtube and patreon increases conversions without reducing organic reach

A step-by-step pricing experiment to find the sweet spot for paywalled clips without killing shareability