← Back to blog

How I Transcribed 45 Hours of Video for $1.24

·6 min read

I needed to build a batch processing pipeline that could transcribe 204 lecture videos - 45 hours and 39 minutes of audio - into searchable text and structured JSON with segment-level timestamps. The constraint was cost (under $5 for the entire batch) and the output needed to be reliable enough to build a search UI on top of. No manual review, no retranscription.

The interesting engineering problem wasn't transcription itself - Whisper handles that. It was designing a pipeline that could process 200+ files reliably over an hour-long run without manual intervention, handle failures gracefully, and adapt to whatever network conditions it was running on.

Provider selection as a system design decision

The speech-to-text market has a pricing spread that exposes how little the underlying model matters compared to infrastructure margins. I benchmarked 45.7 hours across every major provider:

ProviderModelCostConstraint
AWSTranscribe$65.75S3 + IAM setup
OpenAIwhisper-1$16.4325MB upload limit
GoogleChirp / V2VariableGCS + Cloud Function
DeepInfralarge-v3$1.23OpenAI-compatible API
DeepInfralarge-v3-turbo$0.55Lower accuracy on jargon
Groqlarge-v3$0 (free)23hrs+ due to rate limits

The decision wasn't just about cost - it was about operational complexity. AWS requires orchestrating S3 uploads, IAM roles, and async job polling. Google Cloud needs a storage bucket, possibly a Cloud Function, and pricing that's nearly impossible to predict before you run the job. Groq is free but the rate limits make batch processing impractical.

I chose DeepInfra because it exposes an OpenAI-compatible API. Same SDK, same interface, no additional infrastructure. That's a deliberate choice: if DeepInfra disappears tomorrow, switching to Groq, Together, or Fireworks is a one-line base URL change. No vendor lock-in, no infrastructure to tear down.

from openai import OpenAI

client = OpenAI(
    api_key=deepinfra_key,
    base_url="https://api.deepinfra.com/v1/openai"
)

with open(audio_path, "rb") as f:
    result = client.audio.transcriptions.create(
        model="openai/whisper-large-v3",
        file=f,
        response_format="verbose_json",
        timestamp_granularities=["segment"]
    )

Designing for reliability over a long-running batch

Processing 204 files takes about 78 minutes. In that window, your connection will hiccup, the API will rate-limit you, your machine might sleep, or you might need to ctrl-C for an unrelated reason. The pipeline needed to handle all of these without human intervention or re-processing.

Idempotent processing

Every file produces two outputs: a .txt transcript and a .json with segment timestamps. Before processing, the pipeline checks whether both outputs exist. If they do, skip. If either is missing or corrupt, re-process from scratch.

This sounds obvious, but the design decision is where you check. Checking before the API call means you never waste money on re-transcription. If a run gets interrupted mid-file, the output files won't exist for that file, so the next run picks it up automatically. No manual tracking needed.

Bandwidth-aware concurrency

This was the most interesting design problem. Transcription is I/O bound in two independent dimensions: upload bandwidth (client-limited) and API processing time (server-limited). Too few workers and you waste API capacity while waiting for uploads. Too many and you saturate the upload pipe, causing timeouts and retries that waste both time and money.

The solution: measure actual upload bandwidth at startup, then compute optimal parallelism based on the ratio of upload time to processing time:

import subprocess, os

# Upload 3MB to Cloudflare's speed test endpoint, read actual throughput
data = os.urandom(3 * 1024 * 1024)
proc = subprocess.run(
    ["curl", "-s", "-o", "/dev/null", "-w", "%{speed_upload}",
     "-X", "POST", "--data-binary", "@-", "--max-time", "10",
     "https://speed.cloudflare.com/__up"],
    input=data, capture_output=True, text=True
)
upload_bps = float(proc.stdout.strip())

avg_file_bytes = total_audio_size / num_files
upload_time = avg_file_bytes / upload_bps
api_processing_time = 15.0  # measured average per file

optimal_workers = max(2, min(64, int(api_processing_time / upload_time) + 1))

The execution uses a ThreadPoolExecutor with a Semaphore to limit concurrent API calls. Each worker acquires the semaphore before calling the transcription API and releases it when done. This prevents too many simultaneous uploads from saturating the connection while keeping the thread pool busy.

On 14 Mbps upload, this auto-tunes to 4 workers. On gigabit, it scales to 64. The pipeline adapts to whatever machine it runs on without manual configuration.

Retry strategy

Exponential backoff: 2, then 4 seconds. Three attempts per file before marking it as failed and moving on. Failed files get logged so you can see what went wrong. On re-run, the idempotent check skips everything that already succeeded and retries only the missing ones. Across 204 files, I had zero permanent failures - a couple of transient 429s resolved on first retry.

Audio preparation as a cost optimization

Whisper works on audio, not video. The pre-processing step strips audio with ffmpeg:

ffmpeg -i input.mp4 -vn -acodec libmp3lame -ar 44100 -ab 128k output.mp3

MP3 at 128kbps is a deliberate choice: Whisper doesn't benefit from lossless audio for speech recognition, and MP3 at this bitrate keeps files under 25MB for any lecture under 25 minutes (the provider upload limit). A 15-minute lecture drops from 150MB video to 14MB MP3. That's a 10x reduction in upload time, which directly affects pipeline throughput given the bandwidth-limited concurrency model.

Results

  • 204 files processed, zero permanent failures
  • 45 hours 39 minutes of audio transcribed
  • $1.24 total cost (vs. $65.75 on AWS for functionally equivalent output)
  • 78 minutes wall time on 4 workers at 14 Mbps upload
  • Zero vendor lock-in - provider switch is a one-line change

What I'd design differently at higher scale

This pipeline is optimized for a single-machine batch job. At 10x scale (2,000+ files), the design changes:

  • Job queue over ThreadPoolExecutor - move to a proper queue (Redis + workers or Celery) so you can distribute across machines and get automatic persistence of job state.
  • Provider stacking - burn through Groq's free tier first, overflow to DeepInfra. The pipeline already abstracts the provider behind the OpenAI SDK, so this is routing logic, not a rewrite.
  • Streaming results - at this scale you want incremental output, not batch-complete. Write results as they arrive so downstream consumers can start processing before the full batch finishes.

The current design handles the problem it was built for. Because it already uses the OpenAI SDK, restarts are idempotent, and concurrency adapts to bandwidth, scaling it up later would be a routing change, not a rewrite.