Speech-to-Text6 min read

Self-Hosted Speech-to-Text: Docker, Whisper, and Open-Source Options Compared

I've set up self-hosted Whisper for six production deployments. Here's the honest breakdown of Docker, native, and managed open-source options.

Self-Hosted Speech-to-Text: Docker, Whisper, and Open-Source Options Compared

I've set up self-hosted Whisper for six different production deployments over the past three years — on bare metal with NVIDIA GPUs, in Docker Compose stacks on cloud VMs, and on CPU-only servers for light batch workloads. The gap between "it works on my laptop" and "it runs reliably in production" is wider than most tutorials admit. In this guide, I'll give you the honest breakdown of your options, real Docker configurations, and the cost math vendor marketing never includes.

This article is a companion to our complete guide to private speech-to-text APIs.

Self-hosted Whisper running in Docker with GPU acceleration on a remote server

Why Self-Hosted Speech-to-Text?

Self-hosted speech-to-text means running OpenAI Whisper or one of its optimized variants on your own infrastructure. Your audio never touches a third-party API.

Choose self-hosted if:

  • You have hard data residency requirements that mandate audio never leaves your infrastructure
  • You're already running GPU infrastructure for other ML workloads
  • Your audio volume exceeds 2,000 hours per month and you've done the real math on infrastructure vs. API costs
The reality check: For most teams, self-hosting ends up costing more than a managed private API when you factor in GPU infrastructure, DevOps engineering time, and maintenance. I've seen three companies quietly migrate off self-hosted Whisper to managed infrastructure once they ran the real numbers.

Whisper Variants Compared

You won't run vanilla Whisper in production — it's too slow. Here's how the main variants stack up:

faster-whisper is my default recommendation. It re-implements Whisper using CTranslate2 — 4x faster than vanilla Whisper with 50% less memory usage. On a 4090, one hour of audio processes in under 2 minutes. INT8 quantization reduces GPU memory by ~50% with minimal accuracy loss.

whisper.cpp is a C/C++ port that excels on CPU-only hardware — it runs the small model on a Raspberry Pi 4. On GPU, it's slower than faster-whisper. Best for edge deployments with no Python dependencies.

Insanely Fast Whisper hits 30x+ realtime on modern GPUs using speculative decoding. Maximum throughput for batch processing but needs more RAM.

WhisperX adds word-level timestamps and speaker diarization using pyannote.audio. Essential if you need to know when each word was spoken or who spoke.

VariantSpeed (4090 GPU)CPU SupportBest For
faster-whisper~30x realtimeYesGeneral self-hosted production
whisper.cpp~10x realtime (GPU)Best-in-classEdge/embedded, no-Python deps
Insanely Fast Whisper~30-80x realtimeLimitedMaximum throughput batch processing
WhisperX~20x realtimeYesTimestamps + speaker diarization

Docker Deployment That Actually Works

The cleanest approach: faster-whisper behind a FastAPI server, containerized with Docker Compose. The hwdsl2/docker-whisper project maintains a production-ready image with an OpenAI-compatible API, speaker diarization, and WebSocket streaming.

version: '3.8'
services:
  whisper:
    image: ghcr.io/systran/faster-whisper:latest
    container_name: whisper-api
    ports:
      - "8000:8000"
    environment:
      - MODEL=large-v3
      - COMPUTE_TYPE=float16
      - CPU_THREADS=8
    volumes:
      - ./models:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

Critical configuration notes from experience:

  • Set COMPUTE_TYPE=float16 on modern NVIDIA GPUs — it doubles throughput with no accuracy loss
  • Mount a persistent model cache volume — downloading the large model every restart adds 10-15 minutes of startup time
  • GPU memory matters more than core count — a 4090 (24GB) handles large-v3; an RTX 3080 (10GB) maxes out at medium
  • For real-time streaming, the hwdsl2/docker-whisper-live image extends this with WebSocket support

Hardware Requirements

GPUVRAMfastest-whisper modelTime for 1hr audio
RTX 409024GBlarge-v3~2 min
RTX 309024GBlarge-v3~2 min
RTX 408016GBlarge (quantized)~4 min
RTX 308010GBmedium~6 min
CPU onlyN/Abase/tiny15-45 min
The CPU trap: Many teams start with CPU to test, then hit a wall in production. At CPU speeds, one hour of audio takes 15-45 minutes depending on your hardware. Fine for a proof-of-concept with 10 files. Completely unacceptable for production with 100+ files per day.

The Real Cost Breakdown

Cloud GPU (Lambda Labs, Vast.ai, AWS g5): At 400 hours/month of audio, expect $200-400/month on spot instances — before engineering overhead.

Dedicated GPU hardware: RTX 4090 (~$1,600) + server components (~$1,000) = ~$2,600 upfront, plus $50-100/month electricity. Hardware pays off in 8-10 months at 400 hours/month.

But the numbers above don't include:

  • DevOps engineering: 0.25 FTE to maintain self-hosted ML infrastructure properly = ~$1,300/month at market rates
  • Scaling complexity: Load balancing, job queuing, health checks, failure recovery
  • Model updates, hardware failures, CUDA version mismatches

Per AssemblyAI's self-hosting analysis, self-hosting Whisper adds $276+/month in hidden overhead.

Privocio's Go plan: $19/4 weeks for 400 hours — no GPU management, no DevOps overhead, no scaling complexity. The break-even for self-hosting is roughly 2,000+ hours/month, and only with dedicated ML engineering.

When to Choose Self-Hosted vs a Managed Private API

Choose self-hosted if:

  • Volume exceeds 2,000 hours/month AND you have dedicated ML infrastructure engineering
  • Hard compliance requirements mandate on-premise processing
  • You're already running GPU infrastructure for other ML workloads and can share capacity
Choose a managed private API if you're like most teams:
  • Under 2,000 hours/month
  • No dedicated ML/DevOps engineering
  • Want reliability guarantees and someone else's uptime responsibility

Frequently Asked Questions

What hardware do I need to run Whisper at home?

A GPU with at least 10GB VRAM — RTX 3080 or RTX 4090 works well. The large-v3 model requires 24GB; without a GPU you're limited to tiny/base models which are significantly less accurate.

Is faster-whisper really 4x faster than vanilla Whisper?

Yes. In benchmarks, faster-whisper with INT8 quantization processes audio 4x faster than vanilla Whisper with no measurable accuracy difference. The CTranslate2 engine is genuinely impressive engineering.

What's the break-even point for self-hosted vs. managed API?

Above roughly 2,000 hours/month — and only with dedicated ML engineering to maintain it. Below that volume, a managed private API like Privocio is cheaper, more reliable, and frees your team to work on product instead of infrastructure.

Can I run real-time streaming with self-hosted Whisper?

Yes. WhisperLive and the hwdsl2/docker-whisper-live Docker image support WebSocket-based real-time streaming. Latency is typically 500ms-1.5s end-to-end on a modern GPU.

Conclusion: Self-Hosting Is an Engineering Commitment, Not a Cost Saving

After three years running self-hosted Whisper deployments, my honest assessment: self-hosting is a legitimate choice for teams with the engineering capacity to do it right. It's not a cost-cutting measure — when you account for GPU infrastructure, DevOps engineering, and maintenance, a well-run self-hosted deployment costs roughly the same as a managed private API at moderate volumes, and only wins on economics at very high scale.

The teams that succeed with self-hosted Whisper already have GPU infrastructure for other ML workloads, have dedicated ML engineers, and have specific compliance requirements that make on-premise processing mandatory. For everyone else — and that's most AI agent teams and growing companies — a managed private API like Privocio delivers better economics, better reliability, and frees your engineering team to work on product instead of infrastructure.

If self-hosting is the right call for your volume and team, the faster-whisper Docker stack from hwdsl2/docker-whisper is the production-ready starting point I'd recommend. Otherwise, explore Privocio's fixed-rate private transcription plans — 400 hours for $19 every four weeks with no per-minute surprises and no GPU management.

For the full picture on private speech-to-text options, read our Private Speech-to-Text API: The Complete Guide.


Image Credits:

Cover image sourced from Unsplash (Unsplash License).

self-hostedwhisperdockeropen-sourcespeech-to-text