Developer Guides12 min read

Python Speech-to-Text API: Transcribe Audio Files with Privocio

Learn how to use a Python speech-to-text API to transcribe audio files with httpx, Bearer authentication, Whisper-compatible models, and Privocio's private STT infrastructure.

Python Speech-to-Text API tutorial with httpx code example and Privocio transcription workflow
Quick answer: Privocio lets Python developers transcribe audio files with a simple HTTP request. Use httpx, upload a .wav file, pass model=whisper-1, set the language, and receive the transcript as JSON.

If you are building an AI agent, chatbot, meeting tool, voice note app, internal automation, or any product that needs voice input, you eventually need a reliable way to turn audio into text.

In this guide, you will learn how to use a Python Speech-to-Text API with Privocio. We will upload an audio file, send it to the transcription endpoint, authenticate with an API key, and print the transcript response.

Privocio is built for teams that need private speech-to-text infrastructure with hosted or self-hosted deployment, predictable pricing, and output modes designed for downstream AI workflows.

Use caseLanguageAPI style
Audio file transcriptionPythonHTTP + OpenAI-compatible

What is a Python Speech-to-Text API?

A Python Speech-to-Text API lets your application send an audio file to a transcription service and receive text back.

Instead of running speech recognition models locally, your Python app can make a simple HTTP request with:

    • an audio file
    • an API key
    • a model name
    • optional language settings
    • optional output preferences

This is useful for AI agents, voice chatbots, meeting transcription, support call analysis, internal voice notes, legal or healthcare transcription, and workflow automation.

Why use Privocio for speech-to-text in Python?

Privocio is designed for developers and teams that care about cost, privacy, and AI-ready output.

Simple API integration

Privocio exposes a versioned HTTP API and uses Bearer authentication. The production API base is:

https://api.privocio.com

Transcription requests can be sent to:

/v1/transcriptions

See the full API documentation and Python examples for request details.

Privocio also supports an OpenAI-compatible transcription route for developers who already use Whisper-style transcription workflows.

OpenAI Whisper-compatible workflow

If you already know the Whisper-style API flow, Privocio feels familiar. You send a multipart file upload, pass a model such as whisper-1, and receive a transcription response.

This makes Privocio a practical Whisper API Python alternative for teams that want a familiar developer experience with more control over voice infrastructure.

Private hosted or self-hosted deployment

For teams working with sensitive voice data, Privocio can be used as hosted infrastructure or deployed in a self-hosted setup. Learn more on the pricing page.

This matters for companies that do not want audio data moving through third-party systems they do not control.

Output modes for AI workflows

Privocio supports output modes designed for different use cases.

    • Use Raw mode when you need maximum fidelity.
    • Use Clean mode when you want a more readable transcript.
    • Use Agent mode when the transcript will be passed into an AI agent, workflow automation, or LLM pipeline.

Predictable pricing

Many speech-to-text APIs charge per minute or per hour, which can become unpredictable as usage grows.

Privocio is built around predictable pricing and flat-rate usage packages, making it easier to plan costs for AI agents, automation products, internal tools, and high-volume transcription workflows.

Prerequisites

Before you start, you need:

    • Python 3.9 or newer
    • an audio file, for example recording.wav
    • a Privocio API key (authentication docs)
    • the httpx package

Install httpx:

pip install httpx

Python Speech-to-Text API example with httpx

Here is a simple Python example that uploads an audio file to Privocio and prints the transcription response.

import httpx

API_BASE = "https://api.privocio.com"
API_KEY = "YOUR_API_KEY"

with open("recording.wav", "rb") as audio_file:
    files = {"file": ("recording.wav", audio_file, "audio/wav")}
    data = {"model": "whisper-1", "language": "en"}

    response = httpx.post(
        f"{API_BASE}/v1/transcriptions",
        headers={"Authorization": f"Bearer {API_KEY}"},
        files=files,
        data=data,
        timeout=600.0,
    )

    response.raise_for_status()
    print(response.json())

How the Python example works

1. Import httpx

import httpx

httpx is used to send the HTTP request from Python to the Privocio API.

2. Set the API base and API key

API_BASE = "https://api.privocio.com"
API_KEY = "YOUR_API_KEY"

The API base points to Privocio's production API. The API key is used in the Authorization header.

In production, do not hardcode your API key. Use an environment variable instead.

import os

API_KEY = os.environ["PRIVOCIO_API_KEY"]

3. Open the audio file

with open("recording.wav", "rb") as audio_file:

The file is opened in binary mode so it can be uploaded as multipart form data.

4. Prepare the file upload

files = {"file": ("recording.wav", audio_file, "audio/wav")}

This tells the API which file you are uploading and what MIME type it uses.

5. Set the transcription parameters

data = {"model": "whisper-1", "language": "en"}

The model parameter defines the transcription model. The language parameter helps the transcription system understand the expected spoken language.

6. Send the request

response = httpx.post(
    f"{API_BASE}/v1/transcriptions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    files=files,
    data=data,
    timeout=600.0,
)

This sends the audio file to the transcription endpoint documented in the API reference.

The long timeout is useful for larger audio files, because batch transcription waits until the full result is ready.

7. Handle errors and print the result

response.raise_for_status()
print(response.json())

raise_for_status() raises an exception if the API returns an error status code. If the request succeeds, the response JSON contains the transcription result.

Better production example: use environment variables

For real applications, keep your API key outside your source code.

import os
import httpx

API_BASE = "https://api.privocio.com"
API_KEY = os.environ["PRIVOCIO_API_KEY"]

def transcribe_audio(path: str, language: str = "en") -> dict:
    with open(path, "rb") as audio_file:
        files = {"file": (path, audio_file, "audio/wav")}
        data = {
            "model": "whisper-1",
            "language": language,
        }

        response = httpx.post(
            f"{API_BASE}/v1/transcriptions",
            headers={"Authorization": f"Bearer {API_KEY}"},
            files=files,
            data=data,
            timeout=600.0,
        )

        response.raise_for_status()
        return response.json()

if __name__ == "__main__":
    transcript = transcribe_audio("recording.wav")
    print(transcript)

This version is better for real applications because the API key is loaded from an environment variable instead of being written directly into the source code.

OpenAI-compatible Python option

Privocio also supports an OpenAI-compatible transcription route. This is useful if you already use OpenAI-style transcription code and want a familiar setup.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.privocio.com/v1",
)

with open("recording.wav", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        language="en",
    )

print(transcript.text)

Use this approach when you want to keep your existing OpenAI SDK structure while routing transcription through Privocio.

If you prefer JavaScript or TypeScript, see the JavaScript STT API examples.

When should you use batch transcription?

Batch transcription is best when you already have a complete audio file.

Use it for uploaded voice notes, recorded calls, podcasts, meetings, interviews, support recordings, and internal audio archives.

For real-time voice agents or live transcription, streaming transcription may be a better fit.

Batch transcription vs. streaming transcription

Use batch transcription when the full audio file already exists.

Use streaming transcription when you need live or near-real-time transcription.

A simple rule:

    • Use batch transcription for files.
    • Use streaming transcription for live voice.

For example, a meeting recording can use batch transcription. A real-time AI voice agent should usually use streaming transcription.

Common errors and how to fix them

401 Unauthorized

This usually means the API key is missing, invalid, or not passed correctly.

Check that your request includes:

Authorization: Bearer YOUR_API_KEY

Also check that the key has not expired and that it has access to the transcription endpoint. See authentication and API keys.

413 Payload Too Large

Your audio file may be larger than your current plan allows.

To fix this:

    • compress the audio file
    • shorten the recording
    • split the file into smaller parts
    • check your current usage limits

415 Unsupported Media Type

This usually means the uploaded file type or MIME type is not supported.

Check that your file is a valid audio file and that the MIME type matches the file format.

files = {"file": ("recording.wav", audio_file, "audio/wav")}

502 Runtime Unavailable

The transcription runtime may temporarily be unavailable.

For production applications, add retry logic for temporary errors.

Best practices for Python speech-to-text integrations

Store API keys safely

Use environment variables or a secrets manager. Never commit API keys to GitHub.

Set a realistic timeout

Audio transcription can take longer than a normal API request. For longer files, use a longer timeout.

Validate file type and size before upload

Check the file extension, MIME type, and file size before sending the request.

Add retry logic

For production systems, retry temporary errors such as network failures or runtime availability issues.

Choose the right output mode

Use Raw, Clean, or Agent output modes depending on your downstream workflow.

Python Speech-to-Text API use cases

AI agents

Turn voice commands into structured text that agents can understand and act on. See our guide on speech-to-text for AI agents.

Chatbots

Add voice input to a chatbot so users can speak instead of typing.

Meeting tools

Transcribe meeting recordings and send the text to a summarization pipeline.

Support call analysis

Convert support calls into text so they can be analyzed, summarized, tagged, and routed.

Keep sensitive conversations private with hosted or self-hosted transcription infrastructure.

Healthcare and finance applications

Use private transcription infrastructure when audio data is sensitive and strict data handling matters.

Why this matters for AI agents

AI agents work better when they receive enough context.

Typing long instructions is slow. Speaking is faster.

A private speech-to-text API lets users talk to their agents, send longer instructions, explain edge cases, and provide richer context without typing everything manually.

That is one of the main reasons Privocio focuses on voice infrastructure for AI workflows.

Frequently Asked Questions

What is a Python Speech-to-Text API?

A Python Speech-to-Text API lets a Python application upload audio and receive a text transcript. Instead of running a model locally, your app sends the audio file to an API endpoint and gets the transcription result back.

How do I transcribe an audio file in Python?

You can transcribe an audio file in Python by opening the file in binary mode, uploading it as multipart form data to a transcription API, authenticating with an API key, and reading the JSON response.

What is the best Python package for calling a speech-to-text API?

For simple HTTP requests, httpx is a good choice. It supports file uploads, custom headers, timeouts, and clean error handling.

Can I use Privocio as a Whisper API alternative?

Yes. Privocio supports Whisper-style transcription workflows and also provides an OpenAI-compatible transcription route. Compare options in our Privocio vs OpenAI Whisper guide.

Can I transcribe WAV files with Python?

Yes. You can open a WAV file in binary mode and upload it to Privocio using multipart form data.

Should I use batch or streaming transcription?

Use batch transcription for completed audio files. Use streaming transcription when you need real-time or near-real-time transcription.

Is Privocio only for Python developers?

No. Privocio also supports cURL, JavaScript/TypeScript, OpenAI-compatible integrations, and streaming workflows.

Is a private speech-to-text API useful for AI agents?

Yes. A private speech-to-text API is useful for AI agents because it lets users speak longer, more detailed instructions while keeping voice data under better control.

Conclusion

A Python Speech-to-Text API is one of the fastest ways to add voice input to your application.

With Privocio, you can upload an audio file, authenticate with a Bearer token, and receive a transcription response using a simple Python httpx request.

For teams building AI agents, internal tools, voice workflows, or privacy-sensitive products, Privocio adds important advantages: private deployment options, predictable pricing, Whisper-compatible API patterns, and output modes designed for downstream AI workflows.

Start with the simple Python example above, then move toward environment variables, error handling, retries, and the right output mode for your application.

Start building with Privocio

Add private, reliable speech-to-text to your Python app, AI agent, chatbot, or internal workflow.

speech-to-textwhisperAI Agents