Recipe

Voice agent recipe

Build a real-time voice agent pipeline: speech-to-text via OpenAI Whisper, reasoning with GPT-4o, and text-to-speech output using tts-1. End-to-end Python implementation with streaming audio I/O.

WhisperGPT-4otts-1Python

STTWhisper

→

ReasoningGPT-4o

→

TTStts-1

Architecture

The voice agent captures microphone input, transcribes it with Whisper, sends the transcript to GPT-4o for reasoning, and speaks the response aloud using tts-1. All three stages stream where possible to minimize perceived latency.

Stage 1: STT (Whisper)

Audio is recorded in chunks via pyaudio and streamed to the OpenAI Whisper API. The whisper-1 model returns a transcript with word-level timestamps. For local-only deployments, swap in the open-source faster-whisper library.

Stage 2: Reasoning (GPT-4o)

The transcript is injected into a system prompt that defines the agent's persona and behavior. GPT-4o processes the input and returns a concise spoken response. Streaming is enabled via stream=True so the TTS stage can begin before the full response is generated.

Stage 3: TTS (tts-1)

The response text is sent to OpenAI's tts-1 model with the alloy voice. Audio bytes are played back through pydub or pygame. For sub-200ms latency, consider chunked streaming TTS with ElevenLabs or Deepgram as alternatives.

Prerequisites

Python 3.10+
OpenAI API key (set as OPENAI_API_KEY)
pip install openai pyaudio pydub pygame
Working microphone and speakers