Voice agent recipe
Build a real-time voice agent pipeline: speech-to-text via OpenAI Whisper, reasoning with GPT-4o, and text-to-speech output using tts-1. End-to-end Python implementation with streaming audio I/O.
Architecture
The voice agent captures microphone input, transcribes it with Whisper, sends the transcript to GPT-4o for reasoning, and speaks the response aloud using tts-1. All three stages stream where possible to minimize perceived latency.
Stage 1: STT (Whisper)
Audio is recorded in chunks via pyaudio and streamed to the OpenAI Whisper API. The whisper-1 model returns a transcript with word-level timestamps. For local-only deployments, swap in the open-source faster-whisper library.
Stage 2: Reasoning (GPT-4o)
The transcript is injected into a system prompt that defines the agent's persona and behavior. GPT-4o processes the input and returns a concise spoken response. Streaming is enabled via stream=True so the TTS stage can begin before the full response is generated.
Stage 3: TTS (tts-1)
The response text is sent to OpenAI's tts-1 model with the alloy voice. Audio bytes are played back through pydub or pygame. For sub-200ms latency, consider chunked streaming TTS with ElevenLabs or Deepgram as alternatives.
Prerequisites
- Python 3.10+
- OpenAI API key (set as
OPENAI_API_KEY) pip install openai pyaudio pydub pygame- Working microphone and speakers