12 — Open-Source Voice Integration & Channel Connectors
What Was Researched
Voice input (STT), voice output (TTS), and voice conversation systems in open-source AI agents. Additionally, this topic covers secure device-pairing protocols and platform messaging connectors (SMS, Telegram, Discord, Slack, WhatsApp) that route notifications, voice memos, and actions.
For detailed analysis on messaging gateways, see:
Which Sources Were Used
| Source | Type | URL | Relevance |
|---|---|---|---|
Hermes Agent (hermes-agent/tools/tts_tool.py, tools/transcription_tools.py, tools/voice_mode.py) |
Local codebase | https://github.com/NousResearch/hermes-agent | CRITICAL |
| OpenClaw Voice Wake + Talk Mode | Local codebase | https://github.com/openclaw/openclaw | HIGH |
OpenClaw Extensions (openclaw/extensions/device-pair/, whatsapp/, sms/, slack/, telegram/, discord/) |
Local codebase | https://github.com/openclaw/openclaw | CRITICAL |
| Model landscape (GPT Audio, Grok Voice TTS) | Research output | — | HIGH |
Key Findings
Hermes Voice System
tts_tool.py(111KB) — Text-to-speech with multiple backendstranscription_tools.py(73KB) — Audio transcription / speech-to-textvoice_mode.py(48KB) — Continuous voice conversation modetools/neutts_synth.py+tools/neutts_samples/— Neural TTS synthesis- Voice memo transcription — Receives voice memos via messaging platforms, transcribes automatically
- Cross-platform — Voice works via CLI, messaging gateway (Telegram voice messages), and companion apps
- ElevenLabs integration — Premium TTS via API key
OpenClaw Voice System
- Voice Wake — Wake words on macOS/iOS (always-listening trigger)
- Talk Mode — Continuous voice conversation on Android
- ElevenLabs + system TTS fallback — Premium voice with free fallback
- Node-based — Voice processing runs on companion app nodes (macOS, iOS, Android)
Voice Model Landscape (June 2026)
From model research (13_model_agnostic_harness_architecture/model_landscape_june_2026.md):
| Model | Type | Modality |
|---|---|---|
| Grok Voice TTS 1.0 | Text → Audio | TTS only |
| GPT Audio | Text + Audio → Text + Audio | Bidirectional |
| GPT Audio Mini | Text + Audio → Text + Audio | Bidirectional (cost-optimized) |
Voice Architecture Pattern
┌───────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Microphone │────▸│ STT │────▸│ Agent │────▸│ TTS │────▸ Speaker
│ │ │ (Whisper)│ │ (LLM) │ │ (Eleven) │
└───────────┘ └──────────┘ └──────────┘ └──────────┘
Emerging pattern (GPT Audio): Single model handles audio→text→audio natively, eliminating the STT/TTS pipeline.
What Is Confirmed
- STT + TTS pipeline is the current standard (Hermes, OpenClaw)
- ElevenLabs is the dominant TTS provider in open-source agents
- Voice wake words enable hands-free interaction (OpenClaw on macOS/iOS)
- Native audio models (GPT Audio, Grok Voice) are emerging and may replace the pipeline approach
- Voice memo transcription via messaging is a killer feature for mobile use
What Is Uncertain
- Whether native audio models will replace STT + TTS pipelines
- Latency characteristics of pipeline vs. native audio models
- Best approach for voice wake word detection (on-device vs. cloud)
How This Applies to Building a Modern Model-Agnostic Agent Harness
- Support STT + TTS pipeline as the baseline voice capability
- Plan for native audio models (GPT Audio) as an alternative path
- Implement voice memo transcription for messaging platform integration
- Consider ElevenLabs as the default premium TTS provider
- Design voice as a modality, not a feature — voice input/output should be orthogonal to the agent loop