Voice Agent
The CyberDoc Voice Agent provides a real-time AI voice consultation after a scan is complete. Users can speak to the AI doctor about their findings, ask questions about specific vulnerabilities, and receive verbal prescriptions — all in natural conversational English.
How It Works
The voice agent is powered by the Grok Voice Agent API from xAI. When activated, the following sequence occurs:
- Session creation — The client calls
/api/voice/sessionwith the scan ID. The backend creates a Grok Voice session, injecting the scan results as system context so the agent understands the user's findings. - WebSocket connection — The client establishes a WebSocket connection to the Grok Voice endpoint. Audio is streamed bidirectionally in real-time.
- Microphone access — The browser requests microphone permission. Audio is captured at 16kHz mono PCM and streamed to the agent.
- Agent response — The Grok agent processes speech, generates a contextual response about the scan findings, and streams audio back. The client plays it through the speakers.
- Session end — When the user ends the session, the transcript is saved to KV and a voice log entry is created for the admin dashboard.
User Interface
Main Voice Panel
The voice agent UI appears as an overlay panel within the CyberDoc terminal. It includes:
- Oscilloscope — A green waveform visualisation (canvas-based) that reacts to both incoming agent audio and outgoing user speech. Renders at 60fps with the CyberDoc terminal aesthetic.
- Status indicator — Shows the current state: CONNECTING, LISTENING, SPEAKING, PAUSED, or ENDED.
- Transcript — A scrollable live transcript showing both user speech (cyan) and agent responses (green), updating in real-time as speech is recognised.
- Duration counter — Elapsed session time in MM:SS format.
- Cost estimate — Running cost estimate at $0.05/min, updated live.
Controls
| Button | Action |
|---|---|
| Start | Begin voice session (requests microphone permission) |
| Pause / Resume | Temporarily pause audio streaming without ending the session |
| Mute | Mute microphone input (agent can still speak) |
| End Session | Close the voice session and save transcript |
| Detach | Pop the voice panel into a separate floating window |
Detachable Window
The voice agent panel can be detached into a separate draggable/resizable window. This allows the user to browse their scan report while simultaneously talking to the AI doctor. The detached window maintains the oscilloscope visualisation and live transcript.
Text Fallback
For users who cannot or prefer not to use voice (e.g., in a quiet office, no microphone, accessibility needs), the voice agent includes a text input fallback. Users can type questions and the agent responds with text-to-speech audio plus text in the transcript.
Available Voices
The Grok Voice Agent API provides several voice options. CyberDoc defaults to a professional, calm voice but users can select from:
| Voice ID | Description | Character |
|---|---|---|
tina | Female, professional | Clear and authoritative — good for clinical findings |
chloe | Female, warm | Approachable and reassuring — good for anxious users |
drew | Male, professional | Calm and measured — default CyberDoc voice |
james | Male, conversational | Casual and friendly — good for non-technical users |
Context Injection
When a voice session starts, the backend constructs a system prompt that includes:
You are CyberDoc, a cybersecurity doctor conducting a
follow-up consultation. The patient has just completed
a cyber health check-up. Here are their results:
Overall Score: {score}%
Status: {status}
Security Check Categories:
{category_breakdown}
Pen Test Findings:
{findings_list}
Deep Scan Results (if applicable):
{deep_scan_summary}
AI Diagnosis:
{diagnosis_text}
Speak as a doctor would — explain findings in plain
language, prioritise the most critical issues, and
provide clear remediation steps. Be reassuring but
honest about risks.
This ensures the voice agent has full context about the user's scan and can give specific, relevant advice rather than generic security tips.
Language Support
The Grok Voice Agent supports English as the primary language. The speech recognition handles common English accents (Australian, American, British, Indian). Non-English language support is not currently available but is planned for a future release.
Session Recording
Each voice session is recorded (transcript only, not audio) and stored in the VOICE_LOG KV namespace. The record includes:
| Field | Description |
|---|---|
session_id | Unique voice session identifier |
scan_id | Associated scan ID |
voice | Selected voice model |
duration_seconds | Total session duration |
transcript | Full text transcript of both sides |
cost_usd | Estimated cost ($0.05 x minutes) |
started_at | ISO timestamp of session start |
ended_at | ISO timestamp of session end |
Transcripts are retained for 180 days (6 months) and accessible from the admin dashboard voice logs section.
Cost
Cost is calculated on active session time only — paused time is excluded. The cost estimate shown to users is approximate; actual billing from xAI may differ slightly.
Mobile Support
The voice agent is fully functional on mobile devices with the following considerations:
- iOS Safari — Requires user gesture to start audio playback (handled by the Start button). Microphone access requires explicit permission grant.
- Android Chrome — Full support including background audio (if screen is on).
- Detachable window — Not available on mobile; the voice panel renders inline instead.
- Text fallback — Particularly useful on mobile where voice input may be impractical in public.
Error Handling
| Error | Cause | Recovery |
|---|---|---|
| Microphone denied | User denied browser permission | Show text fallback mode, prompt to allow mic in settings |
| WebSocket disconnect | Network interruption | Auto-reconnect with exponential backoff (3 attempts) |
| Session timeout | 15 minutes of inactivity | Prompt user to restart or end session |
| API rate limit | Too many concurrent sessions | Show "busy" message, suggest trying again in 60 seconds |
| Audio playback blocked | Browser autoplay policy (iOS) | Require user tap before starting audio output |
API Endpoints
The voice agent uses two API endpoints:
// Create voice session
POST /api/voice/session
Body: { "scan_id": "scan_abc123", "voice": "drew" }
Response: { "session_id": "vs_xyz", "ws_url": "wss://..." }
// End session and save transcript
POST /api/voice/end
Body: { "session_id": "vs_xyz", "transcript": [...] }
Response: { "voice_log_id": "vl_123" }
See the API Reference for full documentation.