Voice Agent

The CyberDoc Voice Agent provides a real-time AI voice consultation after a scan is complete. Users can speak to the AI doctor about their findings, ask questions about specific vulnerabilities, and receive verbal prescriptions — all in natural conversational English.

How It Works

The voice agent is powered by the Grok Voice Agent API from xAI. When activated, the following sequence occurs:

Session creation — The client calls /api/voice/session with the scan ID. The backend creates a Grok Voice session, injecting the scan results as system context so the agent understands the user's findings.
WebSocket connection — The client establishes a WebSocket connection to the Grok Voice endpoint. Audio is streamed bidirectionally in real-time.
Microphone access — The browser requests microphone permission. Audio is captured at 16kHz mono PCM and streamed to the agent.
Agent response — The Grok agent processes speech, generates a contextual response about the scan findings, and streams audio back. The client plays it through the speakers.
Session end — When the user ends the session, the transcript is saved to KV and a voice log entry is created for the admin dashboard.

User Interface

Main Voice Panel

The voice agent UI appears as an overlay panel within the CyberDoc terminal. It includes:

Oscilloscope — A green waveform visualisation (canvas-based) that reacts to both incoming agent audio and outgoing user speech. Renders at 60fps with the CyberDoc terminal aesthetic.
Status indicator — Shows the current state: CONNECTING, LISTENING, SPEAKING, PAUSED, or ENDED.
Transcript — A scrollable live transcript showing both user speech (cyan) and agent responses (green), updating in real-time as speech is recognised.
Duration counter — Elapsed session time in MM:SS format.
Cost estimate — Running cost estimate at $0.05/min, updated live.

Controls

Button	Action
Start	Begin voice session (requests microphone permission)
Pause / Resume	Temporarily pause audio streaming without ending the session
Mute	Mute microphone input (agent can still speak)
End Session	Close the voice session and save transcript
Detach	Pop the voice panel into a separate floating window

Detachable Window

The voice agent panel can be detached into a separate draggable/resizable window. This allows the user to browse their scan report while simultaneously talking to the AI doctor. The detached window maintains the oscilloscope visualisation and live transcript.

Text Fallback

For users who cannot or prefer not to use voice (e.g., in a quiet office, no microphone, accessibility needs), the voice agent includes a text input fallback. Users can type questions and the agent responds with text-to-speech audio plus text in the transcript.

Available Voices

The Grok Voice Agent API provides several voice options. CyberDoc defaults to a professional, calm voice but users can select from:

Voice ID	Description	Character
`tina`	Female, professional	Clear and authoritative — good for clinical findings
`chloe`	Female, warm	Approachable and reassuring — good for anxious users
`drew`	Male, professional	Calm and measured — default CyberDoc voice
`james`	Male, conversational	Casual and friendly — good for non-technical users

Context Injection

When a voice session starts, the backend constructs a system prompt that includes:

You are CyberDoc, a cybersecurity doctor conducting a
follow-up consultation. The patient has just completed
a cyber health check-up. Here are their results:

Overall Score: {score}%
Status: {status}

Security Check Categories:
{category_breakdown}

Pen Test Findings:
{findings_list}

Deep Scan Results (if applicable):
{deep_scan_summary}

AI Diagnosis:
{diagnosis_text}

Speak as a doctor would — explain findings in plain
language, prioritise the most critical issues, and
provide clear remediation steps. Be reassuring but
honest about risks.

This ensures the voice agent has full context about the user's scan and can give specific, relevant advice rather than generic security tips.

Language Support

The Grok Voice Agent supports English as the primary language. The speech recognition handles common English accents (Australian, American, British, Indian). Non-English language support is not currently available but is planned for a future release.

Session Recording

Each voice session is recorded (transcript only, not audio) and stored in the VOICE_LOG KV namespace. The record includes:

Field	Description
`session_id`	Unique voice session identifier
`scan_id`	Associated scan ID
`voice`	Selected voice model
`duration_seconds`	Total session duration
`transcript`	Full text transcript of both sides
`cost_usd`	Estimated cost ($0.05 x minutes)
`started_at`	ISO timestamp of session start
`ended_at`	ISO timestamp of session end

Transcripts are retained for 180 days (6 months) and accessible from the admin dashboard voice logs section.

Cost

$0.05 per minute. The Grok Voice Agent API charges per minute of active session time. A typical CyberDoc consultation lasts 3-7 minutes, costing $0.15 - $0.35 per session. The cost counter is visible to the user in the voice panel.

Cost is calculated on active session time only — paused time is excluded. The cost estimate shown to users is approximate; actual billing from xAI may differ slightly.

Mobile Support

The voice agent is fully functional on mobile devices with the following considerations:

iOS Safari — Requires user gesture to start audio playback (handled by the Start button). Microphone access requires explicit permission grant.
Android Chrome — Full support including background audio (if screen is on).
Detachable window — Not available on mobile; the voice panel renders inline instead.
Text fallback — Particularly useful on mobile where voice input may be impractical in public.

Error Handling

Error	Cause	Recovery
Microphone denied	User denied browser permission	Show text fallback mode, prompt to allow mic in settings
WebSocket disconnect	Network interruption	Auto-reconnect with exponential backoff (3 attempts)
Session timeout	15 minutes of inactivity	Prompt user to restart or end session
API rate limit	Too many concurrent sessions	Show "busy" message, suggest trying again in 60 seconds
Audio playback blocked	Browser autoplay policy (iOS)	Require user tap before starting audio output

API Endpoints

The voice agent uses two API endpoints:

// Create voice session
POST /api/voice/session
Body: { "scan_id": "scan_abc123", "voice": "drew" }
Response: { "session_id": "vs_xyz", "ws_url": "wss://..." }

// End session and save transcript
POST /api/voice/end
Body: { "session_id": "vs_xyz", "transcript": [...] }
Response: { "voice_log_id": "vl_123" }

See the API Reference for full documentation.