CYBERDOC DOCS

Voice Agent

The CyberDoc Voice Agent provides a real-time AI voice consultation after a scan is complete. Users can speak to the AI doctor about their findings, ask questions about specific vulnerabilities, and receive verbal prescriptions — all in natural conversational English.

How It Works

The voice agent is powered by the Grok Voice Agent API from xAI. When activated, the following sequence occurs:

  1. Session creation — The client calls /api/voice/session with the scan ID. The backend creates a Grok Voice session, injecting the scan results as system context so the agent understands the user's findings.
  2. WebSocket connection — The client establishes a WebSocket connection to the Grok Voice endpoint. Audio is streamed bidirectionally in real-time.
  3. Microphone access — The browser requests microphone permission. Audio is captured at 16kHz mono PCM and streamed to the agent.
  4. Agent response — The Grok agent processes speech, generates a contextual response about the scan findings, and streams audio back. The client plays it through the speakers.
  5. Session end — When the user ends the session, the transcript is saved to KV and a voice log entry is created for the admin dashboard.

User Interface

Main Voice Panel

The voice agent UI appears as an overlay panel within the CyberDoc terminal. It includes:

  • Oscilloscope — A green waveform visualisation (canvas-based) that reacts to both incoming agent audio and outgoing user speech. Renders at 60fps with the CyberDoc terminal aesthetic.
  • Status indicator — Shows the current state: CONNECTING, LISTENING, SPEAKING, PAUSED, or ENDED.
  • Transcript — A scrollable live transcript showing both user speech (cyan) and agent responses (green), updating in real-time as speech is recognised.
  • Duration counter — Elapsed session time in MM:SS format.
  • Cost estimate — Running cost estimate at $0.05/min, updated live.

Controls

ButtonAction
StartBegin voice session (requests microphone permission)
Pause / ResumeTemporarily pause audio streaming without ending the session
MuteMute microphone input (agent can still speak)
End SessionClose the voice session and save transcript
DetachPop the voice panel into a separate floating window

Detachable Window

The voice agent panel can be detached into a separate draggable/resizable window. This allows the user to browse their scan report while simultaneously talking to the AI doctor. The detached window maintains the oscilloscope visualisation and live transcript.

Text Fallback

For users who cannot or prefer not to use voice (e.g., in a quiet office, no microphone, accessibility needs), the voice agent includes a text input fallback. Users can type questions and the agent responds with text-to-speech audio plus text in the transcript.

Available Voices

The Grok Voice Agent API provides several voice options. CyberDoc defaults to a professional, calm voice but users can select from:

Voice IDDescriptionCharacter
tinaFemale, professionalClear and authoritative — good for clinical findings
chloeFemale, warmApproachable and reassuring — good for anxious users
drewMale, professionalCalm and measured — default CyberDoc voice
jamesMale, conversationalCasual and friendly — good for non-technical users

Context Injection

When a voice session starts, the backend constructs a system prompt that includes:

You are CyberDoc, a cybersecurity doctor conducting a
follow-up consultation. The patient has just completed
a cyber health check-up. Here are their results:

Overall Score: {score}%
Status: {status}

Security Check Categories:
{category_breakdown}

Pen Test Findings:
{findings_list}

Deep Scan Results (if applicable):
{deep_scan_summary}

AI Diagnosis:
{diagnosis_text}

Speak as a doctor would — explain findings in plain
language, prioritise the most critical issues, and
provide clear remediation steps. Be reassuring but
honest about risks.

This ensures the voice agent has full context about the user's scan and can give specific, relevant advice rather than generic security tips.

Language Support

The Grok Voice Agent supports English as the primary language. The speech recognition handles common English accents (Australian, American, British, Indian). Non-English language support is not currently available but is planned for a future release.

Session Recording

Each voice session is recorded (transcript only, not audio) and stored in the VOICE_LOG KV namespace. The record includes:

FieldDescription
session_idUnique voice session identifier
scan_idAssociated scan ID
voiceSelected voice model
duration_secondsTotal session duration
transcriptFull text transcript of both sides
cost_usdEstimated cost ($0.05 x minutes)
started_atISO timestamp of session start
ended_atISO timestamp of session end

Transcripts are retained for 180 days (6 months) and accessible from the admin dashboard voice logs section.

Cost

$0.05 per minute. The Grok Voice Agent API charges per minute of active session time. A typical CyberDoc consultation lasts 3-7 minutes, costing $0.15 - $0.35 per session. The cost counter is visible to the user in the voice panel.

Cost is calculated on active session time only — paused time is excluded. The cost estimate shown to users is approximate; actual billing from xAI may differ slightly.

Mobile Support

The voice agent is fully functional on mobile devices with the following considerations:

  • iOS Safari — Requires user gesture to start audio playback (handled by the Start button). Microphone access requires explicit permission grant.
  • Android Chrome — Full support including background audio (if screen is on).
  • Detachable window — Not available on mobile; the voice panel renders inline instead.
  • Text fallback — Particularly useful on mobile where voice input may be impractical in public.

Error Handling

ErrorCauseRecovery
Microphone deniedUser denied browser permissionShow text fallback mode, prompt to allow mic in settings
WebSocket disconnectNetwork interruptionAuto-reconnect with exponential backoff (3 attempts)
Session timeout15 minutes of inactivityPrompt user to restart or end session
API rate limitToo many concurrent sessionsShow "busy" message, suggest trying again in 60 seconds
Audio playback blockedBrowser autoplay policy (iOS)Require user tap before starting audio output

API Endpoints

The voice agent uses two API endpoints:

// Create voice session
POST /api/voice/session
Body: { "scan_id": "scan_abc123", "voice": "drew" }
Response: { "session_id": "vs_xyz", "ws_url": "wss://..." }

// End session and save transcript
POST /api/voice/end
Body: { "session_id": "vs_xyz", "transcript": [...] }
Response: { "voice_log_id": "vl_123" }

See the API Reference for full documentation.