Building Betty: An AI Voice Companion for Truck Drivers with Gemini Live API

This blog post was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

In Australia, 1,300 people died on the roads in 2024 — the highest in over a decade. 179 of those deaths involved heavy vehicles. Fatigue contributes to up to 30% of fatal crashes. The economic cost of road trauma exceeds $27 billion a year.

The technology to detect driver fatigue already exists. Fatigue cameras can spot droopy eyes, yawning, and head nods. Telematics systems can flag lane drift and harsh braking. But all of these systems have the same bottleneck: a human fleet manager has to see the alert and act on it. That doesn't scale when you're running hundreds of trucks across remote Australian highways.

So I built Betty.

What is Betty?

Betty is an AI voice companion that proactively calls truck drivers when safety systems detect danger. She doesn't send a notification. She doesn't flash a light on a dashboard. She picks up the phone and has a real conversation.

When a fatigue camera detects a driver's eyes drooping at 2am on the Nullarbor, Betty calls them:

"Hey Dazza, it's Betty. How's the drive going? I saw a little blip come through on your end — just wanted to make sure you're doing alright."

She sounds warm, Australian, and genuinely caring — not like a corporate system reading a script. If the driver sounds tired, she'll suggest a rest stop and send a visual card to their screen with the location, distance, and facilities. If they refuse to stop despite being clearly fatigued, she escalates to the fleet manager.

And she remembers. If Betty called 30 minutes ago and suggested pulling over at Southern Cross, she'll follow up: "Did you end up stopping like we talked about?"

The Tech Stack: Google AI Models + Google Cloud

Gemini Live API — The Voice

The core of Betty is the Gemini Live API using gemini-2.5-flash-native-audio-latest with the Aoede voice. This gives Betty real-time bidirectional voice streaming — she can hear the driver and respond naturally, including interruptions, pauses, and emotional tone.

What makes Gemini's native audio model special for this use case is that it picks up emotional nuance. It can hear when someone sounds tired vs annoyed vs anxious. That matters when you're talking to a truck driver at 3am who doesn't want to be told what to do.

I'm using the Google GenAI SDK (google-genai) for session management, audio streaming, and function calling. Betty can call tools mid-conversation — checking the driver's hours, looking up recent events, sending visual cards, or escalating to management — all while continuing to talk naturally.

Gemini Flash + Google Search Grounding — Real Rest Stop Data

When Betty recommends a rest stop, she doesn't make one up. Gemini Flash with Google Search grounding looks up the actual location — what facilities it has, what the area looks like, whether there's fuel and food available. This grounds Betty's recommendations in reality.

Imagen 4 — Visual Rest Stop Cards

The rest stop recommendation cards that appear on the driver's screen include a scenic background photograph generated by Imagen 4. The prompt is built from the Search-grounded description, so each card shows a relevant Australian landscape.

Google VEO — Simulated Fatigue Camera Footage

For the demo, I needed realistic fatigue camera footage without filming actual drivers. Google VEO generated 12 simulated in-cab dashcam clips — a driver with droopy eyes, yawning, nodding off, using a phone, and various erratic driving scenarios. These get extracted as JPEG frames and sent to Gemini during calls so Betty can "see" what the fatigue camera detected.

Google Cloud Run — Deployment

The whole application deploys to Cloud Run with a single command. The deploy script builds the Docker container via Cloud Build, pushes to Artifact Registry, and deploys with session affinity (critical for WebSocket call continuity), autoscaling from 0-3 instances, and environment-based secrets. It scales to zero when nobody's using it, so the cost for the entire hackathon has been under $10.

The Two-Session Trick

The biggest technical challenge was making Betty demo-able without requiring a microphone. Judges, investors, and conference attendees can't all plug in a headset.

The solution: two Gemini Live sessions talking to each other. One runs Betty (Aoede voice), the other runs a simulated driver persona (Puck voice) with configurable mood, situation, and resistance level. Audio from one session gets resampled (24kHz to 16kHz and back) and streamed to the other in real-time.

The result is a full, natural conversation that anyone can watch just by clicking a button. The simulated driver will be grumpy, cooperative, defensive, or exhausted depending on the persona settings. Betty adapts her approach accordingly.

Getting the audio bridging right was harder than expected. The resampling has to be precise — any drift and the conversation degrades. Turn-taking management was critical to prevent crosstalk. And mixing in continuous cabin noise (engine and road sounds) makes it sound like a real phone call from a truck cab.

Natural Interruptions

Real phone conversations aren't polite turn-by-turn exchanges. People talk over each other — especially a grumpy truck driver who's been told to pull over for the third time.

Betty handles this naturally. When a simulated driver persona has the interrupts trait, the system lets Betty speak for 1.5–3 seconds, then breaks mid-sentence and has the driver cut in with something abrupt:

Betty: "I can see you on the camera, Graeme. Your eyes are drooping, you're—" Driver: "Nah, listen, I just had a blue with dispatch, they don't get it."

The driver's audio is streamed directly into Betty's Gemini session, triggering a real server-side barge-in. Betty stops generating, processes what the driver said, and her next response addresses the interruption naturally — she doesn't repeat herself or continue from where she was cut off.

This required solving a few problems: - No dead air. The driver's response has to start immediately after the break, not after a multi-second drain of Betty's buffered output. - Stale content flushing. If Betty was mid-word when interrupted (e.g. "it's"), the leftover ("Betty!") would leak into her next response. A brief post-interrupt flush clears this without perceptible silence. - Session stability. Too many rapid interrupts can destabilise a Gemini Live session, so interrupts are capped at 3 per call with the first one guaranteed. - Goodbye detection. Without it, both sides would enter an endless loop of Australian farewells — "See ya." "Hooroo." "Cheers." "Take care." — which is very Australian but not a great demo. The system detects when both sides have said goodbye and ends the call.

The result is conversations that sound like real phone calls, not scripted dialogues.

Cross-Call Memory

One of the features I'm most proud of is Betty's shift memory. Each driver's conversation history is encrypted with AES-256-GCM using per-driver keys derived via HKDF. The memory auto-expires after 14 hours (one shift).

When Betty starts a new call, the memory summary gets injected into her system prompt:

- 25 min ago: Driver was drowsy, agreed to stop at Southern Cross. (fatigue: fatigued) (action: encouraged_break) - 2.1h ago: Companion check-in, driver mentioned his daughter's birthday tomorrow.

This lets Betty have continuity across calls. She can reference earlier conversations naturally, which makes her feel less like a bot and more like a companion who's been riding along all day.

Prompt Engineering for Voice

Writing prompts for voice AI is fundamentally different from text AI. A prompt that produces great written responses will sound terrible when spoken aloud.

Key lessons: - Brevity is everything. 2-3 sentences max per turn. Nobody wants to listen to a paragraph. - Natural phrasing. "How's the drive going?" not "How is your current driving experience?" - Personality over precision. Betty says "I saw a little blip come through" not "The fatigue monitoring system registered a drowsiness event at severity level high." - Mood adaptation. The prompt tells Betty to back off if the driver sounds irritated, chat longer if they sound lonely, and be firm but kind if they sound dangerously tired.

What I'd Build Next

This is a hackathon prototype, but the path to real-world deployment is clear: - Connect to a telephone service (eg. Twilio) so that real world conversations can take place - Connect to actual telematics APIs (Seeing Machines, Lytx) instead of mock data - Multi-language support for Australia's diverse trucking workforce - Shift-over-shift wellness trending to spot patterns across weeks - Let drivers call Betty proactively when they're feeling lonely or stressed on long hauls - Expand to mining, aviation, maritime — any industry where fatigue kills

Try It

Live demo: bettyai.klassicstudios.com
Dashboard: bettyai.klassicstudios.com/dashboard
GitHub: github.com/KlassicStudiosPtyLtd/TruckDriverLiveAgent

No microphone needed. Simulation mode is on by default — just click and watch Betty in action.

Built by Graeme Klass (Klassic Studios) for the Gemini Live Agent Challenge 2026. #GeminiLiveAgentChallenge