15 KiB
Chobit Architecture
Overview
Chobit is an interactive AI companion that lives on the user's desktop as a transparent overlay with a 3D animated character. It coordinates voice interaction (STT/TTS) with LLM-driven conversation and real-time avatar animation.
The client is a Godot 4 application. Backend ML services (@speech-synthesis, @model-boss) run separately.
System Diagram
┌──────────────────────────────────────────────────────────────┐
│ Godot 4 App (transparent desktop overlay) │
│ │
│ ┌────────────────┐ ┌─────────────────┐ ┌──────────────┐ │
│ │ Microphone │ │ Conversation │ │ VRM Avatar │ │
│ │ Input │ │ Orchestrator │ │ │ │
│ │ │ │ │ │ Skeleton │ │
│ │ VAD │ │ State Machine │ │ Blendshapes │ │
│ │ (Silero/energy) │──│ Sentence Stream │──│ AnimationTree│ │
│ │ │ │ Emotion Extract │ │ IK / LookAt │ │
│ │ AudioEffectCapt │ │ Interrupt Ctrl │ │ Lipsync │ │
│ └────────────────┘ └────────┬────────┘ └──────────────┘ │
│ │ │
│ ┌────────────────┐ │ │
│ │ Camera Input │ │ │
│ │ │ │ │
│ │ Webcam Feed │ │ │
│ │ Gesture Classif│───────────┘ │
│ │ Face Detection │ │
│ └────────────────┘ │
│ │
│ ┌──────────────┼──────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ STT │ │ LLM │ │ TTS │ │
│ │ Client │ │ Client │ │ Client │ │
│ │ (HTTP) │ │ (HTTP/WS)│ │ (HTTP) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
└────────────────┼──────────────┼──────────────┼──────────────┘
│ │ │
▼ ▼ ▼
┌───────────────────────────────────────┐
│ Backend Services │
│ │
│ @speech-synthesis @model-boss │
│ ├─ Whisper STT ├─ GPU leases │
│ └─ Chatterbox TTS └─ LLM routing │
│ │
│ Any OpenAI-compatible LLM endpoint │
│ or LifeAI companion service │
└───────────────────────────────────────┘
Attention System (Dual-Mode Gaze)
Chobit has two attention modes that determine where the avatar looks and how it responds to the user:
Desktop Gaze (Ambient Mode)
The avatar tracks what the user is doing on screen. The companion is "with you" while you work.
- Eyes/head follow cursor position — LookAt target is the mouse pointer mapped to 3D space
- Active during idle state — the default when no conversation is happening
- Ambient reactions — occasional glances at notification areas, screen edges, active windows
- Subtle personality — random look-away moments, stretches, yawns (not a robotic cursor tracker)
Face-to-Face (Conversation Mode)
The webcam activates and the avatar looks at the user directly. Mutual eye contact.
- Gaze target is the user's face — detected via webcam, avatar maintains eye contact
- Active during conversation — listening, processing, speaking states
- Facial awareness — can detect user's general expression for responsive reactions
- Triggered by VAD — speech detection switches from Desktop Gaze to Face-to-Face
Mode Transitions
Transitions map to the ConversationState FSM:
| State | Attention Mode | Behavior |
|---|---|---|
idle |
Desktop Gaze | Tracks cursor, ambient companion |
listening |
Face-to-Face | Webcam active, looks at user, attentive posture |
processing |
Face-to-Face | Maintains eye contact, thinking pose |
speaking |
Face-to-Face | Engaged, gesturing, eye contact |
interrupted |
Face-to-Face | Brief surprise, then back to listening |
Return to idle |
Desktop Gaze | Gradual drift back to screen tracking |
The transition is a smooth blend, not a snap — the avatar's gaze target interpolates between cursor-space and face-space over ~0.5s.
Motion Mirroring System
A showcase feature where the avatar mimics the user's gestures detected via webcam. This is methodologically distinct from skeleton-driven tracking:
Mirroring (what we do) vs Tracking (what we don't)
| Approach | How it works | Result |
|---|---|---|
| Mirroring (ours) | Classify gesture → trigger pre-made animation | Curated, expressive, companion-like |
| Tracking (rejected) | Map user skeleton → avatar skeleton in real-time | Puppet-like, jittery, uncanny |
Mirroring means the avatar is a personality that responds to what the user does, not a marionette driven by the user's body. The avatar waves back when you wave — it doesn't replicate your exact arm angle.
Gesture Classification Pipeline
Webcam Frame
│
▼
Pose Detection (MediaPipe / lightweight model)
│
▼
Gesture Classifier
├── wave → play wave_back animation
├── head_cock → play head_tilt animation (mirrored)
├── nod → play nod animation
├── head_shake → play head_shake animation
├── lean_forward → play lean_in animation
├── hand_raise → play greeting animation
├── thumbs_up → play happy_react animation
└── unknown → no action (ignore)
│
▼
Animation Trigger (via EventBus)
│
▼
AnimationTree plays the corresponding animation
with personality variation (speed, amplitude randomization)
Key Properties
- Deliberate delay — 0.2-0.5s response time feels natural, not robotic
- Personality variance — same gesture doesn't always trigger the exact same animation
- Selective response — avatar doesn't mirror everything; chooses what to react to
- Layered on conversation — mirroring active in Face-to-Face mode, can overlay on speaking/listening animations
- Graceful when no camera — falls back to Desktop Gaze only, no degraded experience
Gesture Detection Approach
Two viable approaches (decision deferred to implementation):
- MediaPipe Holistic — full pose/hand/face landmarks, classify from landmark positions. Runs in a separate process, sends classified gestures to Godot via local socket.
- Lightweight CNN classifier — trained on gesture classes directly from webcam frames. Simpler pipeline, less accurate, runs in-process.
Either way, the Godot side only receives gesture labels (strings) — the detection pipeline is opaque to the animation system.
Conversation Loop
1. VAD detects speech end
└─▶ AudioEffectCapture buffer captured by Godot audio server
2. Audio sent to STT service
└─▶ HTTP POST to chatterbox-tts-service /api/stt
└─▶ Returns transcribed text
3. Text + history sent to LLM backend
└─▶ HTTP streaming request (SSE or chunked response)
└─▶ Tokens arrive incrementally
4. SentenceStream buffers tokens into complete sentences
└─▶ Each sentence immediately sent to TTS
└─▶ First sentence plays while LLM still generates
5. EmotionExtractor strips [emotion] tags from each sentence
└─▶ AnimationTree transitions to matching expression
└─▶ TTS exaggeration parameter adjusted
6. TTS synthesizes speech per-sentence
└─▶ Audio returned from chatterbox-tts-service
└─▶ Played via AudioStreamPlayer
7. Lipsync drives mouth blendshape
└─▶ AudioEffectSpectrumAnalyzer reads playback amplitude
└─▶ Mapped to 'aa' (mouth open) blendshape per frame
8. On completion, AnimationTree returns to idle state
└─▶ VAD resumes listening
Voice Interruption
When the user speaks while the AI is talking:
- VAD detects speech onset during
speakingstate interrupt()called on the conversation orchestrator- HTTP request to LLM aborted (stream cancelled)
- AudioStreamPlayer stopped immediately
- Partial response saved with
[interrupted]marker in history - AnimationTree: speaking → interrupted (brief surprise) → listening
Desktop Overlay
Godot 4 transparent window configuration:
# In project.godot or at runtime:
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_TRANSPARENT, true)
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_ALWAYS_ON_TOP, true)
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_BORDERLESS, true)
# Transparent viewport
get_viewport().transparent_bg = true
# Click-through on transparent pixels (optional)
# Handled via input event detection on the character mesh
The result: the character floats on the desktop with no window chrome, visible above all other windows, with only the character model and minimal UI elements being interactive.
Animation Architecture
AnimationTree (AnimationNodeStateMachine)
│
├─ idle
│ ├─ Breathing: sine wave on chest/shoulder bones (always active)
│ ├─ Blink: random interval (2-6s), VRM 'blink' blendshape
│ ├─ Sway: subtle Perlin noise on hip/spine rotation
│ └─ LookAt: eyes track cursor via LookAtModifier3D (Desktop Gaze)
│
├─ listening
│ ├─ Head tilt toward user (Face-to-Face gaze)
│ ├─ Attentive posture (slight forward lean)
│ └─ Crossfade from idle (0.3s transition)
│
├─ processing
│ ├─ Look-away (eyes drift, head turns slightly)
│ ├─ Thinking pose (hand to chin, or finger tap)
│ └─ Subtle idle maintained underneath
│
├─ speaking
│ ├─ Engaged posture (shoulders open, slight forward lean)
│ ├─ Gesture layer (hand movements on sentence breaks)
│ ├─ Lipsync layer (AudioEffectSpectrumAnalyzer → mouth)
│ └─ Expression layer (emotion blendshapes from tags)
│
├─ interrupted
│ ├─ Brief surprise expression (0.2s)
│ └─ Transition to listening (0.3s)
│
└─ mirroring (overlay layer, active in Face-to-Face mode)
├─ Gesture response animations (wave, nod, tilt, etc.)
├─ Blended on top of current state animation
└─ Priority: mirroring < speaking gestures < lipsync
Expression Blend Layer (runs on top of body animations):
AnimationNodeBlendTree with 6 emotion inputs
Smooth weight interpolation (lerp, ~0.3s transition)
Driven by EmotionExtractor output
Emotion System
The LLM is prompted to embed emotion tags inline:
"[joy] That sounds wonderful! [curiosity] Tell me more about your day."
28 extended emotions map to 6 VRM blendshapes:
- happy ← joy, excitement, love, amusement, admiration, gratitude, pride, optimism
- sad ← grief, disappointment, remorse, sadness
- angry ← anger, annoyance, disgust, disapproval
- surprised ← surprise, confusion, curiosity, realization, fear, nervousness
- relaxed ← caring, relief, calm, contentment
- neutral ← embarrassment, desire
Emotions also influence:
- TTS exaggeration — Chatterbox
exaggerationparameter (0.0-1.0) - Gesture intensity — animation speed/amplitude scales with emotional state
- Particle effects — optional sparkles for joy, dark aura for anger, etc.
Godot Node Tree
CompanionRoot (Node3D)
├── Camera3D (fixed, FOV 30, positioned at face level)
├── DirectionalLight3D
├── AmbientLight (WorldEnvironment)
├── AvatarRoot (Node3D)
│ ├── VRMModel (imported .vrm, Skeleton3D child)
│ │ ├── Skeleton3D (VRM humanoid bones)
│ │ ├── MeshInstance3D (body, hair, clothes)
│ │ └── LookAtModifier3D (gaze tracking)
│ ├── AnimationPlayer (imported VRM animations)
│ └── AnimationTree (state machine + expression blend + mirroring layer)
├── AudioStreamPlayer (TTS playback)
│ └── AudioEffectSpectrumAnalyzer (lipsync source)
├── AudioStreamPlayer (mic capture for VAD)
│ └── AudioEffectCapture
├── CameraFeed (webcam input for Face-to-Face mode)
│ └── GestureClassifier (pose detection → gesture labels)
└── UI (CanvasLayer)
├── ChatBubble (appears during conversation)
├── MicIndicator (shows VAD state)
└── SettingsPanel (model/voice/backend config)
@model-boss Integration
GPU coordination is handled by @model-boss on the backend. The Godot app is a pure client — it makes HTTP requests to services that internally acquire GPU leases:
- Whisper STT: Lease acquired per transcription request
- Chatterbox TTS: Lease acquired per synthesis request
- LLM inference: Lease held during streaming response
Concurrent TTS + STT (for interruption handling) is automatically coordinated by @model-boss's priority queue.
VRM Model Format
Chobit uses VRM models (.vrm files) loaded via the VRM4Godot addon:
- VRoid Studio (free, Pixiv) — create custom models
- VRoid Hub — download community models
- UniVRM — convert from other 3D formats
Required blendshapes: happy, sad, angry, surprised, relaxed, neutral, aa (mouth open), blink
File Formats
| Asset | Format | Location |
|---|---|---|
| VRM models | .vrm |
godot/models/ |
| Animations | .tres (Godot resource) |
godot/scenes/ |
| Audio | .wav, .ogg |
godot/audio/ |
| Scripts | .gd (GDScript) |
godot/scripts/ |
| Scenes | .tscn (Godot scene) |
godot/scenes/ |