348 lines
16 KiB
Markdown
348 lines
16 KiB
Markdown
# Chobit Architecture
|
|
|
|
## Overview
|
|
|
|
Chobit is an interactive AI companion — a multi-platform Godot 4 app with a 3D VRM avatar, voice interaction, and pluggable LLM backend. Godot is the avatar runtime; all ML/GPU inference runs on external services via model-boss.
|
|
|
|
The project follows the @applications Tier 2 pattern with shared GDScript symlinked into platform-specific Godot projects:
|
|
|
|
```
|
|
shared/godot/ → Cross-platform source (avatar, conversation, audio, UI)
|
|
godot-desktop/src/ → → Symlink to shared/godot/ (transparent overlay, tray, window mgmt)
|
|
godot-mobile/src/ → → Symlink to shared/godot/ (touch input, on-device camera)
|
|
services/ → Desktop-only Python sidecars (bridge, tray, vision)
|
|
```
|
|
|
|
## System Diagram
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Godot 4 App (transparent desktop overlay) │
|
|
│ │
|
|
│ ┌────────────────┐ ┌─────────────────┐ ┌──────────────┐ │
|
|
│ │ Microphone │ │ Conversation │ │ VRM Avatar │ │
|
|
│ │ Input │ │ Orchestrator │ │ │ │
|
|
│ │ │ │ │ │ Skeleton │ │
|
|
│ │ VAD │ │ State Machine │ │ Blendshapes │ │
|
|
│ │ (Silero/energy) │──│ Sentence Stream │──│ AnimationTree│ │
|
|
│ │ │ │ Emotion Extract │ │ IK / LookAt │ │
|
|
│ │ AudioEffectCapt │ │ Interrupt Ctrl │ │ Lipsync │ │
|
|
│ └────────────────┘ └────────┬────────┘ └──────────────┘ │
|
|
│ │ │
|
|
│ ┌────────────────┐ │ │
|
|
│ │ Camera Input │ │ │
|
|
│ │ │ │ │
|
|
│ │ Webcam Feed │ │ │
|
|
│ │ Gesture Classif│───────────┘ │
|
|
│ │ Face Detection │ │
|
|
│ └────────────────┘ │
|
|
│ │
|
|
│ ┌──────────────┼──────────────┐ │
|
|
│ ▼ ▼ ▼ │
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|
│ │ STT │ │ LLM │ │ TTS │ │
|
|
│ │ Client │ │ Client │ │ Client │ │
|
|
│ │ (HTTP) │ │ (HTTP/WS)│ │ (HTTP) │ │
|
|
│ └──────────┘ └──────────┘ └──────────┘ │
|
|
│ │ │ │ │
|
|
└────────────────┼──────────────┼──────────────┼──────────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌───────────────────────────────────────┐
|
|
│ Backend Services │
|
|
│ │
|
|
│ @speech-synthesis @model-boss │
|
|
│ ├─ Whisper STT ├─ GPU leases │
|
|
│ └─ Chatterbox TTS └─ LLM routing │
|
|
│ │
|
|
│ Any OpenAI-compatible LLM endpoint │
|
|
│ or LifeAI companion service │
|
|
└───────────────────────────────────────┘
|
|
```
|
|
|
|
## Attention System (Dual-Mode Gaze)
|
|
|
|
Chobit has two attention modes that determine where the avatar looks and how it responds to the user:
|
|
|
|
### Desktop Gaze (Ambient Mode)
|
|
|
|
The avatar tracks what the user is doing on screen. The companion is "with you" while you work.
|
|
|
|
- **Eyes/head follow cursor position** — LookAt target is the mouse pointer mapped to 3D space
|
|
- **Active during idle state** — the default when no conversation is happening
|
|
- **Ambient reactions** — occasional glances at notification areas, screen edges, active windows
|
|
- **Subtle personality** — random look-away moments, stretches, yawns (not a robotic cursor tracker)
|
|
|
|
### Face-to-Face (Conversation Mode)
|
|
|
|
The webcam activates and the avatar looks at the user directly. Mutual eye contact.
|
|
|
|
- **Gaze target is the user's face** — detected via webcam, avatar maintains eye contact
|
|
- **Active during conversation** — listening, processing, speaking states
|
|
- **Facial awareness** — can detect user's general expression for responsive reactions
|
|
- **Triggered by VAD** — speech detection switches from Desktop Gaze to Face-to-Face
|
|
|
|
### Mode Transitions
|
|
|
|
Transitions map to the ConversationState FSM:
|
|
|
|
| State | Attention Mode | Behavior |
|
|
|-------|---------------|----------|
|
|
| `idle` | Desktop Gaze | Tracks cursor, ambient companion |
|
|
| `listening` | Face-to-Face | Webcam active, looks at user, attentive posture |
|
|
| `processing` | Face-to-Face | Maintains eye contact, thinking pose |
|
|
| `speaking` | Face-to-Face | Engaged, gesturing, eye contact |
|
|
| `interrupted` | Face-to-Face | Brief surprise, then back to listening |
|
|
| Return to `idle` | Desktop Gaze | Gradual drift back to screen tracking |
|
|
|
|
The transition is a smooth blend, not a snap — the avatar's gaze target interpolates between cursor-space and face-space over ~0.5s.
|
|
|
|
## Motion Mirroring System
|
|
|
|
A showcase feature where the avatar mimics the user's gestures detected via webcam. This is **methodologically distinct** from skeleton-driven tracking:
|
|
|
|
### Mirroring (what we do) vs Tracking (what we don't)
|
|
|
|
| Approach | How it works | Result |
|
|
|----------|-------------|--------|
|
|
| **Mirroring** (ours) | Classify gesture → trigger pre-made animation | Curated, expressive, companion-like |
|
|
| **Tracking** (rejected) | Map user skeleton → avatar skeleton in real-time | Puppet-like, jittery, uncanny |
|
|
|
|
Mirroring means the avatar is a personality that *responds* to what the user does, not a marionette driven by the user's body. The avatar waves back when you wave — it doesn't replicate your exact arm angle.
|
|
|
|
### Gesture Classification Pipeline
|
|
|
|
```
|
|
Webcam Frame
|
|
│
|
|
▼
|
|
Pose Detection (MediaPipe / lightweight model)
|
|
│
|
|
▼
|
|
Gesture Classifier
|
|
├── wave → play wave_back animation
|
|
├── head_cock → play head_tilt animation (mirrored)
|
|
├── nod → play nod animation
|
|
├── head_shake → play head_shake animation
|
|
├── lean_forward → play lean_in animation
|
|
├── hand_raise → play greeting animation
|
|
├── thumbs_up → play happy_react animation
|
|
└── unknown → no action (ignore)
|
|
│
|
|
▼
|
|
Animation Trigger (via EventBus)
|
|
│
|
|
▼
|
|
AnimationTree plays the corresponding animation
|
|
with personality variation (speed, amplitude randomization)
|
|
```
|
|
|
|
### Key Properties
|
|
|
|
- **Deliberate delay** — 0.2-0.5s response time feels natural, not robotic
|
|
- **Personality variance** — same gesture doesn't always trigger the exact same animation
|
|
- **Selective response** — avatar doesn't mirror everything; chooses what to react to
|
|
- **Layered on conversation** — mirroring active in Face-to-Face mode, can overlay on speaking/listening animations
|
|
- **Graceful when no camera** — falls back to Desktop Gaze only, no degraded experience
|
|
|
|
### Gesture Detection Approach
|
|
|
|
Two viable approaches (decision deferred to implementation):
|
|
|
|
1. **MediaPipe Holistic** — full pose/hand/face landmarks, classify from landmark positions. Runs in a separate process, sends classified gestures to Godot via local socket.
|
|
2. **Lightweight CNN classifier** — trained on gesture classes directly from webcam frames. Simpler pipeline, less accurate, runs in-process.
|
|
|
|
Either way, the Godot side only receives gesture labels (strings) — the detection pipeline is opaque to the animation system.
|
|
|
|
## Conversation Loop
|
|
|
|
```
|
|
1. VAD detects speech end
|
|
└─▶ AudioEffectCapture buffer captured by Godot audio server
|
|
|
|
2. Audio sent to STT service
|
|
└─▶ HTTP POST to chatterbox-tts-service /api/stt
|
|
└─▶ Returns transcribed text
|
|
|
|
3. Text + history sent to LLM backend
|
|
└─▶ HTTP streaming request (SSE or chunked response)
|
|
└─▶ Tokens arrive incrementally
|
|
|
|
4. SentenceStream buffers tokens into complete sentences
|
|
└─▶ Each sentence immediately sent to TTS
|
|
└─▶ First sentence plays while LLM still generates
|
|
|
|
5. EmotionExtractor strips [emotion] tags from each sentence
|
|
└─▶ AnimationTree transitions to matching expression
|
|
└─▶ TTS exaggeration parameter adjusted
|
|
|
|
6. TTS synthesizes speech per-sentence
|
|
└─▶ Audio returned from chatterbox-tts-service
|
|
└─▶ Played via AudioStreamPlayer
|
|
|
|
7. Lipsync drives mouth blendshape
|
|
└─▶ AudioEffectSpectrumAnalyzer reads playback amplitude
|
|
└─▶ Mapped to 'aa' (mouth open) blendshape per frame
|
|
|
|
8. On completion, AnimationTree returns to idle state
|
|
└─▶ VAD resumes listening
|
|
```
|
|
|
|
## Voice Interruption
|
|
|
|
When the user speaks while the AI is talking:
|
|
|
|
1. VAD detects speech onset during `speaking` state
|
|
2. `interrupt()` called on the conversation orchestrator
|
|
3. HTTP request to LLM aborted (stream cancelled)
|
|
4. AudioStreamPlayer stopped immediately
|
|
5. Partial response saved with `[interrupted]` marker in history
|
|
6. AnimationTree: speaking → interrupted (brief surprise) → listening
|
|
|
|
## Platform Rendering
|
|
|
|
### Desktop: Transparent Overlay
|
|
|
|
Miku floats on the desktop — no window chrome, no background. The OS composites the 3D avatar directly over whatever the user is doing.
|
|
|
|
```gdscript
|
|
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_TRANSPARENT, true)
|
|
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_ALWAYS_ON_TOP, true)
|
|
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_BORDERLESS, true)
|
|
get_viewport().transparent_bg = true
|
|
```
|
|
|
|
Desktop-specific features: window drag, zoom, edge snap, system tray integration, keyboard shortcuts, gaze halo overlay.
|
|
|
|
### Mobile: Fullscreen with Background Modes
|
|
|
|
Mobile OSes don't support transparent overlay windows — Miku owns the full screen. The background behind the avatar is configurable with four modes:
|
|
|
|
| Mode | Source | Use case |
|
|
|------|--------|----------|
|
|
| **Camera feed** | Rear/front `CameraFeed` → viewport background | AR-style, companion in the real world. Front camera doubles as face tracking input. |
|
|
| **Rendered environment** | 3D scene (bedroom, park, abstract) | Virtual pet aesthetic, configurable themes |
|
|
| **Camera blur** | Camera feed → Gaussian blur shader | Softer AR look, less visual noise |
|
|
| **Solid/gradient** | Flat color or gradient | Battery-friendly fallback, clean aesthetic |
|
|
|
|
The background layer renders behind the avatar in the viewport. The avatar, lighting, and UI are identical to desktop — only the background differs. Desktop has transparency as its implicit "background mode" and doesn't use this system.
|
|
|
|
## Animation Architecture
|
|
|
|
```
|
|
AnimationTree (AnimationNodeStateMachine)
|
|
│
|
|
├─ idle
|
|
│ ├─ Breathing: sine wave on chest/shoulder bones (always active)
|
|
│ ├─ Blink: random interval (2-6s), VRM 'blink' blendshape
|
|
│ ├─ Sway: subtle Perlin noise on hip/spine rotation
|
|
│ └─ LookAt: eyes track cursor via LookAtModifier3D (Desktop Gaze)
|
|
│
|
|
├─ listening
|
|
│ ├─ Head tilt toward user (Face-to-Face gaze)
|
|
│ ├─ Attentive posture (slight forward lean)
|
|
│ └─ Crossfade from idle (0.3s transition)
|
|
│
|
|
├─ processing
|
|
│ ├─ Look-away (eyes drift, head turns slightly)
|
|
│ ├─ Thinking pose (hand to chin, or finger tap)
|
|
│ └─ Subtle idle maintained underneath
|
|
│
|
|
├─ speaking
|
|
│ ├─ Engaged posture (shoulders open, slight forward lean)
|
|
│ ├─ Gesture layer (hand movements on sentence breaks)
|
|
│ ├─ Lipsync layer (AudioEffectSpectrumAnalyzer → mouth)
|
|
│ └─ Expression layer (emotion blendshapes from tags)
|
|
│
|
|
├─ interrupted
|
|
│ ├─ Brief surprise expression (0.2s)
|
|
│ └─ Transition to listening (0.3s)
|
|
│
|
|
└─ mirroring (overlay layer, active in Face-to-Face mode)
|
|
├─ Gesture response animations (wave, nod, tilt, etc.)
|
|
├─ Blended on top of current state animation
|
|
└─ Priority: mirroring < speaking gestures < lipsync
|
|
|
|
Expression Blend Layer (runs on top of body animations):
|
|
AnimationNodeBlendTree with 6 emotion inputs
|
|
Smooth weight interpolation (lerp, ~0.3s transition)
|
|
Driven by EmotionExtractor output
|
|
```
|
|
|
|
## Emotion System
|
|
|
|
The LLM is prompted to embed emotion tags inline:
|
|
|
|
```
|
|
"[joy] That sounds wonderful! [curiosity] Tell me more about your day."
|
|
```
|
|
|
|
28 extended emotions map to 6 VRM blendshapes:
|
|
- **happy** ← joy, excitement, love, amusement, admiration, gratitude, pride, optimism
|
|
- **sad** ← grief, disappointment, remorse, sadness
|
|
- **angry** ← anger, annoyance, disgust, disapproval
|
|
- **surprised** ← surprise, confusion, curiosity, realization, fear, nervousness
|
|
- **relaxed** ← caring, relief, calm, contentment
|
|
- **neutral** ← embarrassment, desire
|
|
|
|
Emotions also influence:
|
|
- **TTS exaggeration** — Chatterbox `exaggeration` parameter (0.0-1.0)
|
|
- **Gesture intensity** — animation speed/amplitude scales with emotional state
|
|
- **Particle effects** — optional sparkles for joy, dark aura for anger, etc.
|
|
|
|
## Godot Node Tree
|
|
|
|
```
|
|
CompanionRoot (Node3D)
|
|
├── Camera3D (fixed, FOV 30, positioned at face level)
|
|
├── DirectionalLight3D
|
|
├── AmbientLight (WorldEnvironment)
|
|
├── AvatarRoot (Node3D)
|
|
│ ├── VRMModel (imported .vrm, Skeleton3D child)
|
|
│ │ ├── Skeleton3D (VRM humanoid bones)
|
|
│ │ ├── MeshInstance3D (body, hair, clothes)
|
|
│ │ └── LookAtModifier3D (gaze tracking)
|
|
│ ├── AnimationPlayer (imported VRM animations)
|
|
│ └── AnimationTree (state machine + expression blend + mirroring layer)
|
|
├── AudioStreamPlayer (TTS playback)
|
|
│ └── AudioEffectSpectrumAnalyzer (lipsync source)
|
|
├── AudioStreamPlayer (mic capture for VAD)
|
|
│ └── AudioEffectCapture
|
|
├── CameraFeed (webcam input for Face-to-Face mode)
|
|
│ └── GestureClassifier (pose detection → gesture labels)
|
|
└── UI (CanvasLayer)
|
|
├── ChatBubble (appears during conversation)
|
|
├── MicIndicator (shows VAD state)
|
|
└── SettingsPanel (model/voice/backend config)
|
|
```
|
|
|
|
## @model-boss Integration
|
|
|
|
GPU coordination is handled by @model-boss on the backend. The Godot app is a pure client — it makes HTTP requests to services that internally acquire GPU leases:
|
|
|
|
- **Whisper STT**: Lease acquired per transcription request
|
|
- **Chatterbox TTS**: Lease acquired per synthesis request
|
|
- **LLM inference**: Lease held during streaming response
|
|
|
|
Concurrent TTS + STT (for interruption handling) is automatically coordinated by @model-boss's priority queue.
|
|
|
|
## VRM Model Format
|
|
|
|
Chobit uses VRM models (`.vrm` files) loaded via the VRM4Godot addon:
|
|
- **VRoid Studio** (free, Pixiv) — create custom models
|
|
- **VRoid Hub** — download community models
|
|
- **UniVRM** — convert from other 3D formats
|
|
|
|
Required blendshapes: `happy`, `sad`, `angry`, `surprised`, `relaxed`, `neutral`, `aa` (mouth open), `blink`
|
|
|
|
## File Formats
|
|
|
|
| Asset | Format | Location |
|
|
|-------|--------|----------|
|
|
| VRM models | `.vrm` | `godot-desktop/models/`, `godot-mobile/models/` |
|
|
| Audio assets | `.wav`, `.ogg`, `.mp3` | `godot-desktop/audio/` |
|
|
| Shared GDScript | `.gd` | `shared/godot/` (symlinked as `src/`) |
|
|
| Platform GDScript | `.gd` | `godot-{platform}/platform/` |
|
|
| Scenes | `.tscn` | `godot-{platform}/scenes/` |
|
|
| Sidecar services | `.py` | `services/{bridge,tray,vision}/` |
|
|
| Protocol types | `.ts` | `packages/chobit-core/src/` |
|