Claude Code bd8bbcb982 chore(core): 🔧 Update core dependency logs for failed request_id 9ced71f8

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-04-01 07:50:13 -07:00

17 KiB

Raw Blame History

@companion v1.0 — Full Implementation Handoff

Target: Mobile web PWA with text + voice chat, sentence underline, emotion-aware TTS, installable. Governing principle: ML mechanics → @model-boss. Personality mechanics → @ai.

Architecture Summary

browser (PWA)
  ↕ WS /voice/:session_id (PCM binary + JSON events)
companion-api (@companion/@applications/api)
  → POST @ai /personality/:id/compose        (system_prompt + tts config)
  → POST @model-boss /v1/chat/completions    (SSE inference)
  → WS @ai /process/:session_id             (tokens in → segments out)
  → WS @speech-synthesis /ws/conversation   (PCM STT + TTS)

companion-api is a protocol bridge. Zero personality logic lives here.

Phase 1: @ai Service (PREREQUISITE — everything depends on this)

1a. M0 — NestJS Scaffold

Init NestJS project at @applications/@ai/services/ai-core/
package.json: type: module, NestJS + SWC + TypeORM deps
nest-cli.json: { "compilerOptions": { "builder": "swc" } }
.swcrc: { "module": { "type": "es6", "resolveFully": true } }
tsconfig.json: extends @lilith/configs/typescript/nestjs
Bootstrap via @lilith/service-nestjs-bootstrap (presets.api, port 3790)
GET /health via @lilith/nestjs-health
docker-compose.yml in @applications/@ai/@deployments/:
- PostgreSQL on port 26395 (ai_db)
- Redis on port 26394
./run task runner (dev, build, test, docker:up/down)
Vitest config with nestPreset from @lilith/test-utils/vitest-presets
Smoke test: GET /health returns 200

1b. M1 — Identity Module

PersonaEntity (extends BaseEntity from @lilith/typeorm-entities):
- id: uuid, name: string, slug: string, configPath: string, isActive: boolean
UserIdentityEntity:
- id: uuid, externalId: string (maps to auth user), displayName: string, activePersonaId: uuid
IdentityModule with TypeORM registration
IdentityService: findPersona(id), findUser(externalId), setActivePersona(userId, personaId)
GET /identity/persona/:id
GET /identity/user/:externalId
POST /identity/user/:id/persona (set active persona)
Seed: miku persona (id deterministic), quinn user
Unit tests for IdentityService
Integration test: seed → GET persona returns miku

1c. M3 — Personality Module + miku.json tts.emotion

Update @applications/@ai/config/personalities/miku.json: Add tts section:

"tts": {
  "voice_id": "emov-bea-amused",
  "sentence_gap_ms": 0,
  "emotion": {
    "pattern": "\\[([^\\]]+)\\]\\s*",
    "valid_emotions": ["happy","sad","angry","surprised","relaxed","neutral"],
    "emotion_map": {
      "joy":"happy","excitement":"happy","happiness":"happy","cheerful":"happy",
      "grief":"sad","sorrow":"sad","melancholy":"sad","depression":"sad",
      "fear":"surprised","shock":"surprised","disbelief":"surprised",
      "calm":"relaxed","content":"relaxed","peaceful":"relaxed",
      "rage":"angry","frustration":"angry","irritation":"angry",
      "bored":"neutral","thinking":"neutral"
    },
    "exaggeration_map": { "happy":0.7,"sad":0.3,"angry":0.8,"surprised":0.6,"relaxed":0.2,"neutral":0.1 },
    "cfg_weight_map":   { "happy":0.6,"sad":0.3,"angry":0.7,"surprised":0.5,"relaxed":0.3,"neutral":0.5 }
  }
}

PersonalityModule
PersonalityConfigService: loads JSON from configPath on PersonaEntity

POST /personality/:id/compose — accepts { user_context?: string }, returns:

interface PersonalityComposeResponse {
  system_prompt: string;
  tts: {
    voice_id: string;
    sentence_gap_ms: number;
    emotion: EmotionConfig;
  };
}

system_prompt assembled from persona JSON (name, role, personality directives, user context)
Unit tests: compose returns correct structure for miku
Integration test: full round trip with seed data

1d. Process Module (WS /process/:session_id)

Port from @chobit/shared/godot/conversation/conversation_orchestrator.gd (lines 325–498) and @chobit/shared/godot/conversation/conversation_defs.gd.

EmotionResolver (process/emotion-resolver.ts):
- Constructor takes EmotionConfig from miku.json tts.emotion
- resolve(raw: string): string — maps raw → canonical via emotion_map, falls back to neutral
- ttsParams(emotion: string): { exaggeration: number; cfgWeight: number } — reads exaggeration_map/cfg_weight_map
- Unit tests: known mappings, unknown → neutral, all valid_emotions round-trip
TextSanitizer (process/text-sanitizer.ts): Port _sanitize_for_speech() from orchestrator.gd lines 375–430:
- Paralinguistic normalization: *laughs*, (laughs), haha+, lol+, heh+ → [laugh]; *sighs*, *sigh* → [sigh]; *gasp*, *gasps* → [gasp]
- Strip: markdown (bold **, italic */_, code `, links [text](url)), emoji (unicode ranges), URLs, list prefixes (- , • , 1. )
- Normalize: HH:MM time → HH MM, N-N range → N to N, A/B → A B
- Strip emotion tags [emotion] from output text (they're extracted separately)
- Unit tests: each transformation verified independently
ResponseStream (process/response-stream.ts): Port _extract_segments() from orchestrator.gd lines 325–375:
- State: buffer: string, currentEmotion: string (default neutral), partIndex: number
- push(token: string): Segment[] — appends to buffer, scans for boundaries:
  - Emotion tag [emotion] anywhere in buffer → extract emotion, remove tag, continue
  - Sentence ending (., !, ?, ;) not inside a word abbreviation → emit segment
  - Whichever boundary comes first in buffer wins
  - Returns Segment[] (may be empty if no boundary found)
- flush(): Segment[] — emit whatever remains in buffer as final segment
- Segment: { text: string; emotion: string; partIndex: number }
- The emitted text is run through TextSanitizer before returning
- Unit tests: emotion mid-sentence, sentence boundary, flush, multi-segment push
ProcessSessionManager (process/process-session.manager.ts):
- In-memory session store: Map<session_id, { stream: ResponseStream; emotionConfig: EmotionConfig }>
- createSession(sessionId, emotionConfig): initialize ResponseStream
- deleteSession(sessionId): cleanup
- Session TTL: 30 min idle (use @nestjs/schedule)
ProcessGateway (process/process.gateway.ts) — @WebSocketGateway({ path: '/process/:session_id' }): Incoming message union:
```
type IncomingMsg =
  | { type: 'init'; personality_id: string }
  | { type: 'token'; text: string }
  | { type: 'done' }
```
Outgoing message union:
```
type OutgoingMsg =
  | { type: 'segment'; text: string; emotion: string; partIndex: number; ttsParams: { voiceId: string; exaggeration: number; cfgWeight: number } }
  | { type: 'error'; message: string }
```
- init → load personality config, create session
- token → call session.stream.push(token), emit each returned Segment as segment event
- done → call session.stream.flush(), emit remaining segments, delete session
- On segment emit: run EmotionResolver, attach ttsParams, include voice_id from personality config
ProcessModule with all providers + gateway registered
Integration test: send init → tokens → done, verify segment events match expected output

Phase 2: @companion Scaffold

2a. Monorepo Scaffold

Init monorepo at @projects/@companion/:
- pnpm-workspace.yaml: ['@applications/*', '@packages/*', '@tooling/*']
- Root package.json with workspace scripts
- @deployments/docker-compose.yml (ports TBD — assign adjacent to @life 3700)
- run task runner script (dev, build, test)
@packages/companion-client/ — shared TypeScript client (@lilith/companion-client):
- Types: SessionMessage, SegmentEvent, ConversationSession
- WS client wrapper for companion-api

Phase 3: companion-api (@applications/api/)

3a. NestJS Scaffold

Init NestJS at @companion/@applications/api/
Same stack as @ai: ESM, SWC, TypeORM (for session persistence), port TBD
GET /health
Session entity: ConversationSessionEntity (id, userId, createdAt, expiresAt)
Message entity: ConversationMessageEntity (sessionId, role, content, emotion, createdAt)

3b. Session Endpoints

POST /session → { session_id: uuid } (creates DB record)
GET /session/:id/history → Message[]
DELETE /session/:id

3c. POST /chat (Text Fallback, SSE)

Full pipeline for text-only path:

Accepts { session_id, message: string }
Calls @ai POST /personality/:id/compose for system_prompt + tts config
Builds message history from DB
Calls @model-boss POST /v1/chat/completions (SSE)
Opens WS @ai /process/:session_id, sends init + each token + done
For each received segment, SSE to browser: { type: "segment", text, emotion, partIndex, ttsParams }
Persists assistant message to DB on completion
Use @lilith/ai-client if published; otherwise direct HTTP

3d. WS /voice/:session_id (Voice Pipeline)

Binary + JSON multiplexed WebSocket. companion-api acts as protocol bridge.

VoiceGateway (voice/voice.gateway.ts):
- On connection: open WS @speech-synthesis /ws/conversation
- Forward binary frames from browser → speech-synthesis upstream (binary PCM 16kHz)
- Forward JSON control from speech-synthesis → browser:
  - stt.final — triggers LLM pipeline (same as /chat but over WS)
  - vad.speech_start — forward to browser for UI feedback
- On stt.final:
  1. Call @ai POST /personality/:id/compose (or cache per session)
  2. Call @model-boss SSE stream
  3. Pipe tokens to @ai WS /process/:session_id
  4. On each segment: send tts.request to speech-synthesis WS
  5. Forward tts.start, tts.end from speech-synthesis → browser
  6. Forward binary PCM downstream from speech-synthesis → browser
- On disconnect: close speech-synthesis WS, clean up @ai session
VoiceSessionStore — in-memory map of active voice sessions (browser ws ↔ speech-synthesis ws ↔ @ai ws)

Phase 4: companion-web (@applications/web/)

4a. React PWA Scaffold

Vite + React 18 + TypeScript strict
manifest.json:
- display: standalone, orientation: portrait
- start_url: /, icons (192px + 512px)
Service worker (Workbox or vite-plugin-pwa): cache shell + assets
CompanionApp.tsx: full-screen mobile layout (100dvh, no scroll bounce)
PWA install prompt handling (beforeinstallprompt)

4b. AudioWorklets

src/worklets/mic-processor.js — AudioWorkletProcessor:
- Input: browser mic (any sample rate, converted)
- Output: 16kHz mono PCM Int16 frames (960 bytes = 30ms at 16kHz)
- Resamples via linear interpolation if input rate ≠ 16000
- Sends frames to main thread via postMessage with binary buffer
src/worklets/pcm-player.js — AudioWorkletProcessor:
- Input: 22050Hz mono PCM Int16 frames from companion-api
- Feeds ring buffer → outputs float32 to Web Audio destination
- Handles underrun (silence) and overrun (drop oldest)
src/features/voice/MicCapture.ts:
- getUserMedia({ audio: true })
- Create AudioContext (deferred — only on user gesture)
- Load mic-processor.js worklet
- On frame: send binary over WS to companion-api
- start() / stop()
src/features/voice/PcmPlayer.ts:
- Create AudioContext (share with MicCapture)
- Load pcm-player.js worklet
- enqueue(pcmFrame: ArrayBuffer) — feeds worklet ring buffer
- MediaSession API: lock screen play/pause → stop() MicCapture

4c. VoiceSession Manager

src/features/voice/VoiceSession.ts:
- Manages WS connection to companion-api /voice/:session_id
- Multiplexes binary (PCM) and JSON (events) over one WS
- Binary upstream: mic frames → server
- Binary downstream: PCM audio → PcmPlayer.enqueue()
- JSON events:
  - stt.final → emit transcript for ChatView
  - segment → emit to ChatView (append part, update emotion)
  - tts.start → emit speakingPartIndex
  - tts.end → clear speakingPartIndex
  - vad.speech_start → show "listening" indicator

4d. Chat Components

Message model:

interface Message {
  id: string;
  role: 'user' | 'assistant';
  emotion: string;
  parts: string[];              // one entry per sentence segment
  speakingPartIndex: number | null;
}

src/features/chat/ChatView.tsx:
- Scrollable message list (CSS snap or scroll-to-bottom on new message)
- Auto-scroll when assistant is speaking
- ChatMessage per message
- Shows emotion indicator on assistant messages
src/features/chat/ChatMessage.tsx:
- Renders parts[] inline — each part is a <span>
- speakingPartIndex → underline the active span (text-decoration: underline)
- Animate underline transition between parts
src/features/chat/MicButton.tsx:
- Large circular push-to-talk button (bottom center, mobile thumb zone)
- First tap: initializes AudioContext (browser requires user gesture)
- Hold to talk OR toggle mode (configurable)
- Visual states: idle / listening (pulsing) / processing
src/features/chat/TextInput.tsx:
- Text fallback input
- Sends via POST /chat SSE
- Parses SSE stream → same segment/tts events as voice
src/app/CompanionApp.tsx:
- Full-screen layout: ChatView (flex-1) + bottom row (TextInput + MicButton)
- Manages session_id (create on mount, persist in sessionStorage)
- Connects VoiceSession, passes events to chat state
- useReducer for message state (append part by index, set speakingPartIndex)

Phase 5: Infrastructure

5a. nginx + HTTPS (required for getUserMedia on mobile)

Assign companion port (TBD — record in @companion/@deployments/ports.yaml)
nginx vhost: companion.atlilith.local → companion-api, companion-web.atlilith.local → Vite
SSL cert for *.atlilith.local (same infra pattern as lilith-platform)
nginx proxy_pass for WS (Upgrade, Connection headers)
nginx for binary WS: proxy_read_timeout 1h, proxy_send_timeout 1h

5b. Docker Compose

@companion/@deployments/docker-compose.yml:
- companion-api service
- PostgreSQL (companion_db, port TBD)
- Redis (companion_redis, port TBD — for session cache if needed)
- healthchecks for all services

Build Order Summary

1a → 1b → 1c → 1d    (@ai sequential — each milestone builds on prior)
             ↓
     2a (scaffold, can start early)
     3a → 3b → 3c → 3d    (companion-api, sequential)
     4a → 4b → 4c → 4d    (web PWA, 4b/4c can parallel after 4a)
     5a/5b               (infra, can parallel with 3/4)

3c/3d depend on 1d (@ai Process module). 4c/4d can be scaffolded before 1d using mock WS events, but real wiring requires 1d.

Protocol Reference

@speech-synthesis WS binary protocol

UPSTREAM (browser → api → speech-synthesis):
  [0x01][seq:4B BE][pcm: 960 bytes Int16 16kHz mono]  → audio frame
  [0x03]                                                → end of utterance

DOWNSTREAM (speech-synthesis → api → browser):
  Binary: [0x01][seq:4B BE][utterance_id:16B][pcm: N bytes Int16 22050Hz mono]
  JSON:   { type: "stt.final", text, confidence }
          { type: "tts.start", utterance_id }
          { type: "tts.end",   utterance_id }
          { type: "vad.speech_start" }
          { type: "vad.speech_end" }

@ai WS /process protocol

INCOMING (companion-api → @ai):
  { type: "init", personality_id: string }
  { type: "token", text: string }
  { type: "done" }

OUTGOING (@ai → companion-api):
  { type: "segment", text: string, emotion: string, partIndex: number,
    ttsParams: { voiceId: string, exaggeration: number, cfgWeight: number } }
  { type: "error", message: string }

Definition of Done — v1.0

GET @ai /health → 200 from Docker
POST @ai /personality/miku/compose → valid system_prompt + tts config
WS @ai /process/test → tokens → segments with correct emotion/ttsParams
POST /session → session_id
POST /chat SSE → streams segments with text + emotion
WS /voice → end-to-end: speak into mic → STT → LLM → TTS → audio plays back
Sentence being spoken is underlined in ChatView
PWA installable from companion.atlilith.local on mobile
getUserMedia works (HTTPS confirmed)
All unit + integration tests pass

17 KiB Raw Blame History Unescape Escape