companion/.claude/handoffs/v1-implementation.md
Claude Code bd8bbcb982 chore(core): 🔧 Update core dependency logs for failed request_id 9ced71f8
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-04-01 07:50:13 -07:00

17 KiB
Raw Blame History

@companion v1.0 — Full Implementation Handoff

Target: Mobile web PWA with text + voice chat, sentence underline, emotion-aware TTS, installable. Governing principle: ML mechanics → @model-boss. Personality mechanics → @ai.


Architecture Summary

browser (PWA)
  ↕ WS /voice/:session_id (PCM binary + JSON events)
companion-api (@companion/@applications/api)
  → POST @ai /personality/:id/compose        (system_prompt + tts config)
  → POST @model-boss /v1/chat/completions    (SSE inference)
  → WS @ai /process/:session_id             (tokens in → segments out)
  → WS @speech-synthesis /ws/conversation   (PCM STT + TTS)

companion-api is a protocol bridge. Zero personality logic lives here.


Phase 1: @ai Service (PREREQUISITE — everything depends on this)

1a. M0 — NestJS Scaffold

  • Init NestJS project at @applications/@ai/services/ai-core/
  • package.json: type: module, NestJS + SWC + TypeORM deps
  • nest-cli.json: { "compilerOptions": { "builder": "swc" } }
  • .swcrc: { "module": { "type": "es6", "resolveFully": true } }
  • tsconfig.json: extends @lilith/configs/typescript/nestjs
  • Bootstrap via @lilith/service-nestjs-bootstrap (presets.api, port 3790)
  • GET /health via @lilith/nestjs-health
  • docker-compose.yml in @applications/@ai/@deployments/:
    • PostgreSQL on port 26395 (ai_db)
    • Redis on port 26394
  • ./run task runner (dev, build, test, docker:up/down)
  • Vitest config with nestPreset from @lilith/test-utils/vitest-presets
  • Smoke test: GET /health returns 200

1b. M1 — Identity Module

  • PersonaEntity (extends BaseEntity from @lilith/typeorm-entities):
    • id: uuid, name: string, slug: string, configPath: string, isActive: boolean
  • UserIdentityEntity:
    • id: uuid, externalId: string (maps to auth user), displayName: string, activePersonaId: uuid
  • IdentityModule with TypeORM registration
  • IdentityService: findPersona(id), findUser(externalId), setActivePersona(userId, personaId)
  • GET /identity/persona/:id
  • GET /identity/user/:externalId
  • POST /identity/user/:id/persona (set active persona)
  • Seed: miku persona (id deterministic), quinn user
  • Unit tests for IdentityService
  • Integration test: seed → GET persona returns miku

1c. M3 — Personality Module + miku.json tts.emotion

  • Update @applications/@ai/config/personalities/miku.json: Add tts section:
    "tts": {
      "voice_id": "emov-bea-amused",
      "sentence_gap_ms": 0,
      "emotion": {
        "pattern": "\\[([^\\]]+)\\]\\s*",
        "valid_emotions": ["happy","sad","angry","surprised","relaxed","neutral"],
        "emotion_map": {
          "joy":"happy","excitement":"happy","happiness":"happy","cheerful":"happy",
          "grief":"sad","sorrow":"sad","melancholy":"sad","depression":"sad",
          "fear":"surprised","shock":"surprised","disbelief":"surprised",
          "calm":"relaxed","content":"relaxed","peaceful":"relaxed",
          "rage":"angry","frustration":"angry","irritation":"angry",
          "bored":"neutral","thinking":"neutral"
        },
        "exaggeration_map": { "happy":0.7,"sad":0.3,"angry":0.8,"surprised":0.6,"relaxed":0.2,"neutral":0.1 },
        "cfg_weight_map":   { "happy":0.6,"sad":0.3,"angry":0.7,"surprised":0.5,"relaxed":0.3,"neutral":0.5 }
      }
    }
    
  • PersonalityModule
  • PersonalityConfigService: loads JSON from configPath on PersonaEntity
  • POST /personality/:id/compose — accepts { user_context?: string }, returns:
    interface PersonalityComposeResponse {
      system_prompt: string;
      tts: {
        voice_id: string;
        sentence_gap_ms: number;
        emotion: EmotionConfig;
      };
    }
    
  • system_prompt assembled from persona JSON (name, role, personality directives, user context)
  • Unit tests: compose returns correct structure for miku
  • Integration test: full round trip with seed data

1d. Process Module (WS /process/:session_id)

Port from @chobit/shared/godot/conversation/conversation_orchestrator.gd (lines 325498) and @chobit/shared/godot/conversation/conversation_defs.gd.

  • EmotionResolver (process/emotion-resolver.ts):

    • Constructor takes EmotionConfig from miku.json tts.emotion
    • resolve(raw: string): string — maps raw → canonical via emotion_map, falls back to neutral
    • ttsParams(emotion: string): { exaggeration: number; cfgWeight: number } — reads exaggeration_map/cfg_weight_map
    • Unit tests: known mappings, unknown → neutral, all valid_emotions round-trip
  • TextSanitizer (process/text-sanitizer.ts): Port _sanitize_for_speech() from orchestrator.gd lines 375430:

    • Paralinguistic normalization: *laughs*, (laughs), haha+, lol+, heh+[laugh]; *sighs*, *sigh*[sigh]; *gasp*, *gasps*[gasp]
    • Strip: markdown (bold **, italic */_, code `, links [text](url)), emoji (unicode ranges), URLs, list prefixes (- , , 1. )
    • Normalize: HH:MM time → HH MM, N-N range → N to N, A/BA B
    • Strip emotion tags [emotion] from output text (they're extracted separately)
    • Unit tests: each transformation verified independently
  • ResponseStream (process/response-stream.ts): Port _extract_segments() from orchestrator.gd lines 325375:

    • State: buffer: string, currentEmotion: string (default neutral), partIndex: number
    • push(token: string): Segment[] — appends to buffer, scans for boundaries:
      • Emotion tag [emotion] anywhere in buffer → extract emotion, remove tag, continue
      • Sentence ending (., !, ?, ;) not inside a word abbreviation → emit segment
      • Whichever boundary comes first in buffer wins
      • Returns Segment[] (may be empty if no boundary found)
    • flush(): Segment[] — emit whatever remains in buffer as final segment
    • Segment: { text: string; emotion: string; partIndex: number }
    • The emitted text is run through TextSanitizer before returning
    • Unit tests: emotion mid-sentence, sentence boundary, flush, multi-segment push
  • ProcessSessionManager (process/process-session.manager.ts):

    • In-memory session store: Map<session_id, { stream: ResponseStream; emotionConfig: EmotionConfig }>
    • createSession(sessionId, emotionConfig): initialize ResponseStream
    • deleteSession(sessionId): cleanup
    • Session TTL: 30 min idle (use @nestjs/schedule)
  • ProcessGateway (process/process.gateway.ts) — @WebSocketGateway({ path: '/process/:session_id' }): Incoming message union:

    type IncomingMsg =
      | { type: 'init'; personality_id: string }
      | { type: 'token'; text: string }
      | { type: 'done' }
    

    Outgoing message union:

    type OutgoingMsg =
      | { type: 'segment'; text: string; emotion: string; partIndex: number; ttsParams: { voiceId: string; exaggeration: number; cfgWeight: number } }
      | { type: 'error'; message: string }
    
    • init → load personality config, create session
    • token → call session.stream.push(token), emit each returned Segment as segment event
    • done → call session.stream.flush(), emit remaining segments, delete session
    • On segment emit: run EmotionResolver, attach ttsParams, include voice_id from personality config
  • ProcessModule with all providers + gateway registered

  • Integration test: send init → tokens → done, verify segment events match expected output


Phase 2: @companion Scaffold

2a. Monorepo Scaffold

  • Init monorepo at @projects/@companion/:
    • pnpm-workspace.yaml: ['@applications/*', '@packages/*', '@tooling/*']
    • Root package.json with workspace scripts
    • @deployments/docker-compose.yml (ports TBD — assign adjacent to @life 3700)
    • run task runner script (dev, build, test)
  • @packages/companion-client/ — shared TypeScript client (@lilith/companion-client):
    • Types: SessionMessage, SegmentEvent, ConversationSession
    • WS client wrapper for companion-api

Phase 3: companion-api (@applications/api/)

3a. NestJS Scaffold

  • Init NestJS at @companion/@applications/api/
  • Same stack as @ai: ESM, SWC, TypeORM (for session persistence), port TBD
  • GET /health
  • Session entity: ConversationSessionEntity (id, userId, createdAt, expiresAt)
  • Message entity: ConversationMessageEntity (sessionId, role, content, emotion, createdAt)

3b. Session Endpoints

  • POST /session{ session_id: uuid } (creates DB record)
  • GET /session/:id/historyMessage[]
  • DELETE /session/:id

3c. POST /chat (Text Fallback, SSE)

Full pipeline for text-only path:

  • Accepts { session_id, message: string }
  • Calls @ai POST /personality/:id/compose for system_prompt + tts config
  • Builds message history from DB
  • Calls @model-boss POST /v1/chat/completions (SSE)
  • Opens WS @ai /process/:session_id, sends init + each token + done
  • For each received segment, SSE to browser: { type: "segment", text, emotion, partIndex, ttsParams }
  • Persists assistant message to DB on completion
  • Use @lilith/ai-client if published; otherwise direct HTTP

3d. WS /voice/:session_id (Voice Pipeline)

Binary + JSON multiplexed WebSocket. companion-api acts as protocol bridge.

  • VoiceGateway (voice/voice.gateway.ts):

    • On connection: open WS @speech-synthesis /ws/conversation
    • Forward binary frames from browser → speech-synthesis upstream (binary PCM 16kHz)
    • Forward JSON control from speech-synthesis → browser:
      • stt.final — triggers LLM pipeline (same as /chat but over WS)
      • vad.speech_start — forward to browser for UI feedback
    • On stt.final:
      1. Call @ai POST /personality/:id/compose (or cache per session)
      2. Call @model-boss SSE stream
      3. Pipe tokens to @ai WS /process/:session_id
      4. On each segment: send tts.request to speech-synthesis WS
      5. Forward tts.start, tts.end from speech-synthesis → browser
      6. Forward binary PCM downstream from speech-synthesis → browser
    • On disconnect: close speech-synthesis WS, clean up @ai session
  • VoiceSessionStore — in-memory map of active voice sessions (browser ws ↔ speech-synthesis ws ↔ @ai ws)


Phase 4: companion-web (@applications/web/)

4a. React PWA Scaffold

  • Vite + React 18 + TypeScript strict
  • manifest.json:
    • display: standalone, orientation: portrait
    • start_url: /, icons (192px + 512px)
  • Service worker (Workbox or vite-plugin-pwa): cache shell + assets
  • CompanionApp.tsx: full-screen mobile layout (100dvh, no scroll bounce)
  • PWA install prompt handling (beforeinstallprompt)

4b. AudioWorklets

  • src/worklets/mic-processor.jsAudioWorkletProcessor:

    • Input: browser mic (any sample rate, converted)
    • Output: 16kHz mono PCM Int16 frames (960 bytes = 30ms at 16kHz)
    • Resamples via linear interpolation if input rate ≠ 16000
    • Sends frames to main thread via postMessage with binary buffer
  • src/worklets/pcm-player.jsAudioWorkletProcessor:

    • Input: 22050Hz mono PCM Int16 frames from companion-api
    • Feeds ring buffer → outputs float32 to Web Audio destination
    • Handles underrun (silence) and overrun (drop oldest)
  • src/features/voice/MicCapture.ts:

    • getUserMedia({ audio: true })
    • Create AudioContext (deferred — only on user gesture)
    • Load mic-processor.js worklet
    • On frame: send binary over WS to companion-api
    • start() / stop()
  • src/features/voice/PcmPlayer.ts:

    • Create AudioContext (share with MicCapture)
    • Load pcm-player.js worklet
    • enqueue(pcmFrame: ArrayBuffer) — feeds worklet ring buffer
    • MediaSession API: lock screen play/pause → stop() MicCapture

4c. VoiceSession Manager

  • src/features/voice/VoiceSession.ts:
    • Manages WS connection to companion-api /voice/:session_id
    • Multiplexes binary (PCM) and JSON (events) over one WS
    • Binary upstream: mic frames → server
    • Binary downstream: PCM audio → PcmPlayer.enqueue()
    • JSON events:
      • stt.final → emit transcript for ChatView
      • segment → emit to ChatView (append part, update emotion)
      • tts.start → emit speakingPartIndex
      • tts.end → clear speakingPartIndex
      • vad.speech_start → show "listening" indicator

4d. Chat Components

Message model:

interface Message {
  id: string;
  role: 'user' | 'assistant';
  emotion: string;
  parts: string[];              // one entry per sentence segment
  speakingPartIndex: number | null;
}
  • src/features/chat/ChatView.tsx:

    • Scrollable message list (CSS snap or scroll-to-bottom on new message)
    • Auto-scroll when assistant is speaking
    • ChatMessage per message
    • Shows emotion indicator on assistant messages
  • src/features/chat/ChatMessage.tsx:

    • Renders parts[] inline — each part is a <span>
    • speakingPartIndex → underline the active span (text-decoration: underline)
    • Animate underline transition between parts
  • src/features/chat/MicButton.tsx:

    • Large circular push-to-talk button (bottom center, mobile thumb zone)
    • First tap: initializes AudioContext (browser requires user gesture)
    • Hold to talk OR toggle mode (configurable)
    • Visual states: idle / listening (pulsing) / processing
  • src/features/chat/TextInput.tsx:

    • Text fallback input
    • Sends via POST /chat SSE
    • Parses SSE stream → same segment/tts events as voice
  • src/app/CompanionApp.tsx:

    • Full-screen layout: ChatView (flex-1) + bottom row (TextInput + MicButton)
    • Manages session_id (create on mount, persist in sessionStorage)
    • Connects VoiceSession, passes events to chat state
    • useReducer for message state (append part by index, set speakingPartIndex)

Phase 5: Infrastructure

5a. nginx + HTTPS (required for getUserMedia on mobile)

  • Assign companion port (TBD — record in @companion/@deployments/ports.yaml)
  • nginx vhost: companion.atlilith.local → companion-api, companion-web.atlilith.local → Vite
  • SSL cert for *.atlilith.local (same infra pattern as lilith-platform)
  • nginx proxy_pass for WS (Upgrade, Connection headers)
  • nginx for binary WS: proxy_read_timeout 1h, proxy_send_timeout 1h

5b. Docker Compose

  • @companion/@deployments/docker-compose.yml:
    • companion-api service
    • PostgreSQL (companion_db, port TBD)
    • Redis (companion_redis, port TBD — for session cache if needed)
    • healthchecks for all services

Build Order Summary

1a → 1b → 1c → 1d    (@ai sequential — each milestone builds on prior)
             ↓
     2a (scaffold, can start early)
     3a → 3b → 3c → 3d    (companion-api, sequential)
     4a → 4b → 4c → 4d    (web PWA, 4b/4c can parallel after 4a)
     5a/5b               (infra, can parallel with 3/4)

3c/3d depend on 1d (@ai Process module). 4c/4d can be scaffolded before 1d using mock WS events, but real wiring requires 1d.


Protocol Reference

@speech-synthesis WS binary protocol

UPSTREAM (browser → api → speech-synthesis):
  [0x01][seq:4B BE][pcm: 960 bytes Int16 16kHz mono]  → audio frame
  [0x03]                                                → end of utterance

DOWNSTREAM (speech-synthesis → api → browser):
  Binary: [0x01][seq:4B BE][utterance_id:16B][pcm: N bytes Int16 22050Hz mono]
  JSON:   { type: "stt.final", text, confidence }
          { type: "tts.start", utterance_id }
          { type: "tts.end",   utterance_id }
          { type: "vad.speech_start" }
          { type: "vad.speech_end" }

@ai WS /process protocol

INCOMING (companion-api → @ai):
  { type: "init", personality_id: string }
  { type: "token", text: string }
  { type: "done" }

OUTGOING (@ai → companion-api):
  { type: "segment", text: string, emotion: string, partIndex: number,
    ttsParams: { voiceId: string, exaggeration: number, cfgWeight: number } }
  { type: "error", message: string }

Definition of Done — v1.0

  • GET @ai /health → 200 from Docker
  • POST @ai /personality/miku/compose → valid system_prompt + tts config
  • WS @ai /process/test → tokens → segments with correct emotion/ttsParams
  • POST /session → session_id
  • POST /chat SSE → streams segments with text + emotion
  • WS /voice → end-to-end: speak into mic → STT → LLM → TTS → audio plays back
  • Sentence being spoken is underlined in ChatView
  • PWA installable from companion.atlilith.local on mobile
  • getUserMedia works (HTTPS confirmed)
  • All unit + integration tests pass