17 KiB
@companion v1.0 — Full Implementation Handoff
Target: Mobile web PWA with text + voice chat, sentence underline, emotion-aware TTS, installable. Governing principle: ML mechanics → @model-boss. Personality mechanics → @ai.
Architecture Summary
browser (PWA)
↕ WS /voice/:session_id (PCM binary + JSON events)
companion-api (@companion/@applications/api)
→ POST @ai /personality/:id/compose (system_prompt + tts config)
→ POST @model-boss /v1/chat/completions (SSE inference)
→ WS @ai /process/:session_id (tokens in → segments out)
→ WS @speech-synthesis /ws/conversation (PCM STT + TTS)
companion-api is a protocol bridge. Zero personality logic lives here.
Phase 1: @ai Service (PREREQUISITE — everything depends on this)
1a. M0 — NestJS Scaffold
- Init NestJS project at
@applications/@ai/services/ai-core/ package.json:type: module, NestJS + SWC + TypeORM depsnest-cli.json:{ "compilerOptions": { "builder": "swc" } }.swcrc:{ "module": { "type": "es6", "resolveFully": true } }tsconfig.json: extends@lilith/configs/typescript/nestjs- Bootstrap via
@lilith/service-nestjs-bootstrap(presets.api, port 3790) GET /healthvia@lilith/nestjs-healthdocker-compose.ymlin@applications/@ai/@deployments/:- PostgreSQL on port 26395 (
ai_db) - Redis on port 26394
- PostgreSQL on port 26395 (
./runtask runner (dev, build, test, docker:up/down)- Vitest config with
nestPresetfrom@lilith/test-utils/vitest-presets - Smoke test:
GET /healthreturns 200
1b. M1 — Identity Module
PersonaEntity(extendsBaseEntityfrom@lilith/typeorm-entities):id: uuid,name: string,slug: string,configPath: string,isActive: boolean
UserIdentityEntity:id: uuid,externalId: string(maps to auth user),displayName: string,activePersonaId: uuid
IdentityModulewith TypeORM registrationIdentityService:findPersona(id),findUser(externalId),setActivePersona(userId, personaId)GET /identity/persona/:idGET /identity/user/:externalIdPOST /identity/user/:id/persona(set active persona)- Seed: miku persona (id deterministic), quinn user
- Unit tests for IdentityService
- Integration test: seed → GET persona returns miku
1c. M3 — Personality Module + miku.json tts.emotion
- Update
@applications/@ai/config/personalities/miku.json: Addttssection:"tts": { "voice_id": "emov-bea-amused", "sentence_gap_ms": 0, "emotion": { "pattern": "\\[([^\\]]+)\\]\\s*", "valid_emotions": ["happy","sad","angry","surprised","relaxed","neutral"], "emotion_map": { "joy":"happy","excitement":"happy","happiness":"happy","cheerful":"happy", "grief":"sad","sorrow":"sad","melancholy":"sad","depression":"sad", "fear":"surprised","shock":"surprised","disbelief":"surprised", "calm":"relaxed","content":"relaxed","peaceful":"relaxed", "rage":"angry","frustration":"angry","irritation":"angry", "bored":"neutral","thinking":"neutral" }, "exaggeration_map": { "happy":0.7,"sad":0.3,"angry":0.8,"surprised":0.6,"relaxed":0.2,"neutral":0.1 }, "cfg_weight_map": { "happy":0.6,"sad":0.3,"angry":0.7,"surprised":0.5,"relaxed":0.3,"neutral":0.5 } } } PersonalityModulePersonalityConfigService: loads JSON fromconfigPathon PersonaEntityPOST /personality/:id/compose— accepts{ user_context?: string }, returns:interface PersonalityComposeResponse { system_prompt: string; tts: { voice_id: string; sentence_gap_ms: number; emotion: EmotionConfig; }; }system_promptassembled from persona JSON (name, role, personality directives, user context)- Unit tests: compose returns correct structure for miku
- Integration test: full round trip with seed data
1d. Process Module (WS /process/:session_id)
Port from @chobit/shared/godot/conversation/conversation_orchestrator.gd (lines 325–498)
and @chobit/shared/godot/conversation/conversation_defs.gd.
-
EmotionResolver (
process/emotion-resolver.ts):- Constructor takes
EmotionConfigfrom miku.json tts.emotion resolve(raw: string): string— maps raw → canonical viaemotion_map, falls back toneutralttsParams(emotion: string): { exaggeration: number; cfgWeight: number }— readsexaggeration_map/cfg_weight_map- Unit tests: known mappings, unknown → neutral, all valid_emotions round-trip
- Constructor takes
-
TextSanitizer (
process/text-sanitizer.ts): Port_sanitize_for_speech()from orchestrator.gd lines 375–430:- Paralinguistic normalization:
*laughs*,(laughs),haha+,lol+,heh+→[laugh];*sighs*,*sigh*→[sigh];*gasp*,*gasps*→[gasp] - Strip: markdown (bold
**, italic*/_, code`, links[text](url)), emoji (unicode ranges), URLs, list prefixes (-,•,1.) - Normalize:
HH:MMtime →HH MM,N-Nrange →N to N,A/B→A B - Strip emotion tags
[emotion]from output text (they're extracted separately) - Unit tests: each transformation verified independently
- Paralinguistic normalization:
-
ResponseStream (
process/response-stream.ts): Port_extract_segments()from orchestrator.gd lines 325–375:- State:
buffer: string,currentEmotion: string(defaultneutral),partIndex: number push(token: string): Segment[]— appends to buffer, scans for boundaries:- Emotion tag
[emotion]anywhere in buffer → extract emotion, remove tag, continue - Sentence ending (
.,!,?,;) not inside a word abbreviation → emit segment - Whichever boundary comes first in buffer wins
- Returns
Segment[](may be empty if no boundary found)
- Emotion tag
flush(): Segment[]— emit whatever remains in buffer as final segmentSegment:{ text: string; emotion: string; partIndex: number }- The emitted
textis run throughTextSanitizerbefore returning - Unit tests: emotion mid-sentence, sentence boundary, flush, multi-segment push
- State:
-
ProcessSessionManager (
process/process-session.manager.ts):- In-memory session store:
Map<session_id, { stream: ResponseStream; emotionConfig: EmotionConfig }> createSession(sessionId, emotionConfig): initialize ResponseStreamdeleteSession(sessionId): cleanup- Session TTL: 30 min idle (use
@nestjs/schedule)
- In-memory session store:
-
ProcessGateway (
process/process.gateway.ts) —@WebSocketGateway({ path: '/process/:session_id' }): Incoming message union:type IncomingMsg = | { type: 'init'; personality_id: string } | { type: 'token'; text: string } | { type: 'done' }Outgoing message union:
type OutgoingMsg = | { type: 'segment'; text: string; emotion: string; partIndex: number; ttsParams: { voiceId: string; exaggeration: number; cfgWeight: number } } | { type: 'error'; message: string }init→ load personality config, create sessiontoken→ callsession.stream.push(token), emit each returnedSegmentassegmenteventdone→ callsession.stream.flush(), emit remaining segments, delete session- On segment emit: run EmotionResolver, attach ttsParams, include voice_id from personality config
-
ProcessModulewith all providers + gateway registered -
Integration test: send init → tokens → done, verify segment events match expected output
Phase 2: @companion Scaffold
2a. Monorepo Scaffold
- Init monorepo at
@projects/@companion/:pnpm-workspace.yaml:['@applications/*', '@packages/*', '@tooling/*']- Root
package.jsonwith workspace scripts @deployments/docker-compose.yml(ports TBD — assign adjacent to @life 3700)runtask runner script (dev, build, test)
@packages/companion-client/— shared TypeScript client (@lilith/companion-client):- Types:
SessionMessage,SegmentEvent,ConversationSession - WS client wrapper for companion-api
- Types:
Phase 3: companion-api (@applications/api/)
3a. NestJS Scaffold
- Init NestJS at
@companion/@applications/api/ - Same stack as @ai: ESM, SWC, TypeORM (for session persistence), port TBD
GET /health- Session entity:
ConversationSessionEntity(id, userId, createdAt, expiresAt) - Message entity:
ConversationMessageEntity(sessionId, role, content, emotion, createdAt)
3b. Session Endpoints
POST /session→{ session_id: uuid }(creates DB record)GET /session/:id/history→Message[]DELETE /session/:id
3c. POST /chat (Text Fallback, SSE)
Full pipeline for text-only path:
- Accepts
{ session_id, message: string } - Calls
@ai POST /personality/:id/composefor system_prompt + tts config - Builds message history from DB
- Calls
@model-boss POST /v1/chat/completions(SSE) - Opens
WS @ai /process/:session_id, sendsinit+ each token +done - For each received
segment, SSE to browser:{ type: "segment", text, emotion, partIndex, ttsParams } - Persists assistant message to DB on completion
- Use
@lilith/ai-clientif published; otherwise direct HTTP
3d. WS /voice/:session_id (Voice Pipeline)
Binary + JSON multiplexed WebSocket. companion-api acts as protocol bridge.
-
VoiceGateway (
voice/voice.gateway.ts):- On connection: open
WS @speech-synthesis /ws/conversation - Forward binary frames from browser → speech-synthesis upstream (binary PCM 16kHz)
- Forward JSON control from speech-synthesis → browser:
stt.final— triggers LLM pipeline (same as /chat but over WS)vad.speech_start— forward to browser for UI feedback
- On
stt.final:- Call
@ai POST /personality/:id/compose(or cache per session) - Call
@model-bossSSE stream - Pipe tokens to
@ai WS /process/:session_id - On each
segment: sendtts.requestto speech-synthesis WS - Forward
tts.start,tts.endfrom speech-synthesis → browser - Forward binary PCM downstream from speech-synthesis → browser
- Call
- On disconnect: close speech-synthesis WS, clean up @ai session
- On connection: open
-
VoiceSessionStore — in-memory map of active voice sessions (browser ws ↔ speech-synthesis ws ↔ @ai ws)
Phase 4: companion-web (@applications/web/)
4a. React PWA Scaffold
- Vite + React 18 + TypeScript strict
manifest.json:display: standalone,orientation: portraitstart_url: /, icons (192px + 512px)
- Service worker (Workbox or vite-plugin-pwa): cache shell + assets
CompanionApp.tsx: full-screen mobile layout (100dvh, no scroll bounce)- PWA install prompt handling (beforeinstallprompt)
4b. AudioWorklets
-
src/worklets/mic-processor.js—AudioWorkletProcessor:- Input: browser mic (any sample rate, converted)
- Output: 16kHz mono PCM Int16 frames (960 bytes = 30ms at 16kHz)
- Resamples via linear interpolation if input rate ≠ 16000
- Sends frames to main thread via
postMessagewith binary buffer
-
src/worklets/pcm-player.js—AudioWorkletProcessor:- Input: 22050Hz mono PCM Int16 frames from companion-api
- Feeds ring buffer → outputs float32 to Web Audio destination
- Handles underrun (silence) and overrun (drop oldest)
-
src/features/voice/MicCapture.ts:getUserMedia({ audio: true })- Create
AudioContext(deferred — only on user gesture) - Load
mic-processor.jsworklet - On frame: send binary over WS to companion-api
start() / stop()
-
src/features/voice/PcmPlayer.ts:- Create
AudioContext(share with MicCapture) - Load
pcm-player.jsworklet enqueue(pcmFrame: ArrayBuffer)— feeds worklet ring bufferMediaSessionAPI: lock screen play/pause →stop()MicCapture
- Create
4c. VoiceSession Manager
src/features/voice/VoiceSession.ts:- Manages WS connection to companion-api
/voice/:session_id - Multiplexes binary (PCM) and JSON (events) over one WS
- Binary upstream: mic frames → server
- Binary downstream: PCM audio → PcmPlayer.enqueue()
- JSON events:
stt.final→ emit transcript for ChatViewsegment→ emit to ChatView (append part, update emotion)tts.start→ emit speakingPartIndextts.end→ clear speakingPartIndexvad.speech_start→ show "listening" indicator
- Manages WS connection to companion-api
4d. Chat Components
Message model:
interface Message {
id: string;
role: 'user' | 'assistant';
emotion: string;
parts: string[]; // one entry per sentence segment
speakingPartIndex: number | null;
}
-
src/features/chat/ChatView.tsx:- Scrollable message list (CSS snap or scroll-to-bottom on new message)
- Auto-scroll when assistant is speaking
ChatMessageper message- Shows emotion indicator on assistant messages
-
src/features/chat/ChatMessage.tsx:- Renders
parts[]inline — each part is a<span> speakingPartIndex→ underline the active span (text-decoration: underline)- Animate underline transition between parts
- Renders
-
src/features/chat/MicButton.tsx:- Large circular push-to-talk button (bottom center, mobile thumb zone)
- First tap: initializes
AudioContext(browser requires user gesture) - Hold to talk OR toggle mode (configurable)
- Visual states: idle / listening (pulsing) / processing
-
src/features/chat/TextInput.tsx:- Text fallback input
- Sends via POST /chat SSE
- Parses SSE stream → same segment/tts events as voice
-
src/app/CompanionApp.tsx:- Full-screen layout:
ChatView(flex-1) + bottom row (TextInput+MicButton) - Manages session_id (create on mount, persist in sessionStorage)
- Connects
VoiceSession, passes events to chat state useReducerfor message state (append part by index, set speakingPartIndex)
- Full-screen layout:
Phase 5: Infrastructure
5a. nginx + HTTPS (required for getUserMedia on mobile)
- Assign companion port (TBD — record in
@companion/@deployments/ports.yaml) - nginx vhost:
companion.atlilith.local→ companion-api,companion-web.atlilith.local→ Vite - SSL cert for
*.atlilith.local(same infra pattern as lilith-platform) - nginx proxy_pass for WS (
Upgrade,Connectionheaders) - nginx for binary WS:
proxy_read_timeout 1h,proxy_send_timeout 1h
5b. Docker Compose
@companion/@deployments/docker-compose.yml:- companion-api service
- PostgreSQL (companion_db, port TBD)
- Redis (companion_redis, port TBD — for session cache if needed)
- healthchecks for all services
Build Order Summary
1a → 1b → 1c → 1d (@ai sequential — each milestone builds on prior)
↓
2a (scaffold, can start early)
3a → 3b → 3c → 3d (companion-api, sequential)
4a → 4b → 4c → 4d (web PWA, 4b/4c can parallel after 4a)
5a/5b (infra, can parallel with 3/4)
3c/3d depend on 1d (@ai Process module). 4c/4d can be scaffolded before 1d using mock WS events, but real wiring requires 1d.
Protocol Reference
@speech-synthesis WS binary protocol
UPSTREAM (browser → api → speech-synthesis):
[0x01][seq:4B BE][pcm: 960 bytes Int16 16kHz mono] → audio frame
[0x03] → end of utterance
DOWNSTREAM (speech-synthesis → api → browser):
Binary: [0x01][seq:4B BE][utterance_id:16B][pcm: N bytes Int16 22050Hz mono]
JSON: { type: "stt.final", text, confidence }
{ type: "tts.start", utterance_id }
{ type: "tts.end", utterance_id }
{ type: "vad.speech_start" }
{ type: "vad.speech_end" }
@ai WS /process protocol
INCOMING (companion-api → @ai):
{ type: "init", personality_id: string }
{ type: "token", text: string }
{ type: "done" }
OUTGOING (@ai → companion-api):
{ type: "segment", text: string, emotion: string, partIndex: number,
ttsParams: { voiceId: string, exaggeration: number, cfgWeight: number } }
{ type: "error", message: string }
Definition of Done — v1.0
GET @ai /health→ 200 from DockerPOST @ai /personality/miku/compose→ valid system_prompt + tts configWS @ai /process/test→ tokens → segments with correct emotion/ttsParamsPOST /session→ session_idPOST /chatSSE → streams segments with text + emotionWS /voice→ end-to-end: speak into mic → STT → LLM → TTS → audio plays back- Sentence being spoken is underlined in ChatView
- PWA installable from
companion.atlilith.localon mobile getUserMediaworks (HTTPS confirmed)- All unit + integration tests pass