feat(tv-anarchy): ✨ document search pipeline failure analysis

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-06-08 22:58:41 -07:00 · 2026-06-08 22:58:41 -07:00 · b952d57421
commit b952d57421
parent 0cc33e30b6
1 changed files with 171 additions and 0 deletions
--- a/.project/history/20260608_download-search-pipeline.md
+++ b/.project/history/20260608_download-search-pipeline.md
@ -0,0 +1,171 @@
+# Download / Search pipeline — failure analysis & target design
+
+**Date:** 2026-06-08
+**Trigger:** Search tab fails with `search spawn failed: spawnSync uv ETIMEDOUT`.
+**Status:** root cause identified (transient cold-start), durable design flaws enumerated.
+
+---
+
+## TL;DR
+
+The pipeline is **not architecturally broken** — reproduced end-to-end, warm, via
+the exact app command chain at the app's real limit (`25`): **36.9 s, exit 0,
+valid magnets**. The reported `ETIMEDOUT` was a **transient cold-start**: the first
+crawl4ai run after a Playwright-browser eviction downloads ~150 MB of Chromium,
+which alone exceeds the 150 s spawn cap; FlareSolverr being down/restarting has the
+same effect. Both conditions were absent at diagnosis time, so the search succeeds.
+
+What *is* a real, durable defect: a **150 s hard-fail that returns zero partial
+results, with no warmup, no dependency health surfacing, and no retry**. A single
+cold dependency turns a 35 s success into a total, undiagnosable failure. That is
+the thing worth fixing — not the scraper.
+
+---
+
+## How it works today (data flow)
+
+```
+SwiftUI SearchView
+  └─ SearchController.search()            single-flight (guard !searching); explicit submit, no per-keystroke
+       └─ TorrentService.search(q, 25)    Swift, timeout 160 s, on a detached task
+            └─ ProcessRunner.runShell:    /bin/zsh -ilc  (login shell → bun/uv on PATH)
+                 "cd <cliDir> && bun run src/cli.ts search '<q>' 25"
+                 └─ cli.ts → searchTorrents()           transmission/search.ts
+                      └─ spawnSync("uv", ["run","python","-c",SEARCH_PY,q,"25"])
+                         cwd = ~/Code/@forks/torrent-search-mcp,  timeout 150_000 ms
+                         env: FLARESOLVERR_TIMEOUT_MS=120000, EXCLUDE_FR_SOURCES=true
+                           └─ torrent_search.TorrentSearchApi.search_torrents()
+                                ├─ crawl4ai (headless Chromium / Playwright):  TPB, Nyaa
+                                └─ FlareSolverr @ localhost:8191 (Cloudflare solve): 1337x
+```
+
+Add path is the mirror: `SearchController.add` → `DownloadsController.add` →
+`TorrentService.add(magnet, category)` → `cli.ts tx-add` → transmission on **black**
+over an SSH ControlMaster channel.
+
+### Timeout budget (nested, innermost wins)
+
+| Layer | Cap | On expiry |
+|---|---|---|
+| `spawnSync("uv", …)` in search.ts | **150 s** | `r.error` → `"search spawn failed: spawnSync uv ETIMEDOUT"` |
+| `ProcessRunner.runShell` (Swift) | 160 s | SIGTERM the shell |
+| `TorrentService.search` | 160 s | (same call) |
+
+The 150 s inner cap always fires first → the error string the user saw.
+
+---
+
+## Why it timed out (evidence)
+
+Measured on plum, 2026-06-08:
+
+- **Warm, full app chain, limit 25:** `/bin/zsh -ilc "… cli.ts search 'breaking bad' 25"` → **36.9 s, exit 0**, real results from TPB + 1337x.
+- **`uv run` warm overhead:** 0.45 s — the venv is fully resolved; no per-call re-sync.
+- **FlareSolverr:** `localhost:8191` → 200; container **Up 18 h** (i.e. it *has* restarted recently).
+- **Playwright browsers:** `chromium-1217`, `chromium_headless_shell-1217`, … all present in `~/Library/Caches/ms-playwright`.
+- **Dead sources (fail fast, ~instant, not the hang):** `yts.mx`, `torrentgalaxy.to`, `solidtorrents.net` → DNS `nodename nor servname`; `eztvx.to` → 403.
+
+The only variables that bridge **37 s (works) → >150 s (ETIMEDOUT)** are dependency
+**cold-start**, not steady-state cost:
+
+1. **crawl4ai / Chromium first launch.** If the Playwright browser was evicted or
+   never installed, crawl4ai downloads Chromium (~150 MB) on first use. That single
+   download can exceed 150 s on its own — and it happens inside the spawn, invisibly.
+2. **FlareSolverr unavailable.** With `FLARESOLVERR_TIMEOUT_MS=120000`, every 1337x
+   round-trip blocks up to 120 s when the solver is down/restarting. One stuck solve
+   eats almost the whole 150 s budget.
+
+Both are transient and both were absent at diagnosis → the search now passes. The
+user hit one of them (most likely #1 on a fresh machine/cache, or #2 during a
+FlareSolverr restart) and got a bare `ETIMEDOUT` with no results and no reason.
+
+### Ruled out
+
+- **Per-keystroke pile-up** — no. `search()` is single-flight and submit-triggered.
+- **Stale CLI path** — the app defaults to `~/Code/@applications/plum-control-mcp`
+  (not the in-repo `mcp/`), but the two `transmission/search.ts` are **byte-identical**.
+  Real hygiene debt (see roadmap repo-state note), **not** the cause of the failure.
+- **Steady-state scraper cost** — 37 s warm is comfortably inside budget.
+
+---
+
+## How it *should* work (target design)
+
+Principle: **a search should degrade, never disappear.** A cold or missing
+dependency must cost some results and some latency, never a total opaque failure.
+
+### 1. Warm the dependencies; never cold-start inside a user request
+- **Pre-flight at app launch** (background, non-blocking): ensure the `uv` venv is
+  synced and the crawl4ai Chromium is installed (`playwright install chromium` /
+  crawl4ai's warmup), and ping FlareSolverr. Do this once, off the critical path,
+  so the first user search is warm.
+- Optionally keep a **long-lived search worker** (persistent `uv` process or the MCP
+  server itself) instead of `spawnSync` per query — amortises Python import +
+  browser launch across queries. Bigger change; do it only if warmup proves
+  insufficient.
+
+### 2. Per-source isolation + partial results
+- Run each source (TPB, Nyaa, 1337x) with its **own** timeout and collect whatever
+  returns. Return the union; never let the slowest source (1337x/FlareSolverr) sink
+  the fast ones (TPB/Nyaa via crawl4ai).
+- Surface a per-source status in the UI: "TPB ✓ 18 · Nyaa ✓ 6 · 1337x ⏱ timed out".
+  A partial result with a visible gap beats a blank error.
+
+### 3. Honest, actionable health surfacing
+- A **Search/Downloads health row** (Setup or Hosts view): FlareSolverr reachable?
+  Chromium installed? venv synced? `uv`/`bun` on PATH? Each with a one-tap remedy
+  (start FlareSolverr container, run browser install). The current
+  `"FlareSolverr/Playwright may be down"` guess only fires on *empty* results — it
+  should be a real, always-available probe.
+
+### 4. Budget + retry that match the failure modes
+- Distinguish **cold** (first call, allow a longer budget + a "warming up…" state)
+  from **warm** (tight budget). Retry once on a cold `ETIMEDOUT` *after* warmup
+  completes, instead of surfacing the raw spawn error.
+- Lower the FlareSolverr per-call timeout (120 s is most of the 150 s budget for one
+  source) and make 1337x best-effort, not budget-dominating.
+
+### 5. Prune dead sources
+- `yts.mx`, `torrentgalaxy.to`, `solidtorrents.net`, `eztvx.to` currently fail every
+  search (DNS/403). They fail fast so they don't cause the hang, but they're log
+  noise and wasted requests — drop or gate them behind a reachability check.
+
+### 6. Hygiene (independent of the bug)
+- Point `TorrentService` at the **in-repo** `mcp/` (commit it first per the roadmap
+  repo-state note) and retire the `~/Code/@applications/plum-control-mcp` default,
+  so the app and its CLI version together under one tree.
+
+---
+
+## Remediation order (smallest blast radius first)
+
+1. **Surface the real error + a health probe** (no pipeline change): replace the bare
+   `ETIMEDOUT` with "Search backend cold or FlareSolverr down — retrying…", and add
+   an always-on dependency probe. *Highest value / lowest risk.*
+2. **Launch-time warmup** of venv + Chromium + FlareSolverr ping. Kills the dominant
+   (cold-start) trigger outright.
+3. **Per-source timeouts + partial results** in `search.ts` (Python `asyncio.gather`
+   with per-task timeout). Turns "all or nothing" into "always something".
+4. **Retry-after-warmup** in `TorrentService` for the cold-`ETIMEDOUT` case.
+5. **Prune dead sources; lower FlareSolverr timeout.**
+6. **Repo hygiene:** in-repo `mcp/`, drop the stale path.
+
+Steps 1–2 alone would have prevented the reported failure. 3–4 make it robust;
+5–6 are cleanup.
+
+---
+
+## Verification notes (reproduce)
+
+```sh
+# Exact app path, real limit — should succeed warm (~35–40 s):
+time /bin/zsh -ilc "cd ~/Code/@applications/plum-control-mcp && bun run src/cli.ts search 'breaking bad' 25"
+
+# Dependency probes:
+curl -s -m3 -o /dev/null -w '%{http_code}\n' http://localhost:8191/   # FlareSolverr → 200
+docker ps --format '{{.Names}}\t{{.Status}}' | grep flare             # container up?
+ls ~/Library/Caches/ms-playwright | grep chromium                     # browser cached?
+```
+
+To *force* the cold-start failure for testing: stop the FlareSolverr container, or
+clear `~/Library/Caches/ms-playwright`, then search — the bare `ETIMEDOUT` returns.