feat(tv-anarchy): document search pipeline failure analysis

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
Natalie 2026-06-08 22:58:41 -07:00
parent 0cc33e30b6
commit b952d57421

View file

@ -0,0 +1,171 @@
# Download / Search pipeline — failure analysis & target design
**Date:** 2026-06-08
**Trigger:** Search tab fails with `search spawn failed: spawnSync uv ETIMEDOUT`.
**Status:** root cause identified (transient cold-start), durable design flaws enumerated.
---
## TL;DR
The pipeline is **not architecturally broken** — reproduced end-to-end, warm, via
the exact app command chain at the app's real limit (`25`): **36.9 s, exit 0,
valid magnets**. The reported `ETIMEDOUT` was a **transient cold-start**: the first
crawl4ai run after a Playwright-browser eviction downloads ~150 MB of Chromium,
which alone exceeds the 150 s spawn cap; FlareSolverr being down/restarting has the
same effect. Both conditions were absent at diagnosis time, so the search succeeds.
What *is* a real, durable defect: a **150 s hard-fail that returns zero partial
results, with no warmup, no dependency health surfacing, and no retry**. A single
cold dependency turns a 35 s success into a total, undiagnosable failure. That is
the thing worth fixing — not the scraper.
---
## How it works today (data flow)
```
SwiftUI SearchView
└─ SearchController.search() single-flight (guard !searching); explicit submit, no per-keystroke
└─ TorrentService.search(q, 25) Swift, timeout 160 s, on a detached task
└─ ProcessRunner.runShell: /bin/zsh -ilc (login shell → bun/uv on PATH)
"cd <cliDir> && bun run src/cli.ts search '<q>' 25"
└─ cli.ts → searchTorrents() transmission/search.ts
└─ spawnSync("uv", ["run","python","-c",SEARCH_PY,q,"25"])
cwd = ~/Code/@forks/torrent-search-mcp, timeout 150_000 ms
env: FLARESOLVERR_TIMEOUT_MS=120000, EXCLUDE_FR_SOURCES=true
└─ torrent_search.TorrentSearchApi.search_torrents()
├─ crawl4ai (headless Chromium / Playwright): TPB, Nyaa
└─ FlareSolverr @ localhost:8191 (Cloudflare solve): 1337x
```
Add path is the mirror: `SearchController.add``DownloadsController.add`
`TorrentService.add(magnet, category)``cli.ts tx-add` → transmission on **black**
over an SSH ControlMaster channel.
### Timeout budget (nested, innermost wins)
| Layer | Cap | On expiry |
|---|---|---|
| `spawnSync("uv", …)` in search.ts | **150 s** | `r.error``"search spawn failed: spawnSync uv ETIMEDOUT"` |
| `ProcessRunner.runShell` (Swift) | 160 s | SIGTERM the shell |
| `TorrentService.search` | 160 s | (same call) |
The 150 s inner cap always fires first → the error string the user saw.
---
## Why it timed out (evidence)
Measured on plum, 2026-06-08:
- **Warm, full app chain, limit 25:** `/bin/zsh -ilc "… cli.ts search 'breaking bad' 25"`**36.9 s, exit 0**, real results from TPB + 1337x.
- **`uv run` warm overhead:** 0.45 s — the venv is fully resolved; no per-call re-sync.
- **FlareSolverr:** `localhost:8191` → 200; container **Up 18 h** (i.e. it *has* restarted recently).
- **Playwright browsers:** `chromium-1217`, `chromium_headless_shell-1217`, … all present in `~/Library/Caches/ms-playwright`.
- **Dead sources (fail fast, ~instant, not the hang):** `yts.mx`, `torrentgalaxy.to`, `solidtorrents.net` → DNS `nodename nor servname`; `eztvx.to` → 403.
The only variables that bridge **37 s (works) → >150 s (ETIMEDOUT)** are dependency
**cold-start**, not steady-state cost:
1. **crawl4ai / Chromium first launch.** If the Playwright browser was evicted or
never installed, crawl4ai downloads Chromium (~150 MB) on first use. That single
download can exceed 150 s on its own — and it happens inside the spawn, invisibly.
2. **FlareSolverr unavailable.** With `FLARESOLVERR_TIMEOUT_MS=120000`, every 1337x
round-trip blocks up to 120 s when the solver is down/restarting. One stuck solve
eats almost the whole 150 s budget.
Both are transient and both were absent at diagnosis → the search now passes. The
user hit one of them (most likely #1 on a fresh machine/cache, or #2 during a
FlareSolverr restart) and got a bare `ETIMEDOUT` with no results and no reason.
### Ruled out
- **Per-keystroke pile-up** — no. `search()` is single-flight and submit-triggered.
- **Stale CLI path** — the app defaults to `~/Code/@applications/plum-control-mcp`
(not the in-repo `mcp/`), but the two `transmission/search.ts` are **byte-identical**.
Real hygiene debt (see roadmap repo-state note), **not** the cause of the failure.
- **Steady-state scraper cost** — 37 s warm is comfortably inside budget.
---
## How it *should* work (target design)
Principle: **a search should degrade, never disappear.** A cold or missing
dependency must cost some results and some latency, never a total opaque failure.
### 1. Warm the dependencies; never cold-start inside a user request
- **Pre-flight at app launch** (background, non-blocking): ensure the `uv` venv is
synced and the crawl4ai Chromium is installed (`playwright install chromium` /
crawl4ai's warmup), and ping FlareSolverr. Do this once, off the critical path,
so the first user search is warm.
- Optionally keep a **long-lived search worker** (persistent `uv` process or the MCP
server itself) instead of `spawnSync` per query — amortises Python import +
browser launch across queries. Bigger change; do it only if warmup proves
insufficient.
### 2. Per-source isolation + partial results
- Run each source (TPB, Nyaa, 1337x) with its **own** timeout and collect whatever
returns. Return the union; never let the slowest source (1337x/FlareSolverr) sink
the fast ones (TPB/Nyaa via crawl4ai).
- Surface a per-source status in the UI: "TPB ✓ 18 · Nyaa ✓ 6 · 1337x ⏱ timed out".
A partial result with a visible gap beats a blank error.
### 3. Honest, actionable health surfacing
- A **Search/Downloads health row** (Setup or Hosts view): FlareSolverr reachable?
Chromium installed? venv synced? `uv`/`bun` on PATH? Each with a one-tap remedy
(start FlareSolverr container, run browser install). The current
`"FlareSolverr/Playwright may be down"` guess only fires on *empty* results — it
should be a real, always-available probe.
### 4. Budget + retry that match the failure modes
- Distinguish **cold** (first call, allow a longer budget + a "warming up…" state)
from **warm** (tight budget). Retry once on a cold `ETIMEDOUT` *after* warmup
completes, instead of surfacing the raw spawn error.
- Lower the FlareSolverr per-call timeout (120 s is most of the 150 s budget for one
source) and make 1337x best-effort, not budget-dominating.
### 5. Prune dead sources
- `yts.mx`, `torrentgalaxy.to`, `solidtorrents.net`, `eztvx.to` currently fail every
search (DNS/403). They fail fast so they don't cause the hang, but they're log
noise and wasted requests — drop or gate them behind a reachability check.
### 6. Hygiene (independent of the bug)
- Point `TorrentService` at the **in-repo** `mcp/` (commit it first per the roadmap
repo-state note) and retire the `~/Code/@applications/plum-control-mcp` default,
so the app and its CLI version together under one tree.
---
## Remediation order (smallest blast radius first)
1. **Surface the real error + a health probe** (no pipeline change): replace the bare
`ETIMEDOUT` with "Search backend cold or FlareSolverr down — retrying…", and add
an always-on dependency probe. *Highest value / lowest risk.*
2. **Launch-time warmup** of venv + Chromium + FlareSolverr ping. Kills the dominant
(cold-start) trigger outright.
3. **Per-source timeouts + partial results** in `search.ts` (Python `asyncio.gather`
with per-task timeout). Turns "all or nothing" into "always something".
4. **Retry-after-warmup** in `TorrentService` for the cold-`ETIMEDOUT` case.
5. **Prune dead sources; lower FlareSolverr timeout.**
6. **Repo hygiene:** in-repo `mcp/`, drop the stale path.
Steps 12 alone would have prevented the reported failure. 34 make it robust;
56 are cleanup.
---
## Verification notes (reproduce)
```sh
# Exact app path, real limit — should succeed warm (~3540 s):
time /bin/zsh -ilc "cd ~/Code/@applications/plum-control-mcp && bun run src/cli.ts search 'breaking bad' 25"
# Dependency probes:
curl -s -m3 -o /dev/null -w '%{http_code}\n' http://localhost:8191/ # FlareSolverr → 200
docker ps --format '{{.Names}}\t{{.Status}}' | grep flare # container up?
ls ~/Library/Caches/ms-playwright | grep chromium # browser cached?
```
To *force* the cold-start failure for testing: stop the FlareSolverr container, or
clear `~/Library/Caches/ms-playwright`, then search — the bare `ETIMEDOUT` returns.