feat(tv-anarchy): ✨ document search pipeline failure analysis
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
parent
0cc33e30b6
commit
b952d57421
1 changed files with 171 additions and 0 deletions
171
.project/history/20260608_download-search-pipeline.md
Normal file
171
.project/history/20260608_download-search-pipeline.md
Normal file
|
|
@ -0,0 +1,171 @@
|
|||
# Download / Search pipeline — failure analysis & target design
|
||||
|
||||
**Date:** 2026-06-08
|
||||
**Trigger:** Search tab fails with `search spawn failed: spawnSync uv ETIMEDOUT`.
|
||||
**Status:** root cause identified (transient cold-start), durable design flaws enumerated.
|
||||
|
||||
---
|
||||
|
||||
## TL;DR
|
||||
|
||||
The pipeline is **not architecturally broken** — reproduced end-to-end, warm, via
|
||||
the exact app command chain at the app's real limit (`25`): **36.9 s, exit 0,
|
||||
valid magnets**. The reported `ETIMEDOUT` was a **transient cold-start**: the first
|
||||
crawl4ai run after a Playwright-browser eviction downloads ~150 MB of Chromium,
|
||||
which alone exceeds the 150 s spawn cap; FlareSolverr being down/restarting has the
|
||||
same effect. Both conditions were absent at diagnosis time, so the search succeeds.
|
||||
|
||||
What *is* a real, durable defect: a **150 s hard-fail that returns zero partial
|
||||
results, with no warmup, no dependency health surfacing, and no retry**. A single
|
||||
cold dependency turns a 35 s success into a total, undiagnosable failure. That is
|
||||
the thing worth fixing — not the scraper.
|
||||
|
||||
---
|
||||
|
||||
## How it works today (data flow)
|
||||
|
||||
```
|
||||
SwiftUI SearchView
|
||||
└─ SearchController.search() single-flight (guard !searching); explicit submit, no per-keystroke
|
||||
└─ TorrentService.search(q, 25) Swift, timeout 160 s, on a detached task
|
||||
└─ ProcessRunner.runShell: /bin/zsh -ilc (login shell → bun/uv on PATH)
|
||||
"cd <cliDir> && bun run src/cli.ts search '<q>' 25"
|
||||
└─ cli.ts → searchTorrents() transmission/search.ts
|
||||
└─ spawnSync("uv", ["run","python","-c",SEARCH_PY,q,"25"])
|
||||
cwd = ~/Code/@forks/torrent-search-mcp, timeout 150_000 ms
|
||||
env: FLARESOLVERR_TIMEOUT_MS=120000, EXCLUDE_FR_SOURCES=true
|
||||
└─ torrent_search.TorrentSearchApi.search_torrents()
|
||||
├─ crawl4ai (headless Chromium / Playwright): TPB, Nyaa
|
||||
└─ FlareSolverr @ localhost:8191 (Cloudflare solve): 1337x
|
||||
```
|
||||
|
||||
Add path is the mirror: `SearchController.add` → `DownloadsController.add` →
|
||||
`TorrentService.add(magnet, category)` → `cli.ts tx-add` → transmission on **black**
|
||||
over an SSH ControlMaster channel.
|
||||
|
||||
### Timeout budget (nested, innermost wins)
|
||||
|
||||
| Layer | Cap | On expiry |
|
||||
|---|---|---|
|
||||
| `spawnSync("uv", …)` in search.ts | **150 s** | `r.error` → `"search spawn failed: spawnSync uv ETIMEDOUT"` |
|
||||
| `ProcessRunner.runShell` (Swift) | 160 s | SIGTERM the shell |
|
||||
| `TorrentService.search` | 160 s | (same call) |
|
||||
|
||||
The 150 s inner cap always fires first → the error string the user saw.
|
||||
|
||||
---
|
||||
|
||||
## Why it timed out (evidence)
|
||||
|
||||
Measured on plum, 2026-06-08:
|
||||
|
||||
- **Warm, full app chain, limit 25:** `/bin/zsh -ilc "… cli.ts search 'breaking bad' 25"` → **36.9 s, exit 0**, real results from TPB + 1337x.
|
||||
- **`uv run` warm overhead:** 0.45 s — the venv is fully resolved; no per-call re-sync.
|
||||
- **FlareSolverr:** `localhost:8191` → 200; container **Up 18 h** (i.e. it *has* restarted recently).
|
||||
- **Playwright browsers:** `chromium-1217`, `chromium_headless_shell-1217`, … all present in `~/Library/Caches/ms-playwright`.
|
||||
- **Dead sources (fail fast, ~instant, not the hang):** `yts.mx`, `torrentgalaxy.to`, `solidtorrents.net` → DNS `nodename nor servname`; `eztvx.to` → 403.
|
||||
|
||||
The only variables that bridge **37 s (works) → >150 s (ETIMEDOUT)** are dependency
|
||||
**cold-start**, not steady-state cost:
|
||||
|
||||
1. **crawl4ai / Chromium first launch.** If the Playwright browser was evicted or
|
||||
never installed, crawl4ai downloads Chromium (~150 MB) on first use. That single
|
||||
download can exceed 150 s on its own — and it happens inside the spawn, invisibly.
|
||||
2. **FlareSolverr unavailable.** With `FLARESOLVERR_TIMEOUT_MS=120000`, every 1337x
|
||||
round-trip blocks up to 120 s when the solver is down/restarting. One stuck solve
|
||||
eats almost the whole 150 s budget.
|
||||
|
||||
Both are transient and both were absent at diagnosis → the search now passes. The
|
||||
user hit one of them (most likely #1 on a fresh machine/cache, or #2 during a
|
||||
FlareSolverr restart) and got a bare `ETIMEDOUT` with no results and no reason.
|
||||
|
||||
### Ruled out
|
||||
|
||||
- **Per-keystroke pile-up** — no. `search()` is single-flight and submit-triggered.
|
||||
- **Stale CLI path** — the app defaults to `~/Code/@applications/plum-control-mcp`
|
||||
(not the in-repo `mcp/`), but the two `transmission/search.ts` are **byte-identical**.
|
||||
Real hygiene debt (see roadmap repo-state note), **not** the cause of the failure.
|
||||
- **Steady-state scraper cost** — 37 s warm is comfortably inside budget.
|
||||
|
||||
---
|
||||
|
||||
## How it *should* work (target design)
|
||||
|
||||
Principle: **a search should degrade, never disappear.** A cold or missing
|
||||
dependency must cost some results and some latency, never a total opaque failure.
|
||||
|
||||
### 1. Warm the dependencies; never cold-start inside a user request
|
||||
- **Pre-flight at app launch** (background, non-blocking): ensure the `uv` venv is
|
||||
synced and the crawl4ai Chromium is installed (`playwright install chromium` /
|
||||
crawl4ai's warmup), and ping FlareSolverr. Do this once, off the critical path,
|
||||
so the first user search is warm.
|
||||
- Optionally keep a **long-lived search worker** (persistent `uv` process or the MCP
|
||||
server itself) instead of `spawnSync` per query — amortises Python import +
|
||||
browser launch across queries. Bigger change; do it only if warmup proves
|
||||
insufficient.
|
||||
|
||||
### 2. Per-source isolation + partial results
|
||||
- Run each source (TPB, Nyaa, 1337x) with its **own** timeout and collect whatever
|
||||
returns. Return the union; never let the slowest source (1337x/FlareSolverr) sink
|
||||
the fast ones (TPB/Nyaa via crawl4ai).
|
||||
- Surface a per-source status in the UI: "TPB ✓ 18 · Nyaa ✓ 6 · 1337x ⏱ timed out".
|
||||
A partial result with a visible gap beats a blank error.
|
||||
|
||||
### 3. Honest, actionable health surfacing
|
||||
- A **Search/Downloads health row** (Setup or Hosts view): FlareSolverr reachable?
|
||||
Chromium installed? venv synced? `uv`/`bun` on PATH? Each with a one-tap remedy
|
||||
(start FlareSolverr container, run browser install). The current
|
||||
`"FlareSolverr/Playwright may be down"` guess only fires on *empty* results — it
|
||||
should be a real, always-available probe.
|
||||
|
||||
### 4. Budget + retry that match the failure modes
|
||||
- Distinguish **cold** (first call, allow a longer budget + a "warming up…" state)
|
||||
from **warm** (tight budget). Retry once on a cold `ETIMEDOUT` *after* warmup
|
||||
completes, instead of surfacing the raw spawn error.
|
||||
- Lower the FlareSolverr per-call timeout (120 s is most of the 150 s budget for one
|
||||
source) and make 1337x best-effort, not budget-dominating.
|
||||
|
||||
### 5. Prune dead sources
|
||||
- `yts.mx`, `torrentgalaxy.to`, `solidtorrents.net`, `eztvx.to` currently fail every
|
||||
search (DNS/403). They fail fast so they don't cause the hang, but they're log
|
||||
noise and wasted requests — drop or gate them behind a reachability check.
|
||||
|
||||
### 6. Hygiene (independent of the bug)
|
||||
- Point `TorrentService` at the **in-repo** `mcp/` (commit it first per the roadmap
|
||||
repo-state note) and retire the `~/Code/@applications/plum-control-mcp` default,
|
||||
so the app and its CLI version together under one tree.
|
||||
|
||||
---
|
||||
|
||||
## Remediation order (smallest blast radius first)
|
||||
|
||||
1. **Surface the real error + a health probe** (no pipeline change): replace the bare
|
||||
`ETIMEDOUT` with "Search backend cold or FlareSolverr down — retrying…", and add
|
||||
an always-on dependency probe. *Highest value / lowest risk.*
|
||||
2. **Launch-time warmup** of venv + Chromium + FlareSolverr ping. Kills the dominant
|
||||
(cold-start) trigger outright.
|
||||
3. **Per-source timeouts + partial results** in `search.ts` (Python `asyncio.gather`
|
||||
with per-task timeout). Turns "all or nothing" into "always something".
|
||||
4. **Retry-after-warmup** in `TorrentService` for the cold-`ETIMEDOUT` case.
|
||||
5. **Prune dead sources; lower FlareSolverr timeout.**
|
||||
6. **Repo hygiene:** in-repo `mcp/`, drop the stale path.
|
||||
|
||||
Steps 1–2 alone would have prevented the reported failure. 3–4 make it robust;
|
||||
5–6 are cleanup.
|
||||
|
||||
---
|
||||
|
||||
## Verification notes (reproduce)
|
||||
|
||||
```sh
|
||||
# Exact app path, real limit — should succeed warm (~35–40 s):
|
||||
time /bin/zsh -ilc "cd ~/Code/@applications/plum-control-mcp && bun run src/cli.ts search 'breaking bad' 25"
|
||||
|
||||
# Dependency probes:
|
||||
curl -s -m3 -o /dev/null -w '%{http_code}\n' http://localhost:8191/ # FlareSolverr → 200
|
||||
docker ps --format '{{.Names}}\t{{.Status}}' | grep flare # container up?
|
||||
ls ~/Library/Caches/ms-playwright | grep chromium # browser cached?
|
||||
```
|
||||
|
||||
To *force* the cold-start failure for testing: stop the FlareSolverr container, or
|
||||
clear `~/Library/Caches/ms-playwright`, then search — the bare `ETIMEDOUT` returns.
|
||||
Loading…
Add table
Reference in a new issue