apricot-health/docs/DIAGNOSIS.md

79 lines
4.6 KiB
Markdown
Raw Permalink Normal View History

# apricot hard-off diagnosis
Running log of the investigation. Newest findings at top.
## Platform
- Gigabyte X399 AORUS XTREME-CF, 8 years old, open-frame wet-bench (no mineral oil; "wet" refers to open-air test bench).
- AMD Threadripper 2990WX (32-core, 250 W TDP).
- 2× NVIDIA RTX 3090 (stock 370 W cap each).
- 2× NVMe + 3× SATA.
- 2× Corsair PSUs:
- **HX1500i** — was producing audible coil-whine before the split; now carries only drives + Molex.
- **HX1200** — now carries mobo + CPU + both GPUs.
- Fedora Bluefin (ostree), kernel 6.17.12-200.fc42.
- Non-ECC memory (`amd64_edac` cannot bind).
## Failure signature (consistent across all events)
1. Journal cuts abruptly mid-operation. No `Reached target Shutdown`, no `systemd-shutdown`, no kernel panic.
2. Next boot runs `XFS (dm-0): Starting recovery` — filesystem wasn't unmounted cleanly.
3. NVMe SMART `Unsafe Shutdowns` increments by 1 on each event. Current ratio ~66 % of all power cycles are unclean.
4. BIOS "AC Back: Power On" (inferred from behavior) auto-restarts the box after each event; earlier events where the box stayed dark likely latched PSU OCP/UVP protection.
5. No MCE / thermal-throttle / OOM / hung-task entries.
→ The kernel never runs a shutdown — the 12 V plane collapses from under it. Classic PSU OCP/UVP or VRM brownout.
## Timeline of captured crashes
| Timestamp (PDT) | GPU 0 | GPU 1 | CPU Tctl | Load profile |
|---|---|---|---|---|
| 2026-04-16 15:58:06 | 158 W | **368 W** (pegged) | — | Sustained high — GPU 1 inference under load |
| 2026-04-17 03:22:54 | 117 W | 25 W (idle) | 70 °C | **Near-idle** — background auto-commit + tor-manager only |
| 2026-04-17 11:15:42 | 20 W | **368 W** (pegged) | 72 °C | High GPU 1 load |
| 2026-04-17 21:35:10 | 117 W | 129 W | 69 °C | Moderate, both GPUs in P2 |
Crashes span idle-to-sustained-peak — no consistent load correlation.
## Rail observations (it8628 SuperIO, after binding via `it87 force_id=0x8628`)
Stable rails during normal operation:
- `in5` on chip 1 (hwmon3/hwmon8 depending on boot order): **852 mV steady** → likely +12 V scaled ~14:1 → ~11.9 V actual.
- `in5` on chip 2: **1632 mV steady** → likely +5 V scaled ~3:1 → ~4.9 V actual.
**Key observation 2026-04-17**: Between crashes, `in5` on chip 1 collapsed from **852 mV → 408 mV** twice (18:50:43-45, 19:20:50-52), recovering within one sample. Roughly a **50 % rail drop** — probably a ~12 V → ~5.7 V momentary sag. System survived both. Demonstrates the supply is visibly failing at slow timescales, not only at the microsecond scale that causes a hard-off.
## What has been ruled out
- **Thermal**: all CPU/GPU/NVMe temps well below throttle thresholds at every crash.
- **OOM / hung task**: journal shows none.
- **MCE**: `edac_mce_amd` loaded, no events logged.
- **Graceful shutdown path**: no systemd shutdown-target progression.
- **nvidia-oc daemon**: fixed independently — was thrashing sqlite locks; not related to crashes.
- **HX1500i as sole cause**: crashes continued after moving all load off it onto HX1200.
## What's consistent with observations
- **Aging filter caps on PSU and/or motherboard VRM**. Both the squealing HX1500i *and* the HX1200 have produced visible rail excursions. Board is 8 years old.
- **Load-independent failure**: crashes happen at both idle and peak load, but the in5 rail drops caught by the watchdog indicate intermittent supply failure decoupled from workload.
## What remains to rule out (physical)
- Visual inspection of VRM caps on the board (open bench, trivial).
- Multimeter back-probe of 12 V at the 24-pin during load, to watch for sag below 11.4 V.
- Swap to a third known-good PSU for a day.
- Reseat EPS12V / 24-pin connectors (oxidation on 8-year-old pins is plausible).
## Software stack currently deployed
- **10 Hz telemetry logger** (`apricot-crash-monitor.service`) — writes ~/apricot-crash.log, fsync per second.
- **Rail watchdog** (`apricot-rail-watchdog.service`) — baseline-learning on `in5`, 30 mV deviation threshold, invokes mitigation on trigger.
- **Emergency mitigation** (`apricot-rail-mitigate`) — drops GPU cap to 250 W, pins CPU governor to powersave, holds 60 s, restores.
- **C-state tune** (`apricot-cstate-tune.service`) — disables C2+ at boot to reduce VRM transient demand.
- **IT8628E binding** (`/etc/modprobe.d/it87.conf` + `/etc/modules-load.d/it87.conf`) — SuperIO sensors auto-load with correct `force_id`.
- **rasdaemon** — optional, via `apricot-rasdaemon-setup`.
## Non-software fixes kept separate from this package
- nvidia-oc WAL-mode patch (upstreamed via ACS to `origin/master` of the nvidia-oc repo, commit `bea1934`).