apricot-health/docs/DIAGNOSIS.md
Natalie dafbabee41 feat(@packages/apricot-health): add power-fault monitoring and mitigation tools
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-04-17 23:18:47 -07:00

78 lines
4.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# apricot hard-off diagnosis
Running log of the investigation. Newest findings at top.
## Platform
- Gigabyte X399 AORUS XTREME-CF, 8 years old, open-frame wet-bench (no mineral oil; "wet" refers to open-air test bench).
- AMD Threadripper 2990WX (32-core, 250 W TDP).
- 2× NVIDIA RTX 3090 (stock 370 W cap each).
- 2× NVMe + 3× SATA.
- 2× Corsair PSUs:
- **HX1500i** — was producing audible coil-whine before the split; now carries only drives + Molex.
- **HX1200** — now carries mobo + CPU + both GPUs.
- Fedora Bluefin (ostree), kernel 6.17.12-200.fc42.
- Non-ECC memory (`amd64_edac` cannot bind).
## Failure signature (consistent across all events)
1. Journal cuts abruptly mid-operation. No `Reached target Shutdown`, no `systemd-shutdown`, no kernel panic.
2. Next boot runs `XFS (dm-0): Starting recovery` — filesystem wasn't unmounted cleanly.
3. NVMe SMART `Unsafe Shutdowns` increments by 1 on each event. Current ratio ~66 % of all power cycles are unclean.
4. BIOS "AC Back: Power On" (inferred from behavior) auto-restarts the box after each event; earlier events where the box stayed dark likely latched PSU OCP/UVP protection.
5. No MCE / thermal-throttle / OOM / hung-task entries.
→ The kernel never runs a shutdown — the 12 V plane collapses from under it. Classic PSU OCP/UVP or VRM brownout.
## Timeline of captured crashes
| Timestamp (PDT) | GPU 0 | GPU 1 | CPU Tctl | Load profile |
|---|---|---|---|---|
| 2026-04-16 15:58:06 | 158 W | **368 W** (pegged) | — | Sustained high — GPU 1 inference under load |
| 2026-04-17 03:22:54 | 117 W | 25 W (idle) | 70 °C | **Near-idle** — background auto-commit + tor-manager only |
| 2026-04-17 11:15:42 | 20 W | **368 W** (pegged) | 72 °C | High GPU 1 load |
| 2026-04-17 21:35:10 | 117 W | 129 W | 69 °C | Moderate, both GPUs in P2 |
Crashes span idle-to-sustained-peak — no consistent load correlation.
## Rail observations (it8628 SuperIO, after binding via `it87 force_id=0x8628`)
Stable rails during normal operation:
- `in5` on chip 1 (hwmon3/hwmon8 depending on boot order): **852 mV steady** → likely +12 V scaled ~14:1 → ~11.9 V actual.
- `in5` on chip 2: **1632 mV steady** → likely +5 V scaled ~3:1 → ~4.9 V actual.
**Key observation 2026-04-17**: Between crashes, `in5` on chip 1 collapsed from **852 mV → 408 mV** twice (18:50:43-45, 19:20:50-52), recovering within one sample. Roughly a **50 % rail drop** — probably a ~12 V → ~5.7 V momentary sag. System survived both. Demonstrates the supply is visibly failing at slow timescales, not only at the microsecond scale that causes a hard-off.
## What has been ruled out
- **Thermal**: all CPU/GPU/NVMe temps well below throttle thresholds at every crash.
- **OOM / hung task**: journal shows none.
- **MCE**: `edac_mce_amd` loaded, no events logged.
- **Graceful shutdown path**: no systemd shutdown-target progression.
- **nvidia-oc daemon**: fixed independently — was thrashing sqlite locks; not related to crashes.
- **HX1500i as sole cause**: crashes continued after moving all load off it onto HX1200.
## What's consistent with observations
- **Aging filter caps on PSU and/or motherboard VRM**. Both the squealing HX1500i *and* the HX1200 have produced visible rail excursions. Board is 8 years old.
- **Load-independent failure**: crashes happen at both idle and peak load, but the in5 rail drops caught by the watchdog indicate intermittent supply failure decoupled from workload.
## What remains to rule out (physical)
- Visual inspection of VRM caps on the board (open bench, trivial).
- Multimeter back-probe of 12 V at the 24-pin during load, to watch for sag below 11.4 V.
- Swap to a third known-good PSU for a day.
- Reseat EPS12V / 24-pin connectors (oxidation on 8-year-old pins is plausible).
## Software stack currently deployed
- **10 Hz telemetry logger** (`apricot-crash-monitor.service`) — writes ~/apricot-crash.log, fsync per second.
- **Rail watchdog** (`apricot-rail-watchdog.service`) — baseline-learning on `in5`, 30 mV deviation threshold, invokes mitigation on trigger.
- **Emergency mitigation** (`apricot-rail-mitigate`) — drops GPU cap to 250 W, pins CPU governor to powersave, holds 60 s, restores.
- **C-state tune** (`apricot-cstate-tune.service`) — disables C2+ at boot to reduce VRM transient demand.
- **IT8628E binding** (`/etc/modprobe.d/it87.conf` + `/etc/modules-load.d/it87.conf`) — SuperIO sensors auto-load with correct `force_id`.
- **rasdaemon** — optional, via `apricot-rasdaemon-setup`.
## Non-software fixes kept separate from this package
- nvidia-oc WAL-mode patch (upstreamed via ACS to `origin/master` of the nvidia-oc repo, commit `bea1934`).