Natalie dafbabee41 feat(@packages/apricot-health): ✨ add power-fault monitoring and mitigation tools

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-04-17 23:18:47 -07:00

4.6 KiB

Raw Permalink Blame History

apricot hard-off diagnosis

Running log of the investigation. Newest findings at top.

Platform

Gigabyte X399 AORUS XTREME-CF, 8 years old, open-frame wet-bench (no mineral oil; "wet" refers to open-air test bench).
AMD Threadripper 2990WX (32-core, 250 W TDP).
2× NVIDIA RTX 3090 (stock 370 W cap each).
2× NVMe + 3× SATA.
2× Corsair PSUs:
- HX1500i — was producing audible coil-whine before the split; now carries only drives + Molex.
- HX1200 — now carries mobo + CPU + both GPUs.
Fedora Bluefin (ostree), kernel 6.17.12-200.fc42.
Non-ECC memory (amd64_edac cannot bind).

Failure signature (consistent across all events)

Journal cuts abruptly mid-operation. No Reached target Shutdown, no systemd-shutdown, no kernel panic.
Next boot runs XFS (dm-0): Starting recovery — filesystem wasn't unmounted cleanly.
NVMe SMART Unsafe Shutdowns increments by 1 on each event. Current ratio ~66 % of all power cycles are unclean.
BIOS "AC Back: Power On" (inferred from behavior) auto-restarts the box after each event; earlier events where the box stayed dark likely latched PSU OCP/UVP protection.
No MCE / thermal-throttle / OOM / hung-task entries.

→ The kernel never runs a shutdown — the 12 V plane collapses from under it. Classic PSU OCP/UVP or VRM brownout.

Timeline of captured crashes

Timestamp (PDT)	GPU 0	GPU 1	CPU Tctl	Load profile
2026-04-16 15:58:06	158 W	368 W (pegged)	—	Sustained high — GPU 1 inference under load
2026-04-17 03:22:54	117 W	25 W (idle)	70 °C	Near-idle — background auto-commit + tor-manager only
2026-04-17 11:15:42	20 W	368 W (pegged)	72 °C	High GPU 1 load
2026-04-17 21:35:10	117 W	129 W	69 °C	Moderate, both GPUs in P2

Crashes span idle-to-sustained-peak — no consistent load correlation.

Rail observations (it8628 SuperIO, after binding via `it87 force_id=0x8628`)

Stable rails during normal operation:

in5 on chip 1 (hwmon3/hwmon8 depending on boot order): 852 mV steady → likely +12 V scaled ~14:1 → ~11.9 V actual.
in5 on chip 2: 1632 mV steady → likely +5 V scaled ~3:1 → ~4.9 V actual.

Key observation 2026-04-17: Between crashes, in5 on chip 1 collapsed from 852 mV → 408 mV twice (18:50:43-45, 19:20:50-52), recovering within one sample. Roughly a 50 % rail drop — probably a ~12 V → ~5.7 V momentary sag. System survived both. Demonstrates the supply is visibly failing at slow timescales, not only at the microsecond scale that causes a hard-off.

What has been ruled out

Thermal: all CPU/GPU/NVMe temps well below throttle thresholds at every crash.
OOM / hung task: journal shows none.
MCE: edac_mce_amd loaded, no events logged.
Graceful shutdown path: no systemd shutdown-target progression.
nvidia-oc daemon: fixed independently — was thrashing sqlite locks; not related to crashes.
HX1500i as sole cause: crashes continued after moving all load off it onto HX1200.

What's consistent with observations

Aging filter caps on PSU and/or motherboard VRM. Both the squealing HX1500i and the HX1200 have produced visible rail excursions. Board is 8 years old.
Load-independent failure: crashes happen at both idle and peak load, but the in5 rail drops caught by the watchdog indicate intermittent supply failure decoupled from workload.

What remains to rule out (physical)

Visual inspection of VRM caps on the board (open bench, trivial).
Multimeter back-probe of 12 V at the 24-pin during load, to watch for sag below 11.4 V.
Swap to a third known-good PSU for a day.
Reseat EPS12V / 24-pin connectors (oxidation on 8-year-old pins is plausible).

Software stack currently deployed

10 Hz telemetry logger (apricot-crash-monitor.service) — writes ~/apricot-crash.log, fsync per second.
Rail watchdog (apricot-rail-watchdog.service) — baseline-learning on in5, 30 mV deviation threshold, invokes mitigation on trigger.
Emergency mitigation (apricot-rail-mitigate) — drops GPU cap to 250 W, pins CPU governor to powersave, holds 60 s, restores.
C-state tune (apricot-cstate-tune.service) — disables C2+ at boot to reduce VRM transient demand.
IT8628E binding (/etc/modprobe.d/it87.conf + /etc/modules-load.d/it87.conf) — SuperIO sensors auto-load with correct force_id.
rasdaemon — optional, via apricot-rasdaemon-setup.

Non-software fixes kept separate from this package

nvidia-oc WAL-mode patch (upstreamed via ACS to origin/master of the nvidia-oc repo, commit bea1934).

4.6 KiB Raw Permalink Blame History Unescape Escape