apricot-health/docs/DIAGNOSIS.md
Natalie dafbabee41 feat(@packages/apricot-health): add power-fault monitoring and mitigation tools
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-04-17 23:18:47 -07:00

4.6 KiB
Raw Permalink Blame History

apricot hard-off diagnosis

Running log of the investigation. Newest findings at top.

Platform

  • Gigabyte X399 AORUS XTREME-CF, 8 years old, open-frame wet-bench (no mineral oil; "wet" refers to open-air test bench).
  • AMD Threadripper 2990WX (32-core, 250 W TDP).
  • 2× NVIDIA RTX 3090 (stock 370 W cap each).
  • 2× NVMe + 3× SATA.
  • 2× Corsair PSUs:
    • HX1500i — was producing audible coil-whine before the split; now carries only drives + Molex.
    • HX1200 — now carries mobo + CPU + both GPUs.
  • Fedora Bluefin (ostree), kernel 6.17.12-200.fc42.
  • Non-ECC memory (amd64_edac cannot bind).

Failure signature (consistent across all events)

  1. Journal cuts abruptly mid-operation. No Reached target Shutdown, no systemd-shutdown, no kernel panic.
  2. Next boot runs XFS (dm-0): Starting recovery — filesystem wasn't unmounted cleanly.
  3. NVMe SMART Unsafe Shutdowns increments by 1 on each event. Current ratio ~66 % of all power cycles are unclean.
  4. BIOS "AC Back: Power On" (inferred from behavior) auto-restarts the box after each event; earlier events where the box stayed dark likely latched PSU OCP/UVP protection.
  5. No MCE / thermal-throttle / OOM / hung-task entries.

→ The kernel never runs a shutdown — the 12 V plane collapses from under it. Classic PSU OCP/UVP or VRM brownout.

Timeline of captured crashes

Timestamp (PDT) GPU 0 GPU 1 CPU Tctl Load profile
2026-04-16 15:58:06 158 W 368 W (pegged) Sustained high — GPU 1 inference under load
2026-04-17 03:22:54 117 W 25 W (idle) 70 °C Near-idle — background auto-commit + tor-manager only
2026-04-17 11:15:42 20 W 368 W (pegged) 72 °C High GPU 1 load
2026-04-17 21:35:10 117 W 129 W 69 °C Moderate, both GPUs in P2

Crashes span idle-to-sustained-peak — no consistent load correlation.

Rail observations (it8628 SuperIO, after binding via it87 force_id=0x8628)

Stable rails during normal operation:

  • in5 on chip 1 (hwmon3/hwmon8 depending on boot order): 852 mV steady → likely +12 V scaled ~14:1 → ~11.9 V actual.
  • in5 on chip 2: 1632 mV steady → likely +5 V scaled ~3:1 → ~4.9 V actual.

Key observation 2026-04-17: Between crashes, in5 on chip 1 collapsed from 852 mV → 408 mV twice (18:50:43-45, 19:20:50-52), recovering within one sample. Roughly a 50 % rail drop — probably a ~12 V → ~5.7 V momentary sag. System survived both. Demonstrates the supply is visibly failing at slow timescales, not only at the microsecond scale that causes a hard-off.

What has been ruled out

  • Thermal: all CPU/GPU/NVMe temps well below throttle thresholds at every crash.
  • OOM / hung task: journal shows none.
  • MCE: edac_mce_amd loaded, no events logged.
  • Graceful shutdown path: no systemd shutdown-target progression.
  • nvidia-oc daemon: fixed independently — was thrashing sqlite locks; not related to crashes.
  • HX1500i as sole cause: crashes continued after moving all load off it onto HX1200.

What's consistent with observations

  • Aging filter caps on PSU and/or motherboard VRM. Both the squealing HX1500i and the HX1200 have produced visible rail excursions. Board is 8 years old.
  • Load-independent failure: crashes happen at both idle and peak load, but the in5 rail drops caught by the watchdog indicate intermittent supply failure decoupled from workload.

What remains to rule out (physical)

  • Visual inspection of VRM caps on the board (open bench, trivial).
  • Multimeter back-probe of 12 V at the 24-pin during load, to watch for sag below 11.4 V.
  • Swap to a third known-good PSU for a day.
  • Reseat EPS12V / 24-pin connectors (oxidation on 8-year-old pins is plausible).

Software stack currently deployed

  • 10 Hz telemetry logger (apricot-crash-monitor.service) — writes ~/apricot-crash.log, fsync per second.
  • Rail watchdog (apricot-rail-watchdog.service) — baseline-learning on in5, 30 mV deviation threshold, invokes mitigation on trigger.
  • Emergency mitigation (apricot-rail-mitigate) — drops GPU cap to 250 W, pins CPU governor to powersave, holds 60 s, restores.
  • C-state tune (apricot-cstate-tune.service) — disables C2+ at boot to reduce VRM transient demand.
  • IT8628E binding (/etc/modprobe.d/it87.conf + /etc/modules-load.d/it87.conf) — SuperIO sensors auto-load with correct force_id.
  • rasdaemon — optional, via apricot-rasdaemon-setup.

Non-software fixes kept separate from this package

  • nvidia-oc WAL-mode patch (upstreamed via ACS to origin/master of the nvidia-oc repo, commit bea1934).