4.6 KiB
4.6 KiB
apricot hard-off diagnosis
Running log of the investigation. Newest findings at top.
Platform
- Gigabyte X399 AORUS XTREME-CF, 8 years old, open-frame wet-bench (no mineral oil; "wet" refers to open-air test bench).
- AMD Threadripper 2990WX (32-core, 250 W TDP).
- 2× NVIDIA RTX 3090 (stock 370 W cap each).
- 2× NVMe + 3× SATA.
- 2× Corsair PSUs:
- HX1500i — was producing audible coil-whine before the split; now carries only drives + Molex.
- HX1200 — now carries mobo + CPU + both GPUs.
- Fedora Bluefin (ostree), kernel 6.17.12-200.fc42.
- Non-ECC memory (
amd64_edaccannot bind).
Failure signature (consistent across all events)
- Journal cuts abruptly mid-operation. No
Reached target Shutdown, nosystemd-shutdown, no kernel panic. - Next boot runs
XFS (dm-0): Starting recovery— filesystem wasn't unmounted cleanly. - NVMe SMART
Unsafe Shutdownsincrements by 1 on each event. Current ratio ~66 % of all power cycles are unclean. - BIOS "AC Back: Power On" (inferred from behavior) auto-restarts the box after each event; earlier events where the box stayed dark likely latched PSU OCP/UVP protection.
- No MCE / thermal-throttle / OOM / hung-task entries.
→ The kernel never runs a shutdown — the 12 V plane collapses from under it. Classic PSU OCP/UVP or VRM brownout.
Timeline of captured crashes
| Timestamp (PDT) | GPU 0 | GPU 1 | CPU Tctl | Load profile |
|---|---|---|---|---|
| 2026-04-16 15:58:06 | 158 W | 368 W (pegged) | — | Sustained high — GPU 1 inference under load |
| 2026-04-17 03:22:54 | 117 W | 25 W (idle) | 70 °C | Near-idle — background auto-commit + tor-manager only |
| 2026-04-17 11:15:42 | 20 W | 368 W (pegged) | 72 °C | High GPU 1 load |
| 2026-04-17 21:35:10 | 117 W | 129 W | 69 °C | Moderate, both GPUs in P2 |
Crashes span idle-to-sustained-peak — no consistent load correlation.
Rail observations (it8628 SuperIO, after binding via it87 force_id=0x8628)
Stable rails during normal operation:
in5on chip 1 (hwmon3/hwmon8 depending on boot order): 852 mV steady → likely +12 V scaled ~14:1 → ~11.9 V actual.in5on chip 2: 1632 mV steady → likely +5 V scaled ~3:1 → ~4.9 V actual.
Key observation 2026-04-17: Between crashes, in5 on chip 1 collapsed from 852 mV → 408 mV twice (18:50:43-45, 19:20:50-52), recovering within one sample. Roughly a 50 % rail drop — probably a ~12 V → ~5.7 V momentary sag. System survived both. Demonstrates the supply is visibly failing at slow timescales, not only at the microsecond scale that causes a hard-off.
What has been ruled out
- Thermal: all CPU/GPU/NVMe temps well below throttle thresholds at every crash.
- OOM / hung task: journal shows none.
- MCE:
edac_mce_amdloaded, no events logged. - Graceful shutdown path: no systemd shutdown-target progression.
- nvidia-oc daemon: fixed independently — was thrashing sqlite locks; not related to crashes.
- HX1500i as sole cause: crashes continued after moving all load off it onto HX1200.
What's consistent with observations
- Aging filter caps on PSU and/or motherboard VRM. Both the squealing HX1500i and the HX1200 have produced visible rail excursions. Board is 8 years old.
- Load-independent failure: crashes happen at both idle and peak load, but the in5 rail drops caught by the watchdog indicate intermittent supply failure decoupled from workload.
What remains to rule out (physical)
- Visual inspection of VRM caps on the board (open bench, trivial).
- Multimeter back-probe of 12 V at the 24-pin during load, to watch for sag below 11.4 V.
- Swap to a third known-good PSU for a day.
- Reseat EPS12V / 24-pin connectors (oxidation on 8-year-old pins is plausible).
Software stack currently deployed
- 10 Hz telemetry logger (
apricot-crash-monitor.service) — writes ~/apricot-crash.log, fsync per second. - Rail watchdog (
apricot-rail-watchdog.service) — baseline-learning onin5, 30 mV deviation threshold, invokes mitigation on trigger. - Emergency mitigation (
apricot-rail-mitigate) — drops GPU cap to 250 W, pins CPU governor to powersave, holds 60 s, restores. - C-state tune (
apricot-cstate-tune.service) — disables C2+ at boot to reduce VRM transient demand. - IT8628E binding (
/etc/modprobe.d/it87.conf+/etc/modules-load.d/it87.conf) — SuperIO sensors auto-load with correctforce_id. - rasdaemon — optional, via
apricot-rasdaemon-setup.
Non-software fixes kept separate from this package
- nvidia-oc WAL-mode patch (upstreamed via ACS to
origin/masterof the nvidia-oc repo, commitbea1934).