No description
|
|
||
|---|---|---|
| docs | ||
| modprobe.d | ||
| modules-load.d | ||
| scripts | ||
| sudoers.d | ||
| systemd | ||
| install.sh | ||
| README.md | ||
apricot-health
Power-fault diagnostics and mitigation for apricot — a Threadripper 2990WX / X399 AORUS XTREME-CF / dual RTX 3090 rig on an open wet-bench, hit by random hard power-offs whose root cause is still being isolated (aging PSU caps, VRM degradation, or both).
What's in here
| Component | What it does |
|---|---|
scripts/apricot-crash-logger |
High-frequency (10 Hz) sensor snapshotter. Captures GPU / CPU / NVMe / motherboard-rail telemetry to ~/apricot-crash.log, fsync'd every second, so the last fractions of a second before a hard reset survive the crash. |
scripts/apricot-rail-watchdog |
Tails the crash-log, learns per-chip baseline for in5 on each it8628/hwmonN, alerts on deviations > DEVIATION_MV (default 30 mV). Optionally invokes a mitigation hook. |
scripts/apricot-rail-mitigate |
Root-only emergency responder: drops GPU power caps and pins CPU governor to powersave for HOLD_SECONDS (default 60), then restores. Fired by the watchdog via sudoers. |
scripts/apricot-rail-mitigate-trigger |
User-space shim that sudos into apricot-rail-mitigate (scoped NOPASSWD). |
scripts/apricot-cstate-tune |
Disables deep CPU C-states (C2+) so Vcore stays at a higher baseline, reducing VRM transient-demand magnitude. Oneshot systemd unit at boot. |
scripts/apricot-rasdaemon-setup |
Installs + enables rasdaemon for detailed AMD MCA/MCE decoding into a sqlite DB. |
modprobe.d/it87.conf |
force_id=0x8628 ignore_resource_conflict=1 — binds IT8628E SuperIO so voltage/fan/temp rails are exposed in /sys/class/hwmon. |
modules-load.d/it87.conf |
Loads it87 at boot. |
sudoers.d/apricot-health |
NOPASSWD rule for lilith to invoke the mitigation entrypoint (scoped to one command). |
systemd/*.service |
Three units — one root (apricot-cstate-tune), two user (apricot-crash-monitor, apricot-rail-watchdog). |
Install
./install.sh # targets HOST=apricot by default
HOST=other-host ./install.sh # or override
Idempotent. Re-run to push updates.
Tuning
All runtime behavior is env-overridable through systemd drop-ins:
systemctl --user edit apricot-rail-watchdog
# [Service]
# Environment=DEVIATION_MV=50 BASELINE_SAMPLES=40 RAIL_KEY=in5
Key knobs:
INTERVAL(crash-logger) — sample period in seconds;0.1= 10 Hz.DEVIATION_MV(watchdog) — deviation from learned baseline that triggers an alert.MITIGATE_CMD(watchdog) — path to mitigation hook; empty = alert only.GPU_LIMIT_SAFE(mitigate) — wattage to clamp GPUs to during mitigation.HOLD_SECONDS(mitigate) — how long to hold the safe state.
Outputs
~/apricot-crash.log— per-sample telemetry.~/apricot-rail-alerts.log— watchdog alerts + baselines.journalctl --user -u apricot-rail-watchdog— live alerts (WARNING priority).journalctl -u apricot-cstate-tune— one-shot C-state tune result at boot./var/lib/rasdaemon/ras-mc_event.db(after rasdaemon setup) — decoded MCEs.
Post-mortem flow when a crash happens
ssh apricot(after it comes back — BIOS "AC Back: Power On" auto-restarts).grep -n '^=== session start' ~/apricot-crash.log | tail -5— find the new session boundary.- Everything between the previous session's last line and the new session marker is the last ~N seconds before death.
tail ~/apricot-rail-alerts.log— did the watchdog see rail deviation before the event?journalctl -b -1 --no-pager | tail -40— kernel's last words (often normal; hard-off gives no panic).- SMART unsafe-shutdown counter:
sudo smartctl -a /dev/nvme0 | grep -i unsafe— should increment by 1.
Diagnosis so far
See docs/DIAGNOSIS.md.