From dafbabee41581cc13724f9f263c9e6ede5bdd72a Mon Sep 17 00:00:00 2001 From: Natalie Date: Fri, 17 Apr 2026 23:18:47 -0700 Subject: [PATCH] =?UTF-8?q?feat(@packages/apricot-health):=20=E2=9C=A8=20a?= =?UTF-8?q?dd=20power-fault=20monitoring=20and=20mitigation=20tools?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Lilith Autocommit --- README.md | 66 ++++++++++++++++ docs/DIAGNOSIS.md | 78 +++++++++++++++++++ install.sh | 105 ++++++++++++++++++++++++++ modprobe.d/it87.conf | 1 + modules-load.d/it87.conf | 1 + scripts/apricot-crash-logger | 83 ++++++++++++++++++++ scripts/apricot-cstate-tune | 52 +++++++++++++ scripts/apricot-rail-mitigate | 92 ++++++++++++++++++++++ scripts/apricot-rail-mitigate-trigger | 5 ++ scripts/apricot-rail-watchdog | 85 +++++++++++++++++++++ scripts/apricot-rasdaemon-setup | 42 +++++++++++ sudoers.d/apricot-health | 4 + systemd/apricot-crash-monitor.service | 16 ++++ systemd/apricot-cstate-tune.service | 16 ++++ systemd/apricot-rail-watchdog.service | 17 +++++ 15 files changed, 663 insertions(+) create mode 100644 README.md create mode 100644 docs/DIAGNOSIS.md create mode 100755 install.sh create mode 100644 modprobe.d/it87.conf create mode 100644 modules-load.d/it87.conf create mode 100755 scripts/apricot-crash-logger create mode 100755 scripts/apricot-cstate-tune create mode 100755 scripts/apricot-rail-mitigate create mode 100755 scripts/apricot-rail-mitigate-trigger create mode 100755 scripts/apricot-rail-watchdog create mode 100755 scripts/apricot-rasdaemon-setup create mode 100644 sudoers.d/apricot-health create mode 100644 systemd/apricot-crash-monitor.service create mode 100644 systemd/apricot-cstate-tune.service create mode 100644 systemd/apricot-rail-watchdog.service diff --git a/README.md b/README.md new file mode 100644 index 0000000..4821479 --- /dev/null +++ b/README.md @@ -0,0 +1,66 @@ +# apricot-health + +Power-fault diagnostics and mitigation for **apricot** — a Threadripper 2990WX / X399 AORUS XTREME-CF / dual RTX 3090 rig on an open wet-bench, hit by random hard power-offs whose root cause is still being isolated (aging PSU caps, VRM degradation, or both). + +## What's in here + +| Component | What it does | +|---|---| +| `scripts/apricot-crash-logger` | High-frequency (10 Hz) sensor snapshotter. Captures GPU / CPU / NVMe / motherboard-rail telemetry to `~/apricot-crash.log`, fsync'd every second, so the last fractions of a second before a hard reset survive the crash. | +| `scripts/apricot-rail-watchdog` | Tails the crash-log, learns per-chip baseline for `in5` on each `it8628/hwmonN`, alerts on deviations > `DEVIATION_MV` (default 30 mV). Optionally invokes a mitigation hook. | +| `scripts/apricot-rail-mitigate` | Root-only emergency responder: drops GPU power caps and pins CPU governor to `powersave` for `HOLD_SECONDS` (default 60), then restores. Fired by the watchdog via sudoers. | +| `scripts/apricot-rail-mitigate-trigger` | User-space shim that `sudo`s into `apricot-rail-mitigate` (scoped NOPASSWD). | +| `scripts/apricot-cstate-tune` | Disables deep CPU C-states (C2+) so Vcore stays at a higher baseline, reducing VRM transient-demand magnitude. Oneshot systemd unit at boot. | +| `scripts/apricot-rasdaemon-setup` | Installs + enables `rasdaemon` for detailed AMD MCA/MCE decoding into a sqlite DB. | +| `modprobe.d/it87.conf` | `force_id=0x8628 ignore_resource_conflict=1` — binds IT8628E SuperIO so voltage/fan/temp rails are exposed in `/sys/class/hwmon`. | +| `modules-load.d/it87.conf` | Loads `it87` at boot. | +| `sudoers.d/apricot-health` | NOPASSWD rule for `lilith` to invoke the mitigation entrypoint (scoped to one command). | +| `systemd/*.service` | Three units — one root (`apricot-cstate-tune`), two user (`apricot-crash-monitor`, `apricot-rail-watchdog`). | + +## Install + +```sh +./install.sh # targets HOST=apricot by default +HOST=other-host ./install.sh # or override +``` + +Idempotent. Re-run to push updates. + +## Tuning + +All runtime behavior is env-overridable through systemd drop-ins: + +```sh +systemctl --user edit apricot-rail-watchdog +# [Service] +# Environment=DEVIATION_MV=50 BASELINE_SAMPLES=40 RAIL_KEY=in5 +``` + +Key knobs: + +- `INTERVAL` (crash-logger) — sample period in seconds; `0.1` = 10 Hz. +- `DEVIATION_MV` (watchdog) — deviation from learned baseline that triggers an alert. +- `MITIGATE_CMD` (watchdog) — path to mitigation hook; empty = alert only. +- `GPU_LIMIT_SAFE` (mitigate) — wattage to clamp GPUs to during mitigation. +- `HOLD_SECONDS` (mitigate) — how long to hold the safe state. + +## Outputs + +- `~/apricot-crash.log` — per-sample telemetry. +- `~/apricot-rail-alerts.log` — watchdog alerts + baselines. +- `journalctl --user -u apricot-rail-watchdog` — live alerts (WARNING priority). +- `journalctl -u apricot-cstate-tune` — one-shot C-state tune result at boot. +- `/var/lib/rasdaemon/ras-mc_event.db` (after rasdaemon setup) — decoded MCEs. + +## Post-mortem flow when a crash happens + +1. `ssh apricot` (after it comes back — BIOS "AC Back: Power On" auto-restarts). +2. `grep -n '^=== session start' ~/apricot-crash.log | tail -5` — find the new session boundary. +3. Everything between the previous session's last line and the new session marker is the last ~N seconds before death. +4. `tail ~/apricot-rail-alerts.log` — did the watchdog see rail deviation before the event? +5. `journalctl -b -1 --no-pager | tail -40` — kernel's last words (often normal; hard-off gives no panic). +6. SMART unsafe-shutdown counter: `sudo smartctl -a /dev/nvme0 | grep -i unsafe` — should increment by 1. + +## Diagnosis so far + +See [`docs/DIAGNOSIS.md`](docs/DIAGNOSIS.md). diff --git a/docs/DIAGNOSIS.md b/docs/DIAGNOSIS.md new file mode 100644 index 0000000..88d2884 --- /dev/null +++ b/docs/DIAGNOSIS.md @@ -0,0 +1,78 @@ +# apricot hard-off diagnosis + +Running log of the investigation. Newest findings at top. + +## Platform +- Gigabyte X399 AORUS XTREME-CF, 8 years old, open-frame wet-bench (no mineral oil; "wet" refers to open-air test bench). +- AMD Threadripper 2990WX (32-core, 250 W TDP). +- 2× NVIDIA RTX 3090 (stock 370 W cap each). +- 2× NVMe + 3× SATA. +- 2× Corsair PSUs: + - **HX1500i** — was producing audible coil-whine before the split; now carries only drives + Molex. + - **HX1200** — now carries mobo + CPU + both GPUs. +- Fedora Bluefin (ostree), kernel 6.17.12-200.fc42. +- Non-ECC memory (`amd64_edac` cannot bind). + +## Failure signature (consistent across all events) + +1. Journal cuts abruptly mid-operation. No `Reached target Shutdown`, no `systemd-shutdown`, no kernel panic. +2. Next boot runs `XFS (dm-0): Starting recovery` — filesystem wasn't unmounted cleanly. +3. NVMe SMART `Unsafe Shutdowns` increments by 1 on each event. Current ratio ~66 % of all power cycles are unclean. +4. BIOS "AC Back: Power On" (inferred from behavior) auto-restarts the box after each event; earlier events where the box stayed dark likely latched PSU OCP/UVP protection. +5. No MCE / thermal-throttle / OOM / hung-task entries. + +→ The kernel never runs a shutdown — the 12 V plane collapses from under it. Classic PSU OCP/UVP or VRM brownout. + +## Timeline of captured crashes + +| Timestamp (PDT) | GPU 0 | GPU 1 | CPU Tctl | Load profile | +|---|---|---|---|---| +| 2026-04-16 15:58:06 | 158 W | **368 W** (pegged) | — | Sustained high — GPU 1 inference under load | +| 2026-04-17 03:22:54 | 117 W | 25 W (idle) | 70 °C | **Near-idle** — background auto-commit + tor-manager only | +| 2026-04-17 11:15:42 | 20 W | **368 W** (pegged) | 72 °C | High GPU 1 load | +| 2026-04-17 21:35:10 | 117 W | 129 W | 69 °C | Moderate, both GPUs in P2 | + +Crashes span idle-to-sustained-peak — no consistent load correlation. + +## Rail observations (it8628 SuperIO, after binding via `it87 force_id=0x8628`) + +Stable rails during normal operation: + +- `in5` on chip 1 (hwmon3/hwmon8 depending on boot order): **852 mV steady** → likely +12 V scaled ~14:1 → ~11.9 V actual. +- `in5` on chip 2: **1632 mV steady** → likely +5 V scaled ~3:1 → ~4.9 V actual. + +**Key observation 2026-04-17**: Between crashes, `in5` on chip 1 collapsed from **852 mV → 408 mV** twice (18:50:43-45, 19:20:50-52), recovering within one sample. Roughly a **50 % rail drop** — probably a ~12 V → ~5.7 V momentary sag. System survived both. Demonstrates the supply is visibly failing at slow timescales, not only at the microsecond scale that causes a hard-off. + +## What has been ruled out + +- **Thermal**: all CPU/GPU/NVMe temps well below throttle thresholds at every crash. +- **OOM / hung task**: journal shows none. +- **MCE**: `edac_mce_amd` loaded, no events logged. +- **Graceful shutdown path**: no systemd shutdown-target progression. +- **nvidia-oc daemon**: fixed independently — was thrashing sqlite locks; not related to crashes. +- **HX1500i as sole cause**: crashes continued after moving all load off it onto HX1200. + +## What's consistent with observations + +- **Aging filter caps on PSU and/or motherboard VRM**. Both the squealing HX1500i *and* the HX1200 have produced visible rail excursions. Board is 8 years old. +- **Load-independent failure**: crashes happen at both idle and peak load, but the in5 rail drops caught by the watchdog indicate intermittent supply failure decoupled from workload. + +## What remains to rule out (physical) + +- Visual inspection of VRM caps on the board (open bench, trivial). +- Multimeter back-probe of 12 V at the 24-pin during load, to watch for sag below 11.4 V. +- Swap to a third known-good PSU for a day. +- Reseat EPS12V / 24-pin connectors (oxidation on 8-year-old pins is plausible). + +## Software stack currently deployed + +- **10 Hz telemetry logger** (`apricot-crash-monitor.service`) — writes ~/apricot-crash.log, fsync per second. +- **Rail watchdog** (`apricot-rail-watchdog.service`) — baseline-learning on `in5`, 30 mV deviation threshold, invokes mitigation on trigger. +- **Emergency mitigation** (`apricot-rail-mitigate`) — drops GPU cap to 250 W, pins CPU governor to powersave, holds 60 s, restores. +- **C-state tune** (`apricot-cstate-tune.service`) — disables C2+ at boot to reduce VRM transient demand. +- **IT8628E binding** (`/etc/modprobe.d/it87.conf` + `/etc/modules-load.d/it87.conf`) — SuperIO sensors auto-load with correct `force_id`. +- **rasdaemon** — optional, via `apricot-rasdaemon-setup`. + +## Non-software fixes kept separate from this package + +- nvidia-oc WAL-mode patch (upstreamed via ACS to `origin/master` of the nvidia-oc repo, commit `bea1934`). diff --git a/install.sh b/install.sh new file mode 100755 index 0000000..1315865 --- /dev/null +++ b/install.sh @@ -0,0 +1,105 @@ +#!/usr/bin/env bash +# Install apricot-health on the target host (default: apricot). +# +# Layout on target: +# /var/home/lilith/bin/ user-runnable scripts +# /var/opt/apricot-health/sbin/ root-only entrypoints (ostree-safe) +# /etc/modprobe.d/it87.conf IT8628E force_id +# /etc/modules-load.d/it87.conf load it87 at boot +# /etc/sudoers.d/apricot-health NOPASSWD shim for mitigation +# /etc/systemd/system/apricot-cstate-tune.service root systemd unit +# /var/home/lilith/.config/systemd/user/*.service user systemd units +# +# Idempotent: re-running copies updates and daemon-reloads. + +set -euo pipefail + +HOST="${HOST:-apricot}" +PKG_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +echo "==> apricot-health install to $HOST (pkg=$PKG_DIR)" + +# --- stage tarball locally so we upload in one round-trip --------------- +stage=$(mktemp -d) +trap 'rm -rf "$stage"' EXIT +mkdir -p "$stage"/{bin,root-sbin,etc-modprobe,etc-modules-load,etc-sudoers,etc-systemd,user-systemd} + +cp "$PKG_DIR/scripts/apricot-crash-logger" "$stage/bin/" +cp "$PKG_DIR/scripts/apricot-rail-watchdog" "$stage/bin/" +cp "$PKG_DIR/scripts/apricot-rail-mitigate-trigger" "$stage/bin/" +cp "$PKG_DIR/scripts/apricot-rasdaemon-setup" "$stage/bin/" +cp "$PKG_DIR/scripts/apricot-rail-mitigate" "$stage/root-sbin/" +cp "$PKG_DIR/scripts/apricot-cstate-tune" "$stage/root-sbin/" +cp "$PKG_DIR/modprobe.d/it87.conf" "$stage/etc-modprobe/" +cp "$PKG_DIR/modules-load.d/it87.conf" "$stage/etc-modules-load/" +cp "$PKG_DIR/sudoers.d/apricot-health" "$stage/etc-sudoers/" +cp "$PKG_DIR/systemd/apricot-cstate-tune.service" "$stage/etc-systemd/" +cp "$PKG_DIR/systemd/apricot-crash-monitor.service" "$stage/user-systemd/" +cp "$PKG_DIR/systemd/apricot-rail-watchdog.service" "$stage/user-systemd/" + +tar -czf "$stage/pkg.tar.gz" -C "$stage" bin root-sbin etc-modprobe etc-modules-load etc-sudoers etc-systemd user-systemd +echo "==> staged $(du -h "$stage/pkg.tar.gz" | cut -f1)" + +# --- ship it ------------------------------------------------------------ +scp -q "$stage/pkg.tar.gz" "$HOST:/tmp/apricot-health.tar.gz" + +ssh "$HOST" bash -s <<'REMOTE' +set -euo pipefail +echo "==> remote install" + +t=$(mktemp -d) +tar -xzf /tmp/apricot-health.tar.gz -C "$t" + +# User-runnable scripts +mkdir -p /var/home/lilith/bin +install -m 0755 -o lilith -g lilith "$t"/bin/* /var/home/lilith/bin/ + +# Root-only entrypoints (ostree-safe path under /var) +sudo mkdir -p /var/opt/apricot-health/sbin +sudo install -m 0755 -o root -g root "$t"/root-sbin/* /var/opt/apricot-health/sbin/ + +# Kernel module config +sudo install -m 0644 "$t"/etc-modprobe/it87.conf /etc/modprobe.d/it87.conf +sudo install -m 0644 "$t"/etc-modules-load/it87.conf /etc/modules-load.d/it87.conf + +# Sudoers (visudo-check first — malformed sudoers can lock the user out) +tmp_sudo=$(mktemp) +cp "$t"/etc-sudoers/apricot-health "$tmp_sudo" +if sudo visudo -cf "$tmp_sudo" >/dev/null 2>&1; then + sudo install -m 0440 -o root -g root "$tmp_sudo" /etc/sudoers.d/apricot-health + echo " sudoers: installed" +else + echo " sudoers: SYNTAX ERROR — not installing" >&2 + exit 1 +fi +rm -f "$tmp_sudo" + +# Root systemd units +sudo install -m 0644 "$t"/etc-systemd/apricot-cstate-tune.service /etc/systemd/system/ +sudo systemctl daemon-reload +sudo systemctl enable --now apricot-cstate-tune.service +echo " apricot-cstate-tune.service: enabled + started" + +# User systemd units (under lilith) +sudo -u lilith mkdir -p /var/home/lilith/.config/systemd/user +sudo -u lilith install -m 0644 "$t"/user-systemd/apricot-crash-monitor.service /var/home/lilith/.config/systemd/user/ +sudo -u lilith install -m 0644 "$t"/user-systemd/apricot-rail-watchdog.service /var/home/lilith/.config/systemd/user/ +sudo loginctl enable-linger lilith 2>/dev/null || true +sudo systemctl --user -M lilith@.host daemon-reload +sudo systemctl --user -M lilith@.host enable --now apricot-crash-monitor.service +sudo systemctl --user -M lilith@.host restart apricot-rail-watchdog.service 2>/dev/null \ + || sudo systemctl --user -M lilith@.host enable --now apricot-rail-watchdog.service +echo " user units: enabled + started" + +# Load it87 now if not yet loaded +if ! lsmod | grep -q '^it87 '; then + sudo modprobe it87 force_id=0x8628 ignore_resource_conflict=1 \ + && echo " it87 module: loaded" \ + || echo " it87 module: load FAILED (try reboot)" +fi + +rm -rf "$t" /tmp/apricot-health.tar.gz +echo "==> install complete" +REMOTE + +echo "==> done" diff --git a/modprobe.d/it87.conf b/modprobe.d/it87.conf new file mode 100644 index 0000000..55815df --- /dev/null +++ b/modprobe.d/it87.conf @@ -0,0 +1 @@ +options it87 force_id=0x8628 ignore_resource_conflict=1 diff --git a/modules-load.d/it87.conf b/modules-load.d/it87.conf new file mode 100644 index 0000000..9900452 --- /dev/null +++ b/modules-load.d/it87.conf @@ -0,0 +1 @@ +it87 diff --git a/scripts/apricot-crash-logger b/scripts/apricot-crash-logger new file mode 100755 index 0000000..becb1c4 --- /dev/null +++ b/scripts/apricot-crash-logger @@ -0,0 +1,83 @@ +#!/usr/bin/env bash +# Continuously appends power/thermal/voltage state to $LOG so that the last +# fractions of a second before a hard reset survive the crash. +# +# Env overrides: +# LOG output path (default ~/apricot-crash.log) +# INTERVAL sample period in seconds (default 0.1 = 10 Hz) +# SENSOR_CHIPS regex of hwmon name(s) to capture (default k10temp|nvme|it8628|nct6*|w83*) + +set -o pipefail + +LOG="${LOG:-${HOME}/apricot-crash.log}" +INTERVAL="${INTERVAL:-0.1}" +GPU_SAMPLE_EVERY="${GPU_SAMPLE_EVERY:-10}" # nvidia-smi is slow; only invoke every Nth iter +SENSOR_CHIPS="${SENSOR_CHIPS:-k10temp|nvme|it8628|nct6.*|w83.*}" + +printf '=== session start %s (pid=%s interval=%ss gpu_every=%s chips=%s) ===\n' \ + "$(date --iso-8601=ns)" "$$" "$INTERVAL" "$GPU_SAMPLE_EVERY" "$SENSOR_CHIPS" >> "$LOG" + +# Pre-resolve matching hwmon paths once per second (cheaper than per-sample). +declare -a HWMONS +refresh_hwmons() { + HWMONS=() + for h in /sys/class/hwmon/hwmon*; do + [ -d "$h" ] || continue + [ -r "$h/name" ] || continue + name=$(<"$h/name") # bash builtin — no fork + [[ "$name" =~ ^(${SENSOR_CHIPS})$ ]] || continue + HWMONS+=("$h") + done +} +refresh_hwmons +last_refresh=$SECONDS +iter=0 + +while :; do + ts=$(date --iso-8601=ns) + + # GPU telemetry — skip most iterations because nvidia-smi startup is + # ~300-500ms, which would cap the loop at ~2 Hz otherwise. + if (( iter % GPU_SAMPLE_EVERY == 0 )); then + while IFS= read -r gpu_line; do + printf '%s gpu %s\n' "$ts" "$gpu_line" + done < <(nvidia-smi \ + --query-gpu=index,temperature.gpu,power.draw,clocks.gr,clocks.mem,pstate,utilization.gpu,memory.used \ + --format=csv,noheader,nounits 2>/dev/null) + fi + iter=$(( iter + 1 )) + + # Platform sensors — use $( 5 )); then + refresh_hwmons + last_refresh=$SECONDS + fi + + # Fsync once per second regardless of sample rate (amortized). + if (( ${ts:20:1} == 0 )); then + sync "$LOG" 2>/dev/null || true + fi + + sleep "$INTERVAL" +done >> "$LOG" diff --git a/scripts/apricot-cstate-tune b/scripts/apricot-cstate-tune new file mode 100755 index 0000000..4ae9d14 --- /dev/null +++ b/scripts/apricot-cstate-tune @@ -0,0 +1,52 @@ +#!/usr/bin/env bash +# Disable deep CPU C-states so Vcore stays at a higher baseline and the VRM +# doesn't have to slam from C6/C7 idle back to full current on every workload +# transient. Reduces transient-demand magnitude; does NOT fix root-cause PSU +# or VRM degradation, but often reduces crash frequency on aging boards. +# +# Leaves C0 + C1 enabled (basic halt). Disables C2+ (package C-states). +# +# Reversible: run with `--restore` to re-enable everything. + +set -o pipefail + +log() { printf '[%s] apricot-cstate-tune: %s\n' "$(date --iso-8601=s)" "$*"; } + +mode="${1:-apply}" + +case "$mode" in + apply) + n_cpus=$(ls -d /sys/devices/system/cpu/cpu[0-9]* 2>/dev/null | wc -l) + disabled=0 + for s in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do + [ -w "$s" ] || continue + idx="${s%/disable}"; idx="${idx##*state}" + # Keep states 0 (POLL/C0) and 1 (C1/halt); disable 2+. + if (( idx >= 2 )); then + echo 1 > "$s" 2>/dev/null && disabled=$(( disabled + 1 )) + fi + done + log "disabled $disabled idle-state entries across $n_cpus CPUs (kept C0+C1)" + ;; + restore) + enabled=0 + for s in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do + [ -w "$s" ] || continue + echo 0 > "$s" 2>/dev/null && enabled=$(( enabled + 1 )) + done + log "re-enabled $enabled idle-state entries" + ;; + status) + printf 'cpu0 idle states:\n' + for d in /sys/devices/system/cpu/cpu0/cpuidle/state*; do + [ -d "$d" ] || continue + name=$(cat "$d/name" 2>/dev/null) + dis=$(cat "$d/disable" 2>/dev/null) + printf ' %s disable=%s name=%s\n' "$(basename "$d")" "$dis" "$name" + done + ;; + *) + echo "usage: $0 {apply|restore|status}" >&2 + exit 2 + ;; +esac diff --git a/scripts/apricot-rail-mitigate b/scripts/apricot-rail-mitigate new file mode 100755 index 0000000..8ce2a50 --- /dev/null +++ b/scripts/apricot-rail-mitigate @@ -0,0 +1,92 @@ +#!/usr/bin/env bash +# Emergency rail-deviation responder. Invoked by apricot-rail-watchdog when +# a rail excursion is detected. Goal: reduce power demand for N seconds to +# let the rail recover, then restore. +# +# Argv (from watchdog): +# +# Actions: +# 1. Drop both GPU power caps to GPU_LIMIT_SAFE (default 250W). +# 2. Pin CPU governor to "powersave". +# 3. Hold for HOLD_SECONDS (default 60). +# 4. Restore prior values if we recorded them. +# +# Requires root (nvidia-smi -pl, writing to /sys/devices/system/cpu/...). +# Intended to run as a root-side systemd unit triggered via a fifo or via +# sudoers allowlist for the lilith user — install.sh sets this up. + +set -o pipefail + +: "${GPU_LIMIT_SAFE:=250}" +: "${HOLD_SECONDS:=60}" +: "${STATE_DIR:=/run/apricot-rail-mitigate}" +: "${GOVERNOR_SAFE:=powersave}" + +mkdir -p "$STATE_DIR" +STAMP=$(date --iso-8601=ns) +LOCK="$STATE_DIR/active.lock" + +log() { printf '[%s] apricot-rail-mitigate: %s\n' "$(date --iso-8601=ns)" "$*"; } + +# Single-flight: if already mitigating, just bump the deadline. +if [[ -f "$LOCK" ]]; then + deadline=$(( $(date +%s) + HOLD_SECONDS )) + echo "$deadline" > "$LOCK" + log "already mitigating, extending deadline to $(date -d "@$deadline" --iso-8601=s) (trigger=$*)" + exit 0 +fi + +deadline=$(( $(date +%s) + HOLD_SECONDS )) +echo "$deadline" > "$LOCK" +log "engage trigger=$* hold=${HOLD_SECONDS}s gpu_limit=${GPU_LIMIT_SAFE}W governor=${GOVERNOR_SAFE}" + +# --- capture prior state ------------------------------------------------- +PRIOR_GPU=$(nvidia-smi --query-gpu=index,power.limit --format=csv,noheader,nounits 2>/dev/null | sed 's/ //g') +echo "$PRIOR_GPU" > "$STATE_DIR/prior_gpu" + +PRIOR_GOV="" +for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do + [ -r "$g" ] && PRIOR_GOV="$(cat "$g")" && break +done +echo "$PRIOR_GOV" > "$STATE_DIR/prior_gov" + +# --- apply safe state ---------------------------------------------------- +while IFS=, read -r idx _; do + [[ "$idx" =~ ^[0-9]+$ ]] || continue + nvidia-smi -i "$idx" -pl "$GPU_LIMIT_SAFE" >/dev/null 2>&1 \ + && log "gpu $idx -> ${GPU_LIMIT_SAFE}W" +done <<< "$PRIOR_GPU" + +for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do + [ -w "$g" ] || continue + echo "$GOVERNOR_SAFE" > "$g" 2>/dev/null || true +done +log "cpu governor -> $GOVERNOR_SAFE (prior=$PRIOR_GOV)" + +# --- hold, honoring deadline bumps -------------------------------------- +while true; do + now=$(date +%s) + target=$(cat "$LOCK" 2>/dev/null || echo 0) + (( now >= target )) && break + sleep $(( target - now )) +done + +# --- restore ------------------------------------------------------------- +while IFS=, read -r idx prior_w; do + [[ "$idx" =~ ^[0-9]+$ ]] || continue + prior_w="${prior_w%.*}" + [[ -n "$prior_w" ]] || continue + nvidia-smi -i "$idx" -pl "$prior_w" >/dev/null 2>&1 \ + && log "gpu $idx -> ${prior_w}W (restored)" +done < "$STATE_DIR/prior_gpu" + +if [[ -n "$PRIOR_GOV" ]]; then + for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do + [ -w "$g" ] || continue + echo "$PRIOR_GOV" > "$g" 2>/dev/null || true + done + log "cpu governor -> $PRIOR_GOV (restored)" +fi + +rm -f "$LOCK" +log "disengage" diff --git a/scripts/apricot-rail-mitigate-trigger b/scripts/apricot-rail-mitigate-trigger new file mode 100755 index 0000000..8402d65 --- /dev/null +++ b/scripts/apricot-rail-mitigate-trigger @@ -0,0 +1,5 @@ +#!/usr/bin/env bash +# User-space shim invoked by the watchdog. Delegates to the root-owned +# apricot-rail-mitigate via sudoers (install.sh installs a NOPASSWD rule +# scoped to this one command). +exec sudo -n /var/opt/apricot-health/sbin/apricot-rail-mitigate "$@" diff --git a/scripts/apricot-rail-watchdog b/scripts/apricot-rail-watchdog new file mode 100755 index 0000000..1e520b4 --- /dev/null +++ b/scripts/apricot-rail-watchdog @@ -0,0 +1,85 @@ +#!/usr/bin/env bash +# Watches a stable PSU-derived rail (default: in5 on it8628 chips) by +# learning each chip's baseline from the first BASELINE_SAMPLES and alerting +# when later samples deviate by more than DEVIATION_MV. +# +# Works for any rail that shouldn't swing under normal operation. For Vcore +# (which swings 600mV+ during P-state transitions on Threadripper) this +# approach is unsuitable — use in5 (+12V divided) or in7 (3VSB) instead. +# +# hwmon numbering is boot-order-dependent, so we resolve it per-line. +# +# Optional mitigation hook (set MITIGATE_CMD) runs when a deviation fires — +# receives the chip, value, baseline, delta on its argv. Use to auto-throttle +# GPU power or CPU governor as an emergency response. + +set -o pipefail + +LOG="${HOME}/apricot-crash.log" +ALERTS="${HOME}/apricot-rail-alerts.log" + +: "${DEVIATION_MV:=30}" +: "${BASELINE_SAMPLES:=20}" +: "${RAIL_KEY:=in5}" +: "${CHIP_REGEX:=it8628/hwmon[0-9]+}" +: "${MITIGATE_CMD:=}" + +printf '=== rail-watchdog start %s key=%s deviation=%smV baseline_samples=%s chip=%s mitigate=%s ===\n' \ + "$(date --iso-8601=ns)" "$RAIL_KEY" "$DEVIATION_MV" "$BASELINE_SAMPLES" "$CHIP_REGEX" "${MITIGATE_CMD:-}" >> "$ALERTS" + +emit() { + local ts msg="$*" + ts=$(date --iso-8601=ns) + printf '%s [WARN] %s\n' "$ts" "$msg" | tee -a "$ALERTS" >&2 +} + +info() { + local ts msg="$*" + ts=$(date --iso-8601=ns) + printf '%s [INFO] %s\n' "$ts" "$msg" >> "$ALERTS" +} + +declare -A seen_count +declare -A baseline +declare -A buffer + +chip_re="($CHIP_REGEX)" +val_re=" ${RAIL_KEY}=([0-9]+)$" + +median_of() { + printf '%s\n' $1 | sort -n | awk -v n=$(wc -w <<< "$1") 'NR==int((n+1)/2){print;exit}' +} + +tail -F -n 0 "$LOG" 2>/dev/null | while IFS= read -r line; do + [[ "$line" =~ $chip_re ]] || continue + chip="${BASH_REMATCH[1]}" + [[ "$line" =~ $val_re ]] || continue + val="${BASH_REMATCH[1]}" + src_ts="${line%% *}" + + n="${seen_count[$chip]:-0}" + n=$(( n + 1 )) + seen_count[$chip]=$n + + if (( n <= BASELINE_SAMPLES )); then + buffer[$chip]="${buffer[$chip]:+${buffer[$chip]} }$val" + if (( n == BASELINE_SAMPLES )); then + b=$(median_of "${buffer[$chip]}") + baseline[$chip]=$b + info "baseline_learned chip=${chip} key=${RAIL_KEY} baseline=${b}mV samples=${BASELINE_SAMPLES}" + unset 'buffer[$chip]' + fi + continue + fi + + b="${baseline[$chip]}" + dev=$(( val - b )) + (( dev < 0 )) && dev=$(( -dev )) + if (( dev > DEVIATION_MV )); then + emit "rail_deviation chip=${chip} key=${RAIL_KEY} val=${val}mV baseline=${b}mV |Δ|=${dev}mV at=${src_ts}" + if [[ -n "$MITIGATE_CMD" ]]; then + # Detach mitigation so a slow command can't block alert delivery. + "$MITIGATE_CMD" "$chip" "$val" "$b" "$dev" "$src_ts" >> "$ALERTS" 2>&1 & + fi + fi +done diff --git a/scripts/apricot-rasdaemon-setup b/scripts/apricot-rasdaemon-setup new file mode 100755 index 0000000..2aa35ae --- /dev/null +++ b/scripts/apricot-rasdaemon-setup @@ -0,0 +1,42 @@ +#!/usr/bin/env bash +# Install + enable rasdaemon for detailed AMD MCA/MCE parsing. +# +# rasdaemon runs a trace-buffer consumer that decodes machine-check events +# into a sqlite DB (~/ras-mc_event.db usually at /var/lib/rasdaemon/) and +# syslogs them in human-readable form. Much more detail than edac_mce_amd +# alone. If any crash is in-CPU or NB-side (not pure board-level power +# loss), this catches it. +# +# Idempotent. Safe to re-run. + +set -o pipefail + +log() { printf '[%s] apricot-rasdaemon-setup: %s\n' "$(date --iso-8601=s)" "$*"; } + +if ! command -v rasdaemon >/dev/null 2>&1; then + log "rasdaemon not installed — attempting rpm-ostree install" + if command -v rpm-ostree >/dev/null 2>&1; then + sudo rpm-ostree install rasdaemon \ + && log "installed — a reboot is required for the layered package to activate" \ + || { log "rpm-ostree install failed"; exit 1; } + elif command -v dnf >/dev/null 2>&1; then + sudo dnf install -y rasdaemon \ + || { log "dnf install failed"; exit 1; } + else + log "no package manager found; install rasdaemon manually" + exit 1 + fi +fi + +# Enable + start the service. On rpm-ostree systems this is deferred until +# reboot; systemctl will still succeed (the symlink is made). +sudo systemctl enable rasdaemon.service 2>&1 | grep -v '^Created' || true +sudo systemctl start rasdaemon.service 2>&1 \ + && log "rasdaemon.service started" \ + || log "rasdaemon.service will start after reboot (layered package)" + +log "status:" +systemctl status rasdaemon.service --no-pager 2>&1 | head -10 || true + +log "recent events (may be empty):" +sudo ras-mc-ctl --summary 2>&1 | head -15 || true diff --git a/sudoers.d/apricot-health b/sudoers.d/apricot-health new file mode 100644 index 0000000..5d5f952 --- /dev/null +++ b/sudoers.d/apricot-health @@ -0,0 +1,4 @@ +# Allow user lilith to invoke the rail-mitigation script without password +# (fired by apricot-rail-watchdog.service when a rail deviation is detected). +# Scoped to this one command. +lilith ALL=(root) NOPASSWD: /var/opt/apricot-health/sbin/apricot-rail-mitigate diff --git a/systemd/apricot-crash-monitor.service b/systemd/apricot-crash-monitor.service new file mode 100644 index 0000000..4b041d4 --- /dev/null +++ b/systemd/apricot-crash-monitor.service @@ -0,0 +1,16 @@ +[Unit] +Description=Apricot crash logger (high-frequency power/thermal/voltage capture) +After=default.target + +[Service] +Type=simple +ExecStart=/var/home/lilith/bin/apricot-crash-logger +Environment=INTERVAL=0.1 +Restart=always +RestartSec=2 +StandardOutput=null +StandardError=journal +SyslogIdentifier=apricot-crash-monitor + +[Install] +WantedBy=default.target diff --git a/systemd/apricot-cstate-tune.service b/systemd/apricot-cstate-tune.service new file mode 100644 index 0000000..28637b9 --- /dev/null +++ b/systemd/apricot-cstate-tune.service @@ -0,0 +1,16 @@ +[Unit] +Description=Apricot CPU C-state tuning (disable deep C-states to reduce VRM transient demand) +After=multi-user.target +ConditionPathExists=/sys/devices/system/cpu/cpu0/cpuidle/state0 + +[Service] +Type=oneshot +ExecStart=/var/opt/apricot-health/sbin/apricot-cstate-tune apply +ExecStop=/var/opt/apricot-health/sbin/apricot-cstate-tune restore +RemainAfterExit=yes +StandardOutput=journal +StandardError=journal +SyslogIdentifier=apricot-cstate-tune + +[Install] +WantedBy=multi-user.target diff --git a/systemd/apricot-rail-watchdog.service b/systemd/apricot-rail-watchdog.service new file mode 100644 index 0000000..992cbe7 --- /dev/null +++ b/systemd/apricot-rail-watchdog.service @@ -0,0 +1,17 @@ +[Unit] +Description=Apricot PSU rail deviation watchdog (it8628 in5 baseline) +After=apricot-crash-monitor.service +Wants=apricot-crash-monitor.service + +[Service] +Type=simple +ExecStart=/var/home/lilith/bin/apricot-rail-watchdog +Environment=MITIGATE_CMD=/var/home/lilith/bin/apricot-rail-mitigate-trigger +Restart=always +RestartSec=2 +StandardOutput=null +StandardError=journal +SyslogIdentifier=apricot-rail-watchdog + +[Install] +WantedBy=apricot-crash-monitor.service