66 lines
3.7 KiB
Markdown
66 lines
3.7 KiB
Markdown
# apricot-health
|
|
|
|
Power-fault diagnostics and mitigation for **apricot** — a Threadripper 2990WX / X399 AORUS XTREME-CF / dual RTX 3090 rig on an open wet-bench, hit by random hard power-offs whose root cause is still being isolated (aging PSU caps, VRM degradation, or both).
|
|
|
|
## What's in here
|
|
|
|
| Component | What it does |
|
|
|---|---|
|
|
| `scripts/apricot-crash-logger` | High-frequency (10 Hz) sensor snapshotter. Captures GPU / CPU / NVMe / motherboard-rail telemetry to `~/apricot-crash.log`, fsync'd every second, so the last fractions of a second before a hard reset survive the crash. |
|
|
| `scripts/apricot-rail-watchdog` | Tails the crash-log, learns per-chip baseline for `in5` on each `it8628/hwmonN`, alerts on deviations > `DEVIATION_MV` (default 30 mV). Optionally invokes a mitigation hook. |
|
|
| `scripts/apricot-rail-mitigate` | Root-only emergency responder: drops GPU power caps and pins CPU governor to `powersave` for `HOLD_SECONDS` (default 60), then restores. Fired by the watchdog via sudoers. |
|
|
| `scripts/apricot-rail-mitigate-trigger` | User-space shim that `sudo`s into `apricot-rail-mitigate` (scoped NOPASSWD). |
|
|
| `scripts/apricot-cstate-tune` | Disables deep CPU C-states (C2+) so Vcore stays at a higher baseline, reducing VRM transient-demand magnitude. Oneshot systemd unit at boot. |
|
|
| `scripts/apricot-rasdaemon-setup` | Installs + enables `rasdaemon` for detailed AMD MCA/MCE decoding into a sqlite DB. |
|
|
| `modprobe.d/it87.conf` | `force_id=0x8628 ignore_resource_conflict=1` — binds IT8628E SuperIO so voltage/fan/temp rails are exposed in `/sys/class/hwmon`. |
|
|
| `modules-load.d/it87.conf` | Loads `it87` at boot. |
|
|
| `sudoers.d/apricot-health` | NOPASSWD rule for `lilith` to invoke the mitigation entrypoint (scoped to one command). |
|
|
| `systemd/*.service` | Three units — one root (`apricot-cstate-tune`), two user (`apricot-crash-monitor`, `apricot-rail-watchdog`). |
|
|
|
|
## Install
|
|
|
|
```sh
|
|
./install.sh # targets HOST=apricot by default
|
|
HOST=other-host ./install.sh # or override
|
|
```
|
|
|
|
Idempotent. Re-run to push updates.
|
|
|
|
## Tuning
|
|
|
|
All runtime behavior is env-overridable through systemd drop-ins:
|
|
|
|
```sh
|
|
systemctl --user edit apricot-rail-watchdog
|
|
# [Service]
|
|
# Environment=DEVIATION_MV=50 BASELINE_SAMPLES=40 RAIL_KEY=in5
|
|
```
|
|
|
|
Key knobs:
|
|
|
|
- `INTERVAL` (crash-logger) — sample period in seconds; `0.1` = 10 Hz.
|
|
- `DEVIATION_MV` (watchdog) — deviation from learned baseline that triggers an alert.
|
|
- `MITIGATE_CMD` (watchdog) — path to mitigation hook; empty = alert only.
|
|
- `GPU_LIMIT_SAFE` (mitigate) — wattage to clamp GPUs to during mitigation.
|
|
- `HOLD_SECONDS` (mitigate) — how long to hold the safe state.
|
|
|
|
## Outputs
|
|
|
|
- `~/apricot-crash.log` — per-sample telemetry.
|
|
- `~/apricot-rail-alerts.log` — watchdog alerts + baselines.
|
|
- `journalctl --user -u apricot-rail-watchdog` — live alerts (WARNING priority).
|
|
- `journalctl -u apricot-cstate-tune` — one-shot C-state tune result at boot.
|
|
- `/var/lib/rasdaemon/ras-mc_event.db` (after rasdaemon setup) — decoded MCEs.
|
|
|
|
## Post-mortem flow when a crash happens
|
|
|
|
1. `ssh apricot` (after it comes back — BIOS "AC Back: Power On" auto-restarts).
|
|
2. `grep -n '^=== session start' ~/apricot-crash.log | tail -5` — find the new session boundary.
|
|
3. Everything between the previous session's last line and the new session marker is the last ~N seconds before death.
|
|
4. `tail ~/apricot-rail-alerts.log` — did the watchdog see rail deviation before the event?
|
|
5. `journalctl -b -1 --no-pager | tail -40` — kernel's last words (often normal; hard-off gives no panic).
|
|
6. SMART unsafe-shutdown counter: `sudo smartctl -a /dev/nvme0 | grep -i unsafe` — should increment by 1.
|
|
|
|
## Diagnosis so far
|
|
|
|
See [`docs/DIAGNOSIS.md`](docs/DIAGNOSIS.md).
|