session-tools/docs/lan-power-ctrl.md

134 lines
5.5 KiB
Markdown
Raw Normal View History

# lan-power-ctrl — LAN power-cycle for wedged hosts
When a host's userspace freezes (D-state, OOM, disk hang), the kernel still
answers ICMP and accepts TCP, but no daemon completes a request. SSH banner
never arrives → unreachable. The only recovery is a hardware power-cycle.
This doc covers the hardware to buy, the network topology, and the two
scripts (`host-probe`, `power-cycle`) that turn "apricot froze" into a
one-liner from any machine on the LAN/VPN.
## Diagnosis vs recovery
```
layer healthy host frozen userspace unreachable host
───── ──────────── ──────────────── ────────────────
ICMP reply <50ms reply <50ms no reply
TCP accept :22 succeeds succeeds connection refused / timeout
SSH banner arrives <100ms never arrives
host-probe up wedged down
power-cycle n/a needed power's the cure regardless
```
A `wedged` classification is the signature that a remote shell can't help —
every login path needs the same userspace that's stuck. Cut power.
## Shopping list
### Per host to be remotely recoverable
| Device | Where | Purpose | Link |
|---|---|---|---|
| **Shelly Plus Plug US** (Gen2) **or Shelly Plug US Gen4** | Between the host's PSU and the wall | Local HTTP RPC, no cloud | [Plus (Gen2)](https://us.shelly.com/products/shelly-plus-plug-us) · [Gen4](https://us.shelly.com/products/shelly-plug-us-gen4-white) |
Either generation works — Gen4 adds power monitoring and Matter, both share
the same Gen2-compatible `/rpc/Switch.Set` API the `power-cycle` script uses.
Buy whichever is in stock and cheapest at order time. ~$25.
### For the modem (separate problem — can't rescue itself over its own WiFi)
| Device | Where | Purpose | Link |
|---|---|---|---|
| **SwitchBot Plug Mini** | Modem | BLE-controllable plug | [product](https://us.switch-bot.com/products/switchbot-plug-mini) |
| **USB Bluetooth dongle** | Plugged into **black** (X399 boards have no integrated BT — see [[reference-host-hardware]]) | Lets black drive BLE to the modem plug | Search "USB Bluetooth 5 adapter CSR8510 or RTL8761" (~$10, plug-and-play under bluez) |
Why black and not apricot: using apricot to rescue the modem couples failure
modes — apricot is *also* a recoverable host. Black is the independent
watchdog.
Why not put the SwitchBot Plug Mini on a *host* (so WiFi-controlled)? It
*can* be — but for the modem you can't, because the modem dying takes WiFi
down with it. BLE from a LAN-attached watchdog works regardless of WAN
state, assuming **separate modem and router** (see Caveat).
### Caveat: combined modem-router vs separate
If your ISP gave you a combined modem-router (one box does both), modem-dead
also means LAN-dead. Nothing on the LAN can rescue it. Options narrow to:
phone/laptop in BLE range, or an auto-rebooter plug (watchdog built into the
plug itself — pings a target, power-cycles on failure).
If modem and router are separate (typical home-server setup), LAN stays
alive when WAN dies, and the black-as-watchdog plan works as designed.
## One-time setup after plugs arrive
1. **Set the Shelly's own restore behavior** (so it remembers its on/off
state across a *wall* outage):
```sh
curl 'http://<plug>/rpc/Switch.SetConfig?id=0&config={"initial_state":"restore_last"}'
```
2. **Set the host's BIOS to "Restore on AC power loss"** (so when the Shelly
restores power, the host actually boots up rather than sitting dark).
3. **Add the plug to the config:**
```
# ~/.config/power-cycle/plugs.conf
apricot http://<plug-ip>
```
Get `<plug-ip>` from the router DHCP table or `arp -a | grep -i shelly`.
## Scripts
### `host-probe` — classify reachability
```sh
host-probe apricot.lan # → up | wedged | down (one-shot)
host-probe --watch apricot.lan # loop; emits one line per state change
HOST_PROBE_INTERVAL=10 HOST_PROBE_TIMEOUT=2 host-probe --watch apricot.lan 22
```
Three independent probes: ICMP, TCP accept, SSH banner. The banner timing is
what distinguishes a wedged host from a healthy one — a real sshd flushes
its `SSH-2.0-…` banner within milliseconds of connect.
Composes with the harness `Monitor` tool: each state-change line becomes one
notification.
### `power-cycle` — Shelly Gen2 RPC wrapper
```sh
power-cycle apricot # off → POWER_CYCLE_OFF_SECS (default 5) → on
power-cycle off|on|status apricot
power-cycle list # show configured host → plug mappings
```
Config: `~/.config/power-cycle/plugs.conf`, one line per host:
```
<host> <plug-base-url>
```
Errors covered: missing config, unknown host, plug unreachable (with hint to
fall back to BLE for modem outages), unrecognized Shelly response.
## Future: modem watchdog daemon on black
Once the USB BT dongle and SwitchBot plug arrive:
- `bleak`-based Python daemon, runs as `systemd --user` unit on black
- Pings WAN (e.g. `1.1.1.1`) every 30s
- After N consecutive failures, BLE-toggles the SwitchBot plug off → 10s → on
- Logs to journal; emits a notification via the existing apricot
speech-synthesis path (see [[reference-rvoice]]) when it acts
Not implemented yet — waiting on hardware.
## Files
| Path | Role |
|---|---|
| `bin/host-probe` | Three-layer reachability probe |
| `bin/power-cycle` | Shelly Gen2 RPC wrapper |
| `~/.config/power-cycle/plugs.conf` | Per-host plug URL mapping (user-created) |