Friday, April 18, 2014

A dead man's switch for Linux, in bash

Last night, I was working on a (very) remote physical machine, over ssh.
After modifying Debian's /etc/network/interfaces to include the definition of a new bridge, br0, and playing around with its settings, I was about to bring it down and the back up again for a final check. And this is what I actually typed:

# ifdown eth0

Which did not bring br0 down, but eth0, instead. I had brought down the physical Ethernet interface, the very one I was actually using for SSHing into the machine. So now what? Left with an unreachable machine, the only option was to call a friend and ask him to drive over and press the magical Reset button.

How can one defend from this or similar cases, which are the equivalent of sawing off the branch your're sitting on? Something like a dead man's switch: if you don't do something within a specified time limit, the machine assumes your connection is gone, and reboots.

Ferm has an interactive mode for exactly this purpose: After applying new firewall rules, it waits for a few seconds for the user to confirm that everything is OK. If nothing happens, it assumes the new firewall rules have caused the user to lock themselves out, so it reverts them. Similarly, Windows or GNOME will wait for confirmation after changes to a monitor's resolution and revert to the previous settings after a few seconds.

Here's an approach in bash: Make sure a reboot is always scheduled within 60 minutes. If something happens, e.g., your connection is gone and you can no longer get it back, the system will reboot. Otherwise, you can cancel the current shutdown operation, and reschedule a new one.

Add this to /root/.bashrc:

alias dead_man_switch="shutdown -c; shutdown -r +60 & disown"

Using disown ensures you cannot bring the shutdown process in the foreground accidentally and kill it. Once you run dead_man_switch, you have to re-run it at least every 60 minutes to ensure the system does not reboot itself.

Hope someone finds this useful!