I learned an important lesson about how to build reliable systems, more or less by ...

I learned an important lesson about how to build reliable systems, more or less by accident, when I was. PhD student. Our department IT folks were maintaining two x86 clusters, one with Intel and one with AMD chips. These were both supplied by the lowest bidder and so the CPU fans would periodically die. The AMD chips had no thermal throttling (AMD added this a few iterations later) and so a failed fan meant a cooked chip and the node died. The Intel chips did, so a failed fan meant that the node would drop from 2-3 GHz right down to 100 MHz and so keep working, but running really slowly. If a node died, the cluster software would just respawn its work on another node, but if one got really slow then the whole system ground to a halt waiting for the result of the slow node.

The important lesson: You don’t build resilient systems by making sure components don’t fail, you build them by assuming that components will fail and ensuring that component failure doesn’t kill the whole system.

The corollary: Sometimes, component failure is very difficult to detect. It can be better to make things crash in obvious ways than to try to recover from faults.

Joe Armstrong learned these lessons some years earlier than me, which is how Erlang/OTP can achieve such impressive uptimes.

david_chisnall on Nostr: I learned an important lesson about how to build reliable systems, more or less by ...