Greg Black comments that he took a look at Joe Armstrong's thesis I linked to below. Just in case his discription makes it sound intimidating, the error handling philosophy discussion --- let it crash --- is one section of chapter 4 (ie. about 3 pages). Of course, handling errors is only one step towards a reliable system.
In fact, the chapters of the thesis are largely approachable independently of each other. Chapters 2 and 4 (Architecture and Programming Principles) are particularly good in this regard.
In the meantime for those who are feeling too lazy to read the actual pdf, an executive summary:
- We don't know how to write bug-free programs
- So every substantial program will have bugs
- Even if we are lucky enough to miss our bugs, unexpected interactions with the outside world (including the hardware we are running on) will cause periodic failures in any long-running process
- So make sure any faults that do occur can't interfere with the execution of your program
- Faults are nasty, subtle, vicious creatures with thousands of non-deterministic side-effects to compensate for
- So the only safe way to handle a bug is to terminate the buggy process
- So don't program defensively: Just let it crash, and make sure
- Your runtime provides adequate logging/tracing/hot-upgrade support to detect/debug/repair the running system
- You run multiple levels of supervisor/watchdog all the way from supervisor trees to automatic, hot-failover hardware clusters