The Importance of Failure Analysis
As part of my job as a Senior Software Architect at CodeValue, I architect, design and lead various software projects at different clients. Recently I’ve been asked by one of these clients to perform a failure analysis on a system we have previously built for him in order to make sure that the system behaves well under various, “weird” end-case scenarios which might just happen. The client wanted me to formally define what could go wrong during the system’s operation, define what is the expected behavior of the system when these conditions occur and at last – check if the system does indeed behave as expected.
Things Will Always Go Wrong!
Performing this failure analysis reminded me our deep tendency as software developers to avoid thinking of bad things that might happen to our software – the service might not be available, the DB might be down, or, god forbid, the disk might be full! All of these issues, and much more complicated ones, are issues that often get overlooked when designing and developing a system, while focusing on our “happy” flow. As a result, exceptions might be caught locally by developers and thus preventing an application from crashing, but at the same time possibly introducing a global inconsistency which was not planned for.
Failure Analysis – Defining The Expected Behavior
The thing that amazed me the most in the analysis process is that while the code itself handled most of these scenarios quite-well, it was often not clear even to the customer what is the expected behavior of the system under these failure conditions. Defining the expected behavior was eventually the hardest part of the analysis. Verifying that the system follows these requirements was considerably easier…
When have you explicitly defined what should your system do in case of failure? You really should.