Keep the BIT - check system liveliness
Moving to architectures like SOA that increase the number of overall “moving parts” or components in the system means that reliability is going down. It is simple math really – if you have 10 components each with a 0.99 reliability then the total reliability is 0.99^10 or 0.904 and that’s before we take into account messages traveling over the wire and the network’s reliability (or lack thereof). What this does is leave us trying to build reliable systems from (a growing) bunch of unreliable components. I know, I know, there’s nothing new here. We’ve been using techniques like redundancy, statelessness etc. to help mitigate this since the beginning of times. With these techniques we decrease the “Mean Time Between Failure” (MTBF) but increase the “Mean Time Between Critical Failure” (MTBCF) or the system’s overall MTBF.
Another aspect of reliability (and reliability calculations) is MTTR or “Mean Time To Repair” which in software mainly has to do with how much time does it take before we know something is wrong. The usual approach to that is monitoring which I’ve written about in the past (e.g. the blogjecting watchdog pattern). In this post I want to expand a little on another approach , which while not common in IT systems, can be useful at times.
Enter the BIT – which is short of “Built In Tests”. BIT is a technique I picked up when I worked on multi-disciplinary systems that also included embedded systems. Each and everyone of the embedded systems we developed (or integrated into the solution) supported BIT . Actually they usually supported several types of BIT at least PBIT, CBIT and IBIT
- PBIT – Power-On Built In Test – usually a short test the system runs to make sure all of its components are ready to go. You actually saw this one a lot of times since this is what motherboards do as you turn them on (all the blips and lights etc.)
- CBIT – Continuous Built In Test – Make sure the system is functioning, even when it isn’t really busy so we’ll know about problems before we actually try to use the system
- IBIT – Initiated Built In Test – provides a way to find out exactly what’s wrong when one of the other test types failed
BIT is very understandable for embedded systems, after all these are closed boxes with limited access to their innards and inner workings. but isn’t that also true for SOAs? After all we are building a bunch of blackboxes that interact to provide some business benefit, how can we be sure that everything is working fine esp. when we don’t control fully control some of the parts?
As mentioned above, a system, especially a distributed one, is built from relatively unreliable components. A continuous test helps us make sure things are working as expected. What we are doing is taking some of the code we wrote to run integration and acceptance tests (which runs a scenario end-to-end) deploy it as a service into the system which we call “liveliness check” and have it run periodically. Every time the liveliness runs it sends a notification (twitter message) so we know the test itself works. If it fails it sends more notifications (twitter, Email, SMS etc.) to an administrator.
This liveliness or CBIT serves as an early warning system. Since the end result is known in advance we can have a pretty good idea if something went wrong. E.g. we know how much time it should take for a test Id, we know what the result of that image is etc. The fact that it works even when the system is in low utilization means we can find out about problems and deal with them before they happen to end-users. That’s a big plus for us.
The advantage over regular monitoring solutions (this is not an either/or – monitoring is also needed) is that you know the specific business scenarios are properly working, which is a higher confidence that things are ok from knowing a specific server or service is running.
On the flip side, or the downside of adding a periodic liveliness is adding complexity into the system. In our case, we have to add a process to clean the traffic data added by the test messages. Also, while we try to make the system behave as usual as much as possible, certain parts of the system will have to know about the test messages and handle them differently. Again, in our case the reporting has to know to disregard test messages and not count them. This is even more problematic in other types of systems, for instance if you simulate an order, you don’t want the purchase order to actually go out to a supplier.
To sum this up, adding a liveliness check as part of the system to create a continuous built-in-test can increase your confidence that things are working as they should. It can also help you identify problems earlier. Like everything in life, it doesn’t come without tradeoffs and you should weight your benefits vs. costs before utilizing it in your systems.