Testing large distributed systems is hard. There is so much more you need to test beyond the exercising code with unit tests. Performance testing, load testing, and error testing must all be undertaken with realistic usage patterns and extreme loads. But exhaustive tests are often slow to run and hard to maintain; and over-fitting the tests to the code can make it difficult to refactor the code. So many problems! In this article, I discuss testing complex systems using a layered approach that is both manageable and delivers comprehensive coverage.
Big distributed systems can't be fully tested on one developer machine. There are several levels of testing stretching over a range of speeds, resources, and fidelity to a production system. For example, a typical large system might consist of thousands of various servers front-end Web applications, REST API servers, internal services, caching systems, and various databases. Such a system might processes several terabytes of data every day and its storage is measured in petabytes. It is constantly hit by countless clients and users. It is kind of difficult to replicate all this on a developer's notebook.
The key to tackling testing on complex systems is layered testing in special environments. The basic layer is composed purely of unit tests using mocks. This practice is well-defined and I will not spend much time talking about it (except for the limitations and dangers of excessive mocking). Unit tests can be run by individual developers on individual machines and should run quickly. How often the tests are run is up to the developer, but at the very minimum, all the tests should be run once before check-in. It is common to run a subset of tests that target the specific code the developer works on very frequently (every few minutes).
The next layer, which can still be run on a developer machine, is integration testing using virtual machines (VMs) and/or other isolation techniques, such as Linux containers, to simulate a production system or some aspects of it. For example, you can deploy a database on one VM and API server on another, then have them talk to each other in the same way they do in production.
This layer builds confidence that the code does not contain incorrect assumptions about various components of the system running on the same machine. It also allows full workflow in a configured, multiple (virtual) machine setting. A developer will run these tests at least once before checking-in any code, and more frequently if working on code that is not exercised fully by unit tests.
The next layer is integration tests + system tests running in a shared environment. This often involves the continuous integration/auto-build environment. The main purpose of this layer is to make sure that recent code checked in by a developer doesn't break anything obvious that will prevent other developers from moving forward. There may be also a separate development environment were developers can run various system-level experiments. These tests should be run automatically whenever a developer checks in any code, as there are sometimes complex issues related to source control practices, batched changes, and handling merges.
The next layer is composed of performance tests, load tests, disaster recovery tests, backup-restore tests, rolling update tests, and other system-wide tests. These tests are typically performed in a staging environment, which should be as close as possible to the production environment. The frequency of these tests is determined on a case-by-case basis. They could be run daily or weekly. At the very least, they should be run before preparing a release candidate, and the results examined carefully and compared to previous runs.
The final layer of tests are production tests. These tests are typically performed on large-scale systems that already have mechanisms for dealing with partial outages, and are believed to be robust to localized or partial failures. A stellar example of production-level tests can be found in the Netflix simian army. It is a collection of programs that randomly cause various failures of different magnitudes, so that it's possible to make sure the system under test can handle them. Not many companies reach this level of testing.
The purpose of an "environment" is to support isolated and repeatable deployment of a known configuration of the system. This includes a particular version of each component, third-party library, configuration file, set of users, permissions, etc. This is particularly important between staging- and production-level testing.
So, how do you setup and maintain an environment? The best practice for creating and managing environments is to use a configuration and orchestration tool. In the past, this was traditionally done with some combination of shell scripts, in-house systems, a lot of manually babysat servers, praying, and cursing. In recent years, as deployments have increased significantly in size, this approach couldn't scale. A slew of tools like Chef, Puppet, SaltStack, and Ansible have arisen to address the needs of current software builds. These tools enable developers to more reasonably manage the entire "environment" infrastructure of machines, configurations, and code deployments.
I use Ansible, which is implemented in Python but uses a set of declarative YAML files to specify the state a particular machine should be in. This design allows repeatable provisioning and configuration of multiple machines. Specifically, Ansible uses playbooks and inventory files. The playbooks specify what needs to be done on each machine, and the inventory describes what playbooks or roles should be applied to each specific machine or group of machines.
Local environments can be created and destroyed at will. Shared environments that are used by multiple developers and often comprise multiple physical and/or virtual machines and must be managed more carefully. Tests on shared environments should either be serialized (one test runs at a time and the environment is restored to its initial state after each test), or the tests should be separated (for example, isolating all artifacts and changes from a particular test to a timestamp-based directory or namespace). The serialized approach is safer and easier to implement, but if integration and system tests take a long time, it can delay development. The approach of a separate test suite is harder to implement correctly and not as safe, because side-effects from one test may affect another test. This is almost unavoidable when tests need to use a common scarce resource (like a database) that can't be replicated for each test.
Note that when multiple developers use a shared environment to run tests, each developer needs to run the entire suite of tests.
There are no easy off-the-shelf answers here. You will have to tailor your process to your situation, and walk the fine line between more rigorous but slow tests and faster but less rigorous tests.
Mock objects are great for testing complex systems. They typically stand in for some dependency that your code calls. The mock object usually has some canned response that mimics the original dependency in a given situation. But when testing distributed systems, you often want to test against a certain input stream of requests, messages, events, or user actions. The concept is similar: You want to replace the real actor with something you control. In unit tests, when you test a single method, you simply provide the input arguments as part of the test. When testing a distributed system over time and multiple transactions, you need something else. In such cases, I have found it useful to write a
Simulator class that generates the message and data streams that can then percolate through the system.