Because RabbitMQ was a new third-party piece of software to be used as a critical component of our system, I wanted to test its integration throughly. That involved multiple tests against a local cluster of three nodes (all running on my local machine), as well as the same tests running against a remote RabbitMQ cluster. The tests involved tearing down, recreating, and configuring the cluster in different ways, and then stress-testing it. Setting up and configuring a remote RabbitMQ cluster involves multiple steps, each normally taking less than a second. But, on occasion, one can take up to 30 seconds. Here is a typical list of the necessary steps for configuring a remote RabbitMQ cluster:
- Shut down every node in the cluster
- Reset the persistent metadata of every node
- Launch every node in isolated mode
- Cluster the nodes together
- Start the application on each node
- Configure virtual hosts, exchanges, queues, and bindings
I created a Python program called Elmer that uses Fabric to remotely interact with the cluster. Due to the way RabbitMQ manages metadata across the cluster, you have to wait for each step to complete for every node in the cluster before you can execute the next step; and checking the result of each step requires parsing the console output of shell commands (yuck!). Couple that with node-specific issues and network hiccups and you get a process with high time variation. In my tests, in addition to graceful shutdown and restart of the whole cluster, I often want to violently kill or restart a node.
From an operations point of view, this is not a problem. Launching a cluster, or replacing a node, are rare events and it's OK if it takes a few seconds. It is quite a different story for a developer who want to run a few dozen cluster tests after each change. Another complication is that some use cases require testing unresponsive nodes, which can lead to the halting problem (is it truly unresponsive or just slow?). After suffering through multiple test runs where each test was blocked for a long time waiting for the remote cluster, I ended up with the following approach:
- Elmer (the Python/Fabric cluster remote control program) exposes every step of the process
- A C# class called
Runner
can launch Python scripts and Fabric commands and capture the output - A C# class called
RabbitMQ
utilizes theRunner
class to control the cluster - A C# class called
Wait
can dynamically wait for an arbitrary operation to complete
The key was the Wait
class. The Wait
class has a static method called Wait.For()
that allows you to wait for an arbitrary operation to complete until a certain timeout. If the operation completes quickly, you will not have to wait for the time to expire, and Wait
will bail out quickly. If the operation doesn't complete in time, Wait.For()
will return after the timeout expires. Wait.For()
accepts a duration (either a TimeSpan
or number of milliseconds), and a function returns bool
. It also has a Nap
member variable that defaults to 50 milliseconds. When you call Wait.For()
, it calls your function in a loop until it returns true
or until the duration expires (napping between calls). If the function returns true
, then Wait.For()
returns true
; but if the duration expires, it returns false
.
public class Wait { public static TimeSpan Nap = TimeSpan.FromMilliseconds(50); public static bool For(TimeSpan duration, Func<bool> func) { var end = DateTime.Now + duration; if (end <= DateTime.Now) { return false; } while (DateTime.Now < end) { if (func.Invoke()) { return true; } Thread.Sleep(Nap); } return false; } public static bool For(int duration, Func<bool> func) { return For(TimeSpan.FromMilliseconds(duration), func); } }
Now, you can efficiently wait for processes that may take highly variable times to complete. Here is how I use Wait.For()
to check whether a RabbitMQ node is stopped:
private bool IsRabbitStopped() { var ok = Wait.For(TimeSpan.FromSeconds(10), () => { var s = rmq("status", displayOutput: false); return !s.Contains("{mnesia,") && !s.Contains("{rabbit,"); }); return ok; }
I call Wait.For()
with a duration of 10 seconds, which I wouldn't want to block on every time I check whether a node is down (since it happens all the time). The anonymous function I pass in calls the rmq()
method with the status
command. The rmq()
method runs the status
command on the remote cluster, then returns the command-line output as text. Here is the output when the Rabbit is running:
Status of node [email protected] ... [{pid,8420}, {running_applications, [{rabbitmq_management,"RabbitMQ Management Console","2.8.2"}, {xmerl,"XML parser","1.3"}, {rabbitmq_management_agent,"RabbitMQ Management Agent","2.8.2"}, {amqp_client,"RabbitMQ AMQP Client","2.8.2"}, {rabbit,"RabbitMQ","2.8.2"}, {os_mon,"CPO CXC 138 46","2.2.8"}, {sasl,"SASL CXC 138 11","2.2"}, {rabbitmq_mochiweb,"RabbitMQ Mochiweb Embedding","2.8.2"}, {webmachine,"webmachine","1.7.0-rmq2.8.2-hg"}, {mochiweb,"MochiMedia Web Server","1.3-rmq2.8.2-git"}, {inets,"INETS CXC 138 49","5.8"}, {mnesia,"MNESIA CXC 138 12","4.6"}, {stdlib,"ERTS CXC 138 10","1.18"}, {kernel,"ERTS CXC 138 10","2.15"}]}, {os,{win32,nt}}, {erlang_version,"Erlang R15B (erts-5.9) [smp:8:8] [async-threads:30]\n"}, {memory, [{total,19703792}, {processes,6181847}, {processes_used,6181832}, {system,13521945}, {atom,495069}, {atom_used,485064}, {binary,81216}, {code,9611946}, {ets,628852}]}, {vm_memory_high_watermark,0.10147532588839969}, {vm_memory_limit,858993459}, {disk_free_limit,8465047552}, {disk_free,15061905408}, {file_descriptors, [{total_limit,924},{total_used,4},{sockets_limit,829},{sockets_used,2}]}, {processes,[{limit,1048576},{used,181}]}, {run_queue,0}, {uptime,62072}] ...done.
The function is making sure that the mnesia
and rabbit
components don't show up in the output. Note that if the node is still up, the function will return false
and Wait.For()
will continue to execute it multiple times. Wait.For()
decreases the sensitivity of my tests to occasional spikes in response time (I can Wait.For()
longer without slowing down the test in the common case), and has reduced the runtime of the whole test suite from minutes to seconds.
Conclusion
The sum total of this series of articles has shown a variety of design principles and testing techniques to deal with hard-to-test systems. Nontrivial code will always contain bugs, but deep testing is guarantied to reduce the number of undiscovered issues.
Gigi Sayfan specializes in cross-platform object-oriented programming in C/C++/ C#/Python/Java with emphasis on large-scale distributed systems, and is a long-time contributor to Dr. Dobb's.