Instant Replay Debugging and Crash Diagnostics

March 23, 2008

An article Michael Fitzgerald in the 3/23/08 Sunday New York Times (Business Section, page 5) had a fascinating story about a company called Replay Solutions that has developed a tool for diagnosing system crashes. The gist I got was that the original tool was developed for XBox and game development, but that they have a new version out for Windows.

I Googled Replay Solutions which lead to their home page and specifications sheets. They indeed have a brand new version of their tool for J2EE app servers. According to the on line spec sheet, the solution works by placing an agent in the J2EE app server that monitors all code actions and responses like mouse movement, keystrokes, network traffic and the like. On replay, they claim to execute actual captured code up to the point of the crash.

Naturally, the approach has value for debugging and testing in general.

Personally, I think the approach is fascinating and worth keeping an eye on. Mike Fitzgerald's article plus the information from the Replay Solutions site provide enough detail to understand the approach. It will be interesting to see how effective the approach is for J2EE apps, considering the J2EE support is brand new.

The one thing that was eye popping was the price and possible licensing approach to the solution at $50K a project, which would immediately suggest rolling one's one. I wouldn't since they are also seeking a patent. For large applications, the cost may be worth it.

I can't help wondering how effective it will be. First off, it is limited being confined to J2EE for now. I would certainly love a version for IE and FireFox to diagnose those frozen pages. Second, I am not sure how effective it will be for a system crash when it runs in the same environment as the system that is crashing. Ideally, I want to see all activity up to the point of the crash, but when the system is "heading for the white light" we all know that system behavior becomes unreliable so unless they have some magic, I an not sure we can see everything up to the point of the crash.

Now, having said that, there may be a way to get the data up to the point of the crash - by running the application in a VMWare environment and monitoring from the physical environment.

I also wonder if their approach would work for diagnosing security penetrations. Seems like it would.

In any case, interesting stuff with potential worth keeping an eye on.


