It occurs to me that as web applications have taken over the world, we still have a heck of a lot to learn about managing them. Complexity in the software industry has skyrocketed, and applications that once were versioned and released every six months are now being upgraded with new production code every couple of weeks. eBay is a prime example of this model. It is widely known that online auctioneer eBay has become so adept and addressing issues and deploying changes to their servers, that every two weeks, a whole new version of eBay is up and running.
Now, eBay is not a simple application, and downtime at eBay can cost millions of dollars, not to mention generate an angry mob of users who are not only desperately trying to buy the latest iPhone 3G, but some of whom are also trying to make a living. This is serious stuff. Downtime at eBay is front-page news.
It's redundant to point out the dangers of our new Software-as-a-Service (SaaS) software paradigm. However, the one-to-many relationship between application server and end-users does present pitfalls that did not exist in Bill Gates's world. When Microsoft Outlook crashes, you may be upset, say something not very nice, then restart it. When your application servers crash, several thousand of your users may all collectively say something not very nice about you.
So why aren't more companies able to follow the eBay model? What does eBay know that they don't?
I've spent time thinking about this, and the answer may lie not just in how they deploy new changes, but how they resolve issues when they occur. Is this truly one of the great remaining challenges in the realm of software? Some would say so, and they have good reasons to back that up.
Multitier applications represent some of the greatest levels of complexity ever seen in the software industry. With pieces of your application running on many heterogeneous, physically dispersed servers and environments, understanding what went wrong in these environments can be next to impossible. When issues occur, most often the only hope a team has is to attempt to reproduce the same conditions that caused the error, and hope it happens again. This means that to understand the root-cause of issues, recreating the environment, repopulating the database, and generating the required load on the servers is the only solution. Frequently, the pain of going through this effort is too great, and the issues lie dormant...until the next time something bad happens!
What the software industry has been screaming out for is the ability to quickly capture, reproduce, and isolate issues as they occur. What we need is something like "TiVo for Software."
When I thought about starting a company around the concept of recording and replaying software execution, I did not initially think about all the mechanisms replay technology could eventually replace. In 2004, we started Replay Solutions with a technology to record not only an application's execution, but just as importantly, the complex environment in which the application ran. With this ability, teams can dispense with massive amounts of inefficient workflows that have traditionally been manual, iterative, and error prone.
Imagine this scenario: Your newly outsourced team in India is handling QA for your complex, multitier application. They're doing a great job and have found over 100 issues with your application. You've got the problem reports, log files, and the very large database datasets that your application was using when the bad things happened. Next comes the fun part. Now it's your turn to bring up the same environment that your outsourced team was running. I hope you're using virtual servers! Finally, let's take a shot at generating the same load on the application that existed when the problem occurred. Hopefully, the moons have aligned, and your fingers are crossed...
Now let's fast-forward. Your outsourced team in India is using your recording system. You arrive in the morning, log on to your defect tracking system, load the recording of an issue they found, and press 'play.' This time, every event that affected your application in that complex environment, including output from your authentication, LDAP, caching, and e-commerce servers, has all been recorded and stored. Even the database and its dataset are no longer required. Most importantly, the end-user traffic that ultimately triggered the problem has been recorded as well. All of these elements are perfectly reproduced, allowing you to focus on the most important thing: What went wrong.
Anyone who has been involved in software development can relate to the age-old conundrum of trying to reproduce an issue that simply doesn't appear to existat least, not on your machine. Too many sleepless nights have been wasted chasing down phantom bugs. It's time for the madness to stop. The problems we're facing are only getting more complex as new technologies are brought to market. This new software paradigm is here to stay. Luckily, I believe new technologies such as record and replay will help control the chaos.