Channels ▼

Robert Dewar

Dr. Dobb's Bloggers

Wall Street and the Mismanagement of Software

August 08, 2012

Last week, an error in some automated high-frequency trading software from Knight Capital Group caused the program to go seriously amok, and when the cyberdust cleared, the company was left barely alive, holding the bill for almost a half-billion dollars to cover the erroneous trades. Much of the ensuing uproar has cited the incident as rationale for additional regulation and/or putting humans more directly in the decision loop. However, that argument is implicitly based on the assumption that software, or at least automated trading software, is intrinsically unreliable and cannot be trusted. Such an assumption is faulty. Reliable software is indeed possible, and people's lives and well-being depend on it every day. But it requires an appropriate combination of technology, process, and culture.

More Insights

White Papers

More >>

Reports

More >>

Webcasts

More >>

In this specific case, the Knight software was an update that was intended to accommodate a new NYSE system, the Retail Liquidity Program that went live on August 1. Other trading companies' systems were able to cope with the new NYSE program; Knight was not so fortunate, and, in what was perhaps the most astounding part of the whole episode, it took the company 30 minutes before they shut down the program. By then, the expensive damage had been done.

It's clear that Knight's software was deployed without adequate verification. With a deadline that could not be extended, Knight had to choose between two alternatives: delaying their new system until they had a high degree of confidence in its reliability (possibly resulting in a loss of business to competitors in the interim), or deploying an incompletely verified system and hoping that any bugs would be minor. They did not choose wisely.

With a disaster of this magnitude—Knight's stock has nosedived since the incident—there is of course a lot of post mortem analysis: what went wrong, and how can it be prevented in the future.

The first question can only be answered in detail by the Knight software developers themselves, but several general observations may be made. First, the company's verification processes were clearly insufficient. This is sometimes phrased as "not enough testing" but there is more to verification than testing; for example source code analysis by humans or by automated tools to detect potential errors and vulnerabilities. Second, the process known as hazard analysis or safety analysis in other domains was not followed. Such an analysis involves planning for "what if..." scenarios: if the software fails—whether from bad code or bad data—, what is the worst that can happen? Answering such questions could have resulted in code to perform limit checks or carry out "fail soft" procedures. This would at least have shut down the program with minimal damage, rather than letting it rumble on like a software version of the sorcerer's apprentice.

The question of how to prevent such incidents in the future is more interesting. Some commentators have claimed that the underlying application in high-frequency trading (calculating trades within microseconds to take advantage of fraction-of-a-cent price differentials) is simply a bad idea that frightens investors and should be banned or heavily regulated. There are arguments on both sides of that issue, and we will leave that discussion to others. However, if such trading is permitted, then how are its risks to be mitigated?

To put things in perspective, in spite of the attention that the incident has caused, the overall system—the trading infrastructure—basically worked. Certainly Knight itself was affected, but the problem was localized: we didn't have another "flash crash." We don't know yet whether this is because we got lucky or because the "circuit breakers" in the NYSE system were sufficient, but it's clear that such an error has the potential to cause much larger problems.

What is needed is a change in the way that such critical software is developed and deployed. Safety-critical domains such as commercial avionics, where software failure could directly cause or contribute to the loss of human life, have known about this for decades. These industries have produced standards for software certification that heavily emphasize appropriate "life cycle" processes for software development, verification, and quality assurance. A "safety culture" has infused the entire industry, with hazard/safety analysis a key part of the overall process. Until the software has been certified as compliant with the standard, the plane does not fly. The result is an impressive record in practice: no human fatality on a commercial aircraft has been attributed to a software error.

High-frequency automated trading is not avionics flight control, but the aviation industry has demonstrated that safe, reliable real-time software is possible, practical, and necessary. It requires appropriate development technology and processes as well as a culture that thinks in terms of safety (or reliability) first. That is the real lesson to be learned from last week's incident. It doesn't come for free, but it certainly costs less than $440M.

Robert Dewar is the president and CEO of AdaCore. He is the principal author GNAT, the free software Ada compiler, and earlier of the Realia COBOL compiler.

Related Reading






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Comments:

ubm_techweb_disqus_sso_-52eff9fbd0a357108aab8471f4b82b70
2012-08-15T17:39:09

Given the difficult decisions and incomplete information, success often depends on luck. This makes it difficult to understand why so little thought is given by management as to ways a company can improve its luck -- despite the availability of a plethora of centuries-old luck-management strategies all to often derided as "mere superstitions."


Permalink
ubm_techweb_disqus_sso_-c5f5905ecc6df427d886ae6bd7a7d92a
2012-08-15T14:25:03

Great post Robert but I just don't agree with the comparison. I think you compared apples with bananas. Everything in Aviation is very specific to avionics. You don't see boeings using windows, that why they didn't fail ! (sorry but I didn't resist). But seriously , Avionics environments are not project to fly, they are projected with fail in mind due to its nature. But even with failure in mind they just cannot predict every single situation that would take the plane to a fail. The best test is a real environment .You don't see any manufctor assuming its failure because they will just close after that and imagine the problems of a Boeing( or Airbus) bankrupcy , so it's easier to claim that pilots are guilty (human failure) but I'll mention 2 accidents here in Brazil that have many questionable pre-assumptions : TAM 402 and Air france . Both from Airbus, what a coincidence ! In both cases the pilots were pointed as responsible because they don't proceed with stupid procedures that were totally against any logic way of thinking. In TAM accident , pilots did not follow one procedure ,kindly stupid,that ending let the aircraft computer think that one side of the plane was taking off and other was lending. In AF cases the sensors frozed and the computer could not inform the crew that the planing were following down or for where it was going. In can see other examples around the world. So our IT environment is not designed with fail in mind and even if we did that , there won't be any test that could certification any software for 100% fail proof. I'm not saying that knight is not responsible but I'm saying that we need more people , we need more work and computer won't replace any human now and ever. So what we have to think is that people are not substituted as easy as they think and that's is their main failure. It's not a panic situation but it can happen again , not the same way obviously but in the next few minutes another fail can happen or another plan can crash and the question that we need ask is : Who is really responsible for this ?


Permalink
ubm_techweb_disqus_sso_-d8665d96671df2f16e7018a2a5982751
2012-08-15T14:00:43

^^^
This is probably one of the most well-articulated commentaries I've ever read concerning the need for restraint when it comes to pushing out software into production.

I've never understood this sentiment regarding the willingness to trade speed to market for quality. There is absolutely a balance that needs to be struck when it comes to setting deadlines, and while it's true that often "senior management" is best positioned to understand the nuances that argue for when product changes should be deployed, it's also true that the best managers know when and when not to listen to the "boots on the ground". Managing software projects is really an art, as it requires precision intuition to understand when to override the inevitable fine tuning and quest for perfection that can often creep into the engineering process, and when to heed caution from the engineers that it's simply to early to hit go and there needs to be more time invested in R&D, testing, etc. for a given system. When it comes down to it, everything in this industry revolves around trade-offs, e.g. space versus time. Trading gradual, deliberate progress on a project for agility in the marketplace is something that should never be taken lightly and without a healthy review of the risk factors.

To make an analogy with a different industry, think about what happens when you bring your vehicle in for service. You tell the mechanic there is a shuddering sensation coming from the front area of the car. Try telling the mechanic that, oh, by the way, I need to you to diagnose and fix the problem in the next hour and a half and see what they say. They will either politely explain why that requirement isn't reasonable, or give you back your keys and wish you good luck driving with the shuddering.


Permalink
ubm_techweb_disqus_sso_-a7ba75798d2c3d39baf405d284530326
2012-08-12T06:37:24

I understand, and even accept at most times, the need to ensure that the schedule concerns of management not override the risk concerns of technical management when considering accelerating software development schedules.

But as you well know, it can never be a hard-and-fast rule that senior management always defer to technical management when it comes to these things.

(As a practical matter, you and I both know that the type of change you're proposing will never happen, so we are dealing here in theoreticals only.)

First, there are often time-to-market concerns that are the focus of senior management, which are simply invisible to technical management, and which drive the acceleration of product schedules.

Second, there are many instances when technical managers expend time and effort in over-engineering a solution to a problem, leading to slipped schedules.

Third, it's hard to reliably assess risk.

To my mind, this third point is the most significant.

Had the risk management team (which ideally would be composed of a wide spectrum of individuals from all functional departments) at Knight really gamed out the scenarios under which their product releases could expose the company? Is there even a risk management function at Knight? If so, does it consider software risk to be a legitimate area for its focus?


Permalink
ubm_techweb_disqus_sso_-01eec9b24b1c7b5917e3c1a5ce4bf296
2012-08-11T03:25:22

It is very easy to point out that there are ways to avoid these sorts of things through proper process. And you are correct, there are good processes that drastically reduce the likelihood of serious failures. However, there is one overarching factor that absolutely has to be dealt with before such processes can even be implemented:

Senior management can not have the capability to accelerate software schedules. It's that simple. It doesn't matter if their bonus is riding on it, or the entire companies survival is relying on it, they absolutely cannot be capable through any means to alter development schedules either directly or indirectly. If they do have the capability to influence such schedules, no matter how indirect it may be, it is only a matter of 'when', not 'if' they are the sole cause of a catastrophic system failure.

Human beings are not capable of being driven, through any means, to accelerate their intellectual capabilities. They do not improve when given shorter time periods in which to accomplish even simple intellectual tasks, much less complex ones. And everyone would do well to remember that software is the most complex invention mankind has ever produced. It operates at a level of abstraction unparalleled in any other field except those which contain software as part. It orchestrates the behavior of billions of components operating at nanosecond speeds and often one single mistake in the flipping of one logic gate is enough to cause complete failure. It's brain surgery with chopsticks where your patient is a different species and you are blind, being guided by a foreigner whose instructions are being fed through automatic translation software.

All the things that need to be done, like investing in a testing environment that can duplicate the live environment as faithfully as scientifically (not economically) feasible, allocating adequate time for testing and schedules which ensure that software deploys when it is ready, not when it is needed, can only be productively used if management cannot demand faster turnaround, plain and simple. Since that is not possible, we will be left with catastrophes.


Permalink
ubm_techweb_disqus_sso_-169ebe579d777df445abfa1b4ce1e9eb
2012-08-09T21:20:13

With all due respect, you don't seem to understand what happened. They lost $440 million. They ate their losses, and the firm almost went under. My understanding is this:

The error is that they applied the change to ~50 stocks, not just the 5 that were being piloted in the Retail Liquidity. The engine thought it was making a shiat ton of money. That said, why position management software didn't kick in, and reduce their risk by trying to reduce the amount of stock they held, is strange. Also, why they didn't detect the problem and just kill -9 the process immediately is beyond me as well.


Permalink
ubm_techweb_disqus_sso_-ff6fe7ac6b446db67dd665e3c8e52316
2012-08-09T21:11:56

In systems like Knights, not running the system against a simulated market data stream is not just incompetent, but just plain stupid. IMO, Knight should be responsible for paying back the losses of EVERYONE who suffered due to their incompetence. I used to write risk analysis software for the options traders and market makers at the CBOE, which would allow their systems to automatically hedge their positions in order to minimize their risk (sub-60ms reaction times). To test this, I would run the software against the market data flows vs. their current positions, and extrapolate the results to determine if it was working as designed/desired. This testing was extensive and exhaustive before we would deploy it to the field. Liability was the least of our worries (though not an insignificant one).

So, if these companies were made liable for the losses of others due to their software's manipulation of the markets, I would expect we'd see a lot less of this sort of "flash crash" event happening.


Permalink


Video