Channels ▼


The Dismal Science of Code Metrics

During the last decade in which systems have become powerful enough to do deep analysis of source code as data, there has been a surge of interest in code metrics and other forms of quantitative analysis. Software engineering was once limited to a few crude metrics: cyclomatic complexity (CCM), lines of code (LOC), and a few forms of defect counts. The Halstead metrics, devised in 1977 — an era where programs frequently consisted of a single file — were also used, although primarily by researchers, as most IT shops had little interest in the metrics and no interest in burning compute cycles to obtain them.

Plentiful, inexpensive CPU power has rekindled interest in quantitative measures, and some organizations now run metrics as a standard part of their build cycle. (The well-respected Sonar open-source project provides a metrics dashboard of code bases.) The metrics are generally combed-through carefully at first, but as time passes, they are ignored unless the code suddenly violates a preset threshold. Even then, it's not clear that much happens at many organizations. In sum, metrics frequently do not occupy an important place in the management of the software process.

To my eye, this is because of general ignorance about how to use metrics and an innate distrust of the numbers themselves. I believe this distrust is warranted.

The central problem in using metrics is that they're only useful if you know a priori what a desirable range of values is. As I discussed last week, even in matters as straightforward as test coverage, the numbers to aim for depend significantly on the nature of the codebase. And so, they will vary tremendously from one project to the next.

The problem of knowing what numbers to aim for is made worse by the fact that many of today's recommendations are based on analysis of open-source projects or government projects. Researchers invariably need large code bases and access to SCM systems as well as defect trackers to obtain the data on the three elements they focus on: code, activity, defects. In many (not to say most) cases, they use established, high-profile OSS projects or government-sponsored projects. In my experience, these codebases are much cleaner than in-house code. The act of making code public tends to have a chilling effect on taking nasty shortcuts, not documenting work, or engaging in other unrecommended practices. Consequently, using these code bases to establish desirable ranges skews the targets.

A second problem is that different tools measure metrics differently, as several studies have revealed. Only on the crudest of measures do the numbers come up roughly the same. The more complex the metrics, the wider the gaps. Consequently, comparison with metrics derived by third parties from external codebases are an iffy proposition.

Moreover, some stats, even something as simple LOCs, have different meanings to different users. When someone quotes LOCs, it's necessary to ascertain what kind of LOCs. They could be source LOCs (SLOCs) or Logical LOCs (LLOCs). The latter is the number of executable statements, rather than actual lines. So a single line consisting of x++; y++; would be 1 SLOC, but 2 LLOCs. Note that SLOC, despite its name, includes and comments and blank lines. Actually, SLOC includes blank lines only if they represent less than 25% of the code file. Should I keep going?

The other fundamental measure, CCM, has limitations as well. CCM, for all intents, measures the number of different paths that exist through the code in a single method or function. (It is occasionally used to measure the complexity of an entire class or a source file. This is invariably an error and produces an essentially meaningless result.) A function consisting of several executable commands will have a CCM 1. If there is an if-statement in the function, the CCM rises to 2. And so on. This is an acceptable coarse measure, but it is coarse. CCM, for example, counts every branch of a switch-statement as a separate path. So a switch-statement with 80 cases, will have a CCM of 80 — as will a hierarchy of nested if/else pairs 80 levels deep. Surely, the complexity of the two items is not close to the same.

Moreover, it's not at all clear that CCM's choice to give all paths an equal value is correct. Certainly, new paths in deeply nested code represent a greater complexity than a single if-statement in a three-line function.

The final consideration is that there is an emerging perception that most defect-rate predictions, based on conventional metrics, correlate too closely with file size. These projected defect rates — regardless of computation — rise in a predictable way as file size increases. On the surface, this would seem to indicate that my contention in "In Praise of Small Classes" was correct: Small classes deliver substantial quality benefits. But another way to view it is that the defect-rate metrics are simply proxies for class size. They just measure its effects through different prisms. In other words, they provide no particular insight.

I don't mean to denigrate code metrics. To manage a project, you do need quantitative information about the code. But rather than invest in complex models, you should choose metrics that are straightforward, easy to understand, and that you can track easily. By regular use, you will be able to determine the thresholds that work for your organization. Only then are you in a position to use metrics beneficially. By this I mean predictive metrics, metrics based on measures you do not fundamentally understand, and thresholds derived from OSS or government contracts should all be dispensed with. Measure what you know and use your own numbers as your guide.

— Andrew Binstock,
Editor in Chief
Twitter: platypusguy

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.