# How to Reveal Implicit Structure in Spreadsheets

For a conference,
I once drew a cartoon about spreadsheets. It showed a City gent staring despairingly
at an Excel formula while behind him, Excel developers in Canary
Wharf fretted. Like this:

In the cartoon,
I'd assumed it was his own spreadsheet, but it could have been
someone else's "legacy"
spreadsheet that he was trying to decipher — and that's
what I'm going to write about. My mind has been running on
the topic since a colleague asked me to explain my work
on what I call "structure discovery":
making explicit the things an author knows about
his spreadsheet but didn't write down.

To demonstrate, I've written a tiny cash-flow
spreadsheet, pictured below. Its formulae are realistic, but I've
reduced the number of rows and simplified the currency and date formats
so that it doesn't take up too much space on the screen:

The first thing one needs is a listing of the spreadsheet's formulae. Oddly, Excel doesn't provide this; but below, I do so by reading the spreadsheet into a reverse-engineering tool I've developed. This parses the Excel file, extracts the formulae, associates them with their cells, and lists them. That's something that will be useful again later on, when I go on to transform the formulae to make them easier to read. But to start with, here are the formulae without any processing:

Cells Sheet1!A1:F12 where: Sheet1!B1 = "Simple cash-flow spreadsheet" Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software" Sheet1!F4 = "Opening cash" Sheet1!F5 = 1000 Sheet1!B7 = "Month" Sheet1!C7 = "Income" Sheet1!D7 = "Expenses" Sheet1!E7 = "Net profit" Sheet1!F7 = "Cash at start of period" Sheet1!B8 = 1 Sheet1!C8 = 120 Sheet1!D8 = 27 Sheet1!E8 = Sheet1!C8-Sheet1!D8 Sheet1!F8 = Sheet1!F5 Sheet1!B9 = 2 Sheet1!C9 = 120 Sheet1!D9 = 41 Sheet1!E9 = Sheet1!C9-Sheet1!D9 Sheet1!F9 = Sheet1!F8+Sheet1!E8 Sheet1!B10 = 3 Sheet1!C10 = 150 Sheet1!D10 = 64 Sheet1!E10 = Sheet1!C10-Sheet1!D10 Sheet1!F10 = Sheet1!F9+Sheet1!E9 Sheet1!A12 = "Totals" Sheet1!C12 = SUM(Sheet1!C8:Sheet1!C10) Sheet1!D12 = SUM(Sheet1!D8:Sheet1!D10) Sheet1!E12 = SUM(Sheet1!E8:Sheet1!E10)

Having the formulae in a single listing I can print is nice, but it would be even better if the cell addresses were replaced by meaningful names. So suppose again that the cash-flow spreadsheet is someone else's that I'm trying to decipher, and suppose also that from its layout, cell colouring, and labels, I've guessed which cells belong together. For example, suppose I've guessed that cells B8:B10 are successive months, and that C8:C10 are successive incomes. I could summarise my guesses like this:

'Sheet1'!F5 is opening_cash. 'Sheet1'!B8:B10 is months. 'Sheet1'!C8:C10 is income. 'Sheet1'!D8:D10 is expenses. 'Sheet1'!E8:E10 is net_profit. 'Sheet1'!F8:F10 is cash_at_start_of_period. 'Sheet1'!D12 is total_expenses. 'Sheet1'!C12:C12 is total_income. 'Sheet1'!E12:E12 iso total_profit.

Equipped with such a mapping from addresses to names, one
could then regard each group of cells as an array. For
example, B8:B10 would consist of the elements
`months[1]`

, `months[2]`

and `months[3]`

. Similarly, C8:C10 would consist of
`income[1]`

, `income[2]`

, and
`income[3]`

. It's then fairly simple to
rewrite each formula, replacing cell addresses in it
by named array elements. I've implemented such
transformations in my reverse-engineering software, and
below, apply them to the cash-flow spreadsheet. Notice how much
more intelligible it becomes:

Cells total_profit[1] total_expenses[1] total_income[1] Sheet1!A1:F12 cash_at_start_of_period[1:3] net_profit[1:3] expenses[1:3] income[1:3] months[1:3] opening_cash[1] where: Sheet1!B1 = "Simple cash-flow spreadsheet" Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software" Sheet1!F4 = "Opening cash" Sheet1!B7 = "Month" Sheet1!C7 = "Income" Sheet1!D7 = "Expenses" Sheet1!E7 = "Net profit" Sheet1!F7 = "Cash at start of period" Sheet1!A12 = "Totals" cash_at_start_of_period[1] = opening_cash[1] cash_at_start_of_period[2] = cash_at_start_of_period[1]+net_profit[1] cash_at_start_of_period[3] = cash_at_start_of_period[2]+net_profit[2] expenses[1] = 27 expenses[2] = 41 expenses[3] = 64 income[1] = 120 income[2] = 120 income[3] = 150 months[1] = 1 months[2] = 2 months[3] = 3 net_profit[1] = income[1]-expenses[1] net_profit[2] = income[2]-expenses[2] net_profit[3] = income[3]-expenses[3] opening_cash[1] = 1000 total_expenses[1] = SUM(expenses[1]:expenses[3]) total_income[1] = SUM(income[1]:income[3]) total_profit[1] = SUM(net_profit[1]:net_profit[3])

Although such a listing is more intelligible than the
original formulae, it can become tedious and very long if the
spreadsheet has lots of repeated formulae such as those for
`net_profit`

.
To overcome this, I've implemented
a "run detector". This compares each formula with the
ones next to it, looking to see whether it's either
identical or differs by 1 in a cell address or constant. When
it finds a run of two or more formulae that are thus related,
it summarises them by factoring out
the common parts and replacing these by a variable,
written here as `V0`

. Here's output
from the run-detector applied to the cash-flow spreadsheet:

total_profit[1] = SUM(net_profit[1]:net_profit[3]). total_expenses[1] = SUM(expenses[1]:expenses[3]). total_income[1] = SUM(income[1]:income[3]). Sheet1!A12 = "Totals". Sheet1!B1 = "Simple cash-flow spreadsheet". Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software". Sheet1!B7 = "Month". Sheet1!C7 = "Income". Sheet1!D7 = "Expenses". Sheet1!E7 = "Net profit". Sheet1!F4 = "Opening cash". Sheet1!F7 = "Cash at start of period". cash_at_start_of_period[1] = opening_cash[1]. cash_at_start_of_period[V0>1] = cash_at_start_of_period[V0-1]+net_profit[V0-1]. net_profit[V0] = income[V0]-expenses[V0]. expenses[1] = 27. expenses[2] = 41. expenses[3] = 64. income[V0<3] = 120. income[3] = 150. months[V0] = V0. opening_cash[1] = 1000.

In case this isn't clear, the equation `months[V0] = V0`

means that for all `V0`

within the bounds of
the `months`

table, the `V0`

month is `V0`

.
That is, the first element of the `months`

table is 1, the second
is 2, and
so on.
Similarly, the equations for
`cash_at_start_of_period`

say that the first element of `cash_at_start_of_period`

is equal to `opening_cash`

, and that each subsequent
element is the value for the previous month plus that
month's net profit. This, by the way, is a
common pattern in spreadsheets:
an initialisation formula followed by a run of formulae that
pass information from one row to the next. In effect, it's
recursion, but over space rather than time.

The run-detector also works on "raw" spreadsheet listings: those without name substitutions. Here it is, applied again to the cash-flow spreadsheet:

Sheet1!A12 = "Totals". Sheet1!B1 = "Simple cash-flow spreadsheet". Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software". Sheet1!B7 = "Month". Sheet1!B[V1 in 8:10] = V1-7. Sheet1!C7 = "Income". Sheet1!C[8:9] = 120. Sheet1!C10 = 150. Sheet1![V0 in C:E]12 = SUM(Sheet1![V0]8:Sheet1![V0]10). Sheet1!D7 = "Expenses". Sheet1!D8 = 27. Sheet1!D9 = 41. Sheet1!D10 = 64. Sheet1!E7 = "Net profit". Sheet1!E[V1 in 8:10] = Sheet1!C[V1]-Sheet1!D[V1]. Sheet1!F4 = "Opening cash". Sheet1!F5 = 1000. Sheet1!F7 = "Cash at start of period". Sheet1!F8 = Sheet1!F5. Sheet1!F[V1 in 9:10] = Sheet1!F[V1-1]+Sheet1!E[V1-1].

I had to invent a
notation for variables that range over row and column numbers.
This is illustrated by the line `Sheet1!B[V1 in 8:10] = V1-7`

.
This says that `Sheet1!B8 = 1`

, and that
`Sheet1!B9 = 2`

, and so on. Similarly, the
left-hand side `Sheet1![V0 in C:E]12`

says
that `V0`

ranges over column addresses
C to E. And on its right-hand side, the
subexpression `Sheet1![V0]8`

denotes
`Sheet1!C8`

or `Sheet1!D8`

or `Sheet1!E8`

.

Such a listing is useful because it helps discover runs of related formulae, and hence determine which cells are part of the same group. For example, it refers repeatedly to rows 8:10 and 9:10. If we didn't already know that the cells in these rows of each column were related, the listing would be a good hint. That isn't actually necessary in my cash-flow example, where the grouping is obvious by eyeballing the spreadsheet, but it can be very useful. I once worked on a 10,000-cell 200-sheet monster of a housing-finance spreadsheet which contained some 60 20×40 interrelated tables. The run-detector was invaluable in showing which formulae were related. 60 × 20×40 is 48,000 cells; I was able to reduce these to about 120 equations. I say 120, because each table tended to be defined by two sets of formulae: one set to initialise its first row, and one to progress down subsequent rows, in a similar way to the recursion mentioned above. The formulae within each set differed only by column address, and hence the run-detector could summarise them into one equation.

Finally, here is a spreadsheet that I've seeded with errors:

This spreadsheet looks like the one I showed earlier, but Excel is not
happy with it, which it signals with the little green triangles
in cells E9 and D12. It displays these when it thinks a
formula is out of step with those around it. To find out what's
wrong, I listed the formulae with name substitutions again; and
three errors become apparent. The value for
`cash_at_start_of_period[2]`

is a hard-wired constant rather than a formula.
The formula for `net_profit[2]`

subtracts from `income[1]`

and not `income[2]`

, an error hard to see by inspecting
the spreadsheet because both cells have the same value. And
the SUM making up `total_expenses[1]`

has run off the end
of its range:

Cells total_profit[1] total_expenses[1] total_income[1] Sheet1!A1:F12 cash_at_start_of_period[1:3] net_profit[1:3] expenses[1:3] income[1:3] months[1:3] opening_cash[1] where: Sheet1!B1 = "Simple cash-flow spreadsheet" Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software" Sheet1!F4 = "Opening cash" Sheet1!B7 = "Month" Sheet1!C7 = "Income" Sheet1!D7 = "Expenses" Sheet1!E7 = "Net profit" Sheet1!F7 = "Cash at start of period" Sheet1!A12 = "Totals" cash_at_start_of_period[1] = opening_cash[1] cash_at_start_of_period[2] = 1093 Error cash_at_start_of_period[3] = cash_at_start_of_period[2]+net_profit[2] expenses[1] = 27 expenses[2] = 41 expenses[3] = 64 income[1] = 120 income[2] = 120 income[3] = 150 months[1] = 1 months[2] = 2 months[3] = 3 net_profit[1] = income[1]-expenses[1] net_profit[2] = income[1]-expenses[2] Error net_profit[3] = income[3]-expenses[3] opening_cash[1] = 1000 total_expenses[1] = SUM(expenses[1]:Sheet1!D11) Error total_income[1] = SUM(income[1]:income[3]) total_profit[1] = SUM(net_profit[1]:net_profit[3])

In bigger spreadsheets, the "out-of-step" errors would also
become evident when I list by runs. The spreadsheet here doesn't
really have enough related formulae for this to happen. However,
the following extract from a by-runs listing does show that something
is amiss with `net_profit`

:

net_profit[V0<3] = income[1]-expenses[V0]. net_profit[3] = income[3]-expenses[3].

All three kinds of error, by the way, have occurred in real spreadsheets. I've got repeated formulae wrong in my own spreadsheets, though I've always caught the error before the spreadsheets went live. I seem to remember horror stories of naïve spreadsheeters overtyping a formula with the number it has calculated, to "save the computer work"; and other anecdotes of the same being done to defraud. And errors in SUM ranges are notorious. SF author and critic David Langford once received only half the royalties he was due because a SUM in his publisher's royalties spreadsheet missed some cells. And the European Spreadsheet Risks Interest Group has a horror story (search for "$38K spreadsheet error") of a hospital's Medicare bad-debts spreadsheet whose SUM ranges were misprogrammed. Because of this, the hospital claimed $38,240 more to compensate it for bad debts than it should have.

Making existing spreadsheets easier to understand is one important
aspect of spreadsheet safety, and
is the main point of this posting. But it's also important to
avoid bugs
when writing new spreadsheets. I've worked on this too, and I'll
finish with a summary of the work. It's based on a language I've
developed called Excelsior, which can compile
listings like those above *into* spreadsheets. This means that you can
write meaningful identifiers
rather than cell addresses. Separate layout statements in the
program specify how these identifiers map to cells.

Excelsior also lets you write reuseable modules that can be shared between different spreadsheets. In other words, code reuse. That's something lacking from Excel, where to share code between spreadsheets, you have to copy cells from one spreadsheet and paste into another. And then, of course, you end up having to maintain several copies of your formulae.

The following links give more information on this work. In How to Document a Spreadsheet: an Exercise with Cash-Flow and Loans, I show how to develop, in Excelsior, a spreadsheet that represents several loans, and for each month, decides which is best to borrow from. The algorithm is fairly simple, but it's subtle, and it would be tricky to get right if coding directly in Excel. Having the formulae all available in front of you, with sensible cell names, makes it much easier.

A much longer exercise of the same kind is Gliders, Hasslers, and the Toadsucker: Writing and Explaining a Structured Excel Life Game. In it, I show how to develop a spreadsheet that implements a well-known cellular automaton, John Conway's game of Life. I wrote my article for maths teachers who might want to introduce secondary-school students to Life. Having the primary source code be an Excelsior listing, rather than a spreadsheet, makes it much easier to show students the equations that update the cellular automaton.

Rather different is the science-fiction generator at Ireson-Paine Spreadsheet Parts Repository Excel science-fiction generator. I based this on a jokey diagram in an old SF anthology, Gahan Wilson's The Science Fiction Horror Movie Pocket Computer from Harry Harrison and Brian Aldiss's The Year's Best Science Fiction No. 5. It's a familiar idea: a network of linked boxes where each box contains a plot event such as "Earth falls into the Sun" or "And so everybody dies". The links between boxes indicate which events can follow which. My spreadsheet implements this as a transition graph, which it recurses over to generate a story. I wrote the spreadsheet entirely in Excelsior; the recursion would have been very tricky to program if using only Excel.

Those examples use the idea that one can avoid bugs by generating spreadsheets from easily readable programs. Another way to avoid bugs is to build spreadsheets from prefabricated components. That's demonstrated in Spreadsheet Components, Google Spreadsheets, and Code Reuse, where I apply it to Google's online spreadsheet, enabling users to load library formulae into it in the same way they can load charts. The use of modules is an important part of spreadsheet safety, but much neglected.