July 14, 2008
As low man on the totem pole at Tufts, it was my honor to teach all
the database courses while I was there. (Don't get me wrong, I also
got to teach some totally cool classes of my own invention. Tufts was
wonderful to be at. I highly recommend it.) I inherited the book used
by the previous guy (who was very happy to leave it). It was a very
fine book in many ways, but I realized that it was so focused on the
details of how to make a database engine run fast, that it missed the
main point about why we want a database in the first place.
We want a database so we can persist information that we need to run
our business or lab or whatever. Anything that does not contribute to
that goal is worthless. No surprise there.
Classic Database courses (like the one my book was written for) focus
almost exclusively on issues of how to make relational database
queries run quckly. And so the major topics are query translation,
indexing, sorting, concurrency optimizations, ACID, etc.
These are all good things and should be understood by anyone getting a
degree in Computer Science. My complaint with this approach is that
the fundamental reason for having a database gets lost and we end up
with systems that store lots of data very efficently, but don't
actually do what we want them to.
So that was my big push in my classes--what do we want?
The answer to that question (as I'm sure you're aware, but that other
guy isn't, so I'm explaining for him) is "We want to persist some of
the data used in our applications." And all that stuff about
correctness, efficency, ACID, etc., is just detail subsumed by our
We want to persist data.
The data we want to persist is the data that our applications
need. And the form of that data is whatever form works best for the
Much of the complication in database design comes from having multiple
applications that all want to share the same data, but they don't use
it in the same way.
You have a customer database that needs to record things like shipping
addresses and invoices. Great.
You have an employee database that needs to record things like health
care plan and home address. Also great.
Now one of your employees starts buying stuff from your
The day comes when you want to send a mailing to all customers and all
employees. No need to send two mailing to the same person. How do you
know if customer Jane and also employee Jane?
Are there two separate tables? If not, do you include all attributes
for all people, even if most of those attributes won't be used? If you
have two tables, how do you decide on which address or phone number to
use? If a request to change address comes in, do you change both
addresses? And these are the easy questions.
These are the issues that keep DBAs up at night.
Our book noted this issue, described the two options, then left us
without guidance as to how to actually make them work. To this day I
only have a vague idea how I would deal with this in the real world.
As part of my class, I assigned the students to "go forth and
interview a real DBA." Too much book learning ain't good for a
man. Somebody famous said that.
I have no qualms about asking students to distinguish between B-trees
and hashtables, nor to calculate precise performance figures for
both. But I don't expect to keep these details in their heads for the
rest of their lives. I expect them to understand what's important from
the application's point of view, and have a vague idea of the details,
secure in the knowlege that if they ever needed it again, they know
how figure it out.