Tim is an independent programmer and freelance writer. He can be contacted at [email protected].
In many applications, database access can be a serious performance problem. For example, a typical servlet-based shopping application can require a half-dozen or more database queries for each page. In this article, I'll survey a number of techniques for optimizing the performance of database-centric applications. Although my examples will utilize Java JDBC, the techniques are applicable to many other environments as well.
Database errors happen, and it's essential that your code handle them gracefully. This is especially critical in long-lived programs such as Java servlets.
Example 1 is a skeleton for a full JDBC request. There are a number of important details you should note about this example. The finally clause ensures that the database resources will be correctly cleaned up regardless of whether or not an error occurs. This type of code usually appears in a low-level function that should throw an exception, which is why the catch block ends by rethrowing the exception. Finally, I'm simply ignoring exceptions that occur during cleanup, since there's really no practical recovery possible.
Typically, the first draft of a database application opens and closes a connection for each query. The first optimization is to avoid the overhead of repeatedly opening and closing connections by maintaining a connection pool. A connection pool contains some number of already-open database connections. To access the database, you ask the pool for a connection, submit the query, read the results, and then return the connection back to the pool. The use of a finally block in Example 1 is essential to ensure that your connection is properly returned to the pool.
There are several good, free connection pools floating around. If you try writing your own, you'll find that a good connection pool is a short but tricky piece of coding. Among other considerations, a good connection pool should:
- Be completely thread safe.
- Have robust error handling throughout.
- Accurately track which connections are in use.
- Open new connections as necessary, but limit the total number of connections to avoid overloading the server.
- Ensure that no single connection remains open too long.
- Provide some help in debugging your application code.
I wrote a serviceable connection pool (available electronically; see "Resource Center," page 5) that is initialized with the various JDBC parameters to be used for new connections. You retrieve connections from the pool by calling the getConnection() method and return the connection with the release() method. If you set the DEBUG variable to True and recompile, the pool stores a Throwable object for each connection it returns; the Throwable can be used to print a stack trace for any connections that are not released, which is useful in debugging.
Reducing the Number of Queries
Each database query involves a certain amount of overhead; the JDBC driver must compose the query and relay it to the database engine, which must, in turn, parse the query and develop a query plan before it actually accesses any data. One of the simplest ways to speed up your application is to aggregate your queries. For example, in most database servers, Example 2(b) is much faster than Example 2(a). Sometimes, this type of optimization can require overhauling your application logic. Rather than making repeated calls to a function that looks up a single item, you might instead have a function that returns a list or array of all matching items.
Of course, the best way to speed up a database query is not to make it at all. With some thought, you can often arrange for frequently used data to be kept in memory where it can be quickly accessed without going back to the database. This requires some care, however.
Use the Factory Pattern
A factory is a class or object (usually a Singleton) responsible for creating and returning instances of another class. For example, if you want a User object, you might invoke UserFactory.getUser(xxx) to obtain the required object. This pattern lets you hide the source of the object. The application that is using the factory does not need to know whether the factory read the data directly from the database or returned an object that it kept in memory.
Having a separate factory class for each major data object allows you to fine tune your caching approach for each type of data. Initially, you can implement the factory very simply: Just do a direct database read for every request. Later, you can measure how often different data objects are requested and implement in-memory caching for the most often-used objects, all without altering any of the code that uses the factory.
Caching can break your application if you're not careful. In particular, there is no easy way to be notified when a piece of data in the database has changed. As such, you can easily end up with stale data in memory.
The easiest way to limit such problems is to store an expiration timestamp in each cached object and simply discard objects that are too old.
Of course, it's better to discard cached data as soon as the data is changed in the database. This is especially important if you have multiple redundant servers; if one server stores an update, you want the other servers to pick up that update quickly. This generally requires building in some form of communication between your servers. For example, in Java servlet designs, I like to have a special URL that invokes a flush() method in each factory class.
Make sure your tables have appropriate indexes, but avoid having too many indexes. Databases use a variety of statistics to decide whether it's faster to use an index. The PostgreSQL database, for example, only updates those statistics when you request a VACUUM ANALYZE. If your database works like this, you should ensure that this occurs regularly so that the database is always working with up-to-date information about your tables. Occasionally look at the output of EXPLAIN SELECT to see if your most common queries are being handled in the way you think they are.
Be careful about large data items. Depending on the system, it can be much faster to store the raw data in a file on disk and just put the filename in the database. This obviously depends heavily on the specific database being used, so you should spend some time experimenting with different approaches.
Partitioning the Database
After optimizing the way you access the data and the way the database stores the data, you may still find that performance is not what you hoped. You should, of course, first test to see whether your database or your application is the bottleneck. If you have both running on the same server, separate them; if you have everything running on one overloaded machine, it can be difficult to tell which part is causing the problem.
If the database is too slow, spread your data across multiple database servers. One way to do this is to split your data and put each part on a separate machine. This works best if you can find natural ways to divide your data so that your application never needs to join tables stored on separate servers.
Replicating the Database
Another approach is to replicate all or part of your database, placing exact copies on multiple machines. This is easiest for read-only data. For example, a typical e-business web site might be able to separate its catalog data (which is never updated by the web site) from the shopping cart and ordering information. To replicate read-only data, you can just set up a daily process that copies the current data to each database server. This process can be as simple as a small Perl script that queries one database and inserts the results into another.
Replicating data that is being read and written by multiple applications is more complex. Many databases, including the popular MySQL and PostgreSQL, have facilities for automatically propagating updates from a single master database that stores the definitive version of your data to several slave databases that hold copies of that data. Design your application to always write to the master database and always read from the slave databases. In this way, the master database only handles the relatively infrequent data updates while the heavy read load is spread across multiple slave databases.
This kind of master/slave arrangement does more than just boost performance. It makes it simpler to set up geographically separated servers. It also provides additional safety, since each slave database contains a complete, up-to-date copy of your data.
The only drawback of this master/slave arrangement is the small delay that gets introduced. A new piece of data will only get reflected in the live site after it gets propagated from the master database out to each slave. This delay is perfectly acceptable for many applications, including forums, newsgroups, and many other content-driven sites, but can be inappropriate for applications that demand immediate feedback.
Optimizing database access requires a lot of attention, but the results can be quite dramatic. Simple connection pooling can often triple the speed of your application; carefully designed caching can accelerate your systems by orders of magnitude.