In spite of the ubiquity of computers, data is often processed in batches. Historically, batch processing was a way to optimize the use of rare and precious machine resources. People prepared data offline, and when they did their batch processing, they were assured of consuming 100 percent of the available computing resources. As the price of computing fell and the number of computers grew, we stopped optimizing machine time. For evidence, consider the number of processor cycles spent running screen savers.
Some transactions are very complex, leading us to create hybrid applications. These have an interactive front end that works as quickly as necessary to make the human users productive. Additionally, they have a batch-oriented back end to process transactions slowly, when no one is tapping their foot waiting for results.
Another kind of batch operation is an analytical process where you may group data into bands. Sometimes you only want the bottom 10 customers or the vendors with nearest delivery dates. In this case, we don't want to see every vendor delivery or every customerwe only want to see the few that we can take action to help.
In both cases, we are selecting the top (or bottom) N rows from a database. This is a common SQL specialization that is not a part of the standard feature set of SQL. There are two definitions for the ordinary SELECT statement: single row and multiple rows (meaning all rows). There are various vendor-specific extensions to SQL to facilitate getting only a limited number of rows. However, there are a number of problems with these constructs.
For small tables, there isn't a terribly big problem here. The time required to table scan a few thousand rows is minimal. If you only want the first 100, you can't really avoid loading many of the remaining 900 into cache because of read-ahead strategies. If the table is reasonably well-used, the entire thing may be lurking in cache anyway, making the processing time negligible.
In very large processing contexts, such as financial institutions or utilities, or in a data-warehousing context, the number of rows may stretch into the hundreds of thousands, making a table scan far too expensive. In this article, we focus on large tables, with over 100,000 rows of data.
The problem we examine is how to pick out just N rows from a very large table as efficiently as possible. In this case, efficiency will be the elapsed time to return the entire set of rows. Our aim is to minimize the kinds of overheads that creep into this kind of problem when we take too many details as given parts of the technology, not as choices we make in creating a solution.
Order By and ROWNUM
Many RDBMS products can fudge in a row number as part of the query results. In Oracle, the ROWNUM column provides a number for each row returned. In DB2 and MySQL, there is a LIMIT clause that can be used to return only selected rows that comprise a batch.
The canonical example in Oracle is something like Example 1. This has the unfortunate side effect of sorting the entire table prior to locating the top N rows. Further, the temporary storage used for sorting can be a scarce resource. If multiple client processes are sorting concurrently, resources can be exhausted, leading to individual application crashes as well as making the system unresponsive.
Because Oracle assigns the row numbers before sorting, we have to use the inline view technique in Example 1. The data is sorted by the view, then row numbers are assigned for picking off a batch of rows. A large sort is done before any rows are returned. For systems like MySQL and DB2 with LIMIT clauses, the syntax is slightly simpler (see Example 2), but the performance is no better.
Some vendors offer a RANK analytic function that can be used with the OVER clause to provide ranking values in complex queries. As with the previous use of ROWNUM or LIMIT clauses, this will query, sort, and process the entire table before returning a useful result set.
One of the most common methods for selecting the top N rows from a result set is to write a short piece of application code. The idea is that the application program does not fetch rows beyond the N required rows. Unfortunately, the sort still gets done. Example 3 is handy for measuring the magnitude of the problem. The delay due to sorting is seen by measuring the time to process each row. The first call to next reflects the time to do the sort and fetch the first batch of rows into cache. The remaining calls to next run extremely quickly.
One Scan Only
To avoid sorting all of the rows in the table, you need to focus your sorting on just a subset of those rows. Example 4 exemplifies this approach. This seeds a TreeMap with keys and data elements for the first N rows. The first key is the minimal key in the set. As rows are fetched, they are compared against the first key. If the new row's key is less than or equal to this first of N keys, the new row is ignored. If the new row's key is greater than this first of N keys, then the previous minimum key is discarded and this new row is inserted.
The sorting is reduced to tree insertion for a subset of the rows. If the rows are in a random order, then approximately half will be ignored. If the rows are in descending order, then all but the first N will be ignored. If the rows are ascending, then the N-row subset will have each of the table rows sorted in and then removed.
Where a full sort of R rows is O(Rlog(R)), this sort is somewhat smaller, and is O(Rlog(N)). When the table is very large and the batch size is very small, this difference can be profound. When we are trying to find a 100 row batch from a 100,000 row table, the processing is cut in almost 1/3 in the worst possible case.
Point #1 in Example 4 is the TreeMap into which we'll accumulate, at most, keep rows of keys and data values. Since the map isn't full at Point #2, we load in key values and data values. In this case, we only use a single column that we presume is a primary key. If necessary, we could construct and insert a more complex object.
Once the map is full at Point #3, we compare each new row against the smallest of the keys in topRows. If the new row is larger, it should be in the collection; we drop the lowest value from the collection. Since this is a balanced red-black tree, the tree will be reordered and balanced as necessary after this operation.
Fork and Split
As with many such problems, the real problem is not to get the top N rows. The real problem is to break up a stream of incoming transactions into batches. The current architecture puts transactions into a large table, then repeatedly picks out batches for processing. This is, of course, slow, and the initial narrow focus on a specific performance issue leads to this investigation.
The table and the consequent table scans are not an essential feature of the problem. The table is merely persistence for transactions until they have been fully processed. It is little more than a recovery mechanism in case the transaction processor fails. From this point of view, it is a slow and expensive version of a reliable message queue, and any number of vendors provide reliable message delivery without using a large, expensive relational database.
Realizing this, there is a canonical UNIX solution that works without any overhead at all:
- A transaction enters the system. It is inserted into the database for reliability purposes. It is also written to a temporary file for processing.
- When the temporary file has a batch of N records in it, the system closes the file and then forks a subprocess to execute the batch of transactions in that file.
- The subprocess is given a file of transactions. Each transaction in the file can also be found on the database. When the subprocess has finished processing the transaction, it deletes the transaction from the database, clearing out the reliability information.
In the event that the transaction processors crash, the unprocessed transactions in the database can be queried and split up into transaction files, and transaction subprocesses created. The only difference from normal operations is the origin of the transactions.
It is often difficult to fix the real problem. In this case, the overall architecture was an effort at inventing a reliable message queue.
However, since the application is working in production, it's difficult to replace it with a simpler, more focused, and reliable message queue product. Instead, we're stuck with Home-Brewed Reliable Message Queuing (HBR-MQ).
Rather than fix the performance problems, we can mask them by using a high-performance application that fetches batches of transactions without the overhead of table sorts. The additional complexity raises the cost of maintenance, and makes it more difficult to diagnose system problems. However, it can run considerably faster.
In any application with large data volumes, sorting should be avoided. In some cases, the entire DBMS should be avoided. Where the database can't be avoided, the performance of each processing step must be considered to assure that solutions are reliable and scalable.