Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Database

Application-Level Data Caching


Dec03: Application-Level Data Caching

Performance is the key

Paul is a senior consultant and programmer analyst at Express Scripts Inc. He can be contacted at [email protected].


Thanks to new partitioning schemes and smarter caching functionality, database performance is improving by leaps and bounds. Nevertheless, it is still common for database access operations to be the primary bottleneck in data-transformation applications. In many cases, in fact, the best way to improve application performance is to remove the database from the picture. Although not always possible, some types of applications can be designed so that continuous database access isn't necessary. In this article, I present an application-level data-caching library that does just this. In one real-world implementation of this technique, a database-centric lookup of about 1 million records was reduced from four hours to just four minutes.

The first thing to consider when developing an application-level caching library is to decide which features are needed to support client applications. The application-level caching library I present here (available electronically; see "Resource Center," page 5) is encapsulated by a single class called "Lookup" (and internal helper class called "LookupSQL" that generates SQL statements). In general, Lookup caches are query/response pairs (key/value pairs) where the query (key) component is always unique. The client application can ask: "What's the response value associated with these query values?" Then, the Lookup object (depending on options) might search its cache, query the database, assign a new value for that query, or tell you that what you requested can't be found.

Other requirements include: storing dictionary (query/response) type pairs for lookup operations; allowing runtime definition of what set of values needs to be cached; providing standard methods for accessing cached values; maintaining records of how many requests succeed or fail; providing standard error handling/messaging; supporting requests and caches in various combinations; prefetching memory cache; prefetching disk cache; supporting dynamic memory cache, database lookup, and assignment; single database tables, and freeform SQL queries.

The Lookup class supports all of these requirements and can be extended to support other requirements applications might specify.

Listings One through Seven show how these requirements are met by Lookup using MySQL, the mysql++ data-access classes, and some STL templates. The examples show how a client application can use Lookup for database lookup caching and describe what happens within Lookup to support those operations.

Prefetch Into Memory

With prefetch caches, the assumption is that the set of all possible query/response pairs is relatively small and can be held in memory. For instance, suppose you have a database table that stores demographic information about customers. The table contains GENDER, AGE, and DEMO_KEY (an artificial surrogate key). In this case, there is a relatively small number of possible GENDER/AGE combinations. Suppose valid values for GENDER are "M/F/U" and AGE is an integer between 0 and 150. So, there are only 453 possible combinations. The data file that the application must process contains both GENDER and AGE, but to insert the final records into the database, you need to convert that GENDER/AGE combination to a DEMO_KEY using a database lookup. Remember that DEMO_KEY is an artificial surrogate key, so there's no order or reason to which DEMO_KEY gets assigned to which GENDER/AGE combination. Listing One shows how to initialize a Lookup object and use it to query for values of DEMO_KEY.

To facilitate prefetch memory cache, the Lookup class queries the database for all possible pairs and stores those responses in a searchable memory-based structure. In this implementation, the C++ STL's map<> template creates an associative array of <string, string> pairs. Even though the upfront initialization time is longer than it would for the program to query the table for an individual input record, the improved response time during each call to Find() outweighs the initial upfront cost if a large number of records is being processed.

Prefetch Onto Disk

Memory-based lookup caching works well with relatively small tables like the DEMOGRAPHICS table. Since there are fewer than 1000 records, it's easy to cache that information in memory. However, a lookup might need to be done against a database table with thousands or millions of records. Take the example of a table of CUSTOMERS. A given company may have tens of thousands, hundreds of thousands, or millions of customers. It may not be practical to cache 80 million customers in a memory-based structure. For these applications, Lookup also supports disk-based caching. Listing Two is a disk-based cache being used to retrieve CUST_KEY from the CUSTOMERS table using SSN.

In this example, information from the CUSTOMERS table is extracted and written to disk instead of to memory. Internally, the Lookup library creates a temporary file with an ordered array of all possible values that match the criteria specified in the Lookup: SSN and CUSTOMER_KEY. The information in the file is ordered by the search fields (SSN, for example), then a system call (mmap()) maps the file into the process's address space for convenient access. There are a few special features of Lookup being taken advantage of in this example.

The Lookup() constructor takes a final parameter that specifies whether the new object should be initialized right away. Initialization is the step during which prefetch queries are executed and data is either stored in memory or written to a scratch file. In this case, the initialization is being delayed so that the maximum query length and the maximum response length can be manually set. Since the internal structure of a disk-based cache is really just a fixed-width array (where the width of each record equals the maximum query length plus the maximum response length), it makes sense to have these parameters configurable. Adjusting these values to represent the true maximum query length and maximum response length helps control how large the temporary file becomes.

The downside to disk-based caching, of course, is that the upfront initialization period may be long, depending on the amount of data being retrieved and general I/O characteristics of the system.

Dynamic Lookups

Another option is to revert to the old-fashioned method of performing individual queries for each and every row being processed. Within Lookup, these types of queries are called "dynamic lookups" and require minimal upfront initialization time, but do not prefetch any data from the table being queried. This means that each subsequent request queries the database and result in generally equal performance. This is practical when the set of records being accessed is extremely large.

However, if the expectation is that each value seen during a given program execution is likely to be seen more than once, it is advantageous to have Lookup keep the answer around for any values it has already encountered during this execution. The DynamicCache option tells Lookup that, once it has retrieved the response for a given query, the resulting pair should be stored in memory for later access. This means that all queries will search the dynamically built memory cache, then revert to asking the database. You can see from the diagnostic output from the program in Listing Three that, despite being asked for the same DEMO_KEY 100 times, the database is actually queried only once. The output of this program is a list of statistics about what happened internally. The key lines are:

Dynamic Cache Hit Count 99

Dynamic Query Hit Count 1

As the message indicates, the database was actually queried only once. After being queried, all subsequent requests for that same value were able to find a hit in the dynamic cache that was updated after the first query.

Dynamic Assignment

So far, my examples have been trying to lookup some piece of reference data. The need to create new reference records is certainly not uncommon, though. For instance, a program might be processing a denormalized customer record that includes address information—characterized the fields City and State—describing the metropolitan area where the customer lives. During the process, one task is to lookup the METRO_AREA_KEY associated with the given CITY and STATE. It is possible, though, that new metro areas might be encountered at any time. Instead of failing when a METRO_AREA_KEY cannot be found for the given CITY and STATE, the requirements are that a new metro area record should be created and assigned a new METRO_AREA_KEY. Lookup includes DynamicAssign, an option to support the dynamic assignment of new keys (see Listing Four).

The implementation of how to assign a new METRO_AREA_KEY is database specific, but in this MySQL implementation, new keys are assigned using AUTO_INCREMENT columns. So there is a requirement that any key dynamically assigned must be an AUTO_INCREMENT field. Listing Five is the DDL for a sample METRO_AREA table.

When the queries to the dynamic cache and the database both fail, DynamicAssign lets a new value be inserted into the database. The SQL generated by the library for the insert looks like this:

insert into METRO_AREA (CITY, STATE) values ('New City','ZZ');

Lookup uses the last insert id functionality within MySQL to retrieve the new METRO_AREA_KEY that was assigned to this CITY, STATE combination.

Using SQL Templates

To this point, my examples have had two main restrictions—only one table is being queried, so there are no joins to allow for the retrieval of data from multiple tables, and the return values have all been single fields.

The UseTemplate example shows how the Lookup constructor can accept a full SQL statement, instead of just a table name, and return more than just one field. The only limitation introduced by the SQL template option is that it cannot be used with the DynamicAssign option on the same Lookup object. This makes sense because if you are planning to pull data using a query that joins several tables, it might be difficult to define where new records should be inserted. Listing Six shows how you can use UseTemplate in a dynamic lookup.

The table parameter is now replaced by a full SQL statement with some place holders to show Lookup how the SQL is structured, and the return expression parameter that has, until now, simply been a field name is replaced by a more complex SQL expression. The SQL statement being used in place of the table parameter must have these characteristics:

  • The SELECT clause must be precisely this string: "select &Q_FIELDS, &R_EXPRESSION".

  • Any number of tables can be included in the FROM clause.

  • If table aliases are used in the FROM clause, they should also be used in the q_fields parameter and r_expression parameter.

  • There must be a WHERE clause that includes, at a minimum, the placeholder &Q_CONDITIONS. The WHERE clause can also include table joins or filters as needed.

Joining multiple tables together and returning more than just one field are not the only ways that UseTemplate enhances the usability of Lookup. The SQL template that is passed into the Lookup() constructor can contain any number of table joins and selection criteria within the WHERE clause. For a dynamic or prefetch cache, this is a wonderful way to filter the set of data being cached. For instance, if data files arrive already divided by region, it makes sense to use the UseTemplate and SQL shown in Listing Seven to create a Lookup object that can retrieve the CUST_KEY by SSN only for those customers in the REGION given on the command line. The flexibility of UseTemplate opens a variety of opportunities for performance improvement.

Test Case Performance

Raw performance tests might look unrealistic, but it shouldn't be too surprising that the overhead involved in making separate individual database queries would yield significantly worse performance than accessing a presorted memory structure. Table 1 shows some of the raw performance numbers based on the lookup of 10 million customers from an indexed customer table with 100,000 rows.

Application-level caching (either to memory or disk) shows a dramatic improvement for overall runtime. A similar real-world implementation showed comparable results. A program built to perform four distinct queries, against tables ranging from 1-4 million records each, improved from a runtime of four hours to 14 minutes when shifted to a strategy that uses the kind of application-level caching in Lookup.

The timings in Table 1 for the memory-based lookup and the file-based one are essentially equivalent. Depending upon the host's performance characteristics, this ratio may fluctuate. In one production system where this type of caching is implemented, this ratio is more like 10 minutes (using memory) to 60 minutes (using disk). Improving the map<> implementation might help improve the end-to-end performance for the prefetch in memory lookup.

Enhancements

There are many ways to improve on this implementation of an application-level data cache, including:

  • Shared file cache for disk-based lookups. If you have multiple copies of the same executable running, it might be nice (if a disk-based cache is still around) to utilize a single temp file for all of those copies of the same program. As implemented, if 30 copies of the program run, then 30 processes will run the exact same query and write the exact same data to disk. Why do that 30 times?

  • Multiple return values. It may be desirable to receive responses where more than one single row might be returned. The library could be enhanced to return an array of responses.

  • Dynamic cache garbage collection. This would require building a mechanism for aging entries in the dynamic cache. In this implementation, the dynamic cache merely grows until process memory is full. If an aging strategy were implemented, it would be possible to define rules such as "remove from cache when not seen in 100 records." This would help keep down the size of the dynamic cache and improve overall performance under some circumstances.

Of course, there are numerous other possible enhancements such as replacing the map<> structure for better performance and wrapping the class in a daemon that constantly maintains and shares various caches between processes.

Conclusion

Even though enterprise databases are designed to store data and return answers quickly, that may not always be quick enough. There are occasions where it makes sense for the application to take on the responsibility of data-access performance and start doing some caching on its own. The Lookup class presented here is just one implementation of such a solution.

DDJ

Listing One

typedef vector<string> StringList;

StringList q_fields = StringList(2);    // Holds list of fields to query on
StringList q_types  = StringList(2);    // Holds the data type of those fields
string *demo_key    = new string();     // String buffer for the demo_key response
int ret;                                // Return value

q_fields[0] = "GENDER";
q_fields[1] = "AGE";
q_types[0]  = "char";
q_types[1]  = "int";

Lookup *demo = new Lookup (
        "user:pass@dn:host",    // Database connect information
        "DEMOGRAPHICS",         // Table name
        q_fields,               // Query field names
        q_types,                // Query field types
        "DEMO_KEY",             // The expression to be returned
        Lookup::PrefetchQuery,  // Options
        True);                  // True so that we do initialization right away

StringList values = StringList(2);      // Contains [0]=GENDER and [1]=AGE

ret = demo->Find(values, demo_key);

cerr << "Return code(" << ret <<"): When looking for ( 
     << values[0] << ,
     << values[1] << ") I found " << *s << endl;

Back to Article

Listing Two

typedef vector<string> StringList;

StringList q_fields = StringList(1);    // Holds list of fields to query on
StringList q_types  = StringList(1);    // Holds the data type of those fields
string *customer_key= new string();     // String buffer for the demo_key response
int ret;                                // Return value

q_fields[0] = "SSN";
q_types[0]  = "char";

Lookup *cust = new Lookup (
        "user:pass@dn:host",    // Database connect information
        "CUSTOMERS",            // Table name
        q_fields,               // Query field names
        q_types,                // Query field types
        "CUSTOMER_KEY",         // The expression to be returned
        Lookup::PrefetchQuery |
        Lookup::UseDiskCache,   // Options are logically OR'ed together
        False);                 // False so that we can change field lengths

// Set the maximum length of the query fields and the maximum length of the response.
// Setting this to the minimum possible value will help save space in the cache file
// and improve overall performance.
cust->SetQueryLength(10);
cust->SetResponseLength(10);

// Now that we've set the min query and response lengths, initialize and
// load the disk cache.
cust->Init();

// For our search, we'll loop through all 9-digit integers and find out how 
// many are a valid customer's SSN.
StringList values = StringList(1);
char ssn[10];
long found=0;

for (n=0; n<999999999; n++)
{
        sprintf (ssn, "%09d", n);
        values[0] = ssn;
        ret = demo->Find(values, customer_key);
        found += (ret==0?0:1);
}

cerr << "Found " << found << " valid SSN" << endl;

Back to Article

Listing Three

typedef vector<string> StringList;

StringList q_fields = StringList(2);    // Holds list of fields to query on
StringList q_types  = StringList(2);    // Holds the data type of those fields
string *demo_key    = new string();     // String buffer for the demo_key response
int ret;                                // Return value

q_fields[0] = "GENDER";
q_fields[1] = "AGE";
q_types[0]  = "char";
q_types[1]  = "int";

Lookup *demo = new Lookup (
        "user:pass@dn:host",    // Database connect information
        "DEMOGRAPHICS",         // Table name
        q_fields,               // Query field names
        q_types,                // Query field types
        "DEMO_KEY",             // The expression to be returned
        Lookup::DynamicQuery |
        Lookup::DynamicCache,   // Options
        True);                  // True so that we do initialization right away

StringList values = StringList(2);      // Contains [0]=GENDER and [1]=AGE

for (int i=0; i<100; i++)
{
        ret = demo->Find(values, demo_key);
}

demo->PrintStats();

Back to Article

Listing Four

typedef vector<string> StringList;

StringList q_fields = StringList(2);    // Holds list of fields to query on
StringList q_types  = StringList(2);    // Holds the data type of those fields
string *ma_key = new string();  // String buffer for the response
int ret;                                // Return value

q_fields[0] = "CITY";
q_fields[1] = "STATE";
q_types[0]  = "char";
q_types[1]  = "char";

Lookup *demo = new Lookup (
        "user:pass@dn:host",    // Database connect information
        "METRO_AREA",           // Table name
        q_fields,               // Query field names
        q_types,                // Query field types
        "METRO_AREA_KEY",      // The expression to be returned
        Lookup::DynamicQuery |
        Lookup::DynamicCache |
        Lookup::DynamicAssign,  // Options
        True);                  // True so that we do initialization right away

StringList values = StringList(2);      // Contains [0]=CITY and [1]=STATE
values[0] = "New City";
values[1] = "ZZ";

ret = demo->Find(values, ma_key);

demo->PrintStats();

Back to Article

Listing Five

create table METRO_AREA (
        METRO_AREA_KEY  int          not null auto_increment,
        CITY            varchar(30)  not null,
        STATE           char(2)      not null,
        REGION          varchar(20)      null,
        primary key (METRO_AREA_KEY)
);

Back to Article

Listing Six

typedef vector<string> StringList;

StringList q_fields = StringList(2);    // Holds list of fields to query on
StringList q_types  = StringList(2);    // Holds the data type of those fields
string *address = new string(); // String buffer for the response
int ret;                                // Return value

q_fields[0] = "c.SSN";
q_fields[1] = "a.ADDR_TYPE";
q_types[0]  = "char";
q_types[1]  = "char";

string sql_template = "
select &Q_FIELDS, &R_EXPRESSION 
from CUSTOMERS c, ADDRESSES a
where c.CUST_KEY = a.CUST_KEY
and &Q_CONDITIONS";

string r_expression = "
ADDR_LINE1 || '|' ||
ADDR_LINE2 || '|' ||
CITY       || '|' ||
STATE      || '|' ||
ZIP        || '|' ||
COUNTRY"

Lookup *demo = new Lookup (
        "user:pass@dn:host",    // Database connect information
        sql_template,           // SQL Template
        q_fields,               // Query field names
        q_types,                // Query field types
        r_expression,           // The expression to be returned
        Lookup::DynamicQuery |
        Lookup::UseTemplate,// Options
        True);                  // True so that we do initialization right away

StringList values = StringList(2);      // Contains [0]=SSN and [1]=ADDR_TYPE
values[0] = "999115555";
values[1] = "HOME";

ret = demo->Find(values, address);

cerr << "Return code(" << ret <<"): When looking for ( 
     << values[0] << ,
     << values[1] << ) I found " << *address << endl;

// Then you can go off and parse the pipe-delimited address string 
// into it's components.

Back to Article

Listing Seven

/// Get the region from the command line
string region = argv[0]; 

// And use that region to generate the SQL template for the Lookup
string sql_template = "
select &Q_FIELDS, &R_EXPRESSION
from CUSTOMERS
where REGION='" + region + "' and &Q_CONDITIONS";




Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.