Dr. Dobb's | Dynamic Caching & ADO DataSets

Dynamic Caching & ADO DataSets

Our authors present a technique that uses incremental queries to fetch data dynamically.

December 01, 2004
URL:http://www.drdobbs.com/database/dynamic-caching-ado-datasets/184405936

December, 2004: Dynamic Caching & ADO DataSets

Building a smart cache

John and Hong are senior software engineers. John works at Sabre Airline Solutions and can be contacted at [email protected].

Caching enough data for an application's normal use while effectively re-querying a database for additional data is a challenging task. For example, suppose you need to create a user interface that lists all of a company's employees, but displays only a hundred or so rows at a time. What you need to do first is create a SQL query string such as: "select name, department... from employees where... order by..." You then query the database to get the data, store it in a local data structure, and use only a small portion of it at a time.

Depending on the application, a large part of the retrieved data may be left untouched. What you really need is a smart cache that supplies the correct amount of data at any time. Of course, many applications have been implemented that manage an internal cache, but this involves a lot of work. Furthermore, it is also difficult to create a generalized solution. Different applications use different data structures, and it is hard to come up with a solution unrelated to the data structure. However, .NET's DataSet class and its related components from ADO.NET can help in addressing this problem.

DataSet Features and Limitations

As a part of the .NET Framework, ADO.NET provides the DataSet class whose primary goal is to represent an in-memory data cache. Objects that can be cached include tables, views, and constraints. DataSet also provides an interface for accessing the cached records. With this interface, OLTP-related operations—read, insert, update, and delete—are straightforward. ADO.NET includes another, closely related component called DbDataAdapter, whose role is to keep a DataSet in sync with database servers. DbDataAdapter provides a "fill" operation that retrieves data records and puts them into an in-memory table in the DataSet, and an "update" operation to update the database server with modified rows.

The key to a dataset cache that retrieves data dynamically is with the adapter. According to the DbDataAdapter documentation, one version of the fill function accepts arguments that specify the starting point of the fill and the amount to fill. Since this is exactly what we were looking for, we wrote a test program to see how fill behaves. To our disappointment, it always queries and transmits an entire result from the first record to the last specified record, but only fills the dataset with records in the specified range. Anything before the range gets lost and anything after it is skipped. In short, the function doesn't help much at all.

Moreover, if you call this fill function multiple times (each targeting at a different data range), the total amount of data processed is more than is necessary. The worst case happens when the end of the range in each fill is close to the end of the entire query. In this instance, the total amount of data processed is many times the amount that is necessary. While disappointing, it is not a surprise when you think about what is going on behind the scenes. Since ADO technology sits on top of SQL, the fill function depends on a SQL query string to get the data. When using a SQL query, you cannot randomly access retrieved records, and you shouldn't expect the fill function to do things differently. Data retrieval has to start from the beginning. As soon as a query starts producing data, however, nothing prevents you from stopping it. This is how the fill function can ignore the rest of the data. To find a solution to the cache problem, you need to go one step deeper into the SQL level—below the DataSet and ADO stage.

Incremental Queries

At the SQL level, a single SQL query string attached to the fill function isn't going to help. So how then can you break a single select statement into multiples? Assume that you use a different query string every time to repeat the fill, depending on the already fetched data. This new query string must know how to avoid getting the records you already have. For example, suppose you are querying all of a company's employee records, ordering the result by firstname/lastname. The query string would look like this:

select fname, lname, department... from employees where... order by fname, lname

Suppose you use this query string once, and stop the data flow somewhere after getting the record of interest. Also suppose that the last record you get is for "Joe Smith." When you need more data, you can simply query those records that come after Joe Smith. Since this query is ordered by firstname/lastname, you can easily modify the original statement to make a new query string, such as:

select fname, lname, department... from employees where... and fname>'Joe' and lname>'Smith' order by fname, lname

where and fname>'Joe' and lname>'Smith' is the newly inserted piece. As before, with this modified query string, you do not have to get the complete result. You can simply stop after getting what you want. You can repeat this process until all the records have been brought into memory, a process we call "incremental querying." In this incremental query, we assume that the original query string uses an order by clause. Without this assumption, the sequence of the result set cannot be guaranteed; hence, the relationship between each incremental query result cannot be established. In actual work, this assumption is true in many cases. If an order by clause is not used, it should be easy enough to add.

Incremental Queries and Limited Data

Depending on the application and platform, the total size of a query's results may be small enough to be placed in physical memory. We call this the "limited data scenario." In this scenario, the caching function emphasizes saving network traffic and increasing the speed of data retrieval. In the worst case, it is assumed that the entire query result may be loaded into memory without problem. Listing One is an implementation of the incremental query method for this limited data scenario. The implementation hides many tasks, such as database connection management, rebuilding the incremental query string and getting the total number of rows in the query. All users need to do is specify the original query string, a comparison string, and ask for a result. The comparison string is used to build the incremental query string. It's replaceable part is replaced by the data from the last row of the last query at runtime. Then, it's inserted into the SQL as an additional where clause. The comparison string is closely related to the order by clause. For example, for the previous query, the comparison string would be:

fname>'{0}' and lname>'{1}'

where 0 and 1 represent the index numbers of the corresponding columns in the selected result. When asking for a result, you use the GetRow function with a single argument—a row index. This is an index into the complete query range. The return is an instance of ADO DataRow, which is generic enough to hold any single row of queried data regardless of the column number and the column type. Figure 1 shows how the data is queried and filled in a dataset.

Remember that the first fill uses the original query string. It starts from the first record and fills to the record of interest. If you chose not to fill the leading segment, you will waste results. In this limited data scenario, assume the memory is not a problem. Accordingly, the implementation chooses to fill any piece of retrieved data into the dataset.

For the second and subsequent queries, an incremental query string is used. The incremental queries start immediately after the last record of the previous result. Depending on the relative distance from the previous record to the next expected record, each incremental query fills a different amount of data. The worst case occurs when the first query is asked to retrieve the last record. In this case, the whole result is filled on the first attempt and there is no need for an incremental query. The best case occurs when the records are consumed sequentially from the beginning. If an expected record is not in memory, the original query is executed first. Then, if any further expected record is not in memory, an incremental query is executed. For efficiency, each query should retrieve some extra records beyond what is required. This reduces the number of queries and avoids querying too often.

Incremental Queries and Unlimited Data

When a query result becomes too large, it may be impossible to hold it in memory. We call this the "unlimited data scenario." As memory becomes cheaper, newer computers are available with much larger memory sizes. This makes the unlimited data scenario less common. If large data returns are the problem, however, the first thing to try is narrowing the query result. This can be done by putting more restrictions on the query string. In other words, it is helpful to break down a large query and work with smaller result sets. A side effect of this breakdown is that the results from different queries may become discontinuous. When this is not acceptable, you can extend the solution presented earlier to handle unlimited data.

The key to this extension is to handle data drops. Every now and then, when the quantity of data within a dataset exceeds a limit, you have to drop some to keep the in-memory data size manageable. The question becomes how to drop data intelligently. In Figure 1, the data fill starts from row 1 and each incremental query continuously fills data up to the last row. When the available memory cannot cover the entire dataset, you can imagine the usable memory as a segment window that moves up and down on the dataset as the interested row index jumps. Depending on the relative location of the previous and new cache windows, there are four situations to consider when filling data; see Figures 2 through 5.

In Figure 2, the new cache window is separated from and after the old cache window. In this situation, the old cache has to be removed, and part of the newly queried data has to be skipped as well. Skipping can easily be done by the fill function we discussed earlier.

In Figure 3, the new cache window overlaps with the old cache window. In this situation, skipping is not necessary, but a part of the old cache window needs to be removed. In Figure 4, the new cache window is separated from before the old cache window. In this situation, the old cache also has to be removed and part of the newly queried data has to be skipped. For situation 4, the original query string needs to be used.

For the situation depicted by Figure 5, the handling is similar to Figure 4. The difference is that you need to pay special attention to the overlapped part. You can skip it, or you can refill it if you use a typed dataset and have a defined key for the filling table.

Recall, we mentioned that a dataset can hold database schema; and in the schema, you can define key constraints. These are exactly what are needed to refill the overlapped section. Choosing whether to skip or refill depends on the requirements of the application. Skipping helps to reduce network traffic and refilling can keep a dataset closely synchronized with its database server.

Conclusion

An in-memory cache for large database queries is often necessary. While ADO.NET—and especially DataSet—provides good functionality for managing cached data, filling the cache is still your responsibility. In this article, we've presented a way of using incremental queries to fetch data dynamically. But remember that there is no best solution for all situations—the method we present works best when the data records are sequentially accessed from top to bottom. In other situations, the method may require large data retrieval in a single query and produce some data loss when memory is limited. The method also requires multiple query strings to be compiled on the database server. In no situation, however, does the presented method cause extra network traffic.

DDJ

Listing One

using System;
using System.Data;
using System.Data.OleDb;
using System.Text;
using System.Collections;

namespace RDView
{
  public interface IRDView
  {
    bool Open(string oleDBDriver, string dataSource, string user, 
              string password, bool keepConnection, int cacheSize);
    bool Close();
    DataRow GetRow(int rowIndex);
    string GetRowAsString(int rowIndex);
    int RowCount();
    string GetLastError();
  }
  public class RDView : IRDView
  {
    OleDbConnection conn;
    string sqlStr;
    string sqlDiffStrTemplate;
    DataSet ds = new DataSet();
    string error;
    int blockSize = 100;
    int totalRowCount = -1;
    const string TblName = "Results";

    public RDView()
    {
    }
    public bool Open(string oleDBDriver, string dataSource, string user, 
                     string password, bool keepConnection, 
   int cacheSize)
    {
      blockSize = cacheSize;
      if (blockSize > 0)
        blockSize = blockSize > 10 ? blockSize : 10;
      StringBuilder sqlb = new StringBuilder();
      sqlb.Append("provider=");
      sqlb.Append(oleDBDriver);
      sqlb.Append(";data source=");
      sqlb.Append(dataSource);
      sqlb.Append(";user id=");
      sqlb.Append(user);
      sqlb.Append(";password=");
      sqlb.Append(password);
            
      try
      {
        conn = new OleDbConnection(sqlb.ToString());
        if (keepConnection)
          conn.Open();
        return true;
      }
      catch (Exception e)
      {
        error = e.Message;
      }
      return false;
    }
    public bool Close()
    {
      if (conn.State != ConnectionState.Closed)
        conn.Close();
      return true;
    }
    /// <summary>
    /// 
    /// </summary>
    /// <param name="sql">The SQL string</param>
    /// <param name="comparingString">The comparing string that differenciate 
    /// current row with the previous row. For example, if you have "order by 
    /// emp_id" in your SQL string. The comparing string should be: 
    /// emp_id>'{i}', where i is the selection position, and {i} will be 
    /// replaced by the ith selection of the last row.
    /// </param>
    /// <param name="cacheSize">give -1 for no cache</param>
    public void SetSelect(string sql, string comparingString)
    {
      sqlStr = sql;
      if (comparingString == null || comparingString.Length == 0)
        sqlDiffStrTemplate = "";
      else
      {
        string tmp = sqlStr.ToUpper();
        int i = tmp.IndexOf("ORDER");
        if (i > 0)
        {
          int j = tmp.IndexOf("WHERE");
          if (j < 0)
            sqlDiffStrTemplate = sqlStr.Substring(0, i) + "WHERE " + 
                             comparingString + " " + sqlStr.Substring(i);
          else
            sqlDiffStrTemplate = sqlStr.Substring(0, i) + "AND " + 
                             comparingString + " " + sqlStr.Substring(i);
        }
        else
          sqlDiffStrTemplate = "";
      }
    }
    /// <summary>
    /// Build a SQL string base on a template, and replace the placeholder 
    /// with data from the comparing row
    /// </summary>
    /// <param name="sqlTemplate">SQL string template</param>
    /// <param name="compareRow">data row to use to fill the 
    /// template</param>
    /// <returns></returns>
    private string BuildSQLDifferencialString(string sqlTemplate, 
                                                         DataRow compareRow)
    {
      string sqlString = "";
      int i=0, j=-1;
      do 
      {
        i = sqlTemplate.IndexOf("{", i);
        if (i>0) 
        {
          sqlString += sqlTemplate.Substring(j+1, i-j-1);
          j = sqlTemplate.IndexOf("}", i);
          if (j > 0)
          {
            string posToken = sqlTemplate.Substring(i+1, j-i-1);
            int pos = Convert.ToInt32(posToken);

            sqlString += compareRow.ItemArray[pos];
          }
          i++;
        }
      }while ( i > 0 && j > 0);
      if (j>0)
        sqlString += sqlTemplate.Substring(j+1);
      return sqlString;
    }
    /// <summary>
    /// Query a block of data start from start, to end (inclusive)
    /// </summary>
    /// <param name="start"></param>
    /// <param name="end"></param>
    /// <returns></returns>
    private bool QueryBlock(int start, int end)
    {
      bool incrementalQuery = false;
      string sql;
      if (start == 0)
        sql = sqlStr;
      else if (sqlDiffStrTemplate.Length == 0)
        sql = sqlStr;
      else
      {
        sql = BuildSQLDifferencialString(sqlDiffStrTemplate, 
                                  ds.Tables[TblName].Rows[start-1]);
        incrementalQuery = true;
      }
      try
      {
        OleDbCommand cmd = conn.CreateCommand();
        cmd.CommandType = CommandType.Text;
        cmd.CommandText = sql;
        OleDbDataAdapter da = new OleDbDataAdapter();
        da.SelectCommand = cmd;

        int rowCount = end - start;
        int fillCount, startRecord;
        if (incrementalQuery)
          startRecord = 0;
          startRecord = start;
        fillCount = da.Fill(ds, startRecord, rowCount, TblName);
        return true;
      }
      catch(Exception e)
      {
        error = e.Message;
        return false;
      }
    }
    public DataRow GetRow(int rowIndex)
    {
      if (totalRowCount < 0)
        totalRowCount = RowCount();
      if (totalRowCount < 0)
        return null;
      if (rowIndex >= totalRowCount)
      {
        error = "Row index out of range";
        return null;
      }
      int lastRetrievedIndex = -1;
      if (ds.Tables[TblName] != null) 
        lastRetrievedIndex = ds.Tables[TblName].Rows.Count - 1;
      if (rowIndex > lastRetrievedIndex)
        QueryBlock(lastRetrievedIndex+1, rowIndex + blockSize);
      DataRow row = ds.Tables[TblName].Rows[rowIndex];
      return row;
    }
    public string GetRowAsString(int rowIndex)
    {
      DataRow row = GetRow(rowIndex);
      int columnCount = row.ItemArray.Length;
      StringBuilder resultCat = new StringBuilder();
      for (int j=0; j<columnCount; j++)
      {
        resultCat.Append(row.ItemArray[j].ToString());
        resultCat.Append('\0');
      }
      return resultCat.ToString();  
    }
    /// <summary>
    /// Return the total number rows from the give query. The return -1
    ///  in case of error.
    /// </summary>
    /// <returns></returns>
    public int RowCount()
    {
      if (totalRowCount >= 0)
        return totalRowCount;
      string countSql = sqlStr;
      countSql = countSql.ToUpper();
      int i = countSql.IndexOf("FROM");
      if (i < 0)
        return totalRowCount = -1;
      bool iOpenedConn = false;
      try
      {
        if (conn.State != ConnectionState.Open)
        {
          iOpenedConn = true;
          conn.Open();
        }
        countSql = "SELECT COUNT(*) " + countSql.Substring(i);
        OleDbCommand cmd = conn.CreateCommand();
        cmd.CommandText = countSql;
        totalRowCount = Convert.ToInt32(cmd.ExecuteScalar().ToString());
        if (iOpenedConn)
          conn.Close();
      }
      catch (Exception e)
      {
        error = e.Message;
        totalRowCount = -1;
      }
      return totalRowCount;
    }
    public string GetLastError()
    {
      return error;
    }
  }
}

Back to article

December, 2004: Dynamic Caching & ADO DataSets