Have you ever found yourself writing a program where a simple B-Tree library didn't seem enough, but a large RDBMS was overkill? What you need in these situations is a library that lets your program handle its data all by itself, but do a little more than just store and fetch. Perhaps you need to write a throw-away script to do some text or log-file processing, but you have a huge mountain of data to slog through and complex calculations to go with it. Or your program needs to work equally well on Windows, UNIX/BSD/Linux, Mac, and perhaps a couple of embedded platforms.
What you need in these situations is SQLite, an open-source embedded relational database system packed into a small C library. It is ACID compliant; supports a large subset of SQL92, indexes, transactions, views, triggers, in-memory databases; and supports a wide variety of interfaces. Currently, there are SQLite interfaces for ODBC, .NET, Perl, Python, Java, Tcl, Ruby, Delphi, Objective C, PHP, Visual Basic, and languages you may have never heard of. SQLite is compact, implemented in a single library of less than 25,000 lines of ANSI C, and its source code is not copyrighted and free to use for any purpose. It is also fast, portable, and scalable, and runs on Windows, Linux, BSD, Solaris, OS X, and has been ported to both embedded systems and mainframes. Its database format is binary compatible between machines with different byte orders and scales up to 2 terabytes (241 bytes) in size. The source code, precompiled binaries, documentation, and other SQLite information is at http://www.sqlite.org/.
SQLite is ideal for managing and processing data within standalone applications that either don't need or cannot connect to multiuser databases. It improves upon simple B-Tree databases such as gdbm by adding relational capabilities, while offering the same functionality if you need it. Since it is the perfect companion to a capable scripting language, whipping up a Perl or Python script to aggregate, slice, and dice data can be done with minimal time and effort.
SQLite was originally developed by D. Richard Hipp to replace the need for large commercial database servers in standalone applications. Hipp wanted to produce a self-contained program that could run anywhere, regardless of what other software was (or wasn't) installed on the host system. SQLite 1.0 used GNU's gdbm B-Tree library as its storage manager. For licensing and performance reasons, Hipp then replaced gdbm with his own B-Tree implementation that supported transactions and stored records in key order, which allowed for optimizations such as logarithmic time MIN() and MAX() functions, and indexed queries with inequality constraints. SQLite has grown considerably in both features and users. As the author of the Python extension to SQLite (PySQLite) along with Gerhard Haring, I have been amazed to see more than 7000 downloads of our extension on Source Forge. SQLite is currently the highest rated database on Freshmeat.net.
SQLite consists of eight layers that work together to take a query and produce a useful result from a database, either in the form of an alteration or materialized result. The first four layers interface, parser, tokenizer, and code generator take the query and turn it into a mini-program. This program is written in a kind of assembly language, which is passed to the next layer, called the "virtual database engine" (VDBE). The VDBE, SQLite's virtual machine, is designed specifically for database operations and is the sole means through which all queries are processed. Everything that can be done in SQLite can be expressed as a series of the VDBE's 128 op codes from opening files, reading indexes, and processing records to firing triggers, manipulating schema, and committing transactions. One by one, the VDBE executes each step in the mini-program, eventually fulfilling the query's request. If you had the patience and inclination, you could actually write your own VDBE program to fulfill your requests rather than use SQL.
Listing 1 is the VDBE program generated to execute a select statement. For every SQL statement, you can obtain the generated VDBE program that fulfills it by prefacing it with EXPLAIN. The VDBE mini-program orchestrates all layers below it, which consist of the storage system, page cache, and OS abstraction layer. The storage system is an efficient B-Tree implementation based on that described by Donald Knuth. The page cache is an adjustable region of memory that SQLite uses to store frequently used pages in order to minimize disk seeks. At the bottom is the OS abstraction layer, which serves to group all OS idiosyncrasies in one place and fit SQLite to the various architectures and operating systems it supports.
Programming with SQLite
There are several ways you can program with SQLite in C/C++. One approach is to use the ODBC interface developed by Christian Werner. However, SQLite includes a C API that requires only about three to five functions to do everything.
Over the years, three variations have taken form within SQLite's C interface. The original interface used a callback function. When you executed a query, you first registered this function, which would be called for every row fetched in the query. On top of this was a wrapper that hid the callback function and felt more like most other client APIs. Later versions of SQLite brought an improved API that is a happy medium between the previous two, but more flexible and intuitive. This API is now considered to be the standard, and is the one I cover in this article.
There are five steps to working with SQLite:
- Prepare a query.
- Process the results.
- Finalize the query.
Listing 2 is a complete example that illustrates these steps. SQLite databases are maintained in a single file; that is, all objects associated with a particular database (indexes, tables, schema, triggers, and so on) are packaged together in a single operating system file. This is the file you are connecting to on sqlite_open.
SQLite has built-in functions, such as avg(), sum(), min(), max(), and count(), to name a few. All of these are implemented using an API that you can use to extend SQLite, creating your own custom functions and aggregates, which can be called from within its SQL. For example, you could add support for obtaining the system time and do things such as select CURRENT_TIME(). Listing 3 illustrates extending SQLite's statistical functions to include computing the area under a Gaussian distribution for a given mean and variance. Before the extension function can be used, it must first be registered in the database, as in Listing 4.
You can also create your own aggregates in SQLite using a similar approach. In this case, you register two functions: one is called for each record returned in the set, and the other to perform the final computation, which is called at the end of the set. Examples of implementing aggregates can be found in the SQLite source file func.c.
Like most databases, SQLite supports operations such as creating auto-increment columns, returning said values, and binary data (BLOBS). In terms of auto-increment columns, SQLite is similar to MySQL. If you declare a column with type INTEGER PRIMARY KEY, SQLite always selects the next largest value for that column if no value is specified on INSERT. The value used in this case can then be obtained by the function sqlite_ last_insert_rowid(). Binary data is handled with two functions: sqlite_encode_binary() and sqlite_decode_binary().
SQLite supports transactions (although currently not nested transactions) through the use of a journal file. As records are modified, the database pages containing the original values are swapped to the journal file. In the event of a rollback, SQLite copies the original pages back into the main database file. This approach also allows for automatic recovery in the event of system crashes. Each time a client connects to a database, SQLite first looks for an associated journal file. If one is found, SQLite assumes a crash has taken place and proceeds to restore the contents of the journal to the database. Once completed, the client is then allowed to work with the database.
SQLite includes a nonstandard feature (clause) to arbitrate conflict resolution. A conflict in this sense occurs in the event of a constraint violation. The default behavior (ABORT) is to restore all changes made in the statement, and proceed with the transaction. You can change the default ABORT with REPLACE, IGNORE, FAIL, or ROLLBACK (listed in order of severity). From the documentation, REPLACE works as follows:
When a UNIQUE constraint violation occurs, the preexisting rows that are causing the constraint violation are removed prior to inserting or updating the current row. Thus the insert or update always occurs. The command continues executing normally. No error is returned. If a NOT NULL constraint violation occurs, the NULL value is replaced by the default value for that column. If the column has no default value, then the ABORT algorithm is used.
IGNORE causes the conflicting operation to simply be skipped, and the operations in the SQL statement continue. For example, if the 100th record modified in an update statement encounters a constraint violation, then it proceeds to record 101 and keeps on going without a peep. FAIL halts the statement but preserves the previous 99 updates. ABORT is like FAIL but restores all previous 99 updates to their original values. Finally, ROLLBACK halts the statement and aborts the entire transaction.
Conflict resolution can be defined in three scopes: object creation (tables and indexes), transaction, and statement. If a statement has no resolution defined, SQLite looks to the transaction. If nothing is defined there, it looks to the object. If undefined still, it defaults to ABORT. In Listing 5, I create an employee table with a unique name field and set its conflict resolution to ROLLBACK. I populate it with three employees, then try to insert the third again. Not only does SQLite forbid this, it also aborts the transaction, as ROLLBACK should do. Next, I override the table's conflict resolution by setting a different resolution at transaction level. This time it works: SQLite deletes the impeding record and replaces it with the contents of the INSERT statement. Finally, I do this again, but play around with setting resolution at statement level.
Triggers can be written for INSERT, DELETE, and UPDATE operations, including the update of specific columns. Listing 6 illustrates the use of SQLite triggers. Conflict resolution is also applicable here, although it might throw you. Resolution can be specified in the trigger statements; however, resolution in the calling statement, if defined, will take precedence. From there resolution proceeds up the chain as explained earlier. While SQLite does not support materialized views, triggers may be defined on views so that they appear as modifiable. In this case, the modification is defined solely by the logic set forth in the trigger. That is, no modification to the base tables is performed other than what the trigger is programmed to do. Listing 7 illustrates creating a view and an update trigger on top of it.
SQLite comes with an array of pragmas that can be used to set various aspects of runtime behavior. Setting a pragma is done by executing it as SQL, such as PRAGMA vdbe_trace=ON. There are pragmas for performance tuning, such as cache_size and synchronous. cache_size controls how much memory SQLite allocates for its page cache. The larger the cache, the more pages are kept in RAM, which helps reduce disk seeks and therefore increase overall performance. synchronous controls whether or not SQLite flushes data to disk at critical moments such as on transaction commits. Turning it off reduces disk writes (increasing speed) but does so at the risk of corrupting the database in the event of a system crash or power loss.
There are pragmas to control what kind of information is returned to the C API client functions, such as full_column_names, which qualifies column names with their table names, and show_datatypes, which returns column type information with fetched records. There are pragmas for maintenance, debugging, and other tasks. One final thing to note about pragmas is that some have different scopes. Some affect only the current session, others can affect the database and all subsequent sessions. Oftentimes for a given setting, there is both a pragma for controlling its value in both session and database scope. For example, default_synchronous affects the entire database and all sessions that connect to it, while synchronous affects only the current session.
While SQLite is scalable with respect to database size (up to 2 terabytes), it is not designed for high concurrency. It has coarse-grained locking that allows single writer, or multiple readers at the database level. Thus, while you could get high concurrency with read-only applications, such is not the case for writes. It is possible for multiple clients to be connected to a single database, but each writer will block so long as another writer has the database locked for writing. According to a poll on the SQLite list, most users found SQLite to be fast enough that write blocking in negligible, and that the preference was not to complicate code by adding finer grained concurrency.
SQLite does not enforce data types in any way. While there has been some healthy debate on this topic, it is in the final analysis considered to be a feature. There are some instances (such as sorting) in which SQLite does make some distinctions between text and numbers, and within numbers integers and floating-point values. This is done automatically based on the values present in the column, not on the type declared in the schema. You might ask then what the types declared in the schema actually do. The short answer is nothing. However, the type names you declare there are passed to your program through sqlite_step, as in Listing 2, where you can cast the data to whatever type you like. Still, it is important to remember that there is no type checking, so it is up to you to ensure that the string value for a column declared as float is in fact capable of being converted to a float. SQLite lets you declare that column as type donkey, if you wish. As far as it is concerned, what you do with the text representation of a donkey type is your business.
Another limitation of SQLite is that it does not have the sophisticated planners and optimizers that you might find in large multiuser databases. Thus, you must take a more active role in tuning large and/or complex queries that perform many joins. Summing up, all of these limitations should not be too surprising, as SQLite is an embedded database meant to serve programs, not users. Thus, while it is faster than most other relational databases for many operations, it's not realistic to say that SQLite is therefore a suitable replacement for those databases. It simply depends on what you are trying to accomplish.
Given its speed, portability, small footprint, easy-to-use APIs, powerful features, language support, and liberal license, SQLite is a tool all programmers should have in their arsenals. It is a unique open-source project that has done much to address the need for simple storage and data management for applications of every stripe, big and small, working in many different environments.
Michael Owens is a chemical engineer turned programmer and coauthor of PySQLite, the Python extension to SQLite. He can be contacted at [email protected].