Relational Algebra & Metakit

Metakit is an embedded database that encapsulates relational algebra and hierarchical data structures.


December 01, 2004
URL:http://www.drdobbs.com/database/relational-algebra-metakit/184405933

December, 2004: Relational Algebra & Metakit

Relational algebra, calculus, & databases

Brian has been working on biology and chemistry analyses for several years and has developed open-source tools to facilitate work at biotech and academic laboratories. He can be contacted at bkelley@biology .columbia.edu.


Many processing tasks require the use of an algebra to ensure correctness of operations performed on disjoint pieces of data. Relational algebra is a formalized system consisting of sets of data and operations on sets of data. Relational algebra and Relational calculus are the foundation of the Structured Query Language (SQL). In many cases, however, SQL is too removed from the underlying data and hampers the ability for programmers to gain a thorough understanding of their data. Metakit, from Equi4 Software (http://equi4.com/), is an efficient, lightweight, embedded database that encapsulates a combination of relational algebra and hierarchical data structures. Additionally, Metakit comes complete with interfaces to popular scripting languages, such as Python, which combined with Metakit's ability to restructure internal databases on-the-fly, makes a powerful combination for both learning relational algebra and prototyping database applications. Metakit also has a C++ API and Tcl bindings. Because its data files are portable, the library has been used on UNIX, Windows, Macintosh, VMS, and other platforms.

Metakit Backgrounder

Metakit databases are called "storages" and they can either be on disk or completely in-memory. On-disk storages are persistent storages that support transactions. In-memory storages, while not supporting transactions, are useful for both debugging and learning how to use Metakit. One of the niceties about Metakit is that you can use on-disk and in-memory storages concurrently. In fact, Metakit supports operations between multiple on-disk and in-memory storages. Table 1 lists the different types of storages.

Tables are called "views" in Metakit. A view is simply a collection of columns and rows that can be manipulated. A view may be the entire table, a portion of a table, or the result of operations performed on tables. Most views in Metakit are treated equally with respect to set operations, but new rows may only be appended to the base view. The wording "a collection of columns and rows" is used instead of the more natural "rows and columns" because Metakit is based on a column-oriented data model, while most databases favor a row-oriented data model. This has some far-reaching consequences when using and optimizing Metakit queries.

Basically, a view can be considered an N×M matrix where N is the number of rows and M are the number of columns. All entries in the same column have the same type (see Table 2), and the combination of a column name and a column type is called a "property." Columns are always accessed by a name (or property) and rows are accessed by an index. This maps wonderfully into Python's dynamic attribute lookup. For example, to access the property "my_index" for the 10th row of a given view, the Python code is "view[10].my_index" that can be used for both accessing and setting database values.

Relational algebra begins with a base relation, which is simply the initial view, sometimes called a "table" in relational databases. Views are initially created through the storage object. Each table must have a unique name and all of its columns must also have unique names. Metakit ignores case for column and view names so "MyTable" is equivalent to "mytable." As a mild gotcha when defining a storage, the first occurrence of a column name is kept and the duplicate column names—even if they have different types—are ignored.

A typical table creation statement looks like st.getas("mytable[first:S,last:S,age:I]"). Table 2 lists the various data types for a column. After the table has been created, it can be retrieved with either the same getas statement or with the view operator, such as in st.getas("mytable"). Of particular interest is that the getas statement is also how tables are modified; for instance, if the age column becomes redundant, it can be dropped: st.getas("mytable[first:S,last:S]"). New columns can also be added in this fashion. When the storage is committed to disk, the table is altered. Finally, getas is also used to drop or remove tables; simply call getas with the table name: st.getas("mytable"). After the commit, the table is dropped.

Data can be added to views by appending Python dictionaries that define the column names and values, or by adding sequences. Sequence items are added in the same order as the columns defined in the getas statement. Listing One presents examples of creating base views and adding data. Finally, all Metakit views are iterable containers; for example:

for row in view:
print row.first_name, row.last_name

One difference from SQL is that Metakit does not use null values. There are various camps concerning the use of null values in relational algebra. Instead, unsupplied data are supplied with default zero values. In most cases, this is adequate, but if 0 is an acceptable value for an integer column it can be confusing in these cases. Having undefined values is useful but can sometimes lead to problems.

Python API

Metakit uses two methods to specify columns for relational operations. For finding data in a view, Metakit uses method variable names such as view.find(name="Brian"). This finds all rows where the column name equals Brian. The second method is used by specifying a column property, such as when joining two views on a particular column: vw.join(vw2, vw.name). Here, Python is using dynamic introspection of the view vw to figure out that vw actually has a column "name" and returns the appropriate property.

An equivalent method to vw.join(vw2, vw2.name) is vw.join(vw2, metakit.property("name", "S")). For some operations, the methods can take a variable number of arguments. view.structure() returns a list of all properties in a view in the order they were defined in the getas statement. view.properties() returns a dictionary properties keyed on the property name.

There are some issues involved when the column names in Metakit are not valid Python attribute names. One solution to this problem is not to make column names invalid attribute names. Listing Two shows some workarounds to these issues.

SQL and Relational Algebra

Volumes have been written about relational algebra and SQL. Relational algebra and calculus are the foundation of modern relational databases. In "Equivalence of Relational Algebra and Relational Calculus Query Languages Having Aggregate Functions," Journal of the ACM (Volume 29, 1982), Anthony Klug showed that relational algebra and relational calculus are dual in that there is a mapping between them. These foundations form a provable set of maxims that detail how data can be managed. SQL itself is just a loose system operating on relational calculus. Any system that implements a complete relational algebra, therefore, can be used to map to SQL.

Recall that everything begins with a base relation or table. These base relations are extended and combined by computing operations such as projections and joins. In essence, a view is just a collection of sets (rows) of key—value pairs (column and associated data in the row). Views, then, are a group of data that can be formed starting from the base relation and manipulated by any combination of relational operations.

A formal definition of relational algebra requires that all sets (or rows) are unique within the view. Neither SQL nor Metakit completely enforces this; they rely on the fact that the index into the view is a kind of hidden key that enforces uniqueness. However, for some operations such as intersection and difference, Metakit requires sets where each row is unique.

Basic Metakit Set Operations

Set operations in Metakit can be confusing at first. Some operations such as intersect require that all rows in a set are unique. Otherwise, the operation is undefined. Any Metakit view can be made unique by using the unique method as in view.unique(). Also, during most set operations, all columns not in common between sets are populated with the default empty values.

Union returns a view with all rows from both views. The columns chosen are taken from the first view such that a.union(b) contains a view with all columns from a. If view b does not have a column from a, the result column will be populated with default values. Intersection returns all rows in common between two views and requires all rows in the initial views to be unique. Difference returns all rows not in common between two rows. Listing Three shows these basic set operations.

Columns can also be renamed. As long as column types are the same, set operations can be performed on any column with the same name. For example, if there is a column named "id" in view1, and one named "card_id" in view2, an intersection between the two columns can be performed as view1.rename("id","card_id").intersect(view2).

The Cartesian product is one of the most important and confusing relational operations. The Cartesian product forms all permutations of columns between two sets. This is essential when performing complicated queries such as SELECT * from VIEW1,VIEW2 where view1.a=view2.b and view1.c > view2.d". One way to think about a Cartesian product (see Table 3) is a nested for loop as in:

for row1 in view1:
for row2 in view2:
if row1.a == row2.b and row1.c > row1.d:
# found a match so
# do something with row1 or row2

Listing Four (available electronically; see "Resource Center," page 5) shows how to use Cartesian products in combination with view filtering to perform this query.

Relational Operations

Projection takes a view and forms a subset of specific columns associated with the view. A SQL version of restriction looks like SELECT a,b from TABLE. Metakit uses the view's project method function result=table.project(table.a, table.b).

Restriction is similar to projection except that it forms a subset of rows based on some criteria. A SQL version of restriction is SELECT * from TABLE where a=1. Metakit would perform this as result=table.select(a=1). Projection and restriction are normally combined; for example, SELECT a,b from TABLE where a=1 and result=table.select(a=1).project(table.a, table.b).

In many cases, it is desirable that results are sorted in some fashion. SELECT a,b from TABLE order by a is equivalent to result=table.select(a=1).sort(table.a). Both commands return column a from TABLE sorted by column a. Metakit has a slightly odd API for sorting in reverse; see Listing Five.

Perhaps the most ubiquitous relational operator is the join. Data is most often logically organized into different tables. For example, consumable items might exist in a base relation named "consumable" that defines all possible consumable items, and locations might exist in a separate relation. Each consumable and location has a unique identifier (this is a prerequisite for relational algebra as each row or set in a relation must be unique). These base relations are combined in a view "stock" that records that a specific consumable is in a specific location.

To extract the actual consumable item and the location, these base relations need to be joined together. To find all locations of a specific item, the consumable view and the stock view must be joined. SQL would perform this as SELECT item, location FROM Consumable, Stock, Location where Consumable.stock_id = Stock.item_id and Stock.location_id = Location.location_id. Metakit would do this by chaining operations:

inventory = consumable.join(stock, stock.stock_id)
locations = stock.join(location, location.location_id)
result = view.project(view.item, view.location)

This example highlights the chaining of relational operations in Metakit. One of the nice features of this approach is that each of the results from the aforementioned operation can immediately be used in a different operation, which allows for excellent query optimization.

Outer joins are an extension to a normal join where all of the information contained in one of the tables is returned in addition to the result of the query or condition. This is useful in determining which elements from a table do not have a certain criterion. A SQL specification for an outer join looks like SELECT * FROM A LEFT OUTER JOIN B ON A.name=B.name. In SQL, outer joins are specified by either LEFT or RIGHT, where a left outer join would return the result of the query plus the table to the left of the outer join statement. A right join would return the table to the right of the statement. To specify the left outer join specified above for metakit, simply add an outer=1 flag to the join method as in A.join(B, B.name, outer=1). The right outer join would look like B.join(A, A.name, outer=1).

Outer joins potentially return more columns of data than exist in either of the tables being joined. For the rows that are added to the results of the join, these columns will be populated with the default empty values; see Listing Six.

Relational division locates all records in one view that are related to all records in another view. For example, if there are 50 baseball cards in a collection, you might want to find all people who have all of the baseball cards. Metakit supplies a groupby method to help with relation division and related problems.

If you have a table "cards" that contains all of the baseball cards in the collection and a table "cardholder" that has both the name of a person and the card that he holds, you can first group the cardholder table on the person's name: groups = cardholder.groupby(cardholder.name, "cards"). You supply a string value for this operation; namely, "cards." This is because we are creating a new column for our groups table named "cards." This column contains a subview of every row from cardholder, omitting the properties that were grouped on. In this case, the subview holds all the cards owned by each person. To find out who has all the cards, you iterate through the rows in the grouped view and compare them to the cards view:

cardholders = people.groupby(people.name, "cards")
for person in cardholders:
# we use the minus set operator to see if the person has all the cards...
if len(cards.minus(person.cards)) == 0:
print person.name, "has all the cards"

Listing Seven shows a slightly more realistic example using card_ids to keep track of card information. Remember that Metakit views can contain subviews as column properties. This can be an incredibly useful optimization in many cases. Defining a subview can take the place of expensive joins. Listing Eight (available electronically) shows more examples of using subviews.

Finally, there's the relatively tricky SQL statement select * from A where KEY not in (select KEY from B). The goal to this is first to form the set of KEY, where KEY is in A and not in B. There are a few ways of doing this, but a good solution is to take all the KEY values from A and subtract the KEY values from B: akeyNotInB = A.project(A.KEY).minus(B.project(B.KEY)).

Now it is a simple matter to join all these keys with the A table: result = A.join(akeyNotInB, A.KEY). This last example, which can be seen in Listing Nine, shows the power of relational operators and gives a taste of how they can be mapped to SQL statements. In fact, there is already a decent SQL wrapper for Metakit, mksql, that uses many of the techniques described here.

On NULL Values

One big difference between SQL and Metakit is that, similar to strict definitions of relational algebra, Metakit has no notion of a NULL entry for a column. In practice, this becomes an issue in some cases. The problem arises when using Metakit's default values, such as 0 for integers, in a join. SQL does not let NULL values be equivalent to other NULL values, Metakit's default empty values will match other default empty values. When using Metakit's default values as NULL, the view should first be filtered to remove the defaults before using the column in a join.

On Large Datasets

Normal Metakit views do not scale well. To make large views, you can create so-called "blocked" views, which are really collections of many smaller views. All of the base Metakit operations can be used transparently on blocked views except for the creation of hash keys. Please see the documentation available electronically for more information.

Acknowledgments

Metakit was created by Jean-Claude Wippler who graciously released it through an open-source license. The Python interface to Metakit was originally written by Gordon McMillan and has gone through several iterations as the underlying Metakit database has evolved. Currently, there is a group of developers maintaining the source both through a series of testing and code revisions.

DDJ



Listing One

import metakit
# create an in-memory storage
st = metakit.storage()
# create a simple base view
view = st.getas("test[first:S,last:S,age:I]")
# add some data
view.append(("Joe", "Schmoe", 32))
view.append({"first":"Moe", "last":"Schmoe", "age":26})
class Loader:
    def __init__(self, first, last, age):
        self.first = first
        self.last = last
        self.age = age
row = Loader("Zoe", "Schmoe", 22)
view.append(row)
# show the view
metakit.dump(view)
Back to article


Listing Two
import metakit
# create an in-memory storage
st = metakit.storage()
# views can be created with with column names that are not valid Python 
# attribute names. Includes anything that has spaces or starts with a letter
view = st.getas("bad_attributes[bad name1:S,1fake:S]")
view.append(("a", "b"))
view.append(("c", "d"))
view.append(("e", "f"))

metakit.dump(view)

# these attributes cannot be accessed through attributes as in view.1fake
print
print "1fake column that cannot be retrieved as a python attribute"
# for columns that cannot be accessed directly, use the getattr method
for row in view:
    value = getattr(row, "1fake")
    print "1fake", value
# to get all the attributes in a view, use the view.structure() method
print
print "Column Types"
columns = view.structure()
for column in columns:
    print "%s:%s"%(column.name, column.type)
Back to article


Listing Three
import metakit
st = metakit.storage()
vw = st.getas("A[key:I,name:S]")
vw2 = st.getas("B[key:I,org:S]")
vw.append((1, "Dave"))
vw.append((2, "Steve"))
vw.append((3, "Brian"))
vw2.append((1, "Royal Society"))
vw2.append((2, "American Chemical Society"))
vw2.append((4, "PNAS"))
# first form the subview where vw.key == vw2.key only use the key value
keys = vw.join(vw2, vw.key).project(vw.key)
# get all rows from vw that are not in the subview
result = vw.minus(keys.join(vw, vw.key))

print metakit.dump(result)
metakit.dump(vw.join(vw2, vw2.key))
metakit.dump(vw.join(vw2, vw2.key, outer=1))
metakit.dump(vw2.join(vw, vw.key))
metakit.dump(vw2.join(vw, vw.key, outer=1))

# how to find all entries in societies that are not used
metakit.dump(vw2.join(vw, vw.key, outer=1).minus(vw2.join(vw, vw.key)))
Back to article


Listing Four
(available electronically; see "Resource Center," page 5)
Back to article


Listing Five
import metakit

# create an in-memory storage
st = metakit.storage()
view = st.getas("test[item:S,count:I,location:S]")
# add some data
view.append({"item":"Pencil #2 Lead", "count":2, "location":"Cubby 1"})
view.append({"item":"Pencil #2 Lead", "count":1, "location":"Desk 3"})
view.append({"item":"Eraser", "count":1, "location":"Desk 2"})
view.append({"item":"Eraser", "count":2, "location":"Cubby 5"})
# sort the view first ascending by item and descending by count to break ties,
# i.e. we want to have the items in order, but show the highest counts first.

sortedView = view.sortrev([view.item,view.count], [view.count])
metakit.dump(sortedView)
Back to article


Listing Six
"""This example shows various uses of metakit joins"""

import metakit
st = metakit.storage()

# simple join between two tables
# select a,b,c from vw1,vw2 where vw1.a=vw2.a

vw = st.getas("test1[a:I,b:S]")
vw.append((1, "view1"))
vw2 = st.getas("test2[a:I,c:S]")
vw2.append((1, "view2"))
vw2.append((2, "view2"))
vw2.append((2, "view1"))

print "vw"
metakit.dump(vw)
print
print "vw2"
metakit.dump(vw2)

# perform a simple join on the a column
print
print "select * from vw1, vw2 where vw1.a=vw2.a"
result = vw.join(vw2, vw.a)
metakit.dump(result)

# result should be:
#a  b      c    
# -  -----  -----
# 1  view1  view2
# -  -----  -----
# Total: 1 rows
# more complicated join, essentially select vw1.a, vw1.b from vw1,vw2 
# where vw1.b=vw2.c in this case we need to rename vw2's c column
#  to "b" so we can use the join
print
print "select vw1.a, vw1.b from vw1,vw2 where vw1.b=vw2.c"
result = vw.join(vw2.rename("c", "b"), vw.b)
metakit.dump(result)

# result should be
#a  b    
# -  -----
# 1  view1
# -  -----
# Total: 1 rows

print
print "select * from vw right outer join vw2 where vw.a=vw2.a"
print " this keeps all data from all rows in vw2"
print " note the last two rows have no corresponding match in vw"
print " but they are kept anyway"
result = vw2.join(vw, vw2.a, outer=1)
metakit.dump(result)
Back to article


Listing Seven
import metakit
st = metakit.storage()
cards = st.getas("card[card_id:I,team,player]")
people = st.getas("person[name,card_id:I]")
cards.append((1, "Red Sox", "Pedro Martinez"))
cards.append((2, "Red Sox", "Trot Nixon"))
# Steve is a fanatic, he has multiple cards
people.append(("Steve", 1))
people.append(("Steve", 2))
people.append(("Steve", 1))
people.append(("Brian", 2))
cardholders = people.groupby(people.name, "cards")
for person in cardholders:
    if len(cards.project(cards.card_id).minus(person.cards)) == 0:
        print person.name
Back to article


Listing Eight
(available electronically; see "Resource Center," page 5)
Back to article


Listing Nine
import metakit

# we are forming two tables, A and B. A holds the KEY and the people 
# associated with KEY. B holds the KEY and the organization associated with 
# KEY. We want to find all the people in A who do not have an entry in B, i.e.
# select * from A where A.KEY not in (select KEY from B)

st = metakit.storage()
A = st.getas("A[key:I,name:S]")
B = st.getas("B[key:I,org:S]")

A.append((1, "Dave"))
A.append((2, "Steve"))
A.append((3, "Brian"))

B.append((1, "Royal Society"))
B.append((2, "American Chemical Society"))
B.append((4, "PNAS"))

akeyNotInB = A.project(A.KEY).minus(B.project(B.KEY))
metakit.dump(akeyNotInB)

# now join keys back to A so we can get the whole rows back. These relational 
# operators add a lot more flexibility than the SQL statements above.
result = A.join(akeyNotInB, A.KEY)
metakit.dump(result)
Back to article

December, 2004: Relational Algebra & Metakit

Python Code Description
st = metakit.storage() Open an in-memory storage.
st = metakit.storage("mydatabase", 1) Open a disk-based storage for modification. It will be created if it doesn't exist.
st = metakit.storage("mydatabase", 0) Open a disk-based storage for read-only operations.

Table 1: Types of storages.

December, 2004: Relational Algebra & Metakit

Data Type Description
I Adaptive Integer (becomes Python Integer).
L 64-bit integer (becomes Python Long Integer).
F C Float (becomes Python Float).
D C double (is a Python Float).
S C Null terminated string (becomes Python String).
B C array of bytes (becomes Python String).
V Subview. This can be any valid getas statement. All subviews in the same column must have the same structure.

Table 2: Properties.

December, 2004: Relational Algebra & Metakit


  View1    View2       Product

  A  C     B  D        A  C    B  D
  ---------------------------------
  1  2     3  4        1  2    3  4
  5  6     0  2        1  2    0  2
                       5  6    3  4
                       5  6    0  2

Table 3: Cartesian product.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.