Database

Theorem Proving and Database Querying

By Sergei Savchenko, August 01, 1998

Efficiently resolving database queries requires you to reconcile the structure of the databases with the structure of queries. Mathematical algorithms known as "automated theorem provers" excel at this type of problem.

Dr. Dobb's Journal August 1998: Algorithm Alley

Sergei is a graduate student at McGill University, School of Computer Science. He can be reached at [email protected].

Historically, the major objections to AI techniques have been performance related, but increasingly powerful hardware is slowly erasing that concern. Today, developers are more concerned that software be modular, flexible, and easily updated. One approach is to use general-purpose inference techniques that let your program derive a conclusion from a list of facts. This lets you simply edit a fact database when conditions change, without having to manually alter custom logic. As Sergei points out this month, this approach can also isolate your user interface from the technical details of the underlying implementation.

-- Tim Kientzle

The trend toward smarter software is most obvious in game-playing programs such as IBM's Deep Blue chess computer. However, users are increasingly insisting that all software be flexible, which makes artificial-intelligence techniques more important to all programmers. The typical objection to such techniques -- that they are resource intensive -- is less of an issue with today's more powerful hardware.

Automated theorem proving is one such technique that has been heavily used in research labs. Essentially, an automatic theorem prover takes a certain set of facts and builds a "proof" -- a series of steps for obtaining or validating a desired result or theorem.

Theorem provers can be used to efficiently query relational databases. The initial facts specify the layout of various databases, and the desired result is a particular query. The proof explains how to use the available databases to answer the query.

Database querying relies on a particular mathematical framework called "relational algebra," which I'll discuss in this article. First, however, I'll examine two other general-purpose mathematical frameworks and show how automated theorem provers work for those systems.

Proving Theorems in Propositional Logic

There are different ways to write logical statements, each with its own peculiar strengths and weaknesses. One of the simplest is propositional logic, which uses atoms -- usually represented by single letters -- to represent facts or propositions. Each proposition can be either true or false. Multiple propositions can be combined using logic functions. For example, "if it is raining then the ground is wet" can be expressed as a-b, where a stands for "it is raining," b for "the ground is wet," and - represents "consequence" or "if...then." I'll also use ~ for "not," & for "and," | for "or," and = for "equivalence" ("if and only if").

To prove a theorem in propositional logic, you start with axioms you assume are true, and use a set of inference rules to derive the result you want. For example, if the statements "it is raining" (a) and "if it is raining then the ground is wet" (a-b) are axioms, you can easily prove the theorem "the ground is wet" (b).

A common approach to theorem proving is to rewrite all statements as groups of clauses combined with and. A clause is a group of atoms or negated atoms combined with or. This corresponds to conjunctive normal form, and any propositional statement can be written in this fashion. After rewriting everything in conjunctive normal form, you can replace each statement a&b with two separate statements a and b. So every statement can be replaced with a set of clauses.

Listing One lists a set of standard transformations that can be used to transform any statement, while Listing Two shows how these transformations reduce (~(a-b)|c) to the two clauses (a|c) and (~b|c).

Once everything is in clause form, you can use a rule called "binary chaining" -- if you have clauses of the form a|~b and b|c, you can add a new clause a|c. Intuitively, this resolution rule is saying that if b follows from a and c follows from b, you can conclude that c follows from a. Similarly, if you have clauses b and ~b, you've found a contradiction, and your search can stop.

Generally, it is easier to look for a proof by contradiction; that is, to negate the theorem and look for a resulting contradiction with the axiom set. If one is located, then the original theorem must be true. You may sometimes find a contradiction without using the clauses of the negated theorem. This signals that the axiom set is inconsistent and contains internal contradictions.

The general strategy, then, is to take the original axioms and the negated theorem, convert them into clause form, and resolve them until you find a contradiction (an empty clause). Although simple conceptually, the number of clauses can grow exponentially as you search. Instead of a breadth-first search where you search one level at a time, many provers use depth-first or iteratively deepening depth-first searches. In the first case, you keep reusing the clause you just found until nothing new can be created or the maximal preset depth has been reached. When you reach a dead end, backtrack and try another direction. Although this doesn't require a lot of space, it can take a long time to find easy proofs that happen to use clauses at the end of your list. Iteratively deepening depth-first search avoids this by doing a depth-first search with maximum depth one, then repeating with successively larger maximum depths. Consecutive iterations repeat a lot of work, but the space requirements are minimal and you can avoid fruitless deep searches.

Listing Three is the output of a sample prover that first transforms the statements into the clause form and then searches for a proof. This simplified prover demands that the input statements be fully bracketed. Also, the last logic statement represents the negated theorem. The program that generated this output is prop.cc (available electronically; see "Resource Center," page 3).

One way to narrow the search is to identify and delete redundant clauses. For example, if you have a clause h and a clause h|g, the second clause can be deleted, since the first one will always contribute to a shorter proof. (If h is true, h|g must also be true whatever g is.) A single clause b|b can be replaced with just b since this is logically equivalent. A clause b|~b (for instance, "it is raining or not raining") is always true and can be safely discarded.

Proving Theorems in Predicate Logic

With just propositional logic, you can't say "For all odd numbers x, x plus one is even." Predicate logic extends propositional logic with quantifiers ("for all" and "there exists"), quantification variables (x in the example just given), predicates ("x is even"), and functions ("x plus one"). For example, the earlier statement can be written (for all x) { o(x)-e(p(x)) }. Here o(x) means "x is odd," e(x) means "x is even," and p(x) adds 1 to x.

An interesting detail is that predicate logic is usually written without constants. Instead, you can use functions that have no arguments, which are essentially the same. This reduces the number of different kinds of entities needed to handle predicate logic.

The basic theorem-proving approach used for propositional logic also works here. To reduce everything to conjunctive normal form, beside performing the same transformations as for propositional formulas, you also need to distribute nots over quantifiers:

~(for all x) p(x) is substituted by (there exists x) ~p(x).
~(there exists x) p(x) is substituted by (for all x) ~p(x).

The next step is to eliminate the quantifiers. "For all" can simply be dropped, since any statement p(x) is assumed to be true for all values of x. To remove "there exists", you add new constants. Saying "there exists an x such that p(x)" is the same as saying "p(a)" is true for some constant a. (This process is known as "Skolemization.")

Once everything is reduced to clauses, including the negated theorem, you can use binary chaining with a few twists. For instance, the clauses a(v())|b(c()) and ~a(x)|c(x) can be resolved to b(c())|c(v()) if you first substitute v() for x. You have to consider possible variable substitutions when you check to see if two terms have something in common.

Unfortunately, binary chaining isn't sufficient. You also need to consider factoring. In propositional logic, factoring b|b to just b is easy. But in predicate logic, you again have to consider variable substitutions: a(x)|a(b()) can be factored to a(b()).

Listing Four presents the output of another sample prover that uses predicate logic. The axioms in this example are: "For all x that are even, x+1 is odd" and "For all x that are odd, x+1 is even." The theorem that we want to prove and which is represented by the last negated statement can be rephrased as: "Is it true that for any even a, a+2 is also even?" For simplicity, any variable in the statements is assumed to be quantified universally ("for all"), whereas the existentially quantified variables must manually be turned into some constants.

Predicate logic does introduce theoretical problems. In propositional logic, you can always prove or disprove a theorem, since the search space is always finite. With predicate logic, the search space can be infinite, and so you can never be sure whether the theorem is False or whether you simply haven't checked far enough.

[Editor's note: A potentially infinite search space is not the only problem with automatic proofs in predicate logic. If it were, you could modify your program to carry out two proofs in parallel. One search tries to prove the theorem is true, the other tries to prove the theorem is false. You would eventually succeed. Unfortunately, even this isn't guaranteed to work. A subtle mathematical result called "Gödel's Incompleteness Theorem" says that, in any reasonable system of mathematics, there are theorems which are true but which cannot be proved. Douglas R. Hofstadter's book Gödel, Escher, Bach (Vintage Books, 1989) is an entertaining and accessible discussion of this important fact and its consequences.]

Relational Algebra for Databases

A relational database is a set of two-dimensional tables called "relations." Every relation contains a number of tuples (also called "records" or "rows") each associating some values to the fields ("columns" or "attributes") of the relation. The only constraint on the relation is that no two tuples may be exactly the same. For the purposes of this article, I will represent a relation as a sequence of its tuples. For instance, A= c e : 1 1 : 1 2 : 2 2 : 3 3 :, where A is the name of the relation, c and e names of the fields followed by colon-separated tuples representing values of c and e in every tuple.

These relations can be manipulated using relational algebra, which is a set of operators that take relations as arguments and produce relations as their result. Structured Query Language (SQL) is based on this type of algebra. Although powerful, relational algebra is often not used directly because it requires that you know the internal structure of the database -- the relations present, field structure of the relations and meanings of the fields -- when you formulate the query.

A similar problem occurs with object-oriented databases (OODBs). Queries are often more efficient with OODBs, but the underlying abstraction appears to be more complex than relational algebra and may require even more intimate familiarity with the internal structure of the database to build an efficient query expression.

An automated theorem prover can help address this problem. Users simply ask their questions, and the database uses its knowledge of the internal structure to arrive at an appropriate answer. To do this, I need to first develop a notation for relational algebra.

A database may contain many different relations describing various aspects of some domain. Operations in relational algebra create new relations from old ones. For example, the selection operator creates a new relation containing only the specified tuples. For instance, Example 1(a) finds the tuples of A (the relation just shown) where the value of the field c is bigger than 1.

In this sample syntax, the select operator is represented by round brackets with the argument relation on the right. An intersection join is another relational algebra operator that combines two relations based on the contents of particular join fields. When the join fields agree, the relevant tuples are combined into a new tuple and placed into the result. For instance, Example 1(b) joins the two relations A and B along fields c and x. The join operator is represented here by the triangular brackets with the square brackets inside denoting the join fields. Also, the x field disappears in the result since it is the same as c and thus is redundant. Of course, the power of an algebra comes from the ability to combine simple expressions into more complex ones; for instance, you can combine Examples 1(a) and 1(b) to select only part of the intersection join result in Example 1(c).

An Automated Prover for Relational Algebra

The inherent disadvantage of using relational algebra expressions is that they must precisely describe how to obtain the answer. You have to know the structure of the database -- which relations and fields are present -- before you can write a query expression. Often, the internal structure of the database shouldn't be exposed to end users, but this prevents them from using relational algebra to query the database.

Automated provers are really a generalized way to ask questions and find answers. You want to use a database to store data and use a prover that can use information about the database structure to construct proper relational algebra query expressions based on the user's questions.

In Listing Five, the relations A and B each have just one field (x and y, respectively). Assume that the first relation lists the parts of some product A and the second relation lists parts used for another product B. Symbolically, you can write a(x) to indicate that x is a component of a, and a(.x)@A to indicate that every part listed in relation A is a part of a. Similarly, b(.x)@B means that b(x) is true (x is a part of b) for each part listed in B. Here, the dot in front of x or y signifies that these are constants stored in the fields of a relation whose name follows the "@" sign.

As an example, consider asking the database which components are present in the product a. This question can be rephrased in a more formal language: For which x a(x)? This is your theorem; you negate it and consider it together with your axioms. In particular, ~a(x) and a(.x)@A produce a contradiction. Although I haven't yet shown how relational expressions can be resolved when the clauses are resolved, it's clear that in this case the relational expression for the contradiction clause will be just A, which represents the answer to the question. Indeed the components for the product a can be fetched from the relation A.

For a more complex example, suppose you want to find all the common parts. I'll use l(x) to indicate that x is common to both A and B. The axiom {{a(x)&b(x)}-l(x)} then indicates that any part x that is used in both a and b is a common part. You now need a proof of l(y); that is, a sequence of steps that shows how the known facts produce l(y). To solve this automatically, you negate your theorem and give it to an automated prover, which will attempt to find a sequence of steps leading to a contradiction. As it does so, it uses relational algebra to explain how each step is produced from the database. At the end, you have a contradiction, together with a relational expression describing how to fetch the solution from the database. Of course, this requires you maintain associated relational expressions with each clause you derive.

Suppose you have a clause {a('1')|c()} with an empty relational part, where '1' is a constant. You can resolve this with {~a(.x)|d()}@R by restricting the field x of the relation R to only have the value '1.' To preserve this, the resolvent should be {c()|d()}@(x='1')R where (x='1') is a selection operator applied to the relation R that restricts the field x to be equal to '1.'

Similarly, you can combine {a(.x)|d()}@R and {~a(.y)|c()}@L. Obviously, these can only be combined when the variables x and y agree. Put differently, it applies only to members of the intersection join of R and L, which gives the symbolic result {d()|c()}@<R[xy]L>.

Although other cases are possible, these examples demonstrate the most important situations. These rules let the prover track the relational expressions as it manipulates the logical ones. The user, on the other hand, only needs to be aware of the concepts described by the database.

Listing Five illustrates the solution to the question, "What are the interchangeable parts?" The prover uses the axioms to contradict the negated theorem. The relational expression corresponding to this proof joins the relations a and b to produce the list of components common to both products.

Listing Six demonstrates a more complex situation. The relation A lists product components that can be used in different environments. The first axiom says that each e is an environment and the corresponding c is a component. The relation B describes three systems: a, b, and c. Each of these contains one or more components in the p field, but only "top" components (where the t field is t) affect the environment in which the product can be used. This knowledge is represented by the second and third axioms. Listing Six shows how a prover would answer the question, "Which systems can serve in environment 3?"

Conclusion

These examples demonstrate how a database application can be made smarter. One advantage is that the internal structure of the database can therefore be hidden from the user, and can even change without affecting the users. The concepts and examples I've presented only demonstrate the basic methods, ignoring many subtle details encountered during practical implementation. One such issue is familiar to programmers of Prolog (which is basically a theorem prover based on a subset of predicate logic called "Horn clauses"). Sometimes it suffices to find the first empty clause (for example, a contradiction); other times it may be necessary to identify all empty clauses. For example, add an axiom to Listing Five that lets you find any manufacturing part ({{a(x)|b(x)}-n(x)}) and observe the reasoning performed by the sample prover.

References

Chang Chin-Liang and Richard Char-Tung Lee. Symbolic Logic and Mechanical Theorem Proving. San Diego, CA: Academic Press, 1973.

Jackson, Peter. Introduction to Expert Systems. Reading, MA: Addison-Wesley, 1990.

Dos Reis, Anthony J. "Theorem Proving Using Semantic Resolution." Dr. Dobb's Journal, April 1988.

DDJ

Listing One

Step 1: Remove - and =     g-h is substituted by ~g|h
    g=h is substituted by (~g|h)&(g|~h)


</p>
Step 2: Distribute negation so only atoms are negated
    ~(g|h) is substituted by ~g&~h
    ~(g&h) is substituted by ~g|~h


</p>
Step 3: Distribute ors over ands
    g|(h&f) is substituted by (g|h)&(g|f)
    (h&f)|g is substituted by (h|g)&(f|g)


</p>

Back to Article

Listing Two

(~(a-b))|c        (Eliminating "-");(~(~a|b))|c       (Distributing not);
(a&~b)|c          (Distributing | over &);
(a|c)&(~b|c)      (Obtaining clauses).


</p>

Back to Article

Listing Three

Statement:              CNF:                     Clauses:
---------------------------------------------------------------------
k                       k                        (0) k
(~a-b)                  (a|b)                    (1) a|b
(k-l)                   (~k|l)                   (2) ~k|l
((b&l)-a)               ((~b|~l)|a)              (3) ~b|~l|a
~a                      ~a                       (4) ~a


</p>
Proof:
---------------------------------------------------------------------
(0) k
(1) a|b
(2) ~k|l
(3) ~b|~l|a
(4) ~a
(5) ~b|~l from (4) and (3)
(6) ~b|~k from (5) and (2)
(7) ~k|a from (6) and (1)
(8) ~k from (7) and (4)
(9) contradiction from (8) and (0)


</p>

Back to Article

Listing Four

Statement:              CNF:                     Clauses:
---------------------------------------------------------------------
{e(x)-o(p(x))}          {~e(x)|o(p(x))}          (0) ~e(x)|o(p(x))
{o(x)-e(p(x))}          {~o(x)|e(p(x))}          (1) ~o(x)|e(p(x))
~{e(a())-e(p(p(a())))}  {e(a())&~e(p(p(a())))}   (2) e(a())
                                                 (3) ~e(p(p(a())))
Proof:
---------------------------------------------------------------------
(0) ~e(x)|o(p(x))
(1) ~o(x)|e(p(x))
(2) e(a())
(3) ~e(p(p(a())))
(4) ~o(p(a())) from (3) and (1)
(5) ~e(a()) from (4) and (0)
(6) contradiction from (5) and (2)


</p>

Back to Article

Listing Five

Relations:
---------------------------------------------------------------------
A = x : 1 : 2 : 3 : 4 : 5 :
B = y : 1 : 5 : 7 :


</p>
Statement:              CNF:                     Clauses:
---------------------------------------------------------------------
a(.x)@A                 a(.x)@A                  (0) a(.x)@A
b(.y)@B                 b(.y)@B                  (1) b(.y)@B
{{a(x)&b(x)}-l(x)}      {{~a(x)|~b(x)}|l(x)}     (2) ~a(x)|~b(x)|l(x)
~l(y)                   ~l(y)                    (3) ~l(y)


</p>
Result:
---------------------------------------------------------------------
A= x : 1 : 2 : 3 : 4 : 5 :
B= y : 1 : 5 : 7 :
(0) a(.x) @ A
(1) b(.y) @ B
(2) ~a(x)|~b(x)|l(x)
(3) ~l(y)
(4) ~a(x)|~b(x) from (3) and (2)
(5) ~a(.y) from (4) and (1) @ B
(6) contradiction from (5) and (0) @ <B[yx]A>
<B[yx]A> = y : 1 : 5 :


</p>

Back to Article

Listing Six

Relations:
---------------------------------------------------------------------
A= c e : 1 1 : 1 2 : 2 2 : 3 3 :
B= p s t : 1 a p : 2 a t : 2 b t : 2 c p : 3 c t :


</p>
Statement:              CNF:                     Clauses:
---------------------------------------------------------------------
{e(.e)-c(.c)}@A         {~e(.e)|c(.c)}           (0) ~e(.e)|c(.c)@A
t('t')                  t('t')                   (1) t('t')
{{c(.p)&t(.t)}-s(.s)}@B {{~c(.p)|~t(.t)}|s(.s)}  (2) ~c(.p)|~t(.t)|s(.s)@B
~{e('3')-s(x)}          {e('3')&~s(x)}           (3) e('3')
                                                 (4) ~s(x)
Result:
---------------------------------------------------------------------
A= c e : 1 1 : 1 2 : 2 2 : 3 3 :
B= p s t : 1 a p : 2 a t : 2 b t : 2 c p : 3 c t :
(0) ~e(.e)|c(.c) @ A
(1) t('t')
(2) ~c(.p)|~t(.t)|s(.s) @ B
(3) e('3')
(4) ~s(x)
(5) ~c(.p)|~t(.t) from (4) and (2) @ B
(6) ~c(.p) from (5) and (1) @ (t='t')B
(7) ~e(.e) from (6) and (0) @ <(t='t')B[pc]A>
(8) contradiction from (7) and (3) @ (e='3')<(t='t')B[pc]A>
(e='3')<(t='t')B[pc]A> = e p s t : 3 3 c t :

Back to Article

1 2 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Database

Theorem Proving and Database Querying

Proving Theorems in Propositional Logic

Proving Theorems in Predicate Logic

Relational Algebra for Databases

An Automated Prover for Relational Algebra

Conclusion

References

Listing One

Listing Two

Listing Three

Listing Four

Listing Five

Listing Six

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Database Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Database

Theorem Proving and Database Querying

Proving Theorems in Propositional Logic

Proving Theorems in Predicate Logic

Relational Algebra for Databases

An Automated Prover for Relational Algebra

Conclusion

References

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Database Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content