Database

SQL, Data Mining, & Genetic Programming

By Brian Conolly, April 01, 2004

Evolutionary algorithms solve problems by mimicking the process of natural evolution.

Apr04: SQL, Data Mining, & Genetic Programming

The practical side of evolutionary algorithms

Brian is an independent consultant who has designed databases for health care companies, international financial trading systems, and investment banking applications. He can be contacted at brian@ ideajungle.com.

Genetic programming is one of the more versatile techniques available to developers. It has been used to solve a wide range of complex problems, in areas as diverse as optimization, process control, symbolic integration, and emergent behavior. Genetic programming is one of a class of techniques called "evolutionary algorithms." These algorithms solve problems by mimicking the process of natural evolution. Evolutionary problem solvers work by first generating a large number of random problem solvers, each of which is given the same problem to solve. Then they repeatedly:

Choose the best problem solvers, by applying a scoring function to each.
Breed them by mixing their constituent parts to form the next generation of problem solvers, allow superior solvers to have more descendants, and occasionally mix in mutations, which are random changes in a child.
Repeat the process, until a fixed number of generations have evolved, or until a suitably good problem solver has been developed.

What I've just described is known as a "genetic algorithm." In genetic programming (GP), the problem solvers that are bred are executable programs such as a LISP function that solves a mathematical optimization problem or a C function that implements financial trading strategies. While GP is usually implemented in languages such as LISP or (sometimes) C/C++, in this article I implement it in SQL to solve a data-mining problem. The program that is bred is a SQL WHERE clause that can be used, for instance, as the basis for an efficient direct marketing campaign. The implementation is done in Microsoft SQL Server 2000 Transact-SQL.

Specifically, the data-mining problem I tackle is from the COiL Challenge 2000, a data-mining contest organized by the Computational Intelligence and Learning Cluster (http://www.wi.leidenuniv.nl/~putten/library/cc2000/index.html and http://www.dcs.napier.ac.uk/coil/challenge/). The basic goal of the contest was to use computational intelligence and learning technology to solve real-world problems. Data miners were given a data set on 5822 people, with 85 columns of demographic, financial, and credit data on each. Each person also had a column named "Caravan," which indicated whether a mobile-home insurance policy had been purchased. Of the 5822 people, 348 people (6 percent) purchased such a policy.

The contest objective was to use this training data to develop an effective means of identifying insurance purchasers in a different set of data. Presumably, the 5822 people represent a sample of a much larger population. The contest and the sample data represent a real-world business problem: How to target a new marketing campaign to likely purchasers? Here, I use GP to develop a SQL WHERE clause that selects as many insurance purchasers as possible, and as few nonpurchasers. This query could be used against the full population to select good candidates for targeted marketing.

The data from COiL 2000 is found in the t_mine_relations table (available electronically; see "Resource Center," page 5). The GP data-mining algorithm I've developed is general purpose and can be applied to any data set, with arbitrary columns in a relation named "t_mine_relations."

Why SQL?

There were a number of reasons why I chose a SQL implementation for the general data-mining solver, including:

Data mining requires checking large amounts of data to discover patterns. GP in SQL lets this be done inside the database, rather than incurring the overhead of shipping the data to an external server.
According to John Koza, GP can be implemented in any programming language that lets programs be treated as data. Transact-SQL provides the EXEC function to execute a dynamically constructed query (see Genetic Programming: On the Programming of Computers by Means of Natural Selection; ISBN 0262111705).
In "A Genetic Programming Framework for Two Data Mining Tasks: Classification and Generalized Rule Induction," Alex A. Freitas pointed out that the SQL GROUP BY operator can be extremely efficient in evaluating large numbers of generated WHERE clauses (see Genetic Programming 1997: Proceedings of the Second Annual Conference; Morgan Kaufmann, 1997).
There are good techniques available for representing data selection profiles in tables. This representation lets the next generation be bred and mutated using simple SQL operations.

Still, SQL does have disadvantages, the most important being that it isn't very good at performing random operations. There is no standard way of selecting a random subset of rows from a table, and much of the complexity in the implementation follows from the need to work around this limitation.

The data-mining implementation in this article is rudimentary because it chooses WHERE clause conditions from a small set of potential operators. For this reason, the solution is not competitive with state-of-the-art data-mining techniques. However, the article does demonstrate that SQL offers all the capabilities needed for a competitive data-mining solution

Design Objectives

The implementation I present here is designed to solve the problem: Given an arbitrary set of data in the relation t_mine_relations (which could be a view of several tables), and a particular target column and target value, find a SQL WHERE clause that is good at selecting t_mine_relation rows where the target column has the target value.

In the COiL 2000 example, the t_mine_relations table contains 85 columns of personal data. The target column is named "Caravan" and the target value is "1," meaning the person has purchased a mobile-home insurance policy.

In addition, I wanted the design to save a history of the evolution process, so that SQL queries can summarize progress toward a solution and trace the lineage of the best current WHERE clause.

Implementation Overview

Figure 1 illustrates the design of the SQL GP Data Miner. A GP data-mining problem is solved in three phases:

Problem definition, which contains a set of stored procedures and views that use the t_mine_relations metadata, and generates a profile representation function and profile evaluation for it.
Execution, which uses stored procedures to execute the GP algorithm, breeding new WHERE clause profiles, and using the generated evaluation function to measure them against t_mine_relation data rows.
Analysis, which saves the profiles for each generation in a set of archive tables for later analysis. There are stored procedures to summarize progress toward a good solution and to examine the lineage of the best problem solver.

The implementation allows for WHERE clauses of the form expression1 AND expression2...AND expressionN. Each Exp expression is of the form C op V, where C is a t_mine_relation column name, op either "==" or "!=", and V one of the values found in any t_mine_relation row for that column. While the implementation is limited to AND operations, the general design can be easily extended to include both AND and OR operations.

As Example 1(a) shows, a SQL WHERE clause is represented as a profile. A WHERE profile has multiple rows, each of which represents a c_name op c_value condition. All of the potential WHERE clauses within the current GP generation can be evaluated with a single SQL statement; see Example 1(b). This is standard SQL, except for its WHERE clause. To use standard SQL, I need to convert the t_profile_column_values table into a table with the same columns as t_time_relations. Dynamic SQL is used to generate a SQL user function—uf_generated_profile_table—that does this. This function returns a table identical in structure to t_mine_relations (with the exception of the target column). A column contains NULL if there is no t_profile_column_values entry for it. If there is a t_profile_column_values entry, uf_generated_profile_table contains ==:V or !=:V. The generation evaluator in Example 1 can then be implemented in the standard SQL in Example 2. Since the column names are dependent on the column names of t_mine_relations, the SQL evaluator must be generated by a dynamic SQL procedure that works off the t_mine_relations metadata.

Schema.SQL creates all the tables and views used in the GP setup, execution, and analysis procedures. Some of the views it defines use SQL Server 2000 meta-data on the t_mine_relations table. A GP problem solver for a particular instance of t_mine_relations is set up by running Definition-Procedures.Sql (available electronically; see "Resource Center, page 5). There are two key procedures in this step. The first, up_generate_profile_table_function (Listing One), creates the function that transforms the profiles found in t_profile_column_values into a row-per-profile table with the same columns as t_mine_relations.

Listing Two (available electronically) presents up_generate_profile_Evaluator. The v_columns view returns a list of t_mine_relations columns selected from the sysobjects table. up_generate_profile_Evaluator uses these columns to generate a stored procedure named up_generated_profile_Evaluator, which performs a complex SELECT...WHERE...GROUP query that evaluates all profiles. Listing Three (available electronically) shows the structure of this generated stored procedure.

SQL procedures are generated and compiled at definition time—the execution phase doesn't use dynamic SQL. The COiL 2000 Challenge data can be loaded by running t_mine_relations.sql to create the data table. The data rows can be loaded from t_mine_relations.csv. When Definition-Procedures.sql is run after this data is loaded, the generated evaluation procedures are shown in Listings Two and Three (both available electronically).

GP Data-Mining Execution

The following parameters control the behavior of the procedure up_evolutionary_data_mining (see Listing Five, available electronically):

Profile_count. The number of profiles that are generated initially, and the number that is maintained through execution.
Number_of_generations. Execution terminates when this number of generations has been run.
Target_value. The value in the target column you are trying to breed selectors for.
Population_fill_proportion should be set to a value greater than 0.0 and less than 1. It controls the number of descendants of each pair of parent profiles. In general, lower values promote diversity; higher values allow a few "champions" to dominate the next generation. If set to 0.5, the best pair of profiles breeds children filling half of the available (up to profile_count) population slots; the next best pair will use half of the remaining slots after the best pairs are finished breeding. However, all profile pairs that select any t_mine_relation rows will have at least one child.
Column_portion controls the maximum number of t_mine-relation columns that appear in any profile of the initial population. A value of 0.5 means that no profile uses more than half the columns.
Equality_percent, the probability that a new profile clause uses the "=" operator.
Mutation_probability, the likelihood that a mutation appears in any breeding operation.
Mutation_add_column_probability, the likelihood that a mutation will add a new column clause to a profile. If set to 0.8, then 80 percent of the mutations adds new columns; the rest of the mutations remove a column.

Listing Six (available electronically) shows the first step taken by the evolutionary algorithm: Creating the initial generation of random profiles. To do this, the procedure up_create_initial_population needs to select random columns and values from t_mine_relations. Selecting random rows from a table is not a natural SQL operation, and there's the additional problem that the Transact-SQL Rand() function is only evaluated once per query. To work around these problems, the procedure loops through the table one row at a time, assigning Rand() values once per iteration. Once this is done, you can do a SELECT with an ORDER BY to retrieve rows in random order. Up_create_initial_population does this to select random column names. It invokes up_insert_profile_column_row to insert a profile row for the random column. That procedure applies the same technique to select a random value for the profile row.

In each generation, an evolutionary algorithm ranks the current set of profiles, and selects the best ones for breeding. Listing Seven (available electronically) is the function that does this. uf_profile_Score takes a table that contains the number of "right" and "wrong" t_mine_relation rows selected by each profile, where a right row is a row with a target column ("Caravan" in the COiL 2000 case) value of "1." The score of a profile is the percentage of possible right rows it selects minus the percentage of possible wrong rows. For example, in the COiL 2000 data there are 348 rows with a Caravan value of "1," and 5474 rows with a Caravan value of "0." So a profile that selected 150 "1" rows and 2500 "0" rows would have a score of (150/348)-(1977/5474), or 0.3611. A better score means a query is more likely to select insurance purchasers than a profile with a lesser score.

Listing Eight (available electronically) shows the procedure used to breed children given a pair of parents. In every generation, the best profiles are selected and sorted by score. Pairs of the best profiles are passed to up_breed_children to create new profiles. The breeding process works by choosing a profile row from one parent or the other, depending on a random number. Care is taken to ensure that the same column name does not appear in two-child profile rows. Since the profile rows are ANDed together, this would likely yield a "dead" profile—one that selects no rows from t_mine_relations. Depending on the mutation parameter setting, either add a new profile row with a new column to the child, or delete an arbitrary profile row that was inherited from a parent.

GP Data-Mining Execution

One of the reasons I implemented the GP Data-Mining solution in SQL was because SQL offers powerful capabilities for tracing the history of an evolving population of candidate solutions. up_evolutionary_data_mining saves a record of every profile in two tables in Listing Nine (available electronically). These tables contain the profile ID, generation, parents, and score of every profile. Analysis-procedures.sql contains several procedures (see Listing Ten, available electronically) that can be used to analyze a run. The first, up_run_summary, displays a summary of the key data for each generation. The second, up_lineage_of_best_profile, displays the WHERE clause of the best profiles and its ancestors a few generations back. The power of SQL in analyzing a run is evident in the simplicity of these two procedures. While the code to execute GP is no simpler in SQL than other languages, the code to analyze results is far simpler.

I configured the GP system to work from COiL 2000 data, and ran the test in Example 3. While it was running, I ran the first analysis procedure, up_run_summary, which produced the data in Figures 2 and 3. Figure 2 (with a logarithmic scale) shows a large initial drop in the number of wrong rows and a large increase in right rows. These are average figures—they stabilize fairly early on and don't change much in the course of a run.

Figure 3 shows the average profile scores and the best profile scores by generation. There's a quick increase in both— up to about generation 40, followed by a slow but steady increase thereafter.

When examining the results of executing up_lineage_of_best_profile after generation 166, the procedure only reports back five generations because the number of ancestors doubles with each generation (see Table 1). Because in every generation, there can be several identical profiles, any duplicates are reflected in the count field.

How good is the best profile? The best profile at generation 166 selected 244 insurance policy purchasers and 1398 others. So 17 percent of the rows it selects are insurance purchasers, while a random subset of t_mine_relations would be expected to have 6 percent policy purchasers. So this query is three times as efficient.

Limitations and Enhancements

One limitation to this technique is the restrictive nature of the WHERE clause. Only "=" and "!=" comparisons are allowed, and there are currently no OR expressions. The implementation can be extended to add these operations. The profile table implementation was based on just a subset of Joe Celko's profile design, which allows for range comparisons and regular expression matching (see "Using Expressions as Parameters for SQL Server Stored Procedures" in the microsoft.public.sqlserver.programming newsgroup). The Test field in t_profile_column_values is reserved for his idea of including OR operations. A set of profile column rows can have different Test numbers. The profile rows with the same Test number represent AND operations, and all of the different Tests are ORed together. Only a tiny change is needed to the evaluation function: Just substitute SELECT DISTINCT for SELECT in up_generated_profile_Evaluator and it correctly evaluates the profile AND and OR expressions.

A second limitation concerns the length of the WHERE clauses the evolutionary algorithm finds—they tend be very long, ANDing together dozens of comparisons. You can correct this by modifying the scoring function uf_profile_Score and adjusting the score to favor profile rows with those containing fewer columns.

Performance is also a concern. It takes approximately 1.5 minutes per generation of 200 profiles on a 1.66-GHz Pentium IV processor, and 90 percent of this time is spent evaluating profiles. It is comparing several hundred profiles against several thousand mine relation rows, so the processing is substantial. I looked at the execution plan for up_generated_profile_Evaluator and saw that it was doing a table scan loop for each of the 85 possible profile columns. One of my objectives in the design was to limit the use of dynamic SQL because I felt that generating dynamic SQL during execution would cause too much overhead. In hindsight, this was a mistake. Even the complex profiles that the algorithm generates use only a few dozen of the 85 COiL 2000 data columns. The algorithm could be enhanced to dynamically create a custom up_generated_profile_Evaluator version for each new generation. The algorithm could first identify all the distinct columns that are used in any active profile, and then generate a new up_generated_profile_Evaluator that has a WHERE clause for just those columns. This could substantially reduce the number of loops needed to execute the query, and thus dramatically reduce execution time.

Conclusion

One of the aspects of GP that attracted me was the promise of harnessing some of the creativity of natural evolution to solve complex problems. I expected (somewhat naively) to implement the algorithm, point it at some data, and be astonished at the solution it found. However, since randomness is the basis of the algorithm, I frequently ran it and found a pleasing rate of progress, only to run it again with the same parameters and watch it slog through generation after generation with only marginal improvement.

I had a tendency to see successes as clear evidence of the power of the algorithm, and failures as just random failure or poorly tuned parameters. I was not unlike the psychic researcher, who, convinced of the reality of the psychic ability, reports only successful experiments and attributes all failures to bad experimental conditions. For an interesting paper on this tendency, see "Magical Thinking in Data Mining: Lessons from the COiL Challenge 2000," by Charles Elkan (http://www-cse.ucsd.edu/users/elkan/kddcoil.pdf).

DDJ

Listing One

/* To evaluate all the profiles in t_profile_column_values, we need to 
restructure them as rows of the form id, col1, col2,...coln, where colN 
is either: NUll - column doesn't matter; "==:N" - mine relation column 
values = N; or "!=:N" - mine relation column values != N. We generate the 
create function statement that performs this transformation. */

drop procedure up_generate_profile_table_function
go
create procedure up_generate_profile_table_function
as
declare
   @sql_initial varchar(8000),
   @sql varchar(8000),
   @next varchar(30),
   @columns_done int
begin
select *
into #t1
from v_columns
order by colorder
--select * from #t1

select @sql = 'drop function uf_generated_profile_table'
exec(@sql)
select @sql = ''
select @sql = @sql + 
'create function uf_generated_profile_table() ' + char(13) + char(10)
select @sql = @sql + 'returns table ' + char(13) + char(10)
select @sql = @sql + 'as return (' + char(13) + char(10)
select @sql = @sql + 'select p.id,' + char(13) + char(10)

select @columns_done = 0
while ( exists (select * from  #t1))
begin
    select top 1 @next = name
    from #t1
    select  @sql = @sql + @next +  
' = (SELECT operator + '':'' + c_value FROM t_profile_column_values ' +   'WHERE c_name=''' + @next + ''' and id = p.id)'
    delete from #t1 where name = @next
    if exists (select * from  #t1)
        select  @sql = @sql + ',' + char(13) +char(10)
    else
        select  @sql = @sql + char(13) +char(10)
select @columns_done = @columns_done + 1
if ((@columns_done % 50) = 0)
   begin
     select @sql_initial = @sql
     select @sql = ''
   end
end
select @sql = @sql + 
' from t_profile_column_values p' + char(13) +char(10) 
select @sql = @sql + ' group by id )' + char(13) +char(10) 
--select @sql_initial, @sql
exec(@sql_initial + @sql)
end
go
/* Create the function above. */
exec up_generate_profile_table_function
go

Back to Article

1 2 3 4 5 6 7 8 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Database

SQL, Data Mining, & Genetic Programming