Channels ▼


Machine Learning & Agent-Based Computing

Source Code Accompanies This Article. Download It Now.

Nov99: Machine Learning & Agent-Based Computing

Zhimin is the founder of the Learning Machines Technology Group and can be contacted at [email protected] Li is a software design engineer at Hewlett-Packard. She can be reached at [email protected]

As the Internet becomes ubiquitous, software is becoming increasingly exposed to complex and dynamic environments. Consequently, the concept of agent-based computing has emerged as a useful model for software that has to deal with an uncertain and changing environment that consists of many distributed software systems.

Agents are software entities that not only gather and present information, but also act intelligently to accomplish one or more assigned goals. Agents can often be described as finite state machines whose state transition dynamics drives them to achieve the goal.

A state-machine model whose state transition dynamics has to be entirely hardwired and manually maintained offers little value to programmers. In this article, We'll examine the application of machine-learning technology to control an agent. Based on Markov process, this technology enables an agent to automatically identify the statistical structure of its environment and synthesize optimal feedback strategies at run time. As the result, a software agent powered by this technology will be able to adapt to its environment and continuously learn and improve its ability to achieve the goal. To this end, we present MLEngine, a general-purpose AI engine with real-time learning capability. MLEngine can be used to control simulated game characters or software agents through interface with (simulated) sensors and actuators. MLEngine can synthesize goal-oriented behavior at run time and therefore make the characters under its control learn. MLEngine, available from both DDJ (see "Resource Center," page 5) and at http://www, includes Win32 run-time libraries, API documentation, sample source code, and demo programs.

Agent Dynamics

Again, the dynamics of an agent can often be conveniently described as a state machine. The basic elements of a state machine are:

  • At any instance, the dynamic state of an agent can be uniquely described by a set of variables called "state variables." These variables can be analog or discrete.
  • An agent can receive input stimuli from its environment, and the state variables will undergo changes. The new state is a function of both the environment input and the previous state.

  • An agent can pass stimuli back to its environment. The output is uniquely determined by the state variables. The environment could include other agents.

If the states are discrete and finite, this mechanism is called a "finite state machine."

An intuitive way to illustrate the idea is to use a game character such as that in Figure 1. Suppose that as this game character moves around in space, you could use its position in space to describe its dynamic state. Additionally, a character usually has an orientation. An orientation indicator can serve as another state variable. Imagine that this character can also sense whether it is in touch with an obstacle. Touch-sensing input can also become a state variable. So to completely describe the character, you need to specify its position, orientation, and whether or not it is in touch with an object.

The state variables undergo changes over time. For example, the character can move in space and flip. The way it moves can be expressed as transitional probability between states. Obviously, the agent state transition will be influenced by inputs from its environment under certain conditions, so that it will not become a completely self-indulging entity. For example, the character's position change depends on its environment, such as whether or not an obstacle is present.

The character's state should also be able to undergo changes without any environment stimuli under certain conditions (so that the new state depends only on its previous state). This is important for implementing any oscillatory or self-driven sequencing behavior such as walking in a legged character.

Assume that you are interested in an agent whose behavior could evolve as it adapts to its environment. A sensible approach toward that goal is to separate constraints from voluntary control in the agent design. A suitable analogy is the human model. Human behavior is determined partly by the brain and partly by the constraints imposed by the physical body (for example, you couldn't jump 100 feet in the air even if you wanted to).

In Figure 2, you divide an agent into a physics block and a control block. The physics part of the agent is a state machine. The controller has access to state information and issues a set of actions. Along with the other inputs, the action input will influence state transitions. So now the state transition is determined not only by external input and the previous state, but also the action input from the controller.

The character in Figure 1, for example, has five movement actions -- left, right, up, down, and flip. The probability of the character falling into any state depends not only on its previous state, but also on the environment input as well as the voluntary action input from the controller.

Setting Up the Goal

Before learning can take place, it is necessary to come up with a formal definition of a goal. After all, the purpose of learning is for the agent to achieve the goal.

A convenient way to define a goal is to choose one or more states and assign them as the goal. The fact that the goal is expressed as one or more states means that the goal has to be a part of the state space that can be sensed by the controller. Intuitively, this makes sense because no matter how intelligent an agent is, it would never be able to learn and improve if it cannot sense whether or not it has achieved the goal.

To make learning happen, you have a few obvious and tempting choices:

  • You simply observe and see which actions cause the state transitions that lead to the goal. You then enhance the probability of taking these actions. This is essentially a conditioning based on reward and penalty. You reward the actions that bring favorable results and penalize those that do not.
  • The problem with this approach is that learned behaviors tend to be too simplistic. The character can only learn simple reflexes that seek immediate rewards. It will not learn to plan ahead and carry out sequential behaviors.

  • You try different combinations of features in the feedback strategy and see which one yields the best overall result. If we try long enough, a good combination of features in the controller can be found. And the character could become very smart.

  • Experience suggests that combinatorial approaches only work well for simple problems. As the state space grows big, the amount of time needed to perform trial-and-error quickly becomes intractable, even if it is done in a simulation environment.

Mathematical Analysis of Learning

For convenient analysis, you can encode the states in the state space into a linear array 1,2,3...n, assuming there are a total of n mutually exclusive states.

Once the states are encoded into a linear array, the state transition can be described by an n×n matrix. Since the state transitional probability depends on the action input from the controller, to completely describe the physics of a character you need m such n×n matrices, assuming there are m actions. In Example 1, for instance,

represents the transitional probabilities from state i and state j under action k. If the state transition is fully deterministic, then aij is either 1 or 0; otherwise, it is some positive value in between.

In Example 2, the controller can be described by an n×m matrix, where xij is the probability of taking action j when the character is in state i.

The combined system that includes both the physics and the controller can be described by a transitional probability matrix T (see Example 3), where

Suppose that on average, the state probability can be described by P={p1,p2,}, where pi is the longterm average probability for the system to be in state i.

According to Markov process theory, under a certain condition (the ergotic condition), the state probability will converge to a stable value that satisfies P=PT.

Suppose that the goal is state o, the purpose of learning is to enhance Po by manipulating control variables. Obviously, you cannot directly manipulate P, which is the consequence rather than the cause of the system behavior. You can, however, manipulate transition probabilities tij through xik. This will, in turn, influence P in the long term, based on the relationship P=PT. In other words, the goal of optimization is to select the appropriate xik so that po is at its maximum.

There is one missing piece in this analysis. The learning algorithm depends on knowledge about state transition probabilities

So, before learning could occur, you need to fill in the state transition matrices.

As mentioned earlier, state transition probability depends on the character's physics, as well as its interaction with its environment. The process of trying to find out state transition information from knowledge about the character's physics is called "system identification."

Our approach here is to come up with a run-time algorithm that calculates state transitional probabilities by observing the character's physics in action.

The advantage of a run-time approach is that the controller is adaptive. If the physics of the character and its environment undergo changes, the controller will reorganize and adapt the changes. The end result is that the character can maintain and improve its capability to achieve its goals in response to environment changes.

Here is how system identification works. The controller issues actions and observes changes in state. If action k is issued and the state undergoes a change from i to j, you will make adjustments to

in Example 4, where is the learning rate. It is usually a small positive number less than 1. For all the other

with i'i, you make adjustments according to the amount in Example 5. If you keep doing that, over a period of time, matrices

converge to best represent the state transition probabilities.

Putting It Together

Based on this analysis, you have put together an agent controller such as MLE, which can be integrated into an agent and make it capable of learning. The learning is carried out in stages.

In the explore phase, MLE drives the agent to interact freely with its environment. MLE passes out quasirandom action commands and monitors the state input. As a result, the agent will start to explore its environment. Internally, MLE associates sensory and action information and builds an internal representation of the agent and its environment. This is essentially a nonmodel-based system identification process.

A parameter called "association entropy" can be obtained from MLE by calling GetAssociationEntropy. This value (which ranges from 0 to 1) should go down as the association phase is being carried out. The smaller this value is, the more deterministic MLE has become while it builds an internal representation of the system under its control. It represents the confidence level of the agent in terms of its understanding of the environment.

In the optimization phase, MLE will synthesize a strategy, which drives the agent to achieve the goal defined by SetGoal. The strategy is essentially a sensory feedback controller that determines how the system should react to sensory inputs and interact with its environment.

A parameter called "optimization entropy" can be returned from the MLE by calling GetOptimizationEntropy. This value should go down as the optimization phase is being carried out. The smaller this value is, the more deterministic the synthesized strategy is. The estimated probability to achieve the goal can be obtained from GetGoalProbability.

The three phases can be switched back and forth by monitoring association entropy, optimization entropy, and the goal probability. For example, during the optimization phase, the application can keep statistics of the average probability of achieving the goal and compare it with the estimated goal probability returned by GetGoalProbability. If the two are sufficiently different, the application can start to explore and associate again, during which the association entropy is monitored. Once the association entropy goes down to a stable value, an optimization phase can be started. The optimization entropy should go down while the goal probability will go up; see Figure 3.

We've included two demos (available electronically) of agent learning. The first is a boxer trying to hit its opponent. At first, the opponent does not really respond well to the boxer's moves. The boxer learns to carry out coordinated arm movement to hit the opponent repeatedly at the highest frequency. Later, we change the opponent's strategy by making it more responsive. The opponent moves his fist to block the boxer's punch. After some adjustment, the boxer learns to fake fist movements and trick the opponent to move its fist to the wrong location, and then hits from another direction.

The second demo is Picker, a cartoon character whose goal is to eat the fruit at various locations. It learns to associate its visual information with its movement and approach the fruit efficiently. You could cause Picker to mutate so that its mouth and its eye will be located on a different side of the body. The picker learns to flip around so that it can see where the fruit is and approach it with its eye facing the fruit. Once there, it turns around again and eats it. For a Pentium-class PC, it usually takes one or so minutes to learn a new strategy if the size of the state space is less than 50.

We have also tried MLE with (analog) simulations that faithfully emulate physics. For example, we coupled MLE with a quadruped simulator that simulates human body mechanics. we set the goal as the model making forward motion while maintaining an upright body position. The model learned to walk and run realistically like a human (see Figure 4).


Based on the aforementioned method, We've designed a general-purpose AI engine called MLEngine (MLE); its interface functions are presented in Listing One. To prepare the character for learning, you need to follow these steps:

1. Design the character's physics.

2. Encode the state information into a linear array ranging from 0 to dim_state -1. For example, if an agent has three state variables, each represents three possibilities. It is sufficient to represent the dynamics of the agent into 27 discrete, mutually exclusive states (or less since some combinations may not be possible).

3. Create an instance of MLEngine. Specify the dimension of the state space and the action space.

4. Select one or more states as the goal and call SetGoal to set them up. In the boxer example, the goal is the state when the arm is extended and the fist is landed on the opponent's body.

5. Connect the controller to the character physics by calling Perform, which accepts state information as the input and pass out actions.

The learning process is carried out in three phases (see Figure 3). Phase control is accomplished by calling SetPhase.

Comparison with Other Approaches

At the beginning of AI, intelligence was thought of as the ability to manipulate symbols and abstract concepts. An intelligent system based on this assumption consists of a reasoning engine and database that stores common sense knowledge (an expert system, for example). To make it useful, real-world information has to be mapped into abstract concepts or symbols before machine reasoning can take place and the expert system database can be updated. The task of recognizing real-world patterns and mapping them into concepts often turns out to be prohibitively difficult.

Symbol-based automatic reasoning is becoming less popular these days because it never produced interesting results. Recently, a school of thought has emerged that believes logical reasoning is more an afterthought that justifies actions that are already taken, than the cause of actions. For example, when you face a situation and make a decision, it is often after you have made that decision that you come up with the thought: "I did that because of such and such reasons..." We all have a tendency to consciously justify what has been decided unconsciously.

The literature on neural networks is vast. The most popular models focus on pattern association and classification. For example, back-propagation models can map a set of input patterns into some output patterns according to a set of examples. A back-propagation neural net can be used as a controller that maps sensory input patterns into actuator outputs provided one supplies examples about what to do under all situations. The controller can only generalize from these examples.

A second class of neural net models can learn to categorize analog patterns into discrete clusters without supervision. The purpose of learning is to reduce encoding error. The best known example is the Kohonen neural net model.

There is yet another class of neural net models that are able to generate sequences of spontaneous patterns. The output of these neural net models can be used to control motor actions. Most of these neural net models have recurrent topology and dynamics (Hopfield neural networks and Bolzman machine, for instance). Markov process theory has often been used to mathematically describe the dynamics of recurrent structures. Our learning algorithm is similar to some aspects of these models, but only at the mathematical level.

Learning automata are a set of mathematical models that address learning directly without worrying about the underlying brain substrate. The heart of this approach is conditioning based on reward-and-penalty. The main problem with this class of methods is that the results are usually too simplistic. For example, a typical learning automaton can only learn to perform actions that will directly lead to reward. It will not learn to plan ahead and accomplish the goal through indirect action sequences.

More sophisticated learning automata are being developed today. Most of them try to give credit to actions that indirectly contribute to achieving the goal. Some of these models borrow ideas from neural network models.

It is said that if you give a monkey a typewriter and allow it to type randomly, you will eventually see beautiful poems coming out of that typewriter. Many tough problems can indeed be solved by random searches in the solution space if we do not worry about the time it takes. Simulated annealing and genetic algorithms (GA) are two such techniques. Combinatorial optimization techniques such as GA and simulated annealing search through the control parameter space to optimize the performance measured by a predefined index or criterion. The search is usually guided by some sort of heuristic. For example, simulated annealing uses gradient information to guide the search. GA recombine simulated gene segments to come up with control parameters for new trials.

The main problem with this class of approaches is the computational requirement. For example, if you use GA to optimize the performance of a game character, the algorithm has to simulate the game character through many generations. In each generation, a large population of individual characters has to be simulated also. The algorithm will come up with many sets of control parameters for those individual characters. Each of these characters has to perform long enough to come up with a fitness measurement. A character will not adapt in its lifetime. Combinatorial algorithms are fundamentally difficult to implement in real time.


There is not a one-size-fits-all solution for machine learning. A practical solution has to inherit the best of what is available. Learning as an optimization algorithm may take many forms. A successful one converges to an optimal result not because it is a neural net, learning automata, or whatever else, but because its underlying mathematics is correct. It is therefore best to deal with learning at the mathematics level.


Listing One

typedef state int;
typefef action int;

class MLEngine {
  MLEngine(int dim_sate, int dim_action, int dim_order);

  void SetGoal(state);	     //one of the states is assigned as the goal
  void SetGoal(state*, int); //when multiple states are assigned as the goal
  action Perform(state);     //0--explore, 1--associate, 2--optimize and perform
    void SetPhase(int);

  void Associate();
  double GetAssociateEntropy();

  void Optimize();
  double GetOptimizeEntropy();
  double GetGoalProbability();

  void SetAssociationRate(double r);      //0<r<1
  void SetOptimizationRate(double r);     //0<r<1

Back to Article

Copyright © 1999, Dr. Dobb's Journal

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.