Dr. Dobb's | Use Embots to Implement Autonomic Computing

Use Embots to Implement Autonomic Computing

Recently, Autonomic Computing initiatives clearly identified the need for embedded, intelligent system management. Autonomic Controller Engine (ACE), with its embot concepts, delivers. Here's how.

August 25, 2006
URL:http://www.drdobbs.com/architecture-and-design/use-embots-to-implement-autonomic-comput/192300135

Servers are increasingly difficult to manage, while their total cost of ownership continues to rise. The availability of enterprise IT servers does not rival that of telecommunications switching equipment. When these issues arise, despite the best efforts of industry, the answer is often architectural. Recently the concept of Autonomic Computing has been introduced with the Autonomic Element proposed as a main architectural component with a vision of self managing, self healing and autonomous IT servers.

Total cost of ownership of servers continues to rise despite improvements in hardware and software. Effective manageability remains a problem for a number of reasons. First, the management infrastructure deployed in the enterprise relies on traditional client-server architectures. Second, high levels of human interaction result in reduced availability while servers wait for operators to diagnose and fix problems. Finally, the deployed management solutions are in-band, with software agents operating on servers communicating with centralized management platforms. This implies that server management is only possible when the operating system is functioning, which is often not the case when management is required. Clearly, change is necessary.

Delegation of responsibility is widely acknowledged as a way of getting things done in an industrial setting. Providing workers with the authority to make decisions speeds things up, making an enterprise more efficient. Translating this observation to the server management problem, the solution is clear; empower management software to make decisions regarding change or reconfiguration. Empowering software to make decisions leads to a number of desirable software characteristics.

First, the software must be capable of autonomous decision making. In other words, the software should be an intelligent agent. This implies that the software should separate its understanding (or knowledge) of what is to be managed from the ways in which problems are diagnosed. Second, the intelligent agent cannot be part of the managed system in terms of the resources that it consumes; e.g. CPU and disk. This requires some explanation. Imagine a scenario where a run-away process is consuming almost all of the CPU. It is difficult to see how an agent would be able to control a server in these circumstances. Consider another scenario in which critically low levels of disk space are detected. An agent sharing resources on the host would be unable to save information potentially critically important to the resolution of the problem. Finally, let's consider the scenario in which the operating system is hung; the agent can no longer communicate with external parties.

The scenarios described in the previous paragraph lead to the inevitable conclusion that the agents tasked with delegated system management should reside on a separate management plane; that is a platform with separate computing and disk resources. Furthermore, the design of the computing platform should support the principles of Autonomic Computing, an area of computing recently proposed by IBM. Recently AMD has embraced autonomic computing principles in its efforts to improve manageability of servers.

Autonomic Computing is a relatively recent field of study that focuses on the ability of computers to self-manage. Autonomic Computing is promoted as the means by which greater dependability will be achieved in systems. This incorporates self-diagnosis, self-healing, self-configuration and other independent behaviors, both reactive and proactive. Ideally, a system will adapt and learn normal levels of resource usage and predict likely points of failure in the system. Certain benefits of computers that are capable of adapting to their usage environments and recovering from failures without human interaction are relatively obvious; specifically the total cost of ownership of a device is reduced and levels of system availability are increased. Repetitive work performed by human administrators is reduced, knowledge of the system's performance over time is retained (assuming that the machine records or publishes information about the problems it detects and the solutions it applies), and events of significance are detected and handled with more consistency and speed than a human could likely provide.

The remainder of this article describes the essential requirements of an autonomic element for servers and client computers.

Figure 1. An Autonomic Element

Figure 1 provides a view of an autonomic element as proposed in the Autonomic Computing literature. In this figure, the Managed Element is the server or a client workstation, which includes the hardware, operating system and hosted applications.

The responsibilities of the autonomic manager are real-time management of the host hardware, operating system and hosted applications. The autonomic manager runs customizable, policy-based, server /OS/application management software thereby automating IT service management. It performs preventative maintenance tasks, detection, isolation, notification and recovery of host events/faults and records root cause forensics and other operating data of user interest. An autonomic manager achieves its goals by monitoring the host. Measurements made are analyzed for scenarios of interest; e.g. disk full or impending failure. A plan to resolve the scenario is then generated; e.g. notification of the need for a higher capacity drive and running of disk cleaning utilities as a temporary measure. Finally, once planned, the autonomic manager schedules (executes) planned activity to resolve the scenario. All four processes depend upon knowledge built into the system--knowledge that has been encoded in policies and information that has been gathered from the Managed Element.

The autonomic manager is an embedded computer application that runs continuously, responding to changes in a system being managed and acting on behalf of a user. Autonomic Managers are examples of embots. An autonomic manager has the following characteristics:

operates on the management plane
lightweight
autonomous; can operate without human involvement
social; it communicates with other autonomic managers
mobile; can move when resources move

Autonomic managers--embots--are trusted applications, designed to manage specific aspects of an operating system (e.g. services, processes or files in the file system) or its hosted applications (e.g. Microsoft Exchange). Embots run on top of an Embot Application Framework (EAF), which executes within the management plane (See Figure 2). Embots implement one or more policies designed to manage some aspect of an application, operating system, device driver or hardware. Embots are lightweight and achieve their expressive power through properties of self-organization and communication with other embots. Embots reason about changes observed within the managed element and can act autonomously; i.e. can take actions without system administrator involvement.

In some cases, embots may ask a system administrator to confirm that an action should be taken; establishing the right level of trust being a straightforward matter of corporate policy. Finally, embots are mobile; this being necessary in order to support migrating virtual machines. If a virtual machine migrates from one physical machine to another, the embot collective managing it does too! So, what is a management plane?

A management plane is a computing environment in which management functionality runs; e.g. embots. A management plane can be physical or virtual. An example of a physical management plane would be an Open Platform Management Architecture (OPMA) card running a framework hosting embots. An example of a virtual management plane would be a privileged virtual machine such as Xen's domain 0. The service plane represents the applications, operating system and hardware being managed. In Figure 1, the service plane is represented by the Managed Element.

Figure 2. Management and Service Planes

The management lifecycle
Figure 3 provides an overview of the lifecycle of an autonomic manager. The environment supporting the lifecycle consists of several important components: the Autonomic Controller Engine (ACE), the Management Console (MC), the Module Development Environment (MDE) and Management Modules (MM). ACE is an example of an autonomic manager shown in Figure 1 and consists of the Embot Application Framework and Embot Execution Environment shown in Figure 2.

The MC is a secure web portal through which all deployed ACEs are centrally managed. The MC centralizes notification and escalation of problems that the ACE cannot solve (e.g. hardware failures). The MC also acts as the integration point for enterprise management consoles such as Microsoft's Operation Manager (MOM) and HP OpenView. The MC is also the point through which group deployment of software to ACEs is coordinated.

Figure 3. The Autonomic Management Lifecycle

The MDE is used to create and edit management modules. A management module is the unit of deployment of autonomic management. As such, it consists of embots that assist in managing the server and its applications along with utilities to support them. Management modules are deployed to one or more ACEs via the MC. One or more management modules can be deployed to an ACE. A module is instantiated in a module archive, similar in structure and intent to a web archive used by application servers. Simply put, a module archive is a directory structure of a standard format that contains managed element models, classes and resources that encode a management scenario of interest. A module archive may also contain dynamic link libraries that may be required in order to augment the low level instrumentation on the host and HTML documents that allow a user to interact with the run time version of the module for purposes of configuration.

From an autonomic manager's perspective, a module is comprised of a set of scenarios related on a conceptual level--for example there might be a module designed to manage printers, another to audit host performance in order to establish normal levels of resource consumption, and a third to enforce security.

A scenario encompasses data and host information to be monitored, as well as the processing of this information: conditions, filters and thresholds to be satisfied, and actions to be taken, for instance events to be logged and alarms to be raised. Modules are completely pluggable, meaning that they can be installed, updated or reconfigured at runtime, and require no modifications to the engine framework. ACE
In order for autonomic systems to be effective, IBM stresses the need for the adoption of open standards. There is little hope for the seamless integration of applications across large heterogeneous systems if each relies heavily on proprietary protocols and platform-dependent technologies. Open standards provide the benefits of both extensibility and flexibility--and they are likely based on the input of many knowledgeable designers. As such, the widely used standards tend to come with all of the other benefits of a well thought-out design. Java has many advantages as the language for implementation of ACE, which includes its widespread industry use, platform independence, object model, strong security model, support for network computing and the multitude of open-source technologies and development tools available for the language.

WS-Management protocols are used for external communications and possess a natural synergy with the object model maintained by the autonomic manager. The Common Base Event (CBE) format is used for event information that flows between a MC and ACE. Tools exist to create adapters from CBE formatted events to a wide range of other formats.

Figure 4. ACE Architecture

Figure 4 shows that ACE is built using services, with a service-oriented-architecture, which can be plugged and unplugged dynamically; i.e. software hot swapping is supported. Services can be arranged in bundles, with bundle lifecycle management being the responsibility of the Open Services Gateway Interface (OSGi) standard.

OSGi is an effort to standardize the way in which managed services can be delivered to networked devices. It is being developed through contributions by experts from many companies in a wide variety of fields (such as manufacturers of Bluetooth devices, smart appliances, and home energy/security systems). An open specification is provided for a service platform so that custom services can be developed (in Java), deployed, and managed remotely.

Figure 5. Combining Policy

As shown in Figure 5, policies implemented within embots are designed to be combined. Combination of policies allows for the creation of control loops--an architectural requirement in Autonomic Computing. In the example shown in Figure 5, a resource escalation policy can be instantiated for disks, and that, combined with a chronic scenario policy can be used to drive capacity planning activities. Combination of policies is achieved through having one policy listen for changes in another. Reuse of policy--of general-purpose knowledge--is a key goal in Autonomic Computing. For example, the knowledge included in a disk cleanup policy should be applicable to Windows and Linux; however, the instrumentation required to implement it must be different.

EAF supports the previous statement through separation of the sensor and effector layers into two distinct facets and managed object modeling. This is shown in Figure 6, where an upward-facing component provides sensor and effector feedback information in a standardized format regardless of the managed element being managed. It is the managed object (e.g. a disk) that listens to the upward-facing sensor. Policies, in turn, listen for changes in the value of specific managed object attributes.

The downward-facing component of the layer is implementation-based and deals with the platform-specific details of instrumentation. By doing this, policy achieves the goal of reuse for the two environments by creation of distinct implementations of sensor and effectors for the two operating systems.

Figure 6. Sensor and Effector Layers

Summary
Server total cost of ownership continues to rise while server reliability and availability does not. Conventional client-server system management solutions have failed to deliver the availability normally expected of telecommunications products. The need for a new approach to systems management is therefore clear.

Recently, IBM's Autonomic Computing initiatives clearly identified the need for embedded, intelligent system management. ACE, with its embot concepts, delivers Autonomic Computing for client and server computers.

About the Authors: Jay Litkey: Is the CEO of Embotics. Before Embotics Jay founded Symbium, a venture capital funded company focused on Autonomic Computing and the automated management of IT infrastructure, where he managed business development and product management. Prior to that, Jay was the founding CEO of BlackholeTV, an Internet public video content aggregation and technology company. Jay is a frequent conference speaker, started his career at Nortel Networks in the design and development of high availability telecommunications management subsystems, and holds an Honors degree in Computer Science from Queen's University. He can be reached at: [email protected]

Tony White: Is the CTO of Embotics. Prior to Embotics, Tony served as CTO at Symbium, setting the technology vision and serving as principal architect. At Texar, as Chief Scientist, he built and managed an R&D group of 35 engineers. While at Nortel Networks he was the principal architect for the Expert Advisor, a multi-agent diagnostic system for X.25 networks. He has published over 70 papers on subjects that include Network and System Management, Multi-agent systems, Swarm Intelligence, and Autonomic Computing. He has been awarded 7 patents with several others pending. Tony can be reached at: [email protected]