Dr. Dobb's | Distributed Computing Component Lifecycles

Distributed Computing Component Lifecycles

A fundamental problem in all distributed systems is that different components are maintained on different schedules - and things inevitably break. Cliff presents a framework designed to synchronize components in distributed systems.

September 01, 2001
URL:http://www.drdobbs.com/jvm/distributed-computing-component-lifecycl/184404765

Sep01: Distributed Computing Component Lifecycles

Cliff is chief scientist for Digital Focus and author of Advanced Java 2 Development for Enterprise Applications (Prentice Hall, 2000). He can be contacted at [email protected].

Since interoperating components in collaborative hub-and-spoke and peer-to-peer (P2P) environments are provisioned and maintained by multiple parties, there is usually no single authority coordinating maintenance and upgrades of all components. Consequently, some participants will likely always be out of sync. This is a fundamental lifecycle problem in distributed systems. Different components are maintained on different schedules and things inevitably break. Loosely bound message-oriented protocols do not protect you from this — they merely defer the failure point to run time. For that matter, even tightly bound systems are vulnerable to run-time errors resulting from deployment of new versions of services at unexpected times. To address problems such as this, I present in this article a framework designed to synchronize components in distributed systems. A Java-based partial implementation of this model is available electronically (see "Resource Center," page 5). This implementation uses a modified RMI as the framework distribution protocol.

Figure 1 illustrates my approach for updating components for distributed systems. This model consists of a set of application types called "services" that are instantiated across the network as applications (also referred to as "application components," "service objects," or "services"). Services execute in run-time containers that I refer to as a service's "run time." Services may or may not communicate with other services. Furthermore, each service is deployed with the help of a remote service provider — a remote factory for making one or more types of service objects that can be deployed to point-of-use run times over the wire.

A fundamental assumption of the model in Figure 1 is that services can use other services (either locally or remotely) and reference each other (for instance, via memory references, if they execute within the same process). Another assumption is that each run time has a locking mechanism (such as Java object monitors) for serializing access to the service objects that are installed on it. I further require that version incompatibility between services be detected and responded to by initiating an update sequence. The update sequence can then be performed in place and, in most cases, does not require the requesting service to be restarted.

All services have a locally installed footprint on the systems that use them. However, the business logic of an installed service may or may not run locally on the requestor's host. In the latter case, the locally installed service is merely a proxy for the actual service and probably contains stubs for remote communications as part of its code base. In any event, from the point of view of the requestor, the locally installed service behaves as if it runs locally. It may perform remote calls as part of its function, but this is not required to be exposed in the service's interface. No distinction is required between local and remote services.

If a service attempts to use another service, it must have access to a schema or API interface for that service (such as a Java interface or XML schema object). The interface or schema accessed by users of a service must be fixed because it represents the contract assumed by existing users of the service (other services, for instance) for a particular version of the service. Therefore, it is best if it is maintained statically in the requesting service's namespace (this would be the case for a Java interface) or in a location known to be strictly maintained by version. Once the correct interface or schema is obtained, the lookup service must return an actual implementation of the requested service based on the version currently installed. If, however, the currently installed version is out of date, exceptions can be generated by the run time.

The goal of my framework is to delegate to the run times and service providers the responsibility of keeping things up to date. The functions a framework such as this should perform include:

Incompatibility detection (at least at the start of each transaction).
Finding service providers.
Ensuring authorization for operations.
Determining version and initialization requirements.
Creating new versions.
Transporting new versions (code, state, and required resources).
Putting old versions in well-defined states (usually inactive and disconnected).
Releasing references to old versions.
Defining run-time namespaces and contexts (including security contexts) for new versions.
Installing new versions.
Configuring new versions.
Sharing resources across services, and authorization for access by users and by services from other service providers.
Registering or announcing new versions for local access by users and other services.
Finding resources needed to start or resume new versions.
Cleanup of unused resources.
Resuming requester activities (ideally where you left off).
Clean, correct uninstallation of services.

From the perspective of other services within a service run-time environment, services do not exist (or have states) until they are installed. Once installed, a service may be in one of three states (see Figure 2):

Uninitialized. The service has been installed; that is, the run time has received the service from its service provider, configured it, and saved its configuration in a persistent manner. There is also a service descriptor maintained by the run time that identifies the installed service. The run time can therefore respond to requests to look up the service.
Initialized. The service has been loaded from local persistent storage and can be quickly started. The addresses of all local resources needed to start the service have been resolved.
Started. The service is running and available to respond to requests or is otherwise actively performing its function. Any resources that the service needs to obtain to be ready to execute have been obtained.

Detecting Service Incompatibility

Services can detect that they are incompatible with other services in several ways, including:

An implicit request to check for a newer version of a requested service when a service lookup is performed. Such a lookup is local within the run time on which the service is installed.
An explicit request to attempt to update a service.
An explicit request to check a service's version number. On the requestor side, determine if a different version is needed. A full-featured framework should provide a mechanism for declaring version dependencies.
Asynchronous notifications by the run time or any external entity (such as service providers) that the service needs to be updated. This involves registering a listener to be notified by the service provider when an update is available. Since notification would occur across a network that most likely has substantial latency and reliability issues (the Internet, for example), such notification would almost certainly have to be treated as asynchronous and advisory, and would not obviate other mechanisms for detecting compatibility.
Run-time errors during an attempt to use a service. Incompatibility can be detected at the moment a call is made to a service. A run-time exception or error in any calls to a service can be detected as caught exceptions or errors within the calling service. It is up to the calling service to interpret the nature of the error. For example, a Java NoSuchMethodError or IncompatibleClassChangeError run-time error could be caught, and the caller could elect to throw an exception or request an update of the called service right there and then. However, it might be burdensome to expect applications to catch such errors, which are normally missed and are not checked by a Java compiler.

The Java SDK 1.3 dynamic proxy Reflection mechanism can be used to build compatibility checking as a side effect of a service call. This mechanism lets you interpose code between callers and called objects without impacting the caller. As long as the service run time provides the factory by which a caller obtains a service object, the run time can create a proxy object that gets called first and then (if desired) delegates to the actual object — even if it does not know the object's type ahead of time. Thus, it can create proxies to handle all service calls. Such proxies could perform version checking as a side effect of every call. For remote calls, it could insert version information into the stream of the remote request. This could be checked by the remote service for compatibility. For local calls, the proxy could internally catch version incompatibility errors and respond according to the rules of the run time.

Service Object Elaboration

If a service provider determines it needs to create new instances of services for deployment to service users, it must create them, then perform any required server-side initialization or configuration of the instance. Creation of the service object or code and resource set may be done with a factory unique to the service. Initialization is not so straightforward. Generally, there are aspects to a service that can be initialized on the service provider side, and there are aspects that can only be initialized on the point-of-use side (where the object will be used). The service provider must take care of its end of the service initialization, which I refer to as "elaboration."

For object-oriented services, service elaboration logically consists of fully constructing the required service instance and filling in all the required pieces that will be needed by service users. The object graph should be complete, or the required set of code resources will not be transported right away and you will not be able to ensure that the classes will be there when they are needed, especially if the service is used in a disconnected mode. But the question arises: What constitutes completion? Since objects reference other objects — potentially other service objects — this question must be answered with precision.

The most practical approach is to make a distinction between primary and nonprimary object references. A primary reference is directional and conveys the semantic quality of ownership: If the owning object is removed, then the owned object should be removed as well. If the cardinality between the two is one-to-many from the owner to the owned object, this constitutes a composite collection. In contrast, any object may hold nonprimary references to other objects. These have no framework-defined meaning.

The elaboration phase, therefore, consists of ensuring that all primary references are not Null, and only then is the object ready to be returned to the requestor. The object may be packaged in a container such as a JAR file for transport, or its components may be transported individually.

Referential Integrity

Services will use other services. To do so, they need to have references to each other. In addition, a run time needs to hold references to the services that are installed in it. When a service is replaced, all these references must be found, disconnected, and reconnected when the new service is installed. This is a potentially complex process unless you bound it by establishing rules for how services interconnect and how they manage references with each other and their environment — and vice versa.

Therefore, my first proposed rule for this is that when a service is stopped, it releases (nullifies and stops using) run-time references it holds to other services and any resources external to it. That ensures that resources are not in use when a service is replaced.

The next rule is that no application (a service or the run time) should hold a run-time reference to a service for longer than it takes to perform a short-duration operation. Further, no such operation may block for an indeterminate or long period of time. Instead of holding a reference to a service, other services (and the run time) may hold indirect reference objects (local references), which are managed by the run time and can be used to obtain an actual service reference when required. This lets the run time manage access to references to services.

My final rule is that whenever a service (or the run time) obtains an actual run-time reference to a service (via a local reference), it synchronizes on the corresponding local reference object for the period of time it uses the service reference. This enforces serialization with regard to access to services. To make this possible, the run time must ensure that a service has one and only one local reference object for it.

In addition, when an actual reference is obtained for a service through its local reference, callers may supply a listener for synchronous notification when the actual service reference is being reclaimed or when it is set to a new value after having been reclaimed. This lets clients of the service take appropriate action if a service they are using is being updated.

Sometimes a service may need to expose references to elements that are internal. In this case, clients holding those internal references need to clear them whenever the service is updated. If the references need to be held for long periods, such clients should employ listeners to clear the references in response to notification by the run time. If, on the other hand, the references are only held for short periods and can be released when not in use, the client should synchronize on the local reference of the service being used, for as long as the internal service references are needed.

Security

Requirements related to security include defining a run-time security context for a newly installed service version, sharing resources across services, and authentication/authorization for access by users and code sources.

Authorization is a complex problem in a system in which components originate from different sources that do not completely trust each other, yet the components must interoperate and possibly share data or resources.

Not only must users be given authorization, but interoperating services from different sources must give access to each other — based on roles and privileges that are unique to each service. The process of installing a service therefore needs to be a secure process and should include the specification of these roles and privileges.

The run time needs to have a privilege-based or policy-based security manager that acts as resource gate for all installed services. It also needs to provide an authentication mechanism, and a delegation mechanism that services can use to confer roles and privileges onto other principals, thereby establishing their security context.

Interdependent Services

There exists the possibility of deadlock if two services that access each other are updated at the same time.

A lazy solution to deadlock is to provide a timeout mechanism and use a retry loop around any operation that might deadlock. Such a timeout is usually an indication that a service is in use and cannot stop what it is doing, and so the update should be tried again at a later time. A more proactive solution is to analyze the requirements of each request before it is attempted. In an update framework in which all resources and update requests are managed by a run-time system, proactive deadlock detection is practical because update requests are few and far between.

Another concern with interdependent services is that updating one may require updating the other. This is true in cases where the behavior of one service would change in such a way that it would be different than the behavior expected by another dependent service. In that case, the interface might remain the same, but behavior changes might necessitate updating both services if either is updated. A framework therefore requires a means of identifying such interdependencies and incorporating that information into the update process.

Interdependencies across services are hard to accurately identify because combinations of components and configurations may affect dependencies in unanticipated ways. However, you can define which versions of a service are backward compatible with earlier versions.

The Open Software Description (OSD) Standard is a specification developed by Microsoft and Marimba in an effort to define a language in which application developers could explicitly specify cross-dependencies among application components. An OSD-enabled deployer can use this information to obtain all the necessary pieces, then find and execute installation scripts to install them. OSD is an XML-based standard, and is used extensively by Microsoft for its software. It is also used by some peer-to-peer frameworks, such as Groove.net.

Uninstalling

The fundamental problem in uninstalling software is identifying what is still in use by other applications and, by default, what can be reclaimed. For this process to be reliable, the framework must take responsibility for organizing and managing a service's resources. That is the only way it can hope to answer the question of what resources are still in use.

If each service were completely standalone, the solution would be trivial: When a service needs to be uninstalled, simply delete everything in the directory containing its classes and resources. The complexity arises when services share persistent resources. For this reason, it is important to define the granularity at which sharing may occur. In my framework, I have set it to be the boundary between services: A service may use another service via that service's public interface, but it may not retain pieces of the service — it must always ask a service for a piece, and there are strict rules regarding how long it may hold a piece of a service.

This avoids the need to track the pieces of each service. It does not eliminate the need to define dependencies between services: A framework could still benefit from knowing that if it updates Service A, it should also update Service B, rather than discovering it as a result of a failure — hence the need for an explicit expression of cross-dependencies such as that provided by the OSD.

Resuming After an Update

When a running service is about to be updated, the service must be put into a known state, so that after the service has been updated, it can be restored to that state. In other words, it must be placed in a restorable state. A trivial and degenerate form of a restorable state is an inactive state that has no detailed state (no internal data) associated with it; for example, an initialization state. A more general form of restorable state is any logically consistent state from which the service could be resumed after it has been updated.

The database world has a mature methodology for this — the concept of a transaction. If an application's operations can be separated into discrete atomic transactions, then it should be possible to update the application and replay the transaction that failed with a version incompatibility error.

An important distinction to make in determining if transactions in progress can be replayed is whether the new version of the service is backward compatible. In that case, the functional behavior and method signatures of methods that existed in the old version's interface are unchanged, and you should be able to resume or replay transactions that were in progress, as long as you can restore the application to the state it was in after its last completed transaction.

Update Coordination and Synchronization

In some situations, it may be critical to ensure that all instances of a service across a network are upgraded simultaneously. More precisely, it may be necessary to ensure that as of a certain time, all transactions issued by services are made with the latest version of the service instead of an earlier version. This might be required if, for instance, the service contains embedded business rules and new rules must take effect at a certain time.

It is also necessary to ensure that new versions of a service are deployed to their service provider in a transactional manner. Thus, if a service is updated, it doesn't get partly new and partly old content. A straightforward way to do this is by simply shutting down the service provider when new service versions are copied to it.

Conclusion

Mature, centrally administered update technologies from the client-server world tend to make the assumption that there is a single system administrator responsible for coordinating updates to all applications within that environment. With the Internet, we are moving to a period that will be dominated by multiple centrally operated services, each with many cooperating providers. The model in which a single dominant player sets all the rules is inflexible and will slow the growth of services. What is needed is an infrastructure that provides support for lifecycle management and sets some rules for how applications must behave to plug into that infrastructure.

DDJ

Sep01: Distributed Computing Component Lifecycles

Figure 1: Distributed service scenario.