Dr. Dobb's | Grid-Enabling Resource-Intensive Applications

Our authors take an application designed to run on a single PC and distribute its processing over a network grid.

In this article, we discuss how we took an application designed to run on a single PC and created a framework that enabled it to distribute its processing over a network grid. The original application is called "CodeMatch" and the grid version we created is "CodeGrid." CodeMatch detects software plagiarism by comparing two directories of possibly thousands of source-code files, then producing a report showing all matching patterns found (www.ddj.com/dept/architect/184405734). CodeMatch is used in court cases to help prove or disprove copyright infringement. Because its algorithms are very rigorous, they are also very time consuming, and CodeMatch can take a long time to compare large numbers of files. For this reason, we were seeking a way to speed it up and came up with the idea of a network grid for resource sharing.

There are many different ways to share resources between multiple computers over a network, and in our research, we compared some of these by listing the respective advantages, disadvantages, and the trade-offs of choosing one way over another (see Table 1). The main advantages of distributed computing are increased performance by sharing the load of resource-intensive applications, improved scalability, and fault tolerance. These advantages come at a cost however of added complexity, potential security problems, and increased manageability issues. To develop a successful distributed system, careful planning is necessary as remote calls can be 1000 times slower than a local call. There has to be a balance between the added overhead of the network management and the size of the job itself.

Architecture Comparison
Architecture	Advantages	Disadvantages
Client-Server	Many tools available, especially in Java. Basis for other architectures.	Overhead of managing socket connections can be tedious. If server fails, clients are useless.
Peer-to-Peer	No need for central server.Robust (no single point of failure).	Harder to manage froma central location.Generally refers to sharing of files, but not hardware resources.
Distributed Objects	Very scalable. Increased flexibility.	Requires middleware or developer must manage synchronization and client-server overhead.
Clustering	High Performance. Excellent load balancing.	Needs homogeneous hardware. High Administrative overhead for processes and network.
Grid Computing	Can use heterogeneous hardware. Can be used across large networks such as the internet. No central administration.	Compared to a normal cluster, a grid has more middleware issues to deal with.

Table 1: Architecture comparison.

In the search for the best approach to this problem, we explored several possible solutions, using different programming methodologies (such as procedural and object-oriented programming) and different frameworks (such as .NET and Java frameworks). We looked at the various software tools available, and different communications methods such as Remote Method Invocation (RMI) and Remote Procedure Calls (RPC). We also looked at different algorithms for distributing the work such as the master-slave paradigm and the push-pull model. While our design choices are by no means the only ones, they represent what we felt were the best choices in this particular project.

Architectures

Distributed programming has several basic common architectures: client-server, peer-to-peer, distributed objects, clustering, and grid computing. These architectures are not all mutually exclusive, and particular implementations may be combinations of these architectures.

Communication Strategies

There are three common strategies for implementing a communications scheme in client-server architectures (Table 2). These are presented in the order of oldest and most standard to newest and most modular. Berkeley sockets (or just "sockets") are the earliest version we discuss, followed by RPC, and then RMI. The latter two strategies are both based upon sockets yet remove some of the management overhead normally required by sockets.

Brief Comparison of Communication Strategies
Strategy	Advantages	Disadvantages
Sockets	The "Standard" for network application communication.	Data is sent as bytes and must be reconstructed into something useful at the receiving end of the connection.
Remote Procedure Calls (RPC)	Sends complete data structures instead of just bytes. Offers reliability in the form of an acknowledgment message that the data has been successfully transferred. Eliminates socket management overhead.	Complex Interface Description Language needed to allow various platforms to call the RPC.
Remote Method Invocation (RMI)	Passes objects instead of straight bytes or data structures. Eliminates socket management overhead. Access to rich Java library tools.	Java only (unless JNI is used). Need JVM 1.5 or newer to use latest features.

Table 2: Communication strategies.

Sockets are defined as endpoints for communication, with one socket at each endpoint of the communications channel. A socket works with Internet Protocol (IP) and either Transmission Control Protocol (TCP) or UDP (User Datagram Protocol). The difference between TCP and UDP is that TCP provides error checking and is bidirectional. UDP is unidirectional and provides no error checking. In this project, we needed TCP because it provides feedback that a connection has actually been established. With sockets the client has to manage the socket connection by opening it, closing it, and establishing an input stream to read from the socket. The server employs a "listener" to monitor a specific port. When a client sends a request for a connection, the server will accept the request and the connection will be established. With sockets each connection is unique. The client needs to know the address of the server, but the server does not need to know the address of the client prior to the connection being established. Once a connection is established, both sides can send and receive information.

The main disadvantage of plain sockets is the overhead of creating and managing the connections. There are newer communications strategies such as RPC and RMI that simplify these connection maintenance details.

When a process calls a procedure on a remote application in a distributed system, it is known as an RPC. With RPCs, an ordinary data structure is used in passing data to a remote procedure. The essential concept of RPC is hiding all the network code within "stub" procedures. The goal of RPC is to simplify writing distributed applications by making the networking code transparent. With RPC, a daemon or "stub" runs on a port of a remote system and listens for messages addressed to it. These stubs locate the appropriate port on the client or server and package the parameters into a form that can be transmitted over the network. This is known as "marshaling." The main drawbacks of RPC are that in passing parameters, true pass-by-reference is not permitted, it is difficult to send and make sense of complex data types such as user-defined types, exception handling is more difficult, and there can be issues with data representation (www.wikipedia.org/wiki/Remote_procedure_call). There is a selection of standardized RPC systems available such as Microsoft's .NET Remoting and Java's Remote Method Invocation.

For reasons of portability and tool availability, we decided to use RMI. With RMI, objects are passed using remote method calls known as "callbacks" instead of data structures. RMI simplifies the design of the client because all it does is get a proxy for the remote object from entries in the RMI registry. It then simply calls the method in the same way it would call a local method. Since RMI is Java specific, it doesn't provide a direct connection between objects implemented in different languages. Using Java's Native Interface (JNI) API, however, it is possible to wrap existing C or C++ code such as the CodeMatch DLL with a Java interface, then export this interface remotely through RMI (see Figure 1). The JNI lets Java code that runs inside a Java Virtual Machine (JVM) interoperate with applications and libraries written in other programming languages.

Distribution Strategies

There are two different basic ways to distribute the jobs to the worker machines in this type of consumer-producer scenario (Table 3). These are known as the "push model" and the "pull model." The way we will describe these is to have a "Master" computer, which is the controller, and many "Worker" computers that merely share their resources and perform computations. This is commonly known as the Master-Worker Paradigm (see Figure 2). The Master initiates the computation by creating a set of tasks or jobs. With the push model, the Master divides up the jobs by some predefined criteria and sends them to the Workers. With the pull model, the Master puts the jobs in some shared container and then waits for the tasks to be picked up and completed by the Workers (see Figure 3).

Brief Comparison of Distribution Strategies
Strategy	Advantages	Disadvantages
Master divides up jobs and send a chunk to each worker (Push model)	No need to set up a shared queue.	More difficult to establish any sort of load balancing. More work up-front to decide how to distribute the work. Greater initial network traffic.
Shared queue (Pull model)	Work load balances itself.	Extra work in creating the queue and setting up its availability to workers.

Table 3: Distribution strategies.

One of the advantages of using the pull model is that the algorithm automatically balances the load. This is due to the simple fact that the set of work is shared, and the workers can pull work from the set at their own pace until there is no more work to be done. The pull model provides excellent load balancing regardless of Worker speeds and network variations. This algorithm also has good scalability.

Frameworks

The three frameworks (see Table 4) we considered in the development of this software were:

Brief Comparison of Frameworks
Framework	Advantages	Disadvantages
Microsoft .NET	Class library functions available from Microsoft (with C#). Microsoft environment.	Need to know C#. Known security issues. Limited to Windows.
Java Native Interface (JNI)	Class library functions available. Allows cross-platform development.	Complex API. No garbage collection. Platform dependent.
Enterprise Java Beans (EJB)	Class library functions available. Hides implementation complexity of RMI. Uses latest advances in efficient distributed computing development. Platform independent.	Designed mainly with web applications in mind. Intended for larger systems than CodeGrid (has additional overhead that we don’t need). API is difficult to learn. Complex XML descriptors.

Table 4: Frameworks.

In consideration of timeliness and saving costs associated with writing and debugging code, it is often advantageous to utilize as much existing code as possible in creating an application. Open source and vendor supplied code snippets can perform many common functions such as standard I/O socket management. This is one area where Java really excels and there are many of these snippets (known as "JavaBeans") downloadable from both Sun and associated vendor's websites.

.NET is similar to the Java solution in that it provides libraries of coded solutions to common program requirements and manages the execution of programs written specifically for the framework. The ideal language to use for this project with .NET would be C#, which is based on Java and C++.

Java Native Interface (JNI) is a programming framework that lets Java code running in the Java VM call and be called by native applications and libraries written in other languages, such as C, C++, and assembly. The JNI can be used to wrap an existing application (such as CodeMatch), written in another programming language (in this case, C), and enable its functions to be accessible to Java applications. The main advantage of this tool is that legacy application code can be integrated with new Java code and not have to be rewritten in Java. The main drawbacks of using JNI are:

JavaBeans are simply software components written in Java that have a standard format that makes them ideal for reuse in many different projects. JavaBeans are also independent objects that can exist outside of a program. JavaBeans are intended to handle common functions leaving you free to concentrate on the particular program at hand. Enterprise Java Beans (EJBs) let highly scalable, platform independent, complex systems be built quickly and cost effectively. The main drawback of using EJBs is that the API is very complicated with many interfaces to implement, and many complex XML deployment descriptors.

Conclusion

Most of the architectures and communication strategies we have discussed here have similarities and overlaps because they have all basically evolved from the same concepts, or from each other. This makes the comparisons a bit blurred. Also, there are many different programming languages available for use, and any one of them could probably have been used to develop this software. We chose Java mainly because we felt it had the most to offer as far as applicable library routines, and would be the least restrictive as far as future portability. Also we felt development time would be shorter because of Java's lack of pointer issues. Our other decisions were of course heavily influenced by our choice to use Java.

From our research we decided that RMI was the most advanced technology to use for the communications between client and server. From what we read we knew that JNI would be difficult to use but that a JNI wrapper would be needed for the function calls to the CodeMatch DLL. As it turned out, we ran across SWIG (www.ddj.com/184410484),, which can automatically generate JNI wrappers for programs written in a multitude of languages. While SWIG had a learning curve of its own, the automation of wrapper generation probably saved us a couple of weeks on this project.

The architecture we chose is a distributed objects model. The workers all remotely call methods on the master and vice-versa. The program uses the "pull" model with the workers requesting new jobs as they complete jobs. Each worker sends back the results of its work and then the master sorts and formats all results into a database file from which various types of end-user reports can be generated.

The resulting performance from this project has been excellent with the run times being cut almost by the number of worker machines. While small jobs (fewer than 40 files or so) are still quicker with the standalone CodeMatch due to the extra overhead of the network initialization, overall we can now run large jobs in a fraction of the time it once took.