In this article, we discuss how we took an application designed to run on a single PC and created a framework that enabled it to distribute its processing over a network grid. The original application is called "CodeMatch" and the grid version we created is "CodeGrid." CodeMatch detects software plagiarism by comparing two directories of possibly thousands of source-code files, then producing a report showing all matching patterns found (www.ddj.com/dept/architect/184405734). CodeMatch is used in court cases to help prove or disprove copyright infringement. Because its algorithms are very rigorous, they are also very time consuming, and CodeMatch can take a long time to compare large numbers of files. For this reason, we were seeking a way to speed it up and came up with the idea of a network grid for resource sharing.
The problem we were trying to solve can be defined as follows:
- Distribute and execute a set of x tasks on y number of nodes of a network.
- Collect the results generated by the nodes and integrate them into a single report.
- Implement load balancing.
- Automatically scale the solution to the number of available nodes in the network.
- Integrate the existing CodeMatch software with minmal modification to the existing code base.
- Run on Windows.
- Allow fault-tolerance in case a node has problems during execution.
- Develop in a short amount of time.
There are many different ways to share resources between multiple computers over a network, and in our research, we compared some of these by listing the respective advantages, disadvantages, and the trade-offs of choosing one way over another (see Table 1). The main advantages of distributed computing are increased performance by sharing the load of resource-intensive applications, improved scalability, and fault tolerance. These advantages come at a cost however of added complexity, potential security problems, and increased manageability issues. To develop a successful distributed system, careful planning is necessary as remote calls can be 1000 times slower than a local call. There has to be a balance between the added overhead of the network management and the size of the job itself.
|Client-Server||Many tools available, especially in Java. Basis for other architectures.||Overhead of managing socket connections can be tedious. If server fails, clients are useless.|
|Peer-to-Peer||No need for central server.Robust (no single point of failure).||Harder to manage froma central location.Generally refers to sharing of files, but not hardware resources.|
|Distributed Objects||Very scalable. Increased flexibility.||Requires middleware or developer must manage synchronization and client-server overhead.|
|Clustering||High Performance. Excellent load balancing.||Needs homogeneous hardware. High Administrative overhead for processes and network.|
|Grid Computing||Can use heterogeneous hardware. Can be used across large networks such as the internet. No central administration.||Compared to a normal cluster, a grid has more middleware issues to deal with.|
In the search for the best approach to this problem, we explored several possible solutions, using different programming methodologies (such as procedural and object-oriented programming) and different frameworks (such as .NET and Java frameworks). We looked at the various software tools available, and different communications methods such as Remote Method Invocation (RMI) and Remote Procedure Calls (RPC). We also looked at different algorithms for distributing the work such as the master-slave paradigm and the push-pull model. While our design choices are by no means the only ones, they represent what we felt were the best choices in this particular project.