CxC (pronounced "C by C") is a parallel language and application development and execution system for creating parallel applications that run on high-performance computers and clusters. CxC was developed at Engineered Intelligence (our company; you can download Linux and Windows versions at http://www.engineeredintelligence.com/) and influenced by and contains elements of C (see Table 1), C* ("C star"), Java, Modula-P, and Fortran. CxC lets you use as many parallel processors as you want to solve problems by defining your own network topology on these processors and a complete multiple-program/multiple-data (MPMD) model. The communication operations are one-sided, which makes parallel programming easy and efficient. If you find parallel programming a challenge and MPI too complex, CxC offers a different approach that lets you define a virtual parallel computer to fit your problem, instead of trying to fit your problem to a particular hardware configuration. With CxC, you can develop parallel algorithms and applications on your desktop system; prototype, test and compile the code; and run the executable unchanged on supercomputers or clusters (see Figures 1 and 2).
The basic concept of CxC is straightforward: You create a virtual parallel computer with as many processors as you need to solve a problem. These processors are connected via a topology that you define. Each of the virtual parallel processors run programs exactly like real physical CPUs, so you can easily reuse existing source code you can even cut-and-paste C/C++ programs for use in CxC. Unlike other approaches to parallel programming, you don't need to know about the physical hardware on which the program executes CxC's runtime environment maps your application to the physical hardware and runs it, taking care of all the data distribution and communication.
The virtualization allows you to develop complex parallel applications on PCs with one processor, under Windows or Linux, prototype and test the code, then compile them to create executables (.run file in CxC). You then transfer these executables to high-performance systems or clusters having many processors with the CxC runtime environment loaded, and CxC runs the code efficiently.
As Figure 3 illustrates, application development and execution with CxC is a three-step process. For instance, say you want to develop an image-processing application that displays 10,000 pixels in a 100×100 grid. In CxC, you specify (with one command) an array of 100×100 virtual parallel processors, define how the processors are connected so they can share information, then create the program(s) they will run (such as edge-detection, filtering, processing, and so on). Once you've developed the program on a PC, you can run it on, say, a 20 CPU Linux compute cluster. CxC's runtime environment automatically maps the 10,000 virtual processors to the 20 physical processors, distributes the program and data, and orchestrates its execution in parallel.
CxC is compatible with Fortran, C, and C++. Programs written in these languages can be integrated with little or no change; you essentially do cut-and-paste for code written. CxC can access all the built-in libraries of C, C++, and Fortran, and all the tools used by these languages can also be accessed. Therefore, you have all the strengths of the existing languages, plus the parallel capabilities of CxC.
CxC Program Structure
Every CxC program consists of three main steps:
- Specifying a virtual parallel machine with n processors, done using a controller declaring the number of virtual processors.
- Defining the communication topology between the processors, required only if the processors communicate.
- Implementing the parallel programming language on the processors, including the actual logic (what each processor will be doing).
A typical CxC program has one or more array controllers (which declares the number and dimensions of parallel processing units used), topology (to define the communication between the processors), and the implementation of programs. The CxC language is an intuitive parallel-development system that lets you prototype, test, and run parallel algorithms with just basic parallel programming knowledge. For instance, Listing 1 is a parallel version of the "hello world" program in Kernighan and Ritchie's classic The C Programming Language. This simple CxC program creates 30 processors that all run the same program. The parallel machine will be executed 10 times. The result is the following output 300 times (30 processors × 10 times executed).
hello parallel world! hello parallel world! ... hello parallel world!
The good news is that you can run this on a laptop in a couple of milliseconds. Imagine, though, if you increase the number of virtual parallel processors from 30 to 30 million you won't be able to run it on your laptop anymore. But on a supercomputer or cluster with 300 nodes, it could run in a few seconds. A new world record: 30 million "hello worlds" printed in the least amount of time! Now let's look at some real-world examples where CxC parallel programming makes sense.
Writing CxC Programs
The specific programs that are executed on each parallel processor are similar to conventional C or Java. The design goal was to take the best practices and concepts of C, Java, and Fortran and create a parallel language that is simple to learn and use.
However, the key concepts that make CxC different lie not in the programming, but in the way you structure and create the virtual parallel machine and its orchestration.
Step 1. Specifying a virtual parallel computer. First, think about the problem, not about the computer. Assume you have an unlimited number of processors available at your disposal. Think about how many parallel processors you need to solve the problem and how they would have to communicate.
To make this more interesting, Figure 4 is a program that does a numerical integration and calculates an approximation for the well-known number pi; this is achieved by an algorithm that computes the integral of a function over the interval 0 to 1.
The interval [0,1] on the x-axis is divided into many smaller intervals. The task of a parallel processing unit is to compute the area for an interval using the rectangle formula as approximation. The partial area sizes are summed up and deliver, as a result, the approximated value of pi. In this algorithm, the function value in the middle of an interval is multiplied by the interval width to get the size of the rectangle area. Assume that the accuracy of 1000 subareas is enough.
The first step to create a CxC program is to specify a virtual parallel computer that consists of multiple array controllers and 1000 parallel processing units. Let's define two important terms:
- Parallel Processing Units (PPUs) are processing elements, each of which is able to execute a program. They are the processors and basic building blocks for the virtual array computer (see Figure 6) and are controlled by the array controller. A PPU is a processing element with the capabilities of a full processor (not just an arithmetic-logical unit). Each PPU has local memory where it stores its data. It represents a dedicated task, which is an arbitrarily defined piece of work done by the program. PPUs are declared within the respective controller declaration at the beginning of the program.
- Array Controllers (AC) group PPUs together and contain the program instructions PPUs can execute. An array controller provides the logical structure for parallel processing units that execute the same program (Figure 5). The AC contains the one program for all attached PPUs in a global shared-memory block. This means that all PPUs of a given Array Controller execute the same program that is remaining at the Array Controller. To solve this problem, you create an array of 1000 processors, where each processor computes the area of a rectangle representing a part of the interval. You also need a single processor that consolidates all the results and sums them up to get to the number pi.
Listing 2 defines the two array controllers that build the basis for the sample parallel problem to compute pi. These are definitions for two-array controllers and their respective PPUs. The first array controller, ZeroToOne, contains 1000 PPUs that computes the 1000 subintervals, based on the earlier-introduced rectangle formula. The second, AC ComputePI, serves as a monitor that collects all partial values from the 1000 IntervalUnit PPUs and sums them up to compute pi. You also see the definition of the variable of the data type double.
Since the processors work independently, the control flow is different for this PPU, and you are creating a Same-Program/Multiple-Data (SPMD) computer for the thousand and a Multiple-Instruction/Multiple-Data (MIMD) computer as an overall architecture.
CxC allows the creation of an unlimited number of array controllers with an unlimited number of associated PPUs. A virtual parallel computer can consist of multiple virtual array computers. While each array controller can have many parallel processing units (PPU), a virtual parallel computer consists of many array controller/PPU combinations, all connected to a sequencer.
The responsibility of the sequencer is to orchestrate the execution of all array controllers, while the array controllers orchestrate the execution of instructions on the PPUs. All PPUs of any virtual array computer are connected via the virtual network to communicate with each other and exchange data. CxC allows you to create SPMD, MIMD, or any hybrid parallel architecture.
The key objectives for the virtual parallel architecture are: execution performance; flexibility to support new parallel algorithms, functions and models; and hardware independence yet still supporting different platforms, operating systems, hardware accelerators, and parallel computer architectures.
Step 2. Communication network topology. The second step is to create a communication network between the PPUs that allows data communication and exchange during execution. A topology definition serves as a logical container for one or more topology declarations that can be used in communication functions to access variables and arrays in other connected units. The CxC syntax allows the creation of any kind of topology including grids, hypercubes, or trees.
The topology defines the communication possibilities between the PPUs. Listing 3, the topology for the example problem that computes pi, is a simple one-to-many declaration that connects the single PPU PIUnit of controller 1 with each of the 1000 units of type IntervalUnit. It creates a link between the result PPU and 1000 compute PPUs, each of which computes a partial value of the integral.
There can be multiple topology definitions, each of which can have multiple topology declarations. A topology declaration specifies links between two groups of parallel processing units. The links generated between those groups are based on an evaluation of arithmetic expressions in the index of the source and destination PPUs. The arithmetic expression allows addition, subtraction, division, multiplication, and modulo as possible operations.
Step 3. Parallel program implementation. The third and last step is implementing the programs stored in the array controller's code segment and executed on each PPU.
In a section called main, a program is specified for each array controller, which is executed on the respective PPUs associated with this controller. The implementation of the example problem pi for the integral function running on 1000 compute nodes looks like Listing 4. Each PPU uses its unique ID to find the partial interval, computes the function value of the formula, and computes the rectangle size for its subinterval. The barrier statement is a global synchronization instruction (Figure 7) that aligns programs across array controllers. Each PPU runs into the barrier statement and waits until all PPUs run into it. Then all PPUs continue to execute their programs.
In the implementation of the result unit, it waits until all compute units have performed their computations by running into the barrier statement at the beginning of the program; see Listing 5. The result unit performs a gather operation to collect the results, followed by a vector summation to finally print the computed value. Listing 6 is the complete CxC program. q
Matt Oberdorfer is president and Jim Gutowski is vice president at Engineered Intelligence. They can be contacted at [email protected] and [email protected], respectively.