Basic Data Flow
Most modern computers work in a similar way when executing large simulation problems. The basic HPC server today consists of one or more CPUs which perform the arithmetic and RAM. RAM contains a server's applications instructions, as well as the data needed to run the application. When the application needs to write out the results, the system needs to access the I/O system and in certain cases needs to use the network connection to communicate with other machines.
There are many ways to design a computer system, but an important factor is that the different sub-systems within a computer are balanced. If the CPU is very fast, but can't be fed data from the memory system, the CPU has wait. If the CPU is slow, as compared to the memory system, then the overall performance drops, as the CPU won't be able to process the data fast enough.
As users start up applications, the computer instructions that define the application will start loading from the disk into RAM. As applications start to execute in the CPU, they need to read data from disk. At this point, the I/O capability and speed are very important. This is typically measured in bytes of data per second. Modern systems can read or write in the gigabyte/second range. Data sets from storage, at application start up time, can range into the hundreds of gigabytes (10^9 bytes). Since the application typically can't start until a certain amount of data is in memory and quickly accessible by the CPU, it is important to have an efficient storage sub-system and architecture.
The Break Down: Using Nodes, RAM vs. CPU and Memory Configuration
Compute nodes today for HPC are typically referred to as "fat" or "thin" nodes. Although there is not a cutoff between fat and thin, many in the industry use the definition of a fat node to be a computer (enclosure or sheet metal) to have more than four sockets. A thin node would generally be defined as having four or less sockets. A socket can be thought of as where the chip physically gets plugged into the motherboard.
Why do we refer to sockets and not CPUs here? As the design of chips has moved from constantly increasing the clock rate (around 4GHz at the top today) to adding more computational elements on the chip (cores), it can sometimes be confusing when describing the compute power on a given chip. Sun, AMD, and Intel are shipping CPUs with two or more individual cores. Intel also has a four-core version, and AMD will be shipping a four-core processors this year. Sun has already shipped sockets with eight cores, although currently those systems are not aimed at HPC environments.
An important component when looking at application performance is how fast data can move from RAM to the CPU. Over time, this transfer rate has not kept up with the increases in CPU speed. Now, with the mainstreaming of multi-core CPUs, the situation is even worse. Since two or more cores will be running different threads in an application, with those threads demanding data from RAM, in certain cases, the overall performance will suffer in HPC-type applications.
Most modern HPC applications have been re-written in the past 15 years to take advantage of multiple CPUs working together. Since the demands in the HPC community have out-stripped the ability of single CPUs to deliver the desired performance, software developers have re-written their applications to take advantage of multiple CPUs in a single machine. In addition, to getting more scaling of performance outside of a single enclosure, applications have been further enhanced to run across a number of nodes. This is typically referred to as "horizontal" scaling, since the computing environments are a number of thin nodes. Applications which have been re-written to take advantage of a number of cores have different requirements for either CPUs or RAM for maximum performance.
If the amount of memory, or RAM that is need to hold the data for an application, is not sufficient then the operating system moves some of that data to temporary files on disk. When that data is needed, it has to be brought back into main memory to be used--"swapping" in other words. Since writing to/from the hard disk is an order of magnitude slower than to/from main memory, swapping is to be avoided at all costs.
The cost of a thin node is based on a number of factors. When trying to gain the most performance per dollar (the "price/performance ratio"), it is important to investigate whether paying more for RAM is more beneficial than paying more for a faster CPU (assuming all are dual core or beyond). After the actual cost to the vendor of the machine is determined by adding up the individual parts, a margin is applied to come up with the final price. If you look at Sun's Sun Fire X2200 M2, which is a two-socket, dual-core HPC server based on the AMD Opteron processor, we can see that RAM costs contribute from 16-60 percent of the final cost to build the machine. The CPU costs can range from 20-35 percent of the final cost to Sun.
For example, a low-end AMD Opteron dual core processor (model 2210), if purchased separately, would cost (normalized) 1.0. The model 2210 runs at 1.8 Ghz. Moving to a faster processor, the model 2214, running at 2.2 Ghz, would cost 2.36 times as much, for a raw speed gain of 1.22X. Moving to the fastest CPU choice, the model 2218, running at 2.6 Ghz, would cost 4.28 times as much, with a raw performance gain of 1.44X. Note that this is for the processor only, and not for the whole system. The overall increase in system price is less than these values.
Looking at memory configurations, if a base configuration of 2 GB costs the customer 1.0, a 4GB kit currently costs 2.22X, which is almost linear. The 4GB RAM kit uses higher density memory, which allows for larger data sets to be run. This is very important, since reducing swapping is critical to application performance. As stated earlier, swapping due to insufficient memory should be avoided. Thus, customers should spend their capital dollars on making sure that the system has enough memory, before looking at faster CPU speeds.