Over the past year or so, we've heard a lot about multithreading, deadlocks, cache invalidation, and superlinear scalability. I, for one, have introduced multithreading for the CPU-intensive parts of my applications. During this time, however, I've come to realize that the biggest bottleneck often involves loading data from files. Consequently I've built a small test suite that performs the following actions:
- Read a single file sequentially
- Read multiple files sequentially
- Read from a single file at random positions
- Read from multiple files at random positions
- Write a single file sequentially
- Write multiple files sequentially
- Write to a single file at random positions
- Write to multiple files at random positions
I repeated each of these tests with 1, 2, 4, 8, 16, and 32 threads. From each test, I recorded the total throughput in MB/s. The source code consists of a single C++ source file and is available here.
I've run all of these tests on:
- A dual quad-core XEON with 10GB RAM and an internal RAID 5 system consisting of four 10k SAS Disks running Windows XP x64
- A dual XEON with 4 GB RAM and 10k U360 SCSI Disks running Windows XP
- A Core Duo Laptop with 1.5 GB RAM and a 5400 RPM SATA disk running Windows XP
For single file I/O, I used files of 200MB in size. All threads in a run accessed the same single file. For multiple files I/O, each file had 20MB. All threads were using separate files. All accesses were made with blocks of 512 Bytes.
I repeated the entire test suite three times. The values I present here are the average of the three runs. The standard deviation in most cases did not exceed 10-20%. All tests have been also run three times with reboots after every run, so that no file was accessed from cache. These runs were slower, of course, but the influence of the number threads was similar to the results shown here (with the exception of random read multiple files; see below). To suppress caching effects with the test, I used different set of files for each subtest with 1, 2...,32 threads.
Before discussing the results, it is important to consider a few important I/O issues.
Read caching. All modern operating systems use all available memory to cache the most recently used files -- or more specifically the most recently read or written blocks. The RAID machine with 10 GB RAM has a clear advantage here, as it's able to cache all files used (some 6 GB in total). The other machines cannot reuse any cache data from one to the next of the three runs as the first used file must be dropped from cache after reading and writing more file space than available RAM.
Write back caches. Operating systems and more specifically RAID systems offer write caches that buffer written data before it is actually stored on the hard disks. The write command returns although the data is only in the cache. On first tests I encountered a slowdown on the RAID system in the middle of the write tests. The reason for this was that the write cache was full, so the tests were influenced by the tests made before. I therefore introduced a pause of some seconds after each subtest before starting with a different number of threads. Read and write caches explain some of the huge throughput of the RAID system used.
Native command queuing. Most current drives (even my three-yearold laptop's SATA drive) support native command queuing. This technology contains asynchronous execution of read and write commands to a hard disk. The commands return in the order where they best fit the drive's point of rotation. For instance, if four read requests (A, B, C, D) are sent to the disk, they might return in the order D, C, B, A, because the drive's heads accidentally were close to the position of the data requested with D when the commands were received. Then the nearest position was the one of C and so on.
Now consider a single-threaded application waiting to receive first A, then B, then C, and so on, compared to a application with four threads requesting A, B, C, and D simultaneously. While the multithreaded application has received A, B, C, and D all together, the single-threaded one might just have received A after the same runtime. (For more information, see Serial ATA: Native Command Queuing.) While there many more theoretical aspects (operating system, drivers, disk fragmentation, sector, and block size, to mention a few), going deeper is beyond the scope of this article.