In this article, we compare Linux, Solaris (for Intel), FreeBSD, and Windows 2000 to determine which operating system (OS) runs high-performance network applications the fastest. We present our OS benchmarks with both simulated and real-world tests, then evaluate the results.
August 23, 2001
URL:http://www.drdobbs.com/which-os-is-fastest-for-high-performance/199101229
Jeffrey B. Rothman and John Buckman
Figure 1 | Figure 2 | Figure 3 | Sidebar 1 | Sidebar 2
In this article, we compare Linux, Solaris (for Intel), FreeBSD, and Windows 2000 to determine which operating system (OS) runs high-performance network applications the fastest. We will describe which software designs to look for from your network software vendor, explaining how each design yields different performance characteristics, and determine which OS platform is best suited for each common network programming design. We present our OS benchmarks with both simulated and real-world tests, then evaluate the results.
We found that the software applications architecture determines speed results much more than the operating system on which it runs. Our benchmarks demonstrate a 12x performance difference between process-based and asynchronous task architectures. Significantly, we found up to a 75% overall performance difference between OSes when using the most efficient asynchronous architecture. We found Linux to be the best performing operating system based on our metrics, performing 35% better than Solaris, which came in second, followed by Windows, and finally, FreeBSD.
Background
At Lyris Technologies, we write high-performance, cross-platform, email-based server applications. Better application performance is a competitive advantage, so we spend a great deal of time tuning all aspects of an applications performance profile (software, hardware, and operating system). Our customers frequently ask us which operating system is best for running our software. Or, if they have already chosen an OS, they ask how to make their system run our applications faster. Additionally, we run a hosting (outsourcing) division and want to reduce our hardware cost while providing the best performance for our hosting customers.
Most Internet applications follow these steps:
In general, the performance issues for network applications are to:
The most effective way to maximize network application performance is for the applications software designer to choose an architecture that addresses the three performance criteria above. The two significant variables are the task architecture and the TCP/IP call architecture.
Task Architecture
In the area of task architecture, there are three main techniques:
inetd
,
Sendmail) or processes are re-used (e.g., Apache). This architecture yields
good performance at low loads. Medium loads can also be handled, if the process
image is small (e.g., qmail), if application-specific efficiency improvements
are implemented, or if the application genre does not create too many simultaneous
tasks. Multiple CPUs are efficiently used if process caching is used, and
if the total number of processes is kept low (i.e., low-to-medium load). This
technique works on all operating systems; however, UNIX is significantly more
efficient than Windows at implementing it. (Windows lacks the fork()
system call, and this method is so slow that few Windows applications use
this technique.)
TCP/IP Call Architecture
The second major performance variable is the TCP/IP call architecture. On an operating system level, there are multiple ways to accomplish the same network operation. A tradeoff exists between TCP/IP speed versus programming effort (faster techniques are more work for the programmer). Additionally, some faster techniques are not available on all platforms; higher performance requirements may limit the platform choice.
Blocking TCP/IP Call
A blocking TCP/IP call waits for the requested operation to complete, then
acts immediately on the result. With small numbers of tasks, this results in
immediate reaction to events as they occur. With large numbers of tasks, the
operating system incurs significant context-switching overhead, and overall
efficiency is poor. Blocking (synchronous) TCP/IP calls yield very short latencies
under low loads, and are ideal for an application such as a low-load Web server,
where page-response time should be very fast, and the load is never very high.
However, if a process-oriented architecture is used and a new process is created
for each new connection (e.g., inetd
), then the latency
improvements from blocking TCP/IP are negated by the significant overhead of
running a new process.
Non-Blocking TCP/IP Call
A non-blocking (asynchronous) TCP/IP call initiates an operation, then continues with other activities. When the operation completes or an event occurs, the application is notified and then reacts. There is more programming work involved with this two-step process and sometimes a small amount of time is needed to react to the new event (increasing latency). This non-blocking technique yields much better performance under medium-to-high loads, and can survive abusively high loads, but latency may be slightly longer than with blocking TCP/IP calls.
Each of the three task-handling architectures matches up with a particular TCP/IP system call model. Process-oriented and multi-threaded programs tend to use blocking TCP/IP calls, as this is the simplest way to program, and handles the low loads that are most common case. However, an application that uses the asynchronous task architecture must use non-blocking TCP/IP operations to handle multiple tasks: blocking TCP/IP is not an option. Therefore, if you find a network application that uses the highly scalable asynchronous task architecture, you also benefit from that application using the most scalable TCP/IP call architecture (non-blocking).
Real-World Test
To evaluate the performance of various operating systems and network applications, we created three different tests: real-world, disk I/O, and task architecture comparison. The operating systems we examined were Linux (Red Hat 7.0, kernel 2.2.16-22), Solaris 2.8 for Intel, FreeBSD 4.2, and Windows 2000 Server. The operating systems were the latest version available from a commercial distribution and were not recompiled (i.e., everything was tested right out of the box). We installed all operating systems on identical 4-GB SCSI-3 drives (IBM model DCAS-34330), and ran the tests on the same machine (ASUS P3B motherboard, Intel Pentium III 550-MHz processor, 384-MB SDRAM, Adaptec 2940UW SCSI controller, ATI Rage Pro 3D video card, Intel EtherExpress Pro 10/100 Ethernet card).
As a real-world test, we measured how quickly email could be sent using our
MailEngine software. MailEngine is an email delivery server, ships on all the
tested platforms (plus on Solaris for Sparc), and uses an asynchronous architecture
(with non-blocking TCP/IP using the poll ()
system call).
So that email was not actually delivered to our 200,000-member test list, we
ran MailEngine in test mode. In this mode, MailEngine performs all the steps
of sending mail, but sends the RSET
command instead of the
DATA
command at the last moment. The SMTP connection is
then QUIT
, and no email is delivered to the recipient. Our
workload consisted of a single message being delivered to 200,000 distinct email
addresses spread across 9113 domains. Because the same message was queued in
memory for every recipient, disk I/O was not a significant factor. We slowly
raised the number of simultaneous connections to see how the increased load
altered performance.
Figure 1 ("Operating system comparison") shows the test-mode email delivery speed for MailEngine over a range of simultaneous connections for each OS. Linux is the clear speed winner, roughly 35% faster than Solaris, the runner-up. Overall performance increased as connections were added, showing marginal additional speed with more than 1500. FreeBSD performance decreased somewhat when more than 1500 connections were added.
On the UNIX-style operating systems, it was necessary to tweak the kernel slightly to allow the use of so many connections in one process. Despite kernel tweaking, FreeBSD gave us resource-shortage warnings and failed to run when loaded with more than 2500 connections.
File System Test
Many network applications also require the ability to queue information on disk for later processing (i.e., Sendmails mail queue) or to handle overflow situations. To measure file system efficiency mimicking typical situations, we wrote a C++ program that creates, writes, and reads back 10,000 files in a single directory, one file at a time. To measure file system efficiency of various kinds of files, the file size was increased from 4 KB to 128 KB.
Figure 2 ("Time to Create, Write and Read 10,000 Files") displays the file system test results. Linux and Windows speeds were almost identical, significantly faster than the other two: 6x faster than FreeBSD, 10x faster than Solaris. The file system for each OS was: Linux - EXT2, Solaris - UFS, Windows 2000 - NTFS, FreeBSD - UFS. Other file systems would undoubtedly yield different performance results. If your software application depends heavily of disk I/O, we recommend using Linux or Windows, or else investigating alternative file systems on FreeBSD or Solaris.
Application Architecture Test
Finally, we evaluated how different network application architectures performed on each operating system. We wrote a simple C++ server program that responded to incoming connections with the message "450 too busy", using one of three architectures to handle sending the response message. The three architectures our program tested were: (1) a process-based architecture, with a new process executed to handle each connection; (2) a multi-threaded architecture, with a thread assigned to each connection; and (3) an asynchronous architecture, with all connections answered using non-blocking TCP/IP. A separate C++ program, running on a different machine (on Linux), attempts to connect to our simple server program as quickly as possible, slowly increasing the simultaneous connection load, and counting successfully received response messages. The multiple charts (12 test runs) were too much to present in this article, so we instead charted the average for each task architecture to show general performance differences.
Figure 3 ("Average Throughput per Network Architecture") shows the performance of each type of task architecture, averaged across all OS platforms. Although there was a significant amount of variation in performance between platforms, the variation was not nearly as significant as the architecture choice. The slowest network application architecture is the process-based architecture, which can handle only about 5% of the connections of the asynchronous method. The asynchronous method can handle about 35% more load than the thread-based method at 1000 simultaneous connections. The trend lines show that the multi-threaded versus asynchronous performance gap widens as load increases.
Kernel Tweaks for High Performance
In their default configurations, the UNIX-style operating systems we tested
do not support the large numbers of simultaneous TCP/IP connections that multi-threaded
and asynchronous applications require. This limitation drastically restricts
applications performance, and can incorrectly dissuade a systems administrator
from using these kinds of high-performance architectures. Fortunately, these
limitations are easily overcome with a few kernel tweaks. On UNIX, each TCP/IP
connection uses a file descriptor, so you must increase the total number of
descriptors available to the operating system, and also increase the maximum
number of descriptors each process is allowed to use. All UNIX-style operating
systems have a ulimit
shell command (sh
and bash
), which can allow more open file descriptors to
commands started in that shell once the appropriate kernel tweak has been made.
We suggest ulimit -n 8192
. Here are our recommended kernel
tweaks:
On Linux: echo 65536 > /proc/sys/fs/file-max
changes the number of system-wide file descriptors.
On FreeBSD: Append to /etc/sysctl
(or you can use sysctl
-w
to add these):
kern.maxfiles=65536
kern.maxfilesperproc=32768On Solaris: Add the following to
/etc/system
and reboot:
set rlim_fd_max=0x8000
set rlim_fd_cur=0x8000Summary
Our real-world test observed a 75% performance gap between the best and worst performing operating systems, with Linux enjoying a 35% lead over runner-up Solaris. Of more significance, asynchronous applications were on average 12x faster than process-based applications, and 35% faster than multi-threaded applications. If disk I/O occupies a significant run-time portion of your application, your disk I/O tasks will run up to 10x faster on Linux and Windows 2000, when compared to Solaris, or 6x faster than FreeBSD.
If you are evaluating a network software application and final performance is important to you, software architecture should be a vital evaluation criterion (i.e., you should show a preference for multi-threaded or asynchronous architectures).
Jeffrey Rothman is the Manager of Technical Support and head System Administrator at Lyris, and holds a Ph.D. in Computer Science from U.C. Berkeley on the topic of high-performance memory architectures for multiprocessor systems. John Buckman is the CEO/Founder of Lyris, and the original software programmer behind their three products: ListManager, MailShield, and MailEngine.
Figure 1 Operating systems comparison
Figure 1 | Figure 2 | Figure 3 | Sidebar 1 | Sidebar 2 | Article
Figure 2 Time to create, write and read 10,000 files
Figure 1 | Figure 2 | Figure 3 | Sidebar 1 | Sidebar 2 | Article
Figure 3 Average throughput per network architecture
Figure 1 | Figure 2 | Figure 3 | Sidebar 1 | Sidebar 2 | Article
Figure 1 | Figure 2 | Figure 3 | Sidebar 1 | Sidebar 2 | Article
Most program documentation remains vague about which network programming architecture is used. Since the architecture is the single strongest determinant of performance under load, you need a way to find out. Below are techniques you can use snoop on an application, even if you do not have access to the source code. First, you must set up a load-creating situation, with at least 200 simultaneous tasks. Then, examine how the process is running with various tools:
Linux
top
Solaris
ps -efL
. The NLWP column indicates
how many lightweight processes are running inside each process,
and lists each LWP in a separate row.
ps -efL
reports multiple
copies of your application, but each has a different process ID, then the
application uses the process-per-task approach.
ps -efL
for your application. If all have the same process ID, and the number of copies
changes over time, then the application uses the one-thread-per-task (multi-threaded)
approach.
ps - efL
, all running copies of
your application have the same process ID, and the number of copies is less
than 100.
Windows 2000 and NT
perfmon.exe
to view the number of threads in the running task.
perfmon.exe
shows approximately
one thread per running task.
perfmon.exe
shows less than 100 threads.
Alternative Snooping Technique: Trace the System Calls
strace
(on Linux) or truss
(on Solaris and FreeBSD) to display the applications system calls.
grep
for calls to select or poll,
which usually indicate asynchronous operations. Poll()
is more efficient than select()
with large numbers of
connections.
grep
for calls that start with aio
(which stands for Asynchronous I/O) or lio
(List I/O).
These calls are a likely sign of a sophisticated asynchronous architecture.
Examples: aio_write()
, aio_read(), lio_listio()
.
grep
for calls that start with pthread
or thr_ (on Solaris), which indicates some amount of multi-threading.
If a new task always causes a call the pthread_create()
,
then the one-thread-per-task design is being used.