The Challenges of Developing Multithreaded Processing Pipelines

By Ryan Bloom, December 26, 2008

Developing applications for multi-core processors will requires you to introduce threading into your software to allow more instructions to be executed simultaneously.

Recently, there have been articles by prominent members of the development community discussing the need for threading in applications. The problem is that we are reaching the limits of what hardware engineers can do to increase processor speed. Years ago, when processors were relatively slow and disks were small, developers spent time optimizing their applications for both speed and space. However, now processors are fast and disks are large, so most developers rely on the hardware to improve performance and worry little about disk space.

Unfortunately, the advances in hardware are changing. Where it used to be that processors were just made faster, that isn't happening anymore. For the time being, we are seeing that, because hardware vendors are incapable of using their standard solutions for improving base processor performance, clock speeds are essentially the fastest they are going to get until there is a major breakthrough in technology. Instead, hardware vendors are adding more cores to their CPUs or adding multiple CPUs to a single machine. Developers need to go back to optimizing their software to take advantage of these new hardware concepts. That means introducing threading into their software to allow more instructions to be executed at the same time.

Do all Developers need to Understand Threading?

I was first introduced to threading at IBM when the development team was moving from the Lotus Domino Go web server to the Apache web server. At the time, the Lotus Domino web server was threaded, but Apache was on Version 1.3, which was a strictly process-based server on Unix (Apache on Windows had threading, but it was bolted on, not designed in). One of the goals of moving to Apache 2.0 was to fully integrate threading. While this was a goal of the entire Apache development team, it was very important to IBM, because AIX (IBM's variant of Unix) wasn't very good at context switching between processes. This was before open source was well understood by commercial software vendors. One of the developers told me something along the lines of: "Wait until the open source developers need to figure out threading, they'll never get it." The reality is that open source developers are just as capable as commercial developers. Writing threaded code is hard to do properly and most developers don't "get it" at first. Like many things worth learning, writing threaded code requires developers to take the time up front to determine the best way to structure their code. This remains the case even when a tool like Threading Building Blocks (TBB) is applied: the code that TBB parallelizes must be thread-safe, and writing thread-safe code requires a proper structuring of the code and data.

My experience is that most developers do not understand threading, and that most don't believe they need to. There are good reasons for this. These days, most developers are working on web applications, and when you are writing a web application in .NET or Java, the platform takes care of the threading for you. The application server creates threads and hands each request off to a single thread. The developer is responsible for writing the code that a single thread will execute. Therefore, instead of writing code that utilizes multiple threads, the developer creates thread-safe code. Writing thread-safe code is far easier than understanding what an application is doing and writing efficient code that takes advantage of the full power of the computer. If developers become accustomed to writing thread-safe code, then their code will be ready for operation in a multithreaded environment.

While many developers function by writing code that is executed by one thread at a time, in my opinion, that is a naive view of what software developers need to do to be successful. Even if all developers don't need to fully understand multithreading, it will be highly beneficial going forward if all developers adopt the practice of writing solidly thread-safe code in all situations (server side, client side, business logic, algorithmic computations, etc.).

Pipeline Processing Example

My own development experience is instructive in this regard. After working on Apache, I moved to web application development, and have been involved in a couple of large-scale web applications that need to satisfy hundreds of thousands of users. One thing that I have learned is that in large web applications, the website is always just one part of a much larger ecosystem of software that works together to solve the overall problem. These other applications can perform a number of different tasks depending on the needs of the application. Examples I have seen include:

Data exchange with partners or clients.
Back-end processing for sending scheduled emails.
Data archiving.
Scheduled tasks to pre-compute values for performance improvements.

It may seem that these tasks can be single threaded safely, but I want to investigate one of these tasks more closely to show why that may not be the best approach. Specifically, let's look at a process that sends regular emails, either nightly or weekly. This process is going to need to do the following:

Find all users who need to get an email.
Find the correct email to send.
Replace template parameters with values from the user.
Send the email.

This is a relatively simple process, so why not do this with a single thread? For one user, this is simple, but now assume you have to do this for 100,000 users. If each one takes one second and the program used a single thread, it would take over 24 hours to process all of the users. Obviously, there are easy ways to reduce this, like doing a portion of the emails over several days if this is a weekly email. But sometimes that isn't possible because the business requirements won't allow it. You probably want this to run in as short a time as possible as well, because while this process is running, it is having an impact on your production database, which can impact the performance of the website. The question then becomes: is there another solution? One obvious choice is to create multiple threads so that you can handle multiple users at the same time.

Experimenting with Threading

From the example above, you can see that even simple applications can benefit from threading, and indeed require multithreading when large volumes of data must be processed within a short time window; however, adding threads to an application needs to be done with some forethought. As is often the case with software development, you need to understand why a solution makes sense in order to achieve the best results. This section is intended to provide some insight into how to add threads to your applications. First I will present some of the theory, and then I will show how the theory can be applied in the real world. Finally, I'll explain how Threading Building Blocks can be applied to abstract the threading details and multithread the processing using a high-level, task-centric approach.

In an ideal world, you will achieve the best performance if the number of threads in an application is equal to the number of processors or cores that you have available. However, we all know that programs don't run in an ideal world. In the real world, performance of applications often has less to do with the performance of the CPU and more to do with how often the application needs to wait for I/O. With this in mind, let's look at how we can organize the email sending application with threads.

First, some assumptions and requirements for our sample application. The application has a number of email templates and each user may get one email for each type of template. For simplicity, we will assume that there is one database that contains a series of email templates and user accounts. We will also assume that there is one query to retrieve a list of email templates and another to retrieve a list of users that should receive an email given a specific template type. How many different ways could we organize this work?

Before starting, we need to determine if the tasks can be executed in parallel. If each task must be done in order, and the data items themselves must be processed in a specific order, then threading is not an option.

In this case, the order in which the emails for individual users are processed doesn't really matter. Given the nature of the task, we can definitely parallelize creating the emails. We will discuss whether we can parallelize sending the emails later. Now we need to determine how best to organize our blocks of work. Based on the requirements, because each user receives one email per template type, it makes the most sense to group the work by template type instead of trying to separate the users into arbitrary groups.

Now that we know we will break the work into components around email templates, how should we organize the work? There are at least two reasonable solutions that we need to investigate. The first option (see Figure 1) is to get a list of email templates and then create some threads. Each thread is given a single template type to operate on. Once created, each thread gets a list of users who should receive an email for that template type. Then, for each user an email is created and sent until all users have been processed. We could either create one thread per template type or we could create fewer threads and have each thread handle multiple template types. This is a relatively simple solution; plus, it will be faster than doing this same work without threads.

Figure 1: First design, with four threads and five templates. Each thread is given a list of templates and does all work associated with that template.

The I/O Problem

However, there is another way to organize the work that will most likely perform even better (see Figure 2). Remember that most of the wasted time in an application is spent waiting for I/O to complete. So if we can isolate the I/O work to a single or a minimal number of threads, then the application should perform more efficiently, assuming that we can use async I/O, which will allow that one thread to saturate the network. Here is one option that would perform best on multiple CPU machines. If the application gets a list of template types, and we create one additional thread for each CPU beyond the original one, this will give us one thread per CPU. The original thread is going to be our I/O thread, so its job is to retrieve data from the database, and give it to the worker threads. It will then accept email messages from the worker threads and send them to the email server. The idea behind this design is to keep all of the worker threads busy creating emails without having to wait for the mail server to be ready or for the database to send back data. There are some challenges with this design. First, there will be contention between retrieving data from the database and sending emails to the server. This can be resolved by adding another thread for sending the emails. Second, the data retrieval thread must always have the next set of users ready to give the worker thread so that the worker thread never needs to wait. This can force a relatively complex implementation. Finally, a single email thread may not make sense. If your email server can handle multiple connections at the same time (all commercial servers can), then you may be better off having multiple threads sending emails simultaneously.

Figure 2: Second design, also with four threads and five templates. Each trhead is given a template and a list of users. The threads create e-mails and given them back to the master thread to be sent.

Given these two possible designs, which is better? To be honest, I don't know. The answer is really dictated by your specific situation. How many emails are you sending each day? Is most of your time spent creating the emails or is most of your time spent retrieving and sending the email after it was created? If creating the email is where you spend your time, because the templates are large or the replacements are complex, then the second design is probably a better match. If the majority of your time is spent retrieving data from the database or communicating with the email server, then improving the performance of creating the emails will not help your application overall. To determine the best design for your situation, you need to understand the environment in which your application will run and then do your own performance testing. You also need to understand the maintenance needs of each design. The first design in this article is easy to understand and maintain, while the second may provide significant performance benefits, but is much more difficult to develop and maintain.

Conclusion

This article has reviewed some of the basic concepts behind adding threads to your application. There are times that adding threads won't work, but those times are getting rarer. The biggest challenge to adding threads to an existing application is ensuring that any third-party libraries you use are also thread-safe. If you believe that you don't need to understand threading, I hope that this article has convinced you that there are valid reasons for all developers to learn how to best add threads to their applications.

A Look Ahead

The development of new techniques for multithreaded programming is surging, in response to the emergence of multicore processors. Technologies such as Threading Building Blocks can abstract the details of thread management, but even so, development of the application still requires understanding how to write thread-safe code.

One new advantage Threading Building Blocks bring to the table is application scalability, which has always been difficult in the past. Historically, multithreaded applications have been designed and developed for a specific set of hardware. The number of threads allocated to each task was typically selected empirically, after testing the application in a specific hardware environment, and finding out which set of thread assignments produced the maximal performance.

TBB's CILK-like task-stealing mechanism means that, with a properly designed application, you can rely on your application itself to automatically detect idle processors/cores and assign them tasks from the queue, hence automatically maximizing the use of the available processors (even if the application is installed onto new hardware with a different number of processing cores). This type of portable scalability has always been difficult to achieve in applications constructed using native threads.

Ryan Bloom is the director of Native Development for Peopleclick Inc. in Raleigh, NC. He has been in software development and management for nearly 10 years and is a member of the Apache Software Foundation.

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

The Challenges of Developing Multithreaded Processing Pipelines

Do all Developers need to Understand Threading?

Pipeline Processing Example

Experimenting with Threading

The I/O Problem

Conclusion

A Look Ahead

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

The Challenges of Developing Multithreaded Processing Pipelines

Do all Developers need to Understand Threading?

Pipeline Processing Example

Experimenting with Threading

The I/O Problem

Conclusion

A Look Ahead

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content