Channels ▼
RSS

Parallel

Boosting Performance with Atomic Operations in .NET 4


When you write concurrent code that has to make changes to shared variables, you might think that a mutual-exclusion lock is necessary to perform each update operation. In some cases, you can replace a mutual-exclusion lock with a more efficient atomic operation and you can boost both your application's performance and scalability.

When you want to achieve the best performance for a parallelized algorithm, you can follow James Reinders' 8 Rules for Parallel Programming for Multicore. Here I focus my attention on the importance of Rule #5 that suggests the usage of atomic operations instead of locks, whenever possible.

I know it is difficult to write lock-free code. In fact, writing lock-free code is error prone. The key message you should get from this simple example is that locks aren't always the best option when you want to maximize both performance and scalability with multicore hardware. These simple examples in C# and Visual Basic with .NET Framework 4 let you understand and measure the performance improvements achieved when replacing locks with atomic operations. You need Visual Studio 2010 in order to run the examples on your multicore hardware and test the results by yourself. However, you have to understand that an atomic operation replaces a lock and an increment operation in this example, but it doesn't mean that you should replace all your locks with atomic operations. You have to analyze each scenario to determine whether an atomic operation is the best alternative.

The next code doesn't represent a best practice. There are simpler and more efficient ways of coding this application. For example, you can use PLINQ to achieve the same goal. However, this example is focused on the comparison of locks and atomic operations. The following C# code snippet is a console application that generates 150 million words with a serial execution and then it runs a parallelized loop to count the number of words that contain the "a" and the "e" characters, in two independent counters.


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
// Added
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;

namespace CSharpTest
{
    class Program
    {
        // 150 million words
        private const int NUM_WORDS = 150000000;
        private static int _counterA;
        private static int _counterE;
        private static object _syncA = new Object();
        private static object _syncE = new Object();
        private static List<string> _wordsList;

        private static void GenerateData()
        {
            string[] words = { 
                "visual", 
                "studio", 
                "2010", 
                "adds",
                "parallel",
                "extensions",
                "but",
                "provides",
                "backwards",
                "compatibility",
                "with",
                "the",
                "previous",
                "threading",
                "model" };

            _wordsList = new List<string>(NUM_WORDS);
            for (int i = 0; i < NUM_WORDS; i++)
            {
                _wordsList.Add(words[i % words.Length]);
            }
        }

        static void Main(string[] args)
        {

            // Generate data (serial execution)
            GenerateData();

            // Measure the parallel execution time
            var sw = Stopwatch.StartNew();

            Parallel.For(0, NUM_WORDS, (i) =>
                {
                    // Locks version
                    if (_wordsList[i].Contains("a"))
                    {
                        lock (_syncA)
                        {
                            _counterA++;
                        }
                    }
                    if (_wordsList[i].Contains("e"))
                    {
                        lock (_syncE)
                        {
                            _counterE++;
                        }
                    }
                });

            Console.WriteLine("A Counter value: {0}", _counterA);
            Console.WriteLine("E Counter value: {0}", _counterE);
            Console.WriteLine(sw.Elapsed.TotalMilliseconds);
            Console.ReadLine();
        }
    }
}

The Main method calls the GenerateData method. This method is very simple and it just adds 150 million words (NUM_WORDS) to the List<string> _wordsList. The method generates the same list of words each time it runs in order to make it simple to compare execution times for the same algorithm with the same input data. The code runs with a serial execution. It is possible to parallelize the code that generates the list of words but the main focus isn't on this part of the code.

Then, the Main method starts a Stopwatch to measure the execution time for the section of the code that is going to run with a parallelized execution. A parallelized for loop analyzes each string in _wordsList using a Parallel.For method. Parallel.For launches the necessary number of tasks in parallel to take advantage of the available logical cores. Each parallelized iteration runs a very simple piece of code that checks whether one of the strings in _wordsList contains the "a" and the "e" characters.

If the string contains "a", the code uses the lock keyword to acquire a mutual-exclusion lock on the _syncA synchronization object, increments _counterA by 1 and then releases the lock. This increment operation is the only line of code that runs while _syncA is locked, as in the following code snippet:


if (_wordsList[i].Contains("a"))
{
    lock (_syncA)
    {
        _counterA++;
    }
}

If the string contains "e", the code uses the lock keyword to acquire a mutual-exclusion lock on the _syncE synchronization object, increments _counterE by 1, then releases the lock, as in the following code snippet:


if (_wordsList[i].Contains("e"))
{
    lock (_syncE)
    {
        _counterE++;
    }
}

As happened when the string contained an "a", this increment operation is the only line of code that runs while _syncE is locked. Some words can have both an "a" and an "e", and therefore, the code doesn't use an else statement and evaluates both characters with an independent if statement.

Again, you can use PLINQ to query _wordsList and you will avoid these locks. However, I want to focus on the locks and the increment operations. Locks are expensive, and you have a scenario with too many locks. Parallel.For runs the iterations in parallel and you will have many tasks acquiring and releasing a mutual-exclusive lock on _counterA and _counterE. Sometimes, tasks will block until they can acquire the mutual-exclusive lock because there is one task that acquired the lock and is running the increment operation. The code that runs with a serialized execution is really simple, because it just increments the value of the shared variable but the necessary code to acquire and release the lock decreases the overall performance for this simple algorithm and reduces scalability. The code with the locks guarantees correctness. The following are the final values for the counters displayed on the console:

A Counter value: 60000000
E Counter value: 70000000

I ran this example many times in a computer with a quad-core microprocessor without turbo modes and with a fixed clock speed. The parallelized for loop required an average time of 34,000 milliseconds to run. Windows Task manager shows all the cores with a 100% CPU usage while the parallelized loop runs. However, a sustained high value for the CPU usage doesn't mean that the computer is running efficient parallelized code. If you can achieve the same results in less time and the code still guarantees correctness, you will have an improved algorithm. In a previous post, I explained some techniques to avoid problems when measuring speedup on microprocessors with Turbo Boost. Be sure to disable Turbo Boost when you run the examples explained in this post to avoid drawing wrong conclusions.

The following C# code snippet shows a new version of the Main method that provides the same results but it uses atomic operations instead of locks to update the shared counters.


static void Main(string[] args)
{
    // Generate data (serial execution)
    GenerateData();

    // Measure the parallel execution time
    var sw = Stopwatch.StartNew();

    Parallel.For(0, NUM_WORDS, (i) =>
        {
            // Atomic operations version
            if (_wordsList[i].Contains("a"))
                Interlocked.Increment(ref _counterA);
            if (_wordsList[i].Contains("e"))
                Interlocked.Increment(ref _counterE);
        });

In this new version, if the string contains "a", the code calls the Interlocked.Increment method and passes _counterA by reference. System.Threading.Interlocked.Increment increments the value for the variable as an atomic operation. This method is thread-safe, and therefore, it is also task-safe. Remember that tasks are assigned to threads. This increment operation was the only line of code that ran while the synchronization object was locked and now there is no need to acquire a mutual-exclusion lock. The following code snippet shows the code that increments _counterA as an atomic operation:


Interlocked.Increment(ref _counterA);

If the string contains "e", the code calls the Interlocked.Increment method and passes _counterE by reference, as in the next line:


Interlocked.Increment(ref _counterE);

There is no need to acquire mutual-exclusion locks because the increment operation was the only line that needed to run with a serialized execution. The code with the atomic operations also guarantees correctness. The following are the final values for the counters displayed on the console:

A Counter value: 60000000
E Counter value: 70000000

I ran this example many times in a computer with a quad-core microprocessor without turbo modes and with a fixed clock speed. The parallelized for loop required an average time of 10,000 milliseconds to run. Windows Task manager shows all the cores with a 100% CPU usage while the parallelized loop runs. However, this new version requires less time to run than the version that uses locks. The speedup achieved is 34,000 / 10,000 = 3.40x over the version that used locks. The System.Threading.Interlocked class provides methods that perform atomic operations on shared variables. Remember that it is possible to replace a lock and an update on a shared variable with an atomic operation when the update operation is the only statement that runs in the critical section. However, don't replace locks and update statements with atomic operations if you aren't sure of what you're doing.

The following Visual Basic code snippet is the equivalent for the previously explained console application that used locks written in C#. In Visual Basic, instead of the C# lock keyword, the code uses the SyncLock equivalent to acquire the mutual-exclusion lock. I ran this example many times in a computer with a quad-core microprocessor without turbo modes and with a fixed clock speed. The parallelized for loop required an average time of 35,000 milliseconds to run. The average time required to run the application is almost the same than its C# equivalent.


Imports System.Diagnostics
Imports System.Threading
Imports System.Threading.Tasks

Module Module1

    ' 150 million words
    Const NUM_WORDS As Integer = 150000000
    Private _counterA As Integer = 0
    Private _counterE As Integer = 0
    Private _syncA As Object = New Object()
    Private _syncE As Object = New Object()
    Private _wordsList As List(Of String)

    Private Sub GenerateData()
        Dim words As String() = New String() {
            "visual",
            "studio",
            "2010",
            "adds",
            "parallel",
            "extensions",
            "but",
            "provides",
            "backwards",
            "compatibility",
            "with",
            "the",
            "previous",
            "threading",
            "model"}
        _wordsList = New List(Of String)(NUM_WORDS)
        For i As Integer = 0 To (NUM_WORDS - 1)
            _wordsList.Add(words(i Mod words.Length))
        Next
    End Sub

    Sub Main()
        ' Generate data (serial execution)
        GenerateData()

        ' Measure the parallel execution time
        Dim sw = Stopwatch.StartNew()

        Parallel.For(
            0,
            NUM_WORDS,
            Sub(i)
                ' Locks version
                If (_wordsList(i).Contains("a")) Then
                    SyncLock (_syncA)
                        _counterA += 1
                    End SyncLock
                End If
                If (_wordsList(i).Contains("e")) Then
                    SyncLock (_syncE)
                        _counterE += 1
                    End SyncLock
                End If
            End Sub)

        Console.WriteLine(
            "A Counter value: {0}",
            _counterA)
        Console.WriteLine(
            "E Counter value: {0}",
            _counterE)
        Console.WriteLine(
            sw.Elapsed.TotalMilliseconds)
        Console.ReadLine()

    End Sub
End Module

The following Visual Basic code snippet shows a new version of the Main procedure that provides the same results but it uses atomic operations instead of locks to update the shared counters.


    Sub Main()
        ' Generate data (serial execution)
        GenerateData()

        ' Measure the parallel execution time
        Dim sw = Stopwatch.StartNew()

        Parallel.For(
            0,
            NUM_WORDS,
            Sub(i)
                ' Atomic operations version
                If (_wordsList(i).Contains("a")) Then
                    Interlocked.Increment(_counterA)
                End If
                If (_wordsList(i).Contains("e")) Then
                     Interlocked.Increment(_counterE)
                End If
            End Sub)

        Console.WriteLine(
            "A Counter value: {0}",
            _counterA)
        Console.WriteLine(
            "E Counter value: {0}",
            _counterE)
        Console.WriteLine(
            sw.Elapsed.TotalMilliseconds)
        Console.ReadLine()

    End Sub
End Module

I ran this example many times in a computer with a quad-core microprocessor without turbo modes and with a fixed clock speed. The parallelized for loop required an average time of 10,400 milliseconds to run. Windows Task manager shows all the cores with a 100% CPU usage while the parallelized loop runs. However, as happened with the C# counterpart, this new version requires less time to run than the Visual Basic version that uses locks. The speedup achieved is 35,000 / 10,400 = 3.37x over the version that used locks. The speedup achieved is almost the same than the speedup for the C# version.

This simple example demonstrates the importance of considering atomic operations when you want to achieve the best performance for a parallelized algorithm. You can use profiling tools to check the differences between the lock version and the atomic operations counterpart. System.Threading.Interlocked is an old .NET Framework member and you should take it into account in your parallelized code with .NET Framework 4.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video