Dr. Dobb's | Concurrency and Python

Concurrency and Python

Stackless Python, Erlang, and greenlets are interesting approaches to concurrency

February 03, 2008
URL:http://www.drdobbs.com/open-source/concurrency-and-python/206103078

Shannon -jj Behrens works for Metaweb Technologies. He can be contacted at [email protected].

Imagine you wrote a new Web server that didn't support concurrency and couldn't do multiple things at the same time. Consequently, each new Web request would run serially. Sure, it would be simple to implement, but a single, long-running request could block the entire Web server! A request involving a query against the database that takes two minutes to complete would mean that no one else could access the Web server during those two minutes.

When thinking of concurrency, it's useful to distinguish between I/O throughput and CPU throughput. Does the problem involve lots of concurrent requests, or does it involve lots of data to crunch? This distinction is important because some techniques work better for one than the other. However, sometimes this distinction is blurred. Threads try to address both of these problems. In this article, I focus on I/O throughput.

Approaches to Multitasking

Let's start with some basics. The kernel clearly needs to support concurrency in order to run multiple applications at the same time. Operating systems use time slicing to share one CPU among many programs. If each program gets a small slice of the CPU's time, the question is, how do they take turns?

In non-preemptive schedulers, each program must explicitly "yield" the CPU to the other programs. This is called "cooperative multitasking" because the programs have to cooperate with each other by yielding the CPU. Macintosh OS 9 used a non-preemptive scheduler. These days, most GUI libraries and JavaScript use this same technique to share the CPU among many "things" competing for it. One downside to this technique is that if one program gets wedged in an infinite loop, the whole system goes down. On the other hand, because the programs yield to each other in predictable ways, it's easier to reason about cooperative multitasking; that is, it's more deterministic.

Kernels that use preemptive schedulers simply interrupt (I'm mostly using the English meaning of the term "interrupt", not the computer science meaning) when it's time for one program to be replaced by another. The interruption is caused by a hardware timer. By and large, the program can be completely unaware that it was even interrupted. UNIX and Windows NT use this technique. Hence, so do Linux, OS X, and Windows XP. Unlike non-preemptive schedulers, preemptive schedulers can interrupt a program even if it's wedged in an infinite loop. Consequently, a wedged program can't bring down the whole system.

Locking

Imagine some ancient banking software with multiple programs trying to write data to a single file. Example 1 is what I like to call the canonical "race" condition.

Process A sees that John's balance is $25
Process B sees that John's balance is $25
Process A deducts $25 and sets balance to $0
Process B deducts $25 and sets balance to $0
John gets a free $25
  .... which he promptly loses on a horse race ;)

Example 1: The Canonical "Race" Condition

Clearly, when you have multiple programs accessing the same data, some additional cooperation is required. This is done via locking. In the pre-Python 2.5 days, using a lock required a try/finally statement to acquire and release the lock. These days, Python 2.5's with syntax makes this task really convenient; see Example 2.

with lock:
    if balance > amt:
        deduct(amt)
    else:
        raise ValueError('Insufficient funds')

Example 2: Locking with with in Python 2.5.

By the way, it's easy to wrap this technique in a function decorator called synchronized to write Java-esque code; see Example 3.

@synchronized
def update_account(...):
    ...

Example 3: A synchronized Function Decorator

Once you introduce locks to protect shared data, the question becomes: What should you protect, and how many locks do you need? If you use a single lock that wraps everything, then no concurrency is possible. Clearly, it can get better than that.

Using a single lock to protect all the critical sections is called "coarse-grained locking." FreeBSD 4 and Python both use this technique. In FreeBSD 4, the lock was called "giant", and in Python it's called the "global interpreter lock" (the "GIL").

Using multiple locks that each protect a specific resource during critical sections is called "fine-grained locking." FreeBSD 5 and up use fine-grained locking. BeOS was famous for how fine-grained its locking was. Thanks to its careful fine-grained locking, BeOS was well suited to high concurrency. It was far better than other operating systems of its time at doing things such as playing multiple videos concurrently. While fine-grained locking has some real benefits, it's harder to implement. Furthermore, it can be more expensive to constantly acquire and release a bunch of small locks than to simply acquire and release a single large lock.

By the way, there has been research done on "lockless data structures". Usually, if you wish to update a hash atomically, you need to use a lock or else a race condition can lead to malformed data. However, there exists a hash implementation that can be updated atomically without the use of a lock. It's possible that these techniques may be increasingly important in the future.

Threads Versus Processes

When UNIX first came out, it only supported processes. Each copy of a program running had its own address space. Hence, it didn't share any global data. If multiple running processes wanted to communicate, they had to pass data amonst themselves. UNIX domain sockets, TCP/IP, pipes, files, and databases are all ways for processes to communicate. It may seem strange to include databases in this list, but modern Web applications still use databases as a way of sharing data between processes running on separate machines. When using TCP/IP and the like, you still have to decide what sort of protocol to use to communicate. You can either take a low-level, simple approach to passing data, or you can take a high-level approach using a Remote Procedure Call (RPC) system.

Unlike processes, multiple threads within a single process do share the same address space. Hence, they can access the same global variables. Technically speaking, threads have shared heaps, but separate stacks.

Returning to the Python Web server, which makes the most sense? Apache 1 didn't use threads. Rather, it had pools of worker processes. Apache 2, on the other hand, lets you mix-and-match processes and threads. For instance, you can configure Apache 2 to use M processes each containing N threads.

Processes are heavy. Each new Python process requires a relatively large amount of memory that can't be shared among the multiple running copies of Python. If the goal is to support 10,000 concurrent requests, clearly processes are not a good solution. (By the way, there's a great paper on the "quest" to support 10,000 simultaneous Web requests called The C10K Problem.)

As an aside, it's interesting to note that when you fork in Linux (but not in Windows), Linux works hard to use copy-on-write for the individual memory pages. Hence, forking a large process doesn't need to consume a large amount of RAM. However, due to the way Python reference counts objects, this works out a lot better in C than in Python. In Python, the two forked processes soon diverge, sharing less and less memory.

Threads aren't cheap either. On most operating systems, each thread requires its own pre-allocated, contiguous stack. The words "pre-allocated" and "contiguous" compound each other to cause a real problem if you want to support 100,000 threads. Fortunately, Stackless Python doesn't require this for its lightweight threads. In fact, Erlang supports lightweight processes that require a smaller memory allocation than most kernel thread libraries.

In thinking about threads versus processes, remember that you only need locking when you are sharing mutable data. Immutable data doesn't need locking. That's great for functional languages like Erlang that generally don't permit mutable data. It's also great for separate processes since they don't have a shared address space.

On the flip side, if you don't share anything, then you don't need locks. The need for locking will be greatly reduced if each request uses a thread that doesn't share any data with any other thread.

Native Threads Versus Green Threads

Not all threads are created equal. There are two main ways of implementing threads. Native threads are implemented by the kernel. Green threads are implemented at the interpreter or virtual machine level. Native threads are heavier because context switching at the kernel level is comparatively expensive. However, green threads can't take advantage of multiple CPUs. That is because from the kernel's perspective, the whole VM is running as a single native thread.

For a while, Sun's JVM supported both types of threads via a compiler option. For subtle reasons involving blocking I/O libraries, this was a real problem if you didn't know which you were getting. These days, the JVM always uses native threads.

Python was earlier than many of the other interpreters at the time to support native threads. However, there's a catch -- the global interpreter lock (the "GIL"). Supporting native threads in a thread-safe way is hard. This problem is compounded by the need to interface with non-thread-safe C extensions. The GIL is a lock that is used to protect all the critical sections in Python. Hence, even if you have multiple CPUs, only one thread may be doing "pythony" things at a time.

This sounds worse than it actually is. Many computationally intensive C extensions know to release the GIL before diving into heavy computation. Furthermore, all the I/O libraries know how to release the GIL before doing blocking I/O calls. Hence, it's possible to use multiple Python threads that are each doing blocking I/O. Hence, Python threads are a viable alternative for a Python Web server.

However, Python threads are terrible for multi-CPU concurrency if you are CPU-bound on Python code. If you have four CPUs, and you're trying to do some heavy data crunching using Python code, Python threads won't help. Three of the four CPUs will spend most of their time blocked, waiting to acquire the GIL. In this situation, pools of Python processes are a better approach. To learn more about the GIL, see Threading the Global Interpreter Lock.

Asynchronous Programming

What happens if you have a lot of sockets that are waiting to read or write data? Asynchronous programming lets you write code that basically says, "Call my callback when you actually have something for me." Although this approach is used all the time in C, it's even nicer in Python because Python has first-class functions.

These days, there are many servers written asynchronously. nginx is a "simplified version" of Apache that is both very fast and highly concurrent. Squid, the popular open source Web proxy, is also written asynchronously. This makes a lot of sense if you think about what a Web proxy does. It spends all of its time managing a ton of sockets, funneling data between clients and servers.

Asynchronous programming starts with operating system APIs such as select, poll, kqueue, aio, and epoll. These APIs let you write code that basically says, "These are the file descriptors I'm working with. Which of them is ready for me to do some reading or writing?" In Python, libraries like the built-in asyncore module and the popular Twisted framework take these low-level APIs and orchestrate callback systems on top of them.

Let's look at an example of asynchronous code. First, the linear (non-asynchronous) code in Example 4.

def handle_request(request):
    data = talk_to_database()
    print "Processing request with data from database."

Example 4: Non-asynchronous Code

Re-written asynchronously, you end up with something like Example 5. (You can move use_data into a new top-level function after handle_request, but it's convenient to do it this way to maintain access to request via a closure.)

def handle_request(request):
    def use_data(data):
        print "Processing request with data from database."
    deferred = talk_to_database()
    deferred.addCallback(use_data)

Example 5: Asynchronous Code.

Notice that the talk_to_database function no longer returns a value directly. Rather, it returns a deferred object to which you can attach callbacks.

This is called "continuation passing style". Rather than waiting for a function to simply return, you must pass a callback detailing how to continue once the data is obtained. Because you must use continuation passing style anytime you call a function that might block, it soon permeates your codebase. This can be painful and prevents you from using any library that does blocking I/O unless it's written using continuation passing style.

On the other hand, living in the asynchronous ghetto has its benefits. Aside from the clear concurrency benefits, the Twisted codebase is widely regarded as well-written code, and it provides implementations for most popular protocols.

Subroutines Versus Coroutines

In the beginning, there was the GOTO. It didn't take any parameters, and it was a one-way trip.

A coroutine is like a subroutine, except it doesn't necessarily return. With subroutines, you can do things like:

f -> g -> h (return to g, return to f)

With coroutines, you can do things like:

f -> g -> h -> f

Coroutines can be used for simple cooperative multitasking. The Python Cookbook has a great recipe for coroutines based on generators. Example 6 is a simple version of it.

import itertools
def my_coro(name):
    count = 0
    while True:
        count += 1
        print "%s %s" % (name, count)
        yield
coros = [my_coro('coro1'), my_coro('coro2')]
for coro in itertools.cycle(coros):  # A round-robin scheduler :)
    coro.next()
# Produces:
#
# coro1 1
# coro2 1
# coro1 2
# coro2 2
# ...

Example 6: Generator-based Coroutines

Using generators to implement coroutines is definitely a cute hack. By the way, this same trick can be used in Twisted to alleviate some of the need to use callbacks everywhere.

On the other hand, there are some limitations to this technique. Specifically, you can only call yield in the generator. What happens if my_coro calls some function f and f wants to yield? There are some workarounds, but the limitation is actually pretty core to Python. (Because Python isn't stackless, it can't support true continuations in the same way that Scheme can.) I've written about this topic in detail on my blog.

Stackless

All of the approaches I've shown so far have their own strengths and weaknesses. It would be nice to mix-and-match some of those strengths. Suppose you want the code to look like threaded code because continuation passing style is painful. Now suppose you'd like it to behave like asynchronous code when it comes to memory footprint and concurrency. Last of all, you'd like the locking to be simpler like it is with more deterministic, non-preemptive schedulers. Suppose you'd like to support 10,000 concurrent requests, each with its own thread. That's the promise of Stackless Python.

If you'd like to have your cake and eat it too, you're going to need a more customized threading library. Most operating systems can't support 10,000 native threads. Hence, you'll need to use some sort of lightweight green threads. To get the benefits of asynchronous code, you'll have to use asynchronous APIs under the covers. Furthermore, it'd be great if the scheduler automatically context switched to whichever thread was waiting on a socket that was ready to do I/O. For instance, it would be nice to have an I/O-based scheduler. Remember, I'm addressing I/O concurrency in this context, not CPU-concurrency.

Stackless Python with coroutines is a technique that was originally used by eGroups (which has now become Yahoo Groups). These days IronPort (recently purchased by Cisco) and Slide continue to use this same technique. Each has a custom version written by separate engineers who left eGroups. It's interesting to note that even the open source version of Stackless Python can be traced back to eGroups through IronPort.

What does it mean to be stackless and what does it have to do with custom threading libraries? An interpreter for a language is said to be "stackless" if function calls in that language do not use the C stack. In effect, the entire interpreter has to run as a giant loop. Futhermore, for this to work, the C extensions generally have to be "in the know".

Some interpreters are written to be stackless and some aren't. Erlang and Scheme are. Python isn't. Way down, deep in the interpreter, there's a C function called PyEvalFrame that gets called recursively, hence making use of the C stack.

The problem with using the C stack is that it gets in the way when you want to implement your own custom threading library. If you have a bunch of frames on the stack, you have to get them out of the way before you can context switch to a new thread. Remember, there's a distinction between the stack used to implement the interpreter and the notion of a stack portrayed by the language to implement subroutines. Python interleaves these two stacks.

However, there are various workarounds for interpreters that are not stackless. One is to copy parts of the stack to the heap whenever you want to context switch. Another is to maintain separate stacks in the heap and then change the stack pointer. This second approach is closer to what the kernel normally does. The first approach is nice in that it doesn't require as much pre-allocated, contiguous space for stacks.

The open source version of Stackless Python involves a rather deep modification to the Python interpreter. However, the greenlets package is a relatively simple C extension that does all the hard work of stack copying for you. Companies like Slide have used the greenlets package as a basis for their own custom thread libraries. With the stack copying already implemented, it's just a matter of implementing an I/O-based scheduler. At PyCon 2007, I saw that the open source version of Stackless Python had done this same thing. By the way, it's interesting to note that part of what makes Erlang so great for concurrency is that it uses many of the same tricks. Go here for more on Stackless Python.

Actors

There's one last thing I'd like to cover in my overview of concurrency in Python. "Actors" are a topic that is becoming increasingly popular these days in languages like Erlang, Scala, and Groovy. An actor is an object that runs on its own thread and can be communicated with asynchronously using its own message queue. I like to think of them as "living" objects since they each run independently on their own thread.

There's an old "I Love Lucy" episode where Lucy and Ricky go to a hotel and the hotel clerk changes hats everytime he switches to a new task. It's pretty comical to watch one man change his hat so often. However, that's exactly how object-oriented programming normally works. There's one thread that's being used to execute the methods of several different objects. Actors are like having one hotel employee per hat.

It's interesting to note that when Alan Kay, the creator of Smalltalk, first coined the term "object-oriented programming", he had in mind something closer to actors. In fact, in Dreaming in Code he said that the OOP you see today is a bastardization of his original ideas (page 289). He even goes on to say, "I made up the term object-oriented...and I can tell you, I did not have C++ in mind."

It'll be interesting to see if actors become a standard feature in more programming languages in the future.

Conclusion

While I haven't actually implemented a new Web server, I've provided you with enough of an overview of concurrency in Python so that you'll know which solutions make the most sense for your particular problem. I hope I've also enticed you to look at some more unusual approaches to concurrency such as greenlets, Stackless Python, and Erlang.

Acknowledgements

Thanks to Libor Michalek, Alec Flett, "Aahz", and Drew Perttula for reviewing this article. I'd also like to thank Sam Rushing for bringing Stackless Python to IronPort and teaching me most of what I know about it as well as Christian Tismer for all of his hard work on the open source version of stackless Python.