All of the approaches I've shown so far have their own strengths and weaknesses. It would be nice to mix-and-match some of those strengths. Suppose you want the code to look like threaded code because continuation passing style is painful. Now suppose you'd like it to behave like asynchronous code when it comes to memory footprint and concurrency. Last of all, you'd like the locking to be simpler like it is with more deterministic, non-preemptive schedulers. Suppose you'd like to support 10,000 concurrent requests, each with its own thread. That's the promise of Stackless Python.
If you'd like to have your cake and eat it too, you're going to need a more customized threading library. Most operating systems can't support 10,000 native threads. Hence, you'll need to use some sort of lightweight green threads. To get the benefits of asynchronous code, you'll have to use asynchronous APIs under the covers. Furthermore, it'd be great if the scheduler automatically context switched to whichever thread was waiting on a socket that was ready to do I/O. For instance, it would be nice to have an I/O-based scheduler. Remember, I'm addressing I/O concurrency in this context, not CPU-concurrency.
Stackless Python with coroutines is a technique that was originally used by eGroups (which has now become Yahoo Groups). These days IronPort (recently purchased by Cisco) and Slide continue to use this same technique. Each has a custom version written by separate engineers who left eGroups. It's interesting to note that even the open source version of Stackless Python can be traced back to eGroups through IronPort.
What does it mean to be stackless and what does it have to do with custom threading libraries? An interpreter for a language is said to be "stackless" if function calls in that language do not use the C stack. In effect, the entire interpreter has to run as a giant loop. Futhermore, for this to work, the C extensions generally have to be "in the know".
Some interpreters are written to be stackless and some aren't. Erlang and Scheme are. Python isn't. Way down, deep in the interpreter, there's a C function called PyEvalFrame that gets called recursively, hence making use of the C stack.
The problem with using the C stack is that it gets in the way when you want to implement your own custom threading library. If you have a bunch of frames on the stack, you have to get them out of the way before you can context switch to a new thread. Remember, there's a distinction between the stack used to implement the interpreter and the notion of a stack portrayed by the language to implement subroutines. Python interleaves these two stacks.
However, there are various workarounds for interpreters that are not stackless. One is to copy parts of the stack to the heap whenever you want to context switch. Another is to maintain separate stacks in the heap and then change the stack pointer. This second approach is closer to what the kernel normally does. The first approach is nice in that it doesn't require as much pre-allocated, contiguous space for stacks.
The open source version of Stackless Python involves a rather deep modification to the Python interpreter. However, the greenlets package is a relatively simple C extension that does all the hard work of stack copying for you. Companies like Slide have used the greenlets package as a basis for their own custom thread libraries. With the stack copying already implemented, it's just a matter of implementing an I/O-based scheduler. At PyCon 2007, I saw that the open source version of Stackless Python had done this same thing. By the way, it's interesting to note that part of what makes Erlang so great for concurrency is that it uses many of the same tricks. Go here for more on Stackless Python.
There's one last thing I'd like to cover in my overview of concurrency in Python. "Actors" are a topic that is becoming increasingly popular these days in languages like Erlang, Scala, and Groovy. An actor is an object that runs on its own thread and can be communicated with asynchronously using its own message queue. I like to think of them as "living" objects since they each run independently on their own thread.
There's an old "I Love Lucy" episode where Lucy and Ricky go to a hotel and the hotel clerk changes hats everytime he switches to a new task. It's pretty comical to watch one man change his hat so often. However, that's exactly how object-oriented programming normally works. There's one thread that's being used to execute the methods of several different objects. Actors are like having one hotel employee per hat.
It's interesting to note that when Alan Kay, the creator of Smalltalk, first coined the term "object-oriented programming", he had in mind something closer to actors. In fact, in Dreaming in Code he said that the OOP you see today is a bastardization of his original ideas (page 289). He even goes on to say, "I made up the term object-oriented...and I can tell you, I did not have C++ in mind."
It'll be interesting to see if actors become a standard feature in more programming languages in the future.
While I haven't actually implemented a new Web server, I've provided you with enough of an overview of concurrency in Python so that you'll know which solutions make the most sense for your particular problem. I hope I've also enticed you to look at some more unusual approaches to concurrency such as greenlets, Stackless Python, and Erlang.
Thanks to Libor Michalek, Alec Flett, "Aahz", and Drew Perttula for reviewing this article. I'd also like to thank Sam Rushing for bringing Stackless Python to IronPort and teaching me most of what I know about it as well as Christian Tismer for all of his hard work on the open source version of stackless Python.