Threads Versus Processes
When UNIX first came out, it only supported processes. Each copy of a program running had its own address space. Hence, it didn't share any global data. If multiple running processes wanted to communicate, they had to pass data amonst themselves. UNIX domain sockets, TCP/IP, pipes, files, and databases are all ways for processes to communicate. It may seem strange to include databases in this list, but modern Web applications still use databases as a way of sharing data between processes running on separate machines. When using TCP/IP and the like, you still have to decide what sort of protocol to use to communicate. You can either take a low-level, simple approach to passing data, or you can take a high-level approach using a Remote Procedure Call (RPC) system.
Unlike processes, multiple threads within a single process do share the same address space. Hence, they can access the same global variables. Technically speaking, threads have shared heaps, but separate stacks.
Returning to the Python Web server, which makes the most sense? Apache 1 didn't use threads. Rather, it had pools of worker processes. Apache 2, on the other hand, lets you mix-and-match processes and threads. For instance, you can configure Apache 2 to use M processes each containing N threads.
Processes are heavy. Each new Python process requires a relatively large amount of memory that can't be shared among the multiple running copies of Python. If the goal is to support 10,000 concurrent requests, clearly processes are not a good solution. (By the way, there's a great paper on the "quest" to support 10,000 simultaneous Web requests called The C10K Problem.)
As an aside, it's interesting to note that when you fork in Linux (but not in Windows), Linux works hard to use copy-on-write for the individual memory pages. Hence, forking a large process doesn't need to consume a large amount of RAM. However, due to the way Python reference counts objects, this works out a lot better in C than in Python. In Python, the two forked processes soon diverge, sharing less and less memory.
Threads aren't cheap either. On most operating systems, each thread requires its own pre-allocated, contiguous stack. The words "pre-allocated" and "contiguous" compound each other to cause a real problem if you want to support 100,000 threads. Fortunately, Stackless Python doesn't require this for its lightweight threads. In fact, Erlang supports lightweight processes that require a smaller memory allocation than most kernel thread libraries.
In thinking about threads versus processes, remember that you only need locking when you are sharing mutable data. Immutable data doesn't need locking. That's great for functional languages like Erlang that generally don't permit mutable data. It's also great for separate processes since they don't have a shared address space.
On the flip side, if you don't share anything, then you don't need locks. The need for locking will be greatly reduced if each request uses a thread that doesn't share any data with any other thread.
Native Threads Versus Green Threads
Not all threads are created equal. There are two main ways of implementing threads. Native threads are implemented by the kernel. Green threads are implemented at the interpreter or virtual machine level. Native threads are heavier because context switching at the kernel level is comparatively expensive. However, green threads can't take advantage of multiple CPUs. That is because from the kernel's perspective, the whole VM is running as a single native thread.
For a while, Sun's JVM supported both types of threads via a compiler option. For subtle reasons involving blocking I/O libraries, this was a real problem if you didn't know which you were getting. These days, the JVM always uses native threads.
Python was earlier than many of the other interpreters at the time to support native threads. However, there's a catch -- the global interpreter lock (the "GIL"). Supporting native threads in a thread-safe way is hard. This problem is compounded by the need to interface with non-thread-safe C extensions. The GIL is a lock that is used to protect all the critical sections in Python. Hence, even if you have multiple CPUs, only one thread may be doing "pythony" things at a time.
This sounds worse than it actually is. Many computationally intensive C extensions know to release the GIL before diving into heavy computation. Furthermore, all the I/O libraries know how to release the GIL before doing blocking I/O calls. Hence, it's possible to use multiple Python threads that are each doing blocking I/O. Hence, Python threads are a viable alternative for a Python Web server.
However, Python threads are terrible for multi-CPU concurrency if you are CPU-bound on Python code. If you have four CPUs, and you're trying to do some heavy data crunching using Python code, Python threads won't help. Three of the four CPUs will spend most of their time blocked, waiting to acquire the GIL. In this situation, pools of Python processes are a better approach. To learn more about the GIL, see Threading the Global Interpreter Lock.