Tasks Should Avoid Waiting
Next, let's consider a key implementation question: How many threads should a thread pool contain? The general idea is that we want to "rightsize" the thread pool to perfectly fit the hardware, so that we provide exactly as much work as the hardware can actually physically execute simultaneously at any given time. If we provide too little work, the machine is "undersubscribed" because we're not fully utilizing the hardware; cores are going to waste. If we provide too much work, however, the machine becomes "oversubscribed" and we end up incurring contention for CPU and memory resources, for example in the form of needless extra context switching and cache eviction. 
So our first answer might be:
Answer 1 (flawed): "A thread pool should have one pool thread for each hardware core."
That's a good first cut, and it's on the right track. Unfortunately, it doesn't take into account that not all threads are always ready to run. At any given time, some threads may be temporarily blocked because they are waiting for I/O, locks, messages, events, or other things to happen, and so they don't contribute to the computational workload.
Figure 2 illustrates why tasks that block don't play nice with thread pools, because they idle their pool thread. While the task is blocked, the pool thread has nothing to do and probably won't even be scheduled by the operating system. The result is that, during the time the task is blocked, we are actually providing less parallel work than the hardware could run; the machine is undersubscribed. Once the task resumes, the full load is restored, but we've wasted the opportunity to get more work done on otherwise-idle hardware.
At this point, someone is bound to ask the natural question: "But couldn't the pool just reuse the idled thread that's just sitting blocked anyway, and use it to run another task in the meantime?" The answer is "no," that's unsafe in general. A pool thread must run only one task at a time, and must run it to completion. Consider: The thread pool cannot reclaim a blocked thread and use it to run another task, thus interleaving tasks, because it cannot know whether the two tasks will interact badly. For example, the pool cannot know whether the blocked task was using any thread-specific state, such as using thread-local storage or currently holding a lock. It could be disastrous if the interleaved task suddenly found itself running under a lock held by the original task; that would be a fancy way to inject potential deadlocks by creating new lock acquisition orders the programmer never knew about. Similarly, the original task could easily break if it returned from a blocking call only to discover unexpected changes were made to its thread-local state made by an interloping interleaver. Some languages and platforms do provide special coroutine-like facilities (e.g., POSIX swapcontext, Windows fibers) that can let the programmer write tasks that intentionally interleave on the same thread, but these should be thought of as "manual scheduling" and "having multiple stacks per thread" rather than just one stack per thread, and the programmer has to agree up front to participate and has to opt into a restrictive programming model to do it safely. [4,5] A general-purpose pool can't safely just interleave arbitrary tasks on the same thread without the programmer's active participation and consent.
So we actually want to match, not just any threads, but specifically threads that are ready to run, with the hardware on this machine that can run them. Therefore a better answer is:
Answer 2 (better): "A thread pool should have one ready pool thread for each hardware core."
Figure 3 illustrates how a thread pool can deal with tasks that block and idle their pool thread. First, a thread pool has to be able to detect when one of its threads goes idle even though it still has a task assigned; this can require integration with the operating system. Second, once idling has been detected, the pool has to create or activate an additional pool thread to take the place of the blocked one, and start assigning work to it. From the time the original task blocks and idles its thread to the time the new thread is active and begins executing work, the system has more cores than work that is ready to run, and is undersubscribed.
But what happens when the original task unblocks and resumes? Then we enter a period where the system is oversubscribed, with more work than there are cores to run the work. This is undesirable because the operating system will schedule some tasks on the same core, and those tasks will contend against each other for CPU and cache. Now we do the dance in reverse: The thread pool has to be able to detect when it has more ready threads than cores, and retire one of the threads as soon as there's an opportunity. Once that has been done, the pool is "rightsized" again to match the hardware.
The more often tasks block, the more they interfere with the thread pool's ability to match ready work with available hardware that can execute it.