Bit by the Cache
While profiling the parallel list operators for Heron I kept running into some really unsettling results. Only getting 25-40% increase in speed for even the most trivially parallelizable computations, on my dual-core machine. The most likely culprit: the cache.
Optimizing multi-threaded code can be a bit tricky if you aren't a low-level assembler junky. After profiling the heck out of my applications and doing tons of experiments, I just couldn't squeeze the maximum theoretical performance out of my dual-core machine for the most trivial cases.
It turns out that I am doing a really bad job of managing my cache. For example I had a 32 megabyte array which I was processing simultaneously in two chunks. One core would work on the beginning, and the other core in the middle.

