Analyzing What Went Wrong
To figure out where the problem lies, perhaps the first thing you might try to do is run this code while watching a CPU monitor to see which system cores are busy and for how long. If we run the example code with P set to 1, which makes it run sequentially, then we would expect to see one hardware core light up for the duration of the execution. Figure 2 shows that this is in fact exactly what happened on my test system. 
Now let's run Example 1 again with P = 14, which will use 14 hardware cores if available, and watch the CPU monitor to see which cores are busy and for how long. I picked 14 just because in Figure 1 the execution time for P = 14 was about the same as for P = 1, so we can compare CPU utilization for about the same execution time as well as the same workload; remember that regardless of the value of P every execution is doing exactly the same amount of counting work on the identical data set, just carving the work up P different ways. Figure 3 shows the result on my system.
What does Figure 3 tell us? First, it confirms that the parallel version is indeed lighting up 14 different cores, so we didn't make a mistake and accidentally run sequentially on a single core. So far, so good. Second, we see that some of the cores stay busy for as long as the single core in Figure 2 did, which is why the total execution time is about the same. Not great, to be sure, but at least it's consistent with our previous data.
Third, if you add up the total CPU busy time in Figure 3, we clearly used much more total CPU horsepower than in Figure 2. Yet both executions performed exactly the same number of traversals and additions; the work was merely divided differently. Now that's just crazy; it means the CPU monitor must somehow be lying to us, because it claims that the cores are pegged and busily computing when that can't be true because we know full well the total number of matrix cell visits and counter additions that our program is doing is identical as in the execution in Figure 2. Hmm. Well, for now, let's just remember this on our list of mysteries to be solved and press on.
Fourth, we find a new clue: We can see some of the cores didn't stay busy as long as others. Now that's a puzzling thing, because each of the 14 workers was given an identically sized chunk of work to do and so should have taken the same length of time to do it. Yet some workers apparently ran faster (or slower) than others. Hmm; add that to the mystery list, too.
That's about as far as we can go with a CPU monitor. Time for other tools.
If you run this code under a performance analyzer that lets you examine the cycles per instruction (CPI) for each line of source code, you'll probably discover that the CPI for one line in particular is amazingly high: The offending line is ++result[p]. Now, that ought to strike us as strange, because ++result[p] isn't some heavyweight function call; it's just computing an offset into an array and then incrementing an integer, and so should have a very low CPI.
Next, if you run the code under a performance analyzer that can measure cache misses, or better still cache line contention, and attribute the cost to particular lines of code, you'll get even more targeted information about the line ++result[p]: It's a hot spot that's spending its time waiting for memory, specifically for cache line ownership.
Put the CPI and cache miss information together, and there we have it: A classic case of false sharing. Each worker is incrementing its own distinct counter, but the counter values are adjacent in the result array. To increment its counter, a worker must have exclusive ownership of the cache line containing the counter, which means that the other workers trying to update their counters elsewhere in that same cache line must stall and wait until they can in turn get exclusive access to the line containing their chunk of result. The ownership of the cache line ping-pongs madly from core to core, and only one core can be running at a time, silently throttling the life out of the program.
For reasons that are invisible in the code, we ended up writing, not a truly parallel algorithm, but just a complicated sequential algorithm.