Common Sources Of Problems
The general cause of event-processing bottlenecks is executing unbounded or high-latency operations on the GUI thread. More often than not, this code lives in user-authored code to respond to UI events. Such code might perform a simple CPU-intensive activity whose algorithmic complexity depends on some variable, say the size of the input data. When testing your program in-house, it might function okay with small sample data sizes. However, when users decide to operate on more data than you anticipated, it might consume more compute time, leading to precisely the situation just described.
That said, most programs aren't CPU-bound. The total amount of work your code performs is a complex equation, depending on variables such as the CPU, memory, peripherals (such as disk), and the network. Clearly CPU clock speed plays a major role, but often isn't the dominant factor. The CPU's internal parallelism and the degree to which it can achieve superscalar execution on a particular piece of code depend heavily on the number of branch prediction misses and instruction stream fences. These are generally of more interest to compiler writers than application developers, but use of locks, volatile reads/writes, interlocked operations, and memory barriers can also impact this. Memory-intensive operations can also lead to variable delays, especially on multiprocessor machines in which some portions of the cache hierarchy are more expensive to access than others. High cache miss rates can be crippling on applications, bloating the cost of memory-intensive work by an order of magnitude. Too much thread-level parallelism in your applicationor among all the machine's running processescan place a higher burden on the OS thread scheduler, leading to context-switching overhead. Many of these factors are never dealt with directly in your code, but should be part of a rigorous stress-testing process to catch problems before software is released. If you encounter issues like this during testing, a good profiler can measure and track down the source.
Device I/O clearly incurs a higher cost than most CPU- and memory-based activities, usually by several orders of magnitude. Most programs today are more and more connected, and as a result must send and receive larger quantities of data via the disk and network. Doing this type of work on the UI thread is almost always a mistake. Network I/O consists of many steps, and the cost of each varies greatly depending on the state of the network peripheral, the network, the destination node, and hops in between. Your program generally has no control over these factors, so doing such things on the UI thread is a disaster waiting to happen.
GUIs often perform dramatically worse under system-wide memory pressure. This is because ordinary memory operations can suddenly turn into disk I/O due to page faulting. This dramatically changes the performance characteristics of simple memory accesses once your program has to compete with others for scarce machine resources. If simple memory accesses can be seen as "variable latency operations," you're probably wondering if there's anything you should do on the GUI thread. Generally speaking, any data- or compute-intensive tasks should be done on a separate thread, even if that incurs overhead for worker synchronization.
Finally, synchronization resulting from access to shared data structures is another variable latency operation. If a lock is contended, the thread attempting acquisition typically ends up blocking. It is only awoken again when the thread that owned the lock finishes its work and relinquishes the lock (and any that are given access to the lock first). Although the thread that owns that lock might not be a GUI thread, it might be performing I/O or waiting for an event. This essentially looks as if the GUI thread itself were performing such operations because it has to wait the same amount of time. Deadlocks are the worst type of lock contention, especially on the GUI thread.