Performance and Data Access Part 2: APIs, Benchmarks, Cloud Computing
Distributed computing divides work among multiple computers and can provide scalability, but it's not a silver bullet. Compute-intensive jobs and those operating on large data sets or databases can exhibit unacceptable execution times. System architects can respond with parallel processing and throwing lots of hardware at the performance problem; this class of application is a prime candidate for grid and cloud computing.
The good news about the cloud is it provides a virtual computing resource, with the possibility of using massive data storage and arrays of servers. Cloud computing is well-suited to processing massive data sets, particularly using a grid engine and job scheduler such as Hadoop Map/Reduce. The bad news is the public or private cloud can contain data sets and computing resources that are thousands of miles apart. For software hosted in the cloud, bandwidth and network latency issues can increase response time and user dissatisfaction. Cloud computing does not provide immunity from performance inhibitors, such as network latency and poorly-optimized queries.
Experience with distributed data and distributed objects taught us marshaling data across network connections is a first-class performance bottleneck. Jeff Dean of Google reported data about network latency in a recent keynote address. For a network packet to make a round trip between California and the Netherlands required 150,000,000 nanoseconds, whereas round-trip time in a data center is about 500,000 nanoseconds and a main memory reference is 100 nanoseconds. The axiom about putting code close to data to reduce network latency is especially important with today's distributed computing platforms, including grids and clouds.
Because servers and data can be widely distributed in the cloud, monitoring of data services can provide important insights about data set and database placement. Profiling software that records data access patterns provides a tool for partitioning data, sharding and establishing zones, such as Amazon EC2's availability zones. It's an enabling technology for intelligent schedulers that are aware of data access patterns in virtualized environments, grids and especially clouds. In an era of terabyte-sized (and larger) data sets, the goal is to reduce network latency by localizing data access as much as possible. The Google File System, for example, maintains metadata about the storage location of data; this enables Google Map/Reduce scheduling of batch jobs on nodes based on data locality.
Performance of SQL applications depends on multiple factors, including the design of the database and execution plan for queries. System architects and developers looking to improve performance inevitably focus on tuning the database and optimizing queries. Often overlooked is how much time is spent in middleware and network communications when an application executes database queries.
There are numerous books about tuning databases but precious few about data access and tuning middleware. That's why system architects and developers will benefit from reading The Data Access Handbook by John Goodson and Robert Steward. It's a book that's packed with pearls of wisdom for those working with databases. The Data Access Handbook uncovers valuable information for people creating applications and services for Linux, Windows, .NET and Java environments. It explains, for example, how to improve database query execution time by optimizing performance of middleware, a critical component of client-server database architectures.
Goodson and Steward describe the performance impact of database driver architectures, protocol handling, virtual machines (Java and .NET), garbage collection, connection pooling, statement pooling, client and server endianness, and network communication tuning. There is a particularly informative discussion of result set size, network packet size and database protocol packet size, private data networks, VPNs, subnets and packet fragmentation. The authors also discuss data services and data access for a service-oriented architecture (SOA).
The book includes chapters that provide performance-oriented coding and query practices for applications using today's predominant data access APIs, ADO.NET, JDBC and ODBC. The JDBC and ODBC APIs make metadata available that describes SQL and database capabilities using methods and functions that support run-time introspection. The authors note that .NET programming differs because it requires a developer to have more understanding of the database to which an application or service connects. This comparison illustrates a difference between this book and most other texts about SQL database programming. Many books cover programming with a single API for a single database, such as Oracle. By contrast, The Data Access Handbook is platform- and database-neutral and it provides an in-depth treatment of performance techniques for the major data access APIs. This is particularly beneficial for programming shops with SQL databases that are accessed by programs and scripts written in a variety of languages.
Because SQL databases can present a performance challenge, benchmarks can be an important tool. The Transaction Processing Council (TPC) provides a variety of benchmarks that provide a basis for comparing DBMS products with specific hardware configurations. Benchmarks can provide a basis for comparing different middleware and data access solutions. "Performing Testing ODBC and Native SQL APIs" (Dr. Dobbs', February 1996) explains this class of benchmark.
Benchmarking can be a helpful solution for tuning an application or service. In The Data Access Handbook, Goodson and Steward provide benchmarking guidelines and tips for this type of benchmark. They recommend running benchmarks before putting applications into production and experiencing unexpected performance problems. The benchmarks should be as realistic as possible, matching the data, queries and network configuration of the application or service to be put into production.