Channels ▼

Ken North

Dr. Dobb's Bloggers

Performance and Data Access Part 2: APIs, Benchmarks, Cloud Computing

November 03, 2009

Distributed computing divides work among multiple computers and can provide scalability, but it's not a silver bullet. Compute-intensive jobs and those operating on large data sets or databases can exhibit unacceptable execution times. System architects can respond with parallel processing and throwing lots of hardware at the performance problem; this class of application is a prime candidate for grid and cloud computing.

The good news about the cloud is it provides a virtual computing resource, with the possibility of using massive data storage and arrays of servers. Cloud computing is well-suited to processing massive data sets, particularly using a grid engine and job scheduler such as Hadoop Map/Reduce. The bad news is the public or private cloud can contain data sets and computing resources that are thousands of miles apart. For software hosted in the cloud, bandwidth and network latency issues can increase response time and user dissatisfaction. Cloud computing does not provide immunity from performance inhibitors, such as network latency and poorly-optimized queries.

Experience with distributed data and distributed objects taught us marshaling data across network connections is a first-class performance bottleneck. Jeff Dean of Google reported data about network latency in a recent keynote address. For a network packet to make a round trip between California and the Netherlands required 150,000,000 nanoseconds, whereas round-trip time in a data center is about 500,000 nanoseconds and a main memory reference is 100 nanoseconds. The axiom about putting code close to data to reduce network latency is especially important with today's distributed computing platforms, including grids and clouds.

Because servers and data can be widely distributed in the cloud, monitoring of data services can provide important insights about data set and database placement. Profiling software that records data access patterns provides a tool for partitioning data, sharding and establishing zones, such as Amazon EC2's availability zones. It's an enabling technology for intelligent schedulers that are aware of data access patterns in virtualized environments, grids and especially clouds. In an era of terabyte-sized (and larger) data sets, the goal is to reduce network latency by localizing data access as much as possible. The Google File System, for example, maintains metadata about the storage location of data; this enables Google Map/Reduce scheduling of batch jobs on nodes based on data locality.

SQL processing

Performance of SQL applications depends on multiple factors, including the design of the database and execution plan for queries. System architects and developers looking to improve performance inevitably focus on tuning the database and optimizing queries. Often overlooked is how much time is spent in middleware and network communications when an application executes database queries.

There are numerous books about tuning databases but precious few about data access and tuning middleware. That's why system architects and developers will benefit from reading The Data Access Handbook by John Goodson and Robert Steward.  It's a book that's packed with pearls of wisdom for those working with databases. The Data Access Handbook uncovers valuable information for people creating applications and services for Linux, Windows, .NET and Java environments. It explains, for example, how to improve database query execution time by optimizing performance of middleware, a critical component of client-server database architectures.

Book cover for The Data Access Handbook

Goodson and Steward describe the performance impact of database driver architectures, protocol handling, virtual machines (Java and .NET), garbage collection, connection pooling, statement pooling, client and server endianness, and network communication tuning. There is a particularly informative discussion of result set size, network packet size and database protocol packet size, private data networks, VPNs, subnets and packet fragmentation. The authors also discuss data services and data access for a service-oriented architecture (SOA).

The book includes chapters that provide performance-oriented coding and query practices for applications using today's predominant data access APIs, ADO.NET, JDBC and ODBC. The JDBC and ODBC APIs make metadata available that describes SQL and database capabilities using methods and functions that support run-time introspection. The authors note that .NET programming differs because it requires a developer to have more understanding of the database to which an application or service connects. This comparison illustrates a difference between this book and most other texts about SQL database programming. Many books cover programming with a single API for a single database, such as Oracle. By contrast, The Data Access Handbook is platform- and database-neutral and it provides an in-depth treatment of performance techniques for the major data access APIs. This is particularly beneficial for programming shops with SQL databases that are accessed by programs and scripts written in a variety of languages.

Because SQL databases can present a performance challenge, benchmarks can be an important tool. The Transaction Processing Council (TPC) provides a variety of benchmarks that provide a basis for comparing DBMS products with specific hardware configurations. Benchmarks can provide a basis for comparing different middleware and data access solutions. "Performing Testing ODBC and Native SQL APIs" (Dr. Dobbs', February 1996) explains this class of benchmark.

Benchmarking can be a helpful solution for tuning an application or service. In The Data Access Handbook, Goodson and Steward provide benchmarking guidelines and tips for this type of benchmark. They recommend running benchmarks before putting applications into production and experiencing unexpected performance problems. The benchmarks should be as realistic as possible, matching the data, queries and network configuration of the application or service to be put into production.




Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.