Dr. Dobb's Journal May 1998
The Future of HDF
By Mike Folk
Mike is the HDF Project Manager at NCSA. He can be reached at [email protected].
Scientists typically use one computer to render results for visualization, and another to further analyze and visualize the data. Furthermore, they frequently share data with colleagues. The need to use a mix of computers and transport large amounts of data among many different computers was an early data management problem for many scientists at the University of Illinois National Center for Supercomputing Applications (NCSA).
In response, NCSA developed the Hierarchical Data Format (HDF) in 1988. NCSA HDF is a portable, self-describing data format for moving and sharing scientific data in networked, heterogeneous computing environments. HDF can store several different kinds of data objects, including multidimensional arrays, raster images, color palettes, and tables. It allows individual scientists to mix and group different kinds of data in one file, according to their needs. NCSA provides a library of APIs for reading and writing HDF as well as workstation tools for visualizing data stored in HDF files.
Although HDF has evolved to meet new requirements, support new kinds of scientific data and applications, and operate effectively in new computing environments, some important new requirements seriously test the original design of HDF. Examples of these new requirements include:
- The need to store very large objects (the current HDF limit is two gigabytes).
- The need to store large numbers of objects (the current limit is 20,000 objects).
- More general, flexible data models.
- Performance improvements.
- Compatibility with object-oriented databases and distributed-object technologies.
To address these new needs, the NCSA HDF project is working on a prototype for the next generation of HDF, codenamed "HDF 5." Current plans call for three fundamental changes in HDF 5:
Unified data model. The proposed data model will support only one datatype: a multidimensional array of atomic elements. The new object will have two required attributes: dimensionality (the number and sizes of dimensions) and a data type (a definition of the array elements type). More data types will be supported, including record structures. Objects will include optional user-defined attributes of the form "parameter = value." Users will specify optional physical storage schemes for the data, such as compressed storage and possibly an indexed structure. For backward compatibility, the new HDF object is designed so that all current objects can be defined as subtypes of this basic object type.
New file structure. The new file structure will support files and objects of any size and any number of objects. The internal structure for describing objects is simpler than the current structure and should provide faster, easier access to objects.
New I/O library. In planning the next-generation HDF library, NCSA developers hope to exploit similarities between HDF and other popular scientific data formats by building a system that understands a variety of different data models and formats. APIs at the top level allow programs to view data according to a variety of different data models. These APIs communicate with the middle layer that interprets their requests in terms of a common model. The service layer consists of different file-format drivers, each of which reads from or writes to one file format. Each driver has a well-documented interface for transferring objects and lists of objects to the higher arbitration layer. Possible drivers in the first implementation include HDF, BigHDF, netCDF, and FITS.
DDJ
Copyright © 1998, Dr. Dobb's Journal