HDF: The Hierarchical Data Format

The Hierarchical Data Format, developed at the National Center for Supercomputing Applications, is a portable, self-describing data format for moving and sharing scientific data in networked, heterogeneous computing environments.

May 01, 1998
URL:http://www.drdobbs.com/database/hdf-the-hierarchical-data-format/184410553

HDF: The Hierarchical Data Format

Brand is the chairman of Fortner Software LLC and coauthor of Number by Colors (with Ted Meyer) and The Data Handbook, both from Springer-Verlag. He can be contacted at [email protected].

Sidebar: The Future of HDF

Although there is no widespread standard for sharing complex technical and scientific data, the Hierarchical Data Format (HDF), developed at the National Center for Supercomputing Applications (NCSA), is a leading candidate for such a standard. NASA's Mission to Planet Earth Project, an $8 billion effort to monitor the earth's environment, for example, uses HDF as its baseline data standard for its million-gigabyte data archive. The HDF software is available in source and precompiled form for Windows 95/NT, Macintosh, and UNIX, and can be downloaded at no charge from http://hdf.ncsa.uiuc.edu/.

The HDF subroutine library is designed to be easy to use. To write a C character array that represents an eight-bit color image to an HDF file, for example, the only HDF call that's required is ret=DFR8addimage("myfile.hdf",image1,rows,cols,0);, which creates and initializes the file myfile.hdf, adds an HDF directory entry to the file, and stores the image as an HDF object. Although written in C, HDF includes a set of wrapper functions that make it accessible from Fortran.

The NCSA HDF Library

The HDF library has three layers: low-level interface, single-file applications, and multifile applications. Most HDF developers work with the last two layers, which hide the low-level details.

Each layer contains a number of interfaces, each of which contains functions that encapsulate a particular operation or data structure. All functions in a single interface begin with the same letters. Table 1 lists the HDF interfaces, grouped by layer.

Two HDF interfaces support multidimensional arrays, called "scientific datasets" (SDS) in the HDF documentation. HDF 3.3's SDS interface, which has largely replaced the older DFSD interface, supports simultaneous access to more than one file at a time.

The SDS interface supports multidimensional arrays containing 8-, 16-, or 32-bit signed or unsigned integers, or 32- or 64-bit floating-point numbers. This interface lets you specify the coordinate system used by the data, and the scale and units associated with each axis. You can specify how values should be formatted for display, a label for each variable, and store the maximum/minimum values for reference.

The Vdata model (VS, VSQ, and VF interfaces) makes it easy to store tables of data in HDF files. Each table consists of a series of records, each of which contains a series of fields. Each field can support its own number type. Valid field types include 8-, 16-, 32-bit signed and unsigned integers, 32- and 64-bit floating-point numbers, and ASCII characters.

Vdata tables use three kinds of identifying information:

A name describes the origin and contents of a table.
A class identifies the meaning of data.
Field names identify the individual fields that make up a record.

Example 1 is a Vdata table with three fields and four records.

The Vdata VS interface provides routines for reading/writing tables, the VSQ interface routines for querying Vdatas, and the VF interface routines for manipulating individual Vdata fields.

The Vgroup interface provides routines for reading/writing groupings of HDF data objects in a particular HDF file. Each Vgroup may contain any number of other HDF data objects -- even other Vgroups. In addition to its members, a Vgroup may also be given a Vgroup name and Vgroup class. Every function of Vgroup begins with the prefix V. The VH interface contains routines for simple creation of Vgroups and Vdatas.

HDF Examples

Listing One takes an eight-bit raster image (stored as a two-dimensional array of eight-bit values), creates a linear grayscale palette, and stores this to an HDF file.

Scientific data is often stored in multidimensional arrays. For example, a biologist might measure the plant density in each square of a large grid, or an astronomer might have a computer simulation that provides the density at each point in a large gas cloud.

Listing Two illustrates how to store a three-dimensional array of data (including some related comments) in an HDF file using the SD interface. Listing Three retrieves that same data. Listing Three doesn't need to know in advance the size or dimensionality of the data -- it can read all of that information from the file.

The HDF Disk Format

An HDF data file consists of a magic number (the control characters ^N^C^S^A), a directory that points to data elements, and the data elements themselves.

The HDF directory entries -- known as "Data Descriptors" (DD) -- are organized into a series of DD blocks. The HDF directory (or DD list) consists of a series of these blocks. Each DD block contains two bytes indicating the number of DDs in that block, a four-byte integer with the location of the next DD block (or 0 if this is the last one), and then the actual directory entries. Figure 1 shows the beginning of a simple HDF file with one DD block containing two DDs.

The location of the next DD block entry (like all locations within an HDF file) is given in bytes from the beginning of the HDF file. Since all locations are given in 32-bit signed integers, the maximum size of a self-contained HDF file or of a single data element is about two gigabytes.

Every data element (image, array, annotation, and so on) in the HDF file has an associated Data Descriptor in the DD list. Every DD is 12 bytes long and has four fields; see Figure 2.

The Tag field, which defines the data element type, is defined as a 16-bit unsigned integer, which means there are 65,535 possible types of data elements. (0 is not a legal tag number.) The possible tag values are divided into three ranges: Those below 32,768 are assigned by NCSA, those above 64,999 are reserved, and the rest can be user defined. Table 2 shows some of the data element tag types.

The HDF library assigns a reference number (Ref), which identifies data elements with the same Tag, to each data object as it is written to the file. The library keeps track of the reference numbers that have been used in the file and guarantees that no two data objects will have the same tag and reference number in the same file. A Tag/Ref pair can then be used as a unique ID (known in HDF terms as a "data identifier") to unambiguously identify particular data elements.

It is allowable for two objects to carry the same tag (such as our two raster images), or for two objects with different tags to carry the same reference number (such as a raster image and an SDS). However, you shouldn't assume that objects with the same reference number have any special relationship. Instead, HDF uses Groups to define relationships.

Relationships between objects are handled with one of three group objects. The Raster Image Group Tag (DFTAG_RIG, tag 306) groups objects associated with a particular raster image. The Numeric Data Group Tag (DFTAG_NDG, tag 720) groups data objects associated with a particular scientific dataset. The Vgroup tag (DFTAG_VG, tag 1965) supports a generalized grouping of all types of data objects, even other Vgroups.

Group objects contain a list of Tag/Ref pairs for each object in that group. From that, a program can locate the DD entry for the object, which gives the offset and size of the object in the file. (An Offset field gives the location of the data element in bytes from the start of the file, while the Length field gives the length of the data element in bytes.)

Both Raster Image Groups (RIG) and Numeric Data Groups (NDG) consist of nothing but a list of data identifiers, each four bytes long. Each is restricted to contain only certain types of tags.

Vgroups do not have this limitation: They can contain any collection of Data Identifiers, including a Vgroup Data Identifier. This lets you construct tree-structured organizations similar to disk directories.

Example 2 shows an HDF file with Vgroups. (I have cheated a bit in this table, because the structure of Vgroup data elements is a bit more complicated than what is shown.)

This file has three objects, all listed in the DD list. The first VGroup refers to the second VGroup, which refers to the image size object. As with pointers in C, it's possible to create circular links; it is the responsibility of the data producer to ensure that the structure makes sense for the application.

HDF Scientific Datasets

The most important data objects in HDF files are scientific datasets, which are multidimensional arrays of numbers. The array can have any dimensionality up to 32,767, and consist of floating-point or integer values.

The DFSD Interface. An SDS, whether created by the DFSD or SD interface, consists of several data objects that are associated with the SDS. These data objects include attributes, numerical scales, the actual data, and so on. In the DFSD interface, all the DDs of the data objects that are associated with a particular SDS are listed in a numeric data group (DFTAG_NDG). Some older HDF files use the roughly equivalent scientific data group (DFTAG_SDG) to collect its members. In either case, only scientific data set tags and a few utility tags are allowed in a numeric data group.

The data identifiers stored in the NDG for this SDS are in Example 3, along with the information pointed to by the data identifiers. I ignore in this list the actual format of the data identifiers, and concentrate instead on the information that they point to. The Ref, Offset, and Length fields for each DD are not shown.

The DFTAG_SDD tag says that the data set consists of two dimensions, sized four and ten elements, respectively, where the data and both scales consist of floating-point numbers. The numerical scales are set by DFTAG_SDS, with a four-element array for the first dimension and a ten-element array for the second dimension.

The SD Interface. The basic ideas behind the SD version of the SDS are the same as those in the DFSD SDS, but some changes were needed in order to achieve the major design goal: compatibility with netCDF. This driver gave rise to three distinct differences between the two implementations.

First, rather than using a specialized tag such as DFTAG_NDG to group its elements, the SD interface uses a general-purpose Vgroup. Using a Vgroup allows for more flexibility in the organization of the SDS without a complete redesign. Second, the handling of dimensions has changed considerably, in that they can now have their own attributes. Third, the SD interface allows the use of user-defined attributes, not just the ones that are predefined. What follows is a list of the different styles of SDS, in order of their adoption into the HDF library.

Scientific Data Groups (SDG), an obsolete form of SDS that only supports 32-bit floating-point arrays.
Numeric Data Groups (NDG), the disk format for NDG groups (tag value = DFTAG_NDG).
Multifile SDS, the SD interface (after the prefix of the subroutine calls) uses Vgroups, Vdatas, and many of the parts of the NDG to create structures that are compatible with netCDF data structures.

These subroutines allow several HDF files to remain open simultaneously, something not possible with the earlier libraries. This interface can also assign arbitrary attributes to SDS. In the documentation they are also referred to as "New Style" SDS, or (confusingly) just SDS. The multifile interface will recognize all older forms of the SDS.

HDF Vdata

Vdata tables requires the use of two tags: DFTAG_VH for naming the columns of values; and DFTAG_VS for pointing to the actual Vdata data itself. There is no explicit grouping needed for Vdata; related DFTAG_VS and DFTAG_VH records have the same reference number. This is the only case where the HDF libraries use the reference numbers explicitly to associate two data elements.

Example 4 shows the Vdata tags needed for Example 1. Again, we focus on the information pointed to by the tags and ignore the actual binary formats. In particular, we have assigned names to the various fields defined in the data record pointed to by DFTAG_VH. These field names are also used in the HDF documentation.

The DFTAG_VH data element defines a Vdata with three fields (columns) and four Vdata records (rows). The values in these fields are defined as short (two-byte) integer, four-byte floating point, and five-character ASCII text, respectively. In addition, each field has an ASCII text name. Note how the entire Vdata definition can be named (<name>). This is unusual; most HDF data objects require a separate tag (DFTAG_DIL) to get a name.

The data pointed to by the DFTAG_VS tag is stored packed. In this example, each record requires 2+4+5=11 bytes, and the complete table takes 11×4=44 bytes.

HDF Extended Tags

The organizational features of HDF are so powerful that it is usually possible to store a complete data product granule in a single HDF file. There are, however, a couple of problems with very large single-file data products.

The first problem is that HDF files are limited to two gigabytes in size. The second problem is that HDF data elements were not initially designed to be appendable. Normally, data elements are required to be in contiguous storage and they are packed right next to each other. Therefore, to make a data element bigger, it needs to be copied to the end of the file where there is room to grow, leaving a gap in the file.

To get around these limitations, NCSA has defined "extended tags," which allow data element to be spread among multiple locations in the HDF file or even in a completely separate file. A data element corresponding to any NCSA defined tag value can (in principle) be converted to an extended tag, although this functionality is currently only supplied for SDS and Vdata.

An extended tag DD does not point directly to the data, as do normal tags. It instead points to a data object defining where the data is and how it is stored. This data object may point to the beginning of a linked list of data blocks that contain the entire data record. This way, a data record can be lengthened just by adding an additional data block; the entire HDF file does not need to be rewritten.

Alternatively, the extended tag record could define the data element as being stored in an external element in another disk file. This feature (also found in CDF), lets users get around the two-gigabyte total file size limitation. The reference to the external file is explicit: That file must exist exactly as specified by directory and name. This is not compatible with moving files extensively between disparate computers.

DDJ

Listing One

#include "hdf.h"#define WIDTH    5
#define HEIGHT    6
main(int argc, char *argv[])
{
    uint8 palette_data[768];
    intn i;
    /*  Initialize the image array  */
    static uint8 raster_data[HEIGHT][WIDTH] =
                { 1,  2,  3,  4,  5,
                  6,  7,  8,  9, 10,
                 11, 12, 13, 14, 15,
                 16, 17, 18, 19, 20,
                 21, 22, 23, 24, 25,
                 26, 27, 28, 29, 30 };



    /*  Initialize the palette to standard linear grayscale  */
    for (i=0; i<256; i++) {
    palette_data[i*3] = i;
    palette_data[i*3+1] = i;
    palette_data[i*3+2] = i;
    }
    /*  Associate the palette with the image  */
    DFR8setpalette(palette_data);
    /*  Write the 8-bit raster image to the file  */
    DFR8addimage("example.hdf", raster_data, WIDTH, HEIGHT, 0);
}

Back to Article

Listing Two

#include "hdf.h"#include "mfhdf.h"
#define LENGTH  3
#define HEIGHT  2
#define WIDTH   5



 main(int argc, char *argv[])
{
    /*  Initialize the image array  */
    static float64 scien_data[LENGTH][HEIGHT][WIDTH] =
                { 1.,  2.,  3.,  4.,  5.,
                  6.,  7.,  8.,  9., 10.,
                 11., 12., 13., 14., 15.,
                 16., 17., 18., 19., 20.,
                 21., 22., 23., 24., 25.,
                 26., 27., 28., 29., 30. };
    int32 dims[3]         = {LENGTH, HEIGHT, WIDTH};
    int16 scale0[LENGTH]  = {2, 4, 6};
    int32 scale1[HEIGHT]  = {1234567, 2345678};
    float32 scale2[WIDTH] = {2.2, 4.4, 6.6, 8.8, 11.0};
    float64 avg           = 15.0;
    int32 start[3]        = {0, 0, 0};
    int32 fid, sdid, dimid0, dimid1, dimid2;
    /*  Open file and initialize SD interface  */
    fid = SDstart("example.hdf", DFACC_CREATE);
    /*  Create named data set  */
    sdid = SDcreate(fid, "Sample Data Set", DFNT_FLOAT64, 3, dims);
    /*  Set up dimension zero  */
    dimid0 = SDgetdimid(sdid, 0);
    SDsetdimname(dimid0, "Dimension 0");
    SDsetdimstrs(dimid0, "The zeroth dimension", "mm", "2d");
    SDsetdimscale(dimid0, LENGTH, DFNT_INT16, (VOIDP)scale0);
    /*  Set up dimension one  */
    dimid1 = SDgetdimid(sdid, 1);
    SDsetdimname(dimid1, "Dimension 1");
    SDsetdimstrs(dimid1, "The first dimension", "cm", "8d");
    SDsetdimscale(dimid1, HEIGHT, DFNT_INT32, (VOIDP)scale1);
    /*  Set up dimension two  */
    dimid2 = SDgetdimid(sdid, 2);
    SDsetdimname(dimid2, "Dimension 2");
    SDsetdimstrs(dimid2, "The second dimension", "m", "4.1f");
    SDsetdimscale(dimid2, WIDTH, DFNT_FLOAT32, (VOIDP)scale2);
    /*  Write the data array to the data set  */
    SDwritedata(sdid, start, NULL, dims, (char *)scien_data);
    /*  Add one local attribute  */
   SDsetattr(sdid, "Average", DFNT_FLOAT64, 1, (char *)&avg);
    /*  Add one global attribute  */
    SDsetattr(fid, "Date", DFNT_CHAR8, 9, "10/29/93");
    /*  Close the data set, file, and interface  */
    SDendaccess(sdid);
    SDend(fid);
}

Back to Article

Listing Three

#include <string.h>#include "hdf.h"
#include "mfhdf.h"



#define MAXRANK   3
#define LENGTH    3
#define HEIGHT    2
#define WIDTH     5
#define DATESIZE  9



main(int argc, char *argv[])
{
    float64 scien_data[LENGTH][HEIGHT][WIDTH];
    int32 dims[MAXRANK];
    int16 scale0[LENGTH];
    int32 scale1[HEIGHT];
    float32 scale2[WIDTH];
    float64 avg;
    int32 start[MAXRANK] = {0, 0, 0};
    int32 fid, sdid, dimid;
    int32 i, index, rank, nattrs, ndatasets, nglobals;
    int32 nt, count, status;
    intn size;
    char name[80], date[80];



    /*  Open file and initialize SD interface  */
    fid = SDstart("example.hdf", DFACC_RDONLY);
    status = SDfileinfo(fid, &ndatasets, &nglobals);



    /*  Read global attribute  */
    if (nglobals == 1) {
        status = SDattrinfo(fid, 0, name, &nt, &size);
        if ((strcmp(name, "Date")) && (nt == DFNT_CHAR8) &&
            (size == DATESIZE))
            SDreadattr(fid, 0, date);
    }
    /*  Open first data set  */
    index = SDnametoindex(fid, "Sample Data Set");
    sdid = SDselect(fid, index);
    SDgetinfo(sdid, name, &rank, dims, &nt, &nattrs);
    /*  Read in data if everything looks okay  */
    if ((rank == MAXRANK) && (dims[0] == LENGTH) && (dims[1] == HEIGHT)
        && (dims[2] == WIDTH) && (nt == DFNT_FLOAT64))
        SDreaddata(sdid, start, NULL, dims, scien_data);
    /*  Read local attribute  */
    status = SDattrinfo(sdid, 0, name, &nt, &size);
    if ((strcmp(name, "Average")) && (nt == DFNT_FLOAT64) &&
        (size == 1))
        SDreadattr(sdid, 0, &avg);
    /*  Read dimensions  */
    dimid = SDgetdimid(sdid, 0);
    SDdiminfo(dimid, name, &count, &nt, &nattrs);
    if ((nt == DFNT_INT16) && (count == LENGTH))
        SDgetdimscale(dimid, scale0);
    dimid = SDgetdimid(sdid, 1);
    SDdiminfo(dimid, name, &count, &nt, &nattrs);
    if ((nt == DFNT_INT32) && (count == HEIGHT))
        SDgetdimscale(dimid, scale1);
    dimid = SDgetdimid(sdid, 2);
    SDdiminfo(dimid, name, &count, &nt, &nattrs);
    if ((nt == DFNT_FLOAT32) && (count == WIDTH))
        SDgetdimscale(dimid, scale2);
    /*  Close the data set, file, and interface  */
    SDendaccess(sdid);
    SDend(fid);
}

Figure 2: HDF Data Descriptor layout.

The Future of HDF

Dr. Dobb's Journal May 1998

The Future of HDF

By Mike Folk

Mike is the HDF Project Manager at NCSA. He can be reached at [email protected].

Scientists typically use one computer to render results for visualization, and another to further analyze and visualize the data. Furthermore, they frequently share data with colleagues. The need to use a mix of computers and transport large amounts of data among many different computers was an early data management problem for many scientists at the University of Illinois National Center for Supercomputing Applications (NCSA).

In response, NCSA developed the Hierarchical Data Format (HDF) in 1988. NCSA HDF is a portable, self-describing data format for moving and sharing scientific data in networked, heterogeneous computing environments. HDF can store several different kinds of data objects, including multidimensional arrays, raster images, color palettes, and tables. It allows individual scientists to mix and group different kinds of data in one file, according to their needs. NCSA provides a library of APIs for reading and writing HDF as well as workstation tools for visualizing data stored in HDF files.

Although HDF has evolved to meet new requirements, support new kinds of scientific data and applications, and operate effectively in new computing environments, some important new requirements seriously test the original design of HDF. Examples of these new requirements include:

The need to store very large objects (the current HDF limit is two gigabytes).
The need to store large numbers of objects (the current limit is 20,000 objects).
More general, flexible data models.
Performance improvements.
Compatibility with object-oriented databases and distributed-object technologies.

To address these new needs, the NCSA HDF project is working on a prototype for the next generation of HDF, codenamed "HDF 5." Current plans call for three fundamental changes in HDF 5:

Unified data model. The proposed data model will support only one datatype: a multidimensional array of atomic elements. The new object will have two required attributes: dimensionality (the number and sizes of dimensions) and a data type (a definition of the array elements type). More data types will be supported, including record structures. Objects will include optional user-defined attributes of the form "parameter = value." Users will specify optional physical storage schemes for the data, such as compressed storage and possibly an indexed structure. For backward compatibility, the new HDF object is designed so that all current objects can be defined as subtypes of this basic object type.

New file structure. The new file structure will support files and objects of any size and any number of objects. The internal structure for describing objects is simpler than the current structure and should provide faster, easier access to objects.

New I/O library. In planning the next-generation HDF library, NCSA developers hope to exploit similarities between HDF and other popular scientific data formats by building a system that understands a variety of different data models and formats. APIs at the top level allow programs to view data according to a variety of different data models. These APIs communicate with the middle layer that interprets their requests in terms of a common model. The service layer consists of different file-format drivers, each of which reads from or writes to one file format. Each driver has a well-documented interface for transferring objects and lists of objects to the higher arbitration layer. Possible drivers in the first implementation include HDF, BigHDF, netCDF, and FITS.

DDJ

HDF: The Hierarchical Data Format

The Future of HDF

By Mike Folk

Dr. Dobb's Journal May 1998

Table 1: HDF interfaces by layer: (a) low-level layer interfaces; (b) single-file application layer interfaces; (c) multifile application layer interfaces.

HDF: The Hierarchical Data Format

The Future of HDF

By Mike Folk

Dr. Dobb's Journal May 1998

The NCSA HDF Library

HDF Examples

The HDF Disk Format

HDF Scientific Datasets

HDF Vdata

HDF Extended Tags

Listing One

Listing Two

Listing Three

The Future of HDF

By Mike Folk

Example 1: Example Vdata table.

The Future of HDF

By Mike Folk

Example 2: Organization of an HDF file with two Vgroups.

The Future of HDF

By Mike Folk

Example 3: Data objects for an example SDS.

The Future of HDF

By Mike Folk

Example 4: Vdata tags for Example 1.

The Future of HDF

By Mike Folk

Figure 1: Beginning of a simple HDF file.

The Future of HDF

By Mike Folk

Figure 2: HDF Data Descriptor layout.

The Future of HDF

By Mike Folk

The Future of HDF

By Mike Folk

Table 1: HDF interfaces by layer: (a) low-level layer interfaces; (b) single-file application layer interfaces; (c) multifile application layer interfaces.

The Future of HDF

By Mike Folk

Table 2: A selection of NCSA defined HDF tag types: (a) utility tags; (b) raster image tags; (c) scientific dataset tags; (d) Vgroup/Vdata tags.