Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Database

HDF: The Hierarchical Data Format


HDF: The Hierarchical Data Format

Brand is the chairman of Fortner Software LLC and coauthor of Number by Colors (with Ted Meyer) and The Data Handbook, both from Springer-Verlag. He can be contacted at [email protected].


Sidebar: The Future of HDF

Although there is no widespread standard for sharing complex technical and scientific data, the Hierarchical Data Format (HDF), developed at the National Center for Supercomputing Applications (NCSA), is a leading candidate for such a standard. NASA's Mission to Planet Earth Project, an $8 billion effort to monitor the earth's environment, for example, uses HDF as its baseline data standard for its million-gigabyte data archive. The HDF software is available in source and precompiled form for Windows 95/NT, Macintosh, and UNIX, and can be downloaded at no charge from http://hdf.ncsa.uiuc.edu/.

The HDF subroutine library is designed to be easy to use. To write a C character array that represents an eight-bit color image to an HDF file, for example, the only HDF call that's required is ret=DFR8addimage("myfile.hdf",image1,rows,cols,0);, which creates and initializes the file myfile.hdf, adds an HDF directory entry to the file, and stores the image as an HDF object. Although written in C, HDF includes a set of wrapper functions that make it accessible from Fortran.

The NCSA HDF Library

The HDF library has three layers: low-level interface, single-file applications, and multifile applications. Most HDF developers work with the last two layers, which hide the low-level details.

Each layer contains a number of interfaces, each of which contains functions that encapsulate a particular operation or data structure. All functions in a single interface begin with the same letters. Table 1 lists the HDF interfaces, grouped by layer.

Two HDF interfaces support multidimensional arrays, called "scientific datasets" (SDS) in the HDF documentation. HDF 3.3's SDS interface, which has largely replaced the older DFSD interface, supports simultaneous access to more than one file at a time.

The SDS interface supports multidimensional arrays containing 8-, 16-, or 32-bit signed or unsigned integers, or 32- or 64-bit floating-point numbers. This interface lets you specify the coordinate system used by the data, and the scale and units associated with each axis. You can specify how values should be formatted for display, a label for each variable, and store the maximum/minimum values for reference.

The Vdata model (VS, VSQ, and VF interfaces) makes it easy to store tables of data in HDF files. Each table consists of a series of records, each of which contains a series of fields. Each field can support its own number type. Valid field types include 8-, 16-, 32-bit signed and unsigned integers, 32- and 64-bit floating-point numbers, and ASCII characters.

Vdata tables use three kinds of identifying information:

  • A name describes the origin and contents of a table.
  • A class identifies the meaning of data.
  • Field names identify the individual fields that make up a record.

Example 1 is a Vdata table with three fields and four records.

The Vdata VS interface provides routines for reading/writing tables, the VSQ interface routines for querying Vdatas, and the VF interface routines for manipulating individual Vdata fields.

The Vgroup interface provides routines for reading/writing groupings of HDF data objects in a particular HDF file. Each Vgroup may contain any number of other HDF data objects -- even other Vgroups. In addition to its members, a Vgroup may also be given a Vgroup name and Vgroup class. Every function of Vgroup begins with the prefix V. The VH interface contains routines for simple creation of Vgroups and Vdatas.

HDF Examples

Listing One takes an eight-bit raster image (stored as a two-dimensional array of eight-bit values), creates a linear grayscale palette, and stores this to an HDF file.

Scientific data is often stored in multidimensional arrays. For example, a biologist might measure the plant density in each square of a large grid, or an astronomer might have a computer simulation that provides the density at each point in a large gas cloud.

Listing Two illustrates how to store a three-dimensional array of data (including some related comments) in an HDF file using the SD interface. Listing Three retrieves that same data. Listing Three doesn't need to know in advance the size or dimensionality of the data -- it can read all of that information from the file.

The HDF Disk Format

An HDF data file consists of a magic number (the control characters ^N^C^S^A), a directory that points to data elements, and the data elements themselves.

The HDF directory entries -- known as "Data Descriptors" (DD) -- are organized into a series of DD blocks. The HDF directory (or DD list) consists of a series of these blocks. Each DD block contains two bytes indicating the number of DDs in that block, a four-byte integer with the location of the next DD block (or 0 if this is the last one), and then the actual directory entries. Figure 1 shows the beginning of a simple HDF file with one DD block containing two DDs.

The location of the next DD block entry (like all locations within an HDF file) is given in bytes from the beginning of the HDF file. Since all locations are given in 32-bit signed integers, the maximum size of a self-contained HDF file or of a single data element is about two gigabytes.

Every data element (image, array, annotation, and so on) in the HDF file has an associated Data Descriptor in the DD list. Every DD is 12 bytes long and has four fields; see Figure 2.

The Tag field, which defines the data element type, is defined as a 16-bit unsigned integer, which means there are 65,535 possible types of data elements. (0 is not a legal tag number.) The possible tag values are divided into three ranges: Those below 32,768 are assigned by NCSA, those above 64,999 are reserved, and the rest can be user defined. Table 2 shows some of the data element tag types.

The HDF library assigns a reference number (Ref), which identifies data elements with the same Tag, to each data object as it is written to the file. The library keeps track of the reference numbers that have been used in the file and guarantees that no two data objects will have the same tag and reference number in the same file. A Tag/Ref pair can then be used as a unique ID (known in HDF terms as a "data identifier") to unambiguously identify particular data elements.

It is allowable for two objects to carry the same tag (such as our two raster images), or for two objects with different tags to carry the same reference number (such as a raster image and an SDS). However, you shouldn't assume that objects with the same reference number have any special relationship. Instead, HDF uses Groups to define relationships.

Relationships between objects are handled with one of three group objects. The Raster Image Group Tag (DFTAG_RIG, tag 306) groups objects associated with a particular raster image. The Numeric Data Group Tag (DFTAG_NDG, tag 720) groups data objects associated with a particular scientific dataset. The Vgroup tag (DFTAG_VG, tag 1965) supports a generalized grouping of all types of data objects, even other Vgroups.

Group objects contain a list of Tag/Ref pairs for each object in that group. From that, a program can locate the DD entry for the object, which gives the offset and size of the object in the file. (An Offset field gives the location of the data element in bytes from the start of the file, while the Length field gives the length of the data element in bytes.)

Both Raster Image Groups (RIG) and Numeric Data Groups (NDG) consist of nothing but a list of data identifiers, each four bytes long. Each is restricted to contain only certain types of tags.

Vgroups do not have this limitation: They can contain any collection of Data Identifiers, including a Vgroup Data Identifier. This lets you construct tree-structured organizations similar to disk directories.

Example 2 shows an HDF file with Vgroups. (I have cheated a bit in this table, because the structure of Vgroup data elements is a bit more complicated than what is shown.)

This file has three objects, all listed in the DD list. The first VGroup refers to the second VGroup, which refers to the image size object. As with pointers in C, it's possible to create circular links; it is the responsibility of the data producer to ensure that the structure makes sense for the application.

HDF Scientific Datasets

The most important data objects in HDF files are scientific datasets, which are multidimensional arrays of numbers. The array can have any dimensionality up to 32,767, and consist of floating-point or integer values.

The DFSD Interface. An SDS, whether created by the DFSD or SD interface, consists of several data objects that are associated with the SDS. These data objects include attributes, numerical scales, the actual data, and so on. In the DFSD interface, all the DDs of the data objects that are associated with a particular SDS are listed in a numeric data group (DFTAG_NDG). Some older HDF files use the roughly equivalent scientific data group (DFTAG_SDG) to collect its members. In either case, only scientific data set tags and a few utility tags are allowed in a numeric data group.

The data identifiers stored in the NDG for this SDS are in Example 3, along with the information pointed to by the data identifiers. I ignore in this list the actual format of the data identifiers, and concentrate instead on the information that they point to. The Ref, Offset, and Length fields for each DD are not shown.

The DFTAG_SDD tag says that the data set consists of two dimensions, sized four and ten elements, respectively, where the data and both scales consist of floating-point numbers. The numerical scales are set by DFTAG_SDS, with a four-element array for the first dimension and a ten-element array for the second dimension.

The SD Interface. The basic ideas behind the SD version of the SDS are the same as those in the DFSD SDS, but some changes were needed in order to achieve the major design goal: compatibility with netCDF. This driver gave rise to three distinct differences between the two implementations.

First, rather than using a specialized tag such as DFTAG_NDG to group its elements, the SD interface uses a general-purpose Vgroup. Using a Vgroup allows for more flexibility in the organization of the SDS without a complete redesign. Second, the handling of dimensions has changed considerably, in that they can now have their own attributes. Third, the SD interface allows the use of user-defined attributes, not just the ones that are predefined. What follows is a list of the different styles of SDS, in order of their adoption into the HDF library.

  • Scientific Data Groups (SDG), an obsolete form of SDS that only supports 32-bit floating-point arrays.
  • Numeric Data Groups (NDG), the disk format for NDG groups (tag value = DFTAG_NDG).
  • Multifile SDS, the SD interface (after the prefix of the subroutine calls) uses Vgroups, Vdatas, and many of the parts of the NDG to create structures that are compatible with netCDF data structures.

These subroutines allow several HDF files to remain open simultaneously, something not possible with the earlier libraries. This interface can also assign arbitrary attributes to SDS. In the documentation they are also referred to as "New Style" SDS, or (confusingly) just SDS. The multifile interface will recognize all older forms of the SDS.

HDF Vdata

Vdata tables requires the use of two tags: DFTAG_VH for naming the columns of values; and DFTAG_VS for pointing to the actual Vdata data itself. There is no explicit grouping needed for Vdata; related DFTAG_VS and DFTAG_VH records have the same reference number. This is the only case where the HDF libraries use the reference numbers explicitly to associate two data elements.

Example 4 shows the Vdata tags needed for Example 1. Again, we focus on the information pointed to by the tags and ignore the actual binary formats. In particular, we have assigned names to the various fields defined in the data record pointed to by DFTAG_VH. These field names are also used in the HDF documentation.

The DFTAG_VH data element defines a Vdata with three fields (columns) and four Vdata records (rows). The values in these fields are defined as short (two-byte) integer, four-byte floating point, and five-character ASCII text, respectively. In addition, each field has an ASCII text name. Note how the entire Vdata definition can be named (<name>). This is unusual; most HDF data objects require a separate tag (DFTAG_DIL) to get a name.

The data pointed to by the DFTAG_VS tag is stored packed. In this example, each record requires 2+4+5=11 bytes, and the complete table takes 11×4=44 bytes.

HDF Extended Tags

The organizational features of HDF are so powerful that it is usually possible to store a complete data product granule in a single HDF file. There are, however, a couple of problems with very large single-file data products.

The first problem is that HDF files are limited to two gigabytes in size. The second problem is that HDF data elements were not initially designed to be appendable. Normally, data elements are required to be in contiguous storage and they are packed right next to each other. Therefore, to make a data element bigger, it needs to be copied to the end of the file where there is room to grow, leaving a gap in the file.

To get around these limitations, NCSA has defined "extended tags," which allow data element to be spread among multiple locations in the HDF file or even in a completely separate file. A data element corresponding to any NCSA defined tag value can (in principle) be converted to an extended tag, although this functionality is currently only supplied for SDS and Vdata.

An extended tag DD does not point directly to the data, as do normal tags. It instead points to a data object defining where the data is and how it is stored. This data object may point to the beginning of a linked list of data blocks that contain the entire data record. This way, a data record can be lengthened just by adding an additional data block; the entire HDF file does not need to be rewritten.

Alternatively, the extended tag record could define the data element as being stored in an external element in another disk file. This feature (also found in CDF), lets users get around the two-gigabyte total file size limitation. The reference to the external file is explicit: That file must exist exactly as specified by directory and name. This is not compatible with moving files extensively between disparate computers.

DDJ

Listing One

#include "hdf.h"#define WIDTH    5
#define HEIGHT    6
main(int argc, char *argv[])
{
    uint8 palette_data[768];
    intn i;
    /*  Initialize the image array  */
    static uint8 raster_data[HEIGHT][WIDTH] =
                { 1,  2,  3,  4,  5,
                  6,  7,  8,  9, 10,
                 11, 12, 13, 14, 15,
                 16, 17, 18, 19, 20,
                 21, 22, 23, 24, 25,
                 26, 27, 28, 29, 30 };


</p>
    /*  Initialize the palette to standard linear grayscale  */
    for (i=0; i<256; i++) {
    palette_data[i*3] = i;
    palette_data[i*3+1] = i;
    palette_data[i*3+2] = i;
    }
    /*  Associate the palette with the image  */
    DFR8setpalette(palette_data);
    /*  Write the 8-bit raster image to the file  */
    DFR8addimage("example.hdf", raster_data, WIDTH, HEIGHT, 0);
}

Back to Article

Listing Two

#include "hdf.h"#include "mfhdf.h"
#define LENGTH  3
#define HEIGHT  2
#define WIDTH   5


</p>
 main(int argc, char *argv[])
{
    /*  Initialize the image array  */
    static float64 scien_data[LENGTH][HEIGHT][WIDTH] =
                { 1.,  2.,  3.,  4.,  5.,
                  6.,  7.,  8.,  9., 10.,
                 11., 12., 13., 14., 15.,
                 16., 17., 18., 19., 20.,
                 21., 22., 23., 24., 25.,
                 26., 27., 28., 29., 30. };
    int32 dims[3]         = {LENGTH, HEIGHT, WIDTH};
    int16 scale0[LENGTH]  = {2, 4, 6};
    int32 scale1[HEIGHT]  = {1234567, 2345678};
    float32 scale2[WIDTH] = {2.2, 4.4, 6.6, 8.8, 11.0};
    float64 avg           = 15.0;
    int32 start[3]        = {0, 0, 0};
    int32 fid, sdid, dimid0, dimid1, dimid2;
    /*  Open file and initialize SD interface  */
    fid = SDstart("example.hdf", DFACC_CREATE);
    /*  Create named data set  */
    sdid = SDcreate(fid, "Sample Data Set", DFNT_FLOAT64, 3, dims);
    /*  Set up dimension zero  */
    dimid0 = SDgetdimid(sdid, 0);
    SDsetdimname(dimid0, "Dimension 0");
    SDsetdimstrs(dimid0, "The zeroth dimension", "mm", "2d");
    SDsetdimscale(dimid0, LENGTH, DFNT_INT16, (VOIDP)scale0);
    /*  Set up dimension one  */
    dimid1 = SDgetdimid(sdid, 1);
    SDsetdimname(dimid1, "Dimension 1");
    SDsetdimstrs(dimid1, "The first dimension", "cm", "8d");
    SDsetdimscale(dimid1, HEIGHT, DFNT_INT32, (VOIDP)scale1);
    /*  Set up dimension two  */
    dimid2 = SDgetdimid(sdid, 2);
    SDsetdimname(dimid2, "Dimension 2");
    SDsetdimstrs(dimid2, "The second dimension", "m", "4.1f");
    SDsetdimscale(dimid2, WIDTH, DFNT_FLOAT32, (VOIDP)scale2);
    /*  Write the data array to the data set  */
    SDwritedata(sdid, start, NULL, dims, (char *)scien_data);
    /*  Add one local attribute  */
   SDsetattr(sdid, "Average", DFNT_FLOAT64, 1, (char *)&avg);
    /*  Add one global attribute  */
    SDsetattr(fid, "Date", DFNT_CHAR8, 9, "10/29/93");
    /*  Close the data set, file, and interface  */
    SDendaccess(sdid);
    SDend(fid);
}

Back to Article

Listing Three

#include <string.h>#include "hdf.h"
#include "mfhdf.h"


</p>
#define MAXRANK   3
#define LENGTH    3
#define HEIGHT    2
#define WIDTH     5
#define DATESIZE  9


</p>
main(int argc, char *argv[])
{
    float64 scien_data[LENGTH][HEIGHT][WIDTH];
    int32 dims[MAXRANK];
    int16 scale0[LENGTH];
    int32 scale1[HEIGHT];
    float32 scale2[WIDTH];
    float64 avg;
    int32 start[MAXRANK] = {0, 0, 0};
    int32 fid, sdid, dimid;
    int32 i, index, rank, nattrs, ndatasets, nglobals;
    int32 nt, count, status;
    intn size;
    char name[80], date[80];


</p>
    /*  Open file and initialize SD interface  */
    fid = SDstart("example.hdf", DFACC_RDONLY);
    status = SDfileinfo(fid, &ndatasets, &nglobals);


</p>
    /*  Read global attribute  */
    if (nglobals == 1) {
        status = SDattrinfo(fid, 0, name, &nt, &size);
        if ((strcmp(name, "Date")) && (nt == DFNT_CHAR8) &&
            (size == DATESIZE))
            SDreadattr(fid, 0, date);
    }
    /*  Open first data set  */
    index = SDnametoindex(fid, "Sample Data Set");
    sdid = SDselect(fid, index);
    SDgetinfo(sdid, name, &rank, dims, &nt, &nattrs);
    /*  Read in data if everything looks okay  */
    if ((rank == MAXRANK) && (dims[0] == LENGTH) && (dims[1] == HEIGHT)
        && (dims[2] == WIDTH) && (nt == DFNT_FLOAT64))
        SDreaddata(sdid, start, NULL, dims, scien_data);
    /*  Read local attribute  */
    status = SDattrinfo(sdid, 0, name, &nt, &size);
    if ((strcmp(name, "Average")) && (nt == DFNT_FLOAT64) &&
        (size == 1))
        SDreadattr(sdid, 0, &avg);
    /*  Read dimensions  */
    dimid = SDgetdimid(sdid, 0);
    SDdiminfo(dimid, name, &count, &nt, &nattrs);
    if ((nt == DFNT_INT16) && (count == LENGTH))
        SDgetdimscale(dimid, scale0);
    dimid = SDgetdimid(sdid, 1);
    SDdiminfo(dimid, name, &count, &nt, &nattrs);
    if ((nt == DFNT_INT32) && (count == HEIGHT))
        SDgetdimscale(dimid, scale1);
    dimid = SDgetdimid(sdid, 2);
    SDdiminfo(dimid, name, &count, &nt, &nattrs);
    if ((nt == DFNT_FLOAT32) && (count == WIDTH))
        SDgetdimscale(dimid, scale2);
    /*  Close the data set, file, and interface  */
    SDendaccess(sdid);
    SDend(fid);
}

Back to Article


Copyright © 1998, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.