Skip to content

HDF5

HDF5 is a technology suite that enables the management of extremely large and complex data collections. It uses a versatile data model to represent very complex data objects and a variety of metadata. It can be used in conjunction with C, C++, Fortran 90, Java, and Python interfaces, in a portable file format.

HDF5 is available as a module on Apocrita. The package provides a library with APIs suitable for several languages and various command line utilities.

Using HDF5 on Apocrita

The HDF5 modules on Apocrita come in two forms: serial and parallel. The available HDF5 modules on Apocrita can be seen by:

module avail hdf5

To load the default version of the serial HDF5 module use:

module load hdf5

and to load the default version of the parallel HDF5 module:

module load hdf5-parallel

The configuration of the HDF5 package provided by a module can be examined using an option to the compilation wrappers:

  • h5cc -showconfig (for a serial package)
  • h5pcc -showconfig (for a parallel package)

For example, the module hdf5/1.10.2 offers a serial HDF5 package which is built with GCC 4.8.5. It provides C, C++, and Fortran interfaces, but no Java interface.

The module hdf5-parallel/1.8.18 offers a parallel HDF5 package which is built with Intel 17.0.1. It provides C and Fortran interfaces but not C++ and Java interfaces.

HDF5 packages rely on a corresponding MPI implementation and such an HDF5 module may load a dependency module. For example, the module hdf5-parallel/1.8.18 depends on Intel MPI 17.0.1 and loading the module first loads the module intelmpi/17.0.1.

HDF5 data file concepts

The following is a summary of the official introduction to HDF5 concepts. HDF5 files are containers for storing a variety of scientific data and are composed of two primary types of objects; groups and datasets. Datatypes describe individual data elements in a dataset. Dataspaces describe the layour of elements within a dataset.

Groups and datasets

  • HDF5 groups are the structures that contain HDF5 objects, along with any supporting metadata. Every HDF5 file contains a root group that can contain other groups or be linked to objects in other files. Working with groups and group members is similar in many ways to working with directories and files in UNIX. As with UNIX directories and files, objects in an HDF5 file are often described by giving their full (or absolute) path names. Dataset datatat tata ta

  • HDF5 datasets organise and contain the “raw” data values. A dataset consists of metadata that describes the data, in addition to the data itself. Datatypes, dataspaces, properties and (optional) attributes are HDF5 objects that describe a dataset. The datatype describes the individual data elements.

Datatypes, Dataspaces, Properties and Attributes

Datatypes describe the individual data elements in a dataset. They provide complete information for data conversion to or from that datatype. Datatypes can be grouped into two categories.

  • Pre-defined Datatypes are created by HDF5. They are actually opened (and closed) by HDF5 and can have different values from one HDF5 session to the next. Standard datatypes are the same on all platforms and are what you see in an HDF5 file. Their names are of the form H5T_ARCH_BASE where ARCH is an architecture name and BASE is a programming type name. For example, H5T_IEEE_F32BE indicates a standard Big Endian floating point type. Native datatypes are used to simplify memory operations (reading, writing) and are NOT the same on different platforms. For example, H5T_NATIVE_INT indicates a C int.

Table: Examples of HDF5 predefined datatypes

Datatype Description
H5T_STD_I32LE Four-byte, little-endian, signed, two's complement integer
H5T_STD_U16BE Two-byte, big-endian, unsigned integer
H5T_IEEE_F32BE Four-byte, big-endian, IEEE floating point
H5T_IEEE_F64LE Eight-byte, little-endian, IEEE floating point
H5T_C_S1 One-byte, null-terminated string of eight-bit characters
  • Derived Datatypes are created or derived from the pre-defined datatypes. An example of a commonly used derived datatype is a string of more than one character. Compound datatypes are also derived types. A compound datatype can be used to create a simple table, and can also be nested, in which it includes one more other compound datatypes.

Table: Examples of HDF5 native datatypes

Native Datatype Language Description
H5T_NATIVE_INT C int
H5T_NATIVE_FLOAT C float
H5T_NATIVE_INTEGER Fortran integer
H5T_NATIVE_REAL Fortran real

Dataspaces describe the data elements' layout in a dataset. They can consist of no elements (NULL), a single element (scalar), or be a simple array. Their dimensions can be either fixed (unchanging) or unlimited, which means they can grow in size (that is, they are extendable).

There are two roles of a dataspace:

  • It contains the spatial information (logical layout) of a dataset stored in a file. This includes the rank and dimensions of a dataset, which are a permanent part of the dataset definition.

  • It describes an application’s data buffers and data elements participating in I/O. In other words, it can be used to select a portion or subset of a dataset.

Properties are characteristics or features of an HDF5 object. There are default properties which handle the most common needs. These default properties can be modified using the HDF5 Property List API to take advantage of more powerful or unusual features of HDF5 objects.

For example, the data storage layout property of a dataset is contiguous by default. For better performance, the layout can be modified to be chunked or chunked and compressed.

Attributes can optionally be associated with HDF5 objects. They have two parts: a name and a value. Attributes are accessed by opening the object that they are attached so are not independent objects. Typically an attribute is small in size and contains user metadata about the object that it is attached to.

Attributes look similar to HDF5 datasets in that they have a datatype and dataspace. However, they do not support partial I/O operations, and they cannot be compressed or extended.

Compiling HDF5 applications

Applications using the HDF5 library written in C, C++ or Fortran can be compiled by helper scripts. These scripts are based on the current compilation environment and add flags for finding the HDF5 headers and libraries.

For each language and package type, the table below shows the corresponding compiler wrapper.

Language C C++ Fortran
Serial HDF5 h5cc h5c++ h5fc
Parallel HDF5 h5pcc n/a h5pfc

For example, to compile a C program using a serial HDF5 package, use h5cc instead of, say, icc or gcc. For a Fortran program using parallel HDF5, use h5pfc instead of, say, mpifort.

Programming considerations

Applications require language-specific files. For example the C and C++ header files are accessed by #include hdf5.h, and the Fortran module by use hdf5.

Generally, when working with HDF5 objects, we:

  • Open an object
  • Access the object
  • Close the object

The library imposes an order on the operations by argument dependencies. For example, a file must be opened before a dataset because the dataset open call requires a file handle as an argument. Objects can be closed in any order. However, once an object is closed it no longer can be accessed.

Additionally, the routine names have differences between languages. For example, all C routines in the HDF5 library begin with a prefix of the form H5*, where * is one or two uppercase letters indicating the type of object on which the function operates. fortran routines are similar; they begin with h5* and end with _f:

  • File Interface: H5Fopen in C and h5fopen_f in Fortran
  • Dataset Interface: H5Dopen in C and h5dopen_f in Fortran
  • Dataspace interface: H5Sclose in C and h5sclose_f in Fortran

In Fortran, it is mandatory to initialise and finalise the HDF5 Fortran interface by calling h5open_f before any other HDF5 functions, and h5close_f after the last HDF5 function. Omission may lead to difficult to debug issues.

The HDF5 library provides its own defined types for portability considerations. Common types used include:

  • hid_t, used for object handles
  • hsize_t, used for dimensions
  • herr_t, used for many return values

HDF5 APIs and Libraries

The HDF5 library provides several interfaces, or APIs. These interfaces provide routines for creating, accessing, and manipulating HDF5 files and objects. A list of the available interfaces is given here.

There are APIs for each type of object in HDF5.

  • H5A refers to Attribute Interface.
  • H5D refers to Dataset Interface.
  • H5F refers to File Interface.

Higher level libraries

The HDF5 High Level libraries simplify many of the steps required to create and access objects, as well as providing templates for storing objects.

  • HDF5 Lite (H5LT) – High-level functions that wrap multiple low-level APIs to perform common operations
  • HDF5 Images (H5IM) – Creating and manipulating HDF5 datasets intended to be interpreted as images
  • HDF5 Tables (H5TB) – Creating and manipulating HDF5 datasets intended to be interpreted as tables
  • Packet Tables (H5PT) – Creating and manipulating HDF5 datasets to support append- and read-only operations on table data
  • HDF5 Dimension Scales (H5DS) – Creating and manipulating HDf5 datasets that are associated with the dimension of another HDF5 dataset
  • HDF5 Optimizations (H5DO) – Bypassing default HDF5 behaviour in order to optimize for specific use cases
  • HDF5 Extensions (H5LR,H5LT) – Working with region references, hyperslab selections, and bit-fields

HDF5 Lite

The HDF5 Lite API consists of higher-level functions which do more operations per call than the basic HDF5 interface. The purpose is to wrap intuitive functions around certain sets of features in the existing APIs. This version of the API has two sets of functions: dataset and attribute related functions.

To use any of the functions or subroutines present in the HDF5 Lite, you must first include the relevant header file or module in your program. In C programs, the following line enables the use of H5LT, the HDF5 Lite package:

#include "hdf5_hl.h"

The H5LT module is available in Fortran programs through:

use h5lt

Using the HDF5 tools/utilities

There are lots of HDF5 tools. Below are a few honourable mentions:

  • h5dump
  • h5diff

A quick overview of the h5dump and h5diff command-line tools is given below.

The contents of an HDF5 file can be examined by the command-line utility h5dump. The utility dumps the contents of the file to standard output (terminal), in human readable form. Documentation about the utility is available on the web, or by loading the module and running h5dump --help.

Useful options for h5dump include -H which prints only the header (without any data), -p or --properties which prints information regarding dataset properties, filters, storage layout, fill value, and allocation time.

h5dump [OPTIONS] file.h5

Two different HDF5 files can be compared by using the h5diff tool. The tool will report the differences between them. Optionally, the tool can compare two objects within these files. If only one object is specified, the tool will compare the object across the two files. Parallel environments instead use the parallel ph5dif tool.

h5diff [OPTIONS] file1.h5 file2.h5 [object1 [object2] ]
ph5diff [OPTIONS] file1.h5 file2.h5 [object1 [object2] ]

Useful options for h5diff include -v or --verbose and -v1 or --verbose=1, which print difference information, list of objects and warnings. Additionally, -v1 will include a one-line attribute status summary. Documentation is available on the web, or by running h5diff --help (or ph5dif if applicable). The exit status can also inform whether differences where found (1) or not (0), or errors occurred (>1).

Not installed on Apocrita, HDFView is a visual tool for browsing and editing HDF5 (and HDF4) files. More information is available on the Tutorial, User's Guide, and Documentation links. Users can install the tool on local machines, to assist with code development and data analysis.

Examples of HDF5 on Apocrita

After loading an HDF5 module on Apocrita, the next task is to create a new empty file. Following the instructions we created a file named fileexample.f90. In order to compile it we use the HDF5 wrappers; h5fc for Fortran, or the C/C++ equivalent h5cc, h5c++.

h5fc fileexample.f90

Running ./a.out will create an empty file named filef.h5 (file.h5 for C). The file contents can be examined by using the HDF5 dumper tool, h5dump.

$ h5dump filef.h5

HDF5 "filef.h5" {
GROUP "/" {
}
}

The file definition is a simplified data description language (DDL) version of the more complete and rigorous version, which can be found in the DDL in BNF for HDF5, a section of the HDF5 User's Guide.

To create a simple dataset, the program must specify the location at which to create the dataset, the dataset name, the datatype and dataspace of the data array, and the property lists.

Compiling and running the following examples on Apocrita, for C, C++, and Fortran, creates three files; dset.h5, h5tutr_dset.h5, and dsetf.h5 respectively.

To examine the files we run $ h5dump <filename>, and the resulting output is:

$ h5dump dset.h5

HDF5 "dset.h5" {
GROUP "/" {
   DATASET "dset" {
      DATATYPE  H5T_STD_I32BE
      DATASPACE  SIMPLE { ( 4, 6 ) / ( 4, 6 ) }
      DATA {
      (0,0): 0, 0, 0, 0, 0, 0,
      (1,0): 0, 0, 0, 0, 0, 0,
      (2,0): 0, 0, 0, 0, 0, 0,
      (3,0): 0, 0, 0, 0, 0, 0
      }
   }
}
}

for C (h5dump h5tutr_dset for C++), and

HDF5 "dsetf.h5" {
GROUP "/" {
   DATASET "dset" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 6, 4 ) / ( 6, 4 ) }
      DATA {
      (0,0): 0, 0, 0, 0,
      (1,0): 0, 0, 0, 0,
      (2,0): 0, 0, 0, 0,
      (3,0): 0, 0, 0, 0,
      (4,0): 0, 0, 0, 0,
      (5,0): 0, 0, 0, 0
      }
   }
}
}

for the Fortran version. Notice how the datatypes differ between the Fortran version and the C/C++ versions. However, as long as the data structures are the same, the created files will be interchangeable between C and Fortran. To see how h5diff identifies these differences we will use the --verbose=1 option:

$ h5diff -v1 dsetf.h5 dset.h5

file1     file2
---------------------------------------
    x      x    /
    x      x    /dset

group  : </> and </>
0 differences found
Attributes status:  0 common, 0 only in obj1, 0 only in obj2

dataset: </dset> and </dset>
Not comparable: </dset> or </dset> is an empty dataset
Warning: different storage datatype
</dset> has file datatype H5T_STD_I32LE
</dset> has file datatype H5T_STD_I32BE
Not comparable: </dset> has rank 2, dimensions [6x4], max dimensions [6x4]
and </dset> has rank 2, dimensions [4x6], max dimensions [4x6]
0 differences found
Attributes status:  0 common, 0 only in obj1, 0 only in obj2
--------------------------------
Some objects are not comparable
--------------------------------
Use -c for a list of objects without details of differences.

Other HDF5 language bindings

Mathematica, Java, Julia, Matlab, Python and R are available on Apocrita. These languages offer bindings for HDF5 but these are not provided by the HDF5 modules described above. Instead, instructions on how to use HDF5 with these languages can be found here.

In Python, for example, the HDF5 package can be loaded in numpy by import h5py.

Documentation

HDF5 C/Fortran Reference Manual

Source code download

HDF5 Documentation

Introduction to Parallel HDF5

HDFView

References

The HDF Group