HDF5¶
HDF5 is a technology suite that enables the management of extremely large and complex data collections. It uses a versatile data model to represent very complex data objects and a variety of metadata. It can be used in conjunction with C, C++, Fortran 90, Java, and Python interfaces, in a portable file format.
HDF5 is available as a module on Apocrita. The package provides a library with APIs suitable for several languages and various command line utilities.
Using HDF5 on Apocrita¶
The HDF5 modules on Apocrita come in two forms: serial and parallel. The available HDF5 modules on Apocrita can be seen by:
module avail hdf5
To load the default version of the serial HDF5 module use:
module load hdf5
and to load the default version of the parallel HDF5 module:
module load hdf5-parallel
The configuration of the HDF5 package provided by a module can be examined using an option to the compilation wrappers:
h5cc -showconfig
(for a serial package)h5pcc -showconfig
(for a parallel package)
For example, the module hdf5/1.10.2
offers a serial HDF5 package which is
built with GCC 4.8.5. It provides C, C++, and Fortran interfaces, but no Java
interface.
The module hdf5-parallel/1.8.18
offers a parallel HDF5 package which is
built with Intel 17.0.1. It provides C and Fortran interfaces but not C++ and
Java interfaces.
HDF5 packages rely on a corresponding MPI implementation and such an HDF5
module may load a dependency module.
For example, the module hdf5-parallel/1.8.18
depends on Intel MPI 17.0.1
and loading the module first loads the module intelmpi/17.0.1
.
HDF5 data file concepts¶
The following is a summary of the official introduction to HDF5 concepts. HDF5 files are containers for storing a variety of scientific data and are composed of two primary types of objects; groups and datasets. Datatypes describe individual data elements in a dataset. Dataspaces describe the layout of elements within a dataset.
Groups and datasets¶
-
HDF5 groups are the structures that contain HDF5 objects, along with any supporting metadata. Every HDF5 file contains a root group that can contain other groups or be linked to objects in other files. Working with groups and group members is similar in many ways to working with directories and files in UNIX. As with UNIX directories and files, objects in an HDF5 file are often described by giving their full (or absolute) path names.
-
HDF5 datasets organise and contain the “raw” data values. A dataset consists of metadata that describes the data, in addition to the data itself. Datatypes, dataspaces, properties and (optional) attributes are HDF5 objects that describe a dataset. The datatype describes the individual data elements.
Datatypes, Dataspaces, Properties and Attributes¶
Datatypes describe the individual data elements in a dataset. They provide complete information for data conversion to or from that datatype. Datatypes can be grouped into two categories.
- Pre-defined Datatypes are created by HDF5.
They are actually opened (and closed) by HDF5 and can have different values
from one HDF5 session to the next. Standard datatypes are the same on all
platforms and are what you see in an HDF5 file. Their names are of the form
H5T_ARCH_BASE
whereARCH
is an architecture name andBASE
is a programming type name. For example,H5T_IEEE_F32BE
indicates a standard Big Endian floating point type. Native datatypes are used to simplify memory operations (reading, writing) and are NOT the same on different platforms. For example,H5T_NATIVE_INT
indicates a Cint
.
Table: Examples of HDF5 predefined datatypes
Datatype | Description |
---|---|
H5T_STD_I32LE |
Four-byte, little-endian, signed, two's complement integer |
H5T_STD_U16BE |
Two-byte, big-endian, unsigned integer |
H5T_IEEE_F32BE |
Four-byte, big-endian, IEEE floating point |
H5T_IEEE_F64LE |
Eight-byte, little-endian, IEEE floating point |
H5T_C_S1 |
One-byte, null-terminated string of eight-bit characters |
- Derived Datatypes are created or derived from the pre-defined datatypes. An example of a commonly used derived datatype is a string of more than one character. Compound datatypes are also derived types. A compound datatype can be used to create a simple table, and can also be nested, in which it includes one more other compound datatypes.
Table: Examples of HDF5 native datatypes
Native Datatype | Language | Description |
---|---|---|
H5T_NATIVE_INT | C | int |
H5T_NATIVE_FLOAT | C | float |
H5T_NATIVE_INTEGER | Fortran | integer |
H5T_NATIVE_REAL | Fortran | real |
Dataspaces describe the data elements' layout in a dataset. They can consist of no elements (NULL), a single element (scalar), or be a simple array. Their dimensions can be either fixed (unchanging) or unlimited, which means they can grow in size (that is, they are extendable).
There are two roles of a dataspace:
-
It contains the spatial information (logical layout) of a dataset stored in a file. This includes the rank and dimensions of a dataset, which are a permanent part of the dataset definition.
-
It describes an application’s data buffers and data elements participating in I/O. In other words, it can be used to select a portion or subset of a dataset.
Properties are characteristics or features of an HDF5 object. There are default properties which handle the most common needs. These default properties can be modified using the HDF5 Property List API to take advantage of more powerful or unusual features of HDF5 objects.
For example, the data storage layout property of a dataset is contiguous by default. For better performance, the layout can be modified to be chunked or chunked and compressed.
Attributes can optionally be associated with HDF5 objects. They have two parts: a name and a value. Attributes are accessed by opening the object that they are attached so are not independent objects. Typically an attribute is small in size and contains user metadata about the object that it is attached to.
Attributes look similar to HDF5 datasets in that they have a datatype and dataspace. However, they do not support partial I/O operations, and they cannot be compressed or extended.
Compiling HDF5 applications¶
Applications using the HDF5 library written in C, C++ or Fortran can be compiled by helper scripts. These scripts are based on the current compilation environment and add flags for finding the HDF5 headers and libraries.
For each language and package type, the table below shows the corresponding compiler wrapper.
Language | C | C++ | Fortran |
---|---|---|---|
Serial HDF5 | h5cc | h5c++ | h5fc |
Parallel HDF5 | h5pcc | n/a | h5pfc |
For example, to compile a C program using a serial HDF5 package, use h5cc
instead of, say, icc
or gcc
. For a Fortran program using parallel HDF5,
use h5pfc
instead of, say, mpifort
.
Programming considerations¶
Applications require language-specific files. For example the C and C++ header
files are accessed by #include hdf5.h
, and the Fortran module by use hdf5
.
Generally, when working with HDF5 objects, we:
- Open an object
- Access the object
- Close the object
The library imposes an order on the operations by argument dependencies. For example, a file must be opened before a dataset because the dataset open call requires a file handle as an argument. Objects can be closed in any order. However, once an object is closed it no longer can be accessed.
Additionally, the routine names have differences between languages. For example,
all C routines in the HDF5 library begin with a prefix of the form H5*
, where
*
is one or two uppercase letters indicating the type of object on which the
function operates. fortran routines are similar; they begin with h5*
and end
with _f
:
- File Interface:
H5Fopen
in C andh5fopen_f
in Fortran - Dataset Interface:
H5Dopen
in C andh5dopen_f
in Fortran - Dataspace interface:
H5Sclose
in C andh5sclose_f
in Fortran
In Fortran, it is mandatory to initialise and finalise the HDF5 Fortran interface
by calling h5open_f
before any other HDF5 functions, and h5close_f
after
the last HDF5 function. Omission may lead to difficult to debug issues.
The HDF5 library provides its own defined types for portability considerations. Common types used include:
hid_t
, used for object handleshsize_t
, used for dimensionsherr_t
, used for many return values
HDF5 APIs and Libraries¶
The HDF5 library provides several interfaces, or APIs. These interfaces provide routines for creating, accessing, and manipulating HDF5 files and objects.
There are APIs for each type of object in HDF5.
H5A
refers to Attribute Interface.H5D
refers to Dataset Interface.H5F
refers to File Interface.
Higher level libraries¶
The HDF5 High Level libraries simplify many of the steps required to create and access objects, as well as providing templates for storing objects.
- HDF5 Lite (
H5LT
) – High-level functions that wrap multiple low-level APIs to perform common operations - HDF5 Images (
H5IM
) – Creating and manipulating HDF5 datasets intended to be interpreted as images - HDF5 Tables (
H5TB
) – Creating and manipulating HDF5 datasets intended to be interpreted as tables - Packet Tables (
H5PT
) – Creating and manipulating HDF5 datasets to support append- and read-only operations on table data - HDF5 Dimension Scales (
H5DS
) – Creating and manipulating HDf5 datasets that are associated with the dimension of another HDF5 dataset - HDF5 Optimizations (
H5DO
) – Bypassing default HDF5 behaviour in order to optimize for specific use cases - HDF5 Extensions (
H5LR
,H5LT
) – Working with region references, hyperslab selections, and bit-fields
HDF5 Lite¶
The HDF5 Lite API consists of higher-level functions which do more operations per call than the basic HDF5 interface. The purpose is to wrap intuitive functions around certain sets of features in the existing APIs. This version of the API has two sets of functions: dataset and attribute related functions.
To use any of the functions or subroutines present in the HDF5 Lite, you must
first include the relevant header file or module in your program. In C programs,
the following line enables the use of H5LT
, the HDF5 Lite package:
#include "hdf5_hl.h"
The H5LT
module is available in Fortran programs through:
use h5lt
Using the HDF5 tools/utilities¶
There are lots of HDF5 tools. Below are a few honourable mentions:
h5dump
h5diff
A quick overview of the h5dump
and
h5diff
command-line tools is given below.
The contents of an HDF5 file can be examined by the command-line utility h5dump
.
The utility dumps the contents of the file to standard output (terminal), in human
readable form. Documentation about the utility is available on the web, or
by loading the module and running h5dump --help
.
Useful options for h5dump
include -H
which prints only the header (without any
data), -p
or --properties
which prints information regarding dataset properties,
filters, storage layout, fill value, and allocation time.
h5dump [OPTIONS] file.h5
Two different HDF5 files can be compared by using the h5diff
tool. The tool will
report the differences between them. Optionally, the tool can compare two objects
within these files. If only one object is specified, the tool will compare the object
across the two files. Parallel environments instead use the parallel ph5dif
tool.
h5diff [OPTIONS] file1.h5 file2.h5 [object1 [object2] ]
ph5diff [OPTIONS] file1.h5 file2.h5 [object1 [object2] ]
Useful options for h5diff
include -v
or --verbose
and -v1
or --verbose=1
,
which print difference information, list of objects and warnings. Additionally, -v1
will include a one-line attribute status summary. Documentation is available on
the web, or by running h5diff --help
(or ph5dif
if applicable). The exit
status can also inform whether differences where found (1) or not (0), or errors
occurred (>1).
Not installed on Apocrita, HDFView is a visual tool for browsing and editing HDF5 (and HDF4) files. Users can install the tool on local machines, to assist with code development and data analysis.
Examples of HDF5 on Apocrita¶
After loading an HDF5 module on Apocrita, the next task is to create a new
empty file. Following the instructions we created a file named
fileexample.f90
. In order to compile it we use the HDF5 wrappers; h5fc
for
Fortran, or the C/C++ equivalent h5cc
, h5c++
.
h5fc fileexample.f90
Running ./a.out
will create an empty file named filef.h5
(file.h5
for C).
The file contents can be examined by using the HDF5 dumper tool, h5dump
.
$ h5dump filef.h5
HDF5 "filef.h5" {
GROUP "/" {
}
}
The file definition is a simplified data description language (DDL) version of the more complete and rigorous version, which can be found in the
DDL in BNF for HDF5, a section of the HDF5 user guide.
To create a simple dataset, the program must specify the location at which to create the dataset, the dataset name, the datatype and dataspace of the data array, and the property lists.
Compiling and running the following examples on Apocrita, for C,
C++, and Fortran, creates three files;
dset.h5
, h5tutr_dset.h5
, and dsetf.h5
respectively.
To examine the files we run $ h5dump <filename>
, and the resulting output is:
$ h5dump dset.h5
HDF5 "dset.h5" {
GROUP "/" {
DATASET "dset" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 4, 6 ) / ( 4, 6 ) }
DATA {
(0,0): 0, 0, 0, 0, 0, 0,
(1,0): 0, 0, 0, 0, 0, 0,
(2,0): 0, 0, 0, 0, 0, 0,
(3,0): 0, 0, 0, 0, 0, 0
}
}
}
}
for C (h5dump h5tutr_dset
for C++), and
HDF5 "dsetf.h5" {
GROUP "/" {
DATASET "dset" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 6, 4 ) / ( 6, 4 ) }
DATA {
(0,0): 0, 0, 0, 0,
(1,0): 0, 0, 0, 0,
(2,0): 0, 0, 0, 0,
(3,0): 0, 0, 0, 0,
(4,0): 0, 0, 0, 0,
(5,0): 0, 0, 0, 0
}
}
}
}
for the Fortran version. Notice how the datatypes differ between the Fortran version
and the C/C++ versions. However, as long as the data structures are the same, the
created files will be interchangeable between C and Fortran. To see how h5diff
identifies these differences we will use the --verbose=1
option:
$ h5diff -v1 dsetf.h5 dset.h5
file1 file2
---------------------------------------
x x /
x x /dset
group : </> and </>
0 differences found
Attributes status: 0 common, 0 only in obj1, 0 only in obj2
dataset: </dset> and </dset>
Not comparable: </dset> or </dset> is an empty dataset
Warning: different storage datatype
</dset> has file datatype H5T_STD_I32LE
</dset> has file datatype H5T_STD_I32BE
Not comparable: </dset> has rank 2, dimensions [6x4], max dimensions [6x4]
and </dset> has rank 2, dimensions [4x6], max dimensions [4x6]
0 differences found
Attributes status: 0 common, 0 only in obj1, 0 only in obj2
--------------------------------
Some objects are not comparable
--------------------------------
Use -c for a list of objects without details of differences.
Other HDF5 language bindings¶
Mathematica, Java, Julia, Matlab, Python and R are available on Apocrita. These languages offer bindings for HDF5 but these are not provided by the HDF5 modules described above.
In Python, for example, the HDF5 package can be loaded in numpy by
import h5py
.