Wrapping vs Native Implementation for Cross-Language Interoperability

Case study of the rhdf5 and Rarr Bioconductor packages

Hugo Gruson

February 25, 2026

Intro to formats

Competing formats to store multi-dimensional arrays (e.g., images, cell-by-gene matrices, etc.) on disk
Particularly important for large datasets that don’t fit in memory
Basis of interoperability between languages (e.g., R, Python, Julia, etc.)

flowchart LR
    A["Step 1 (Python)"] -->|Pass data as Zarr or HDF5| B["Step 2 (R)"]

Zarr vs HDF5: technical

Zarr is “cloud-native”, and easily parallelizable.

For both Zarr and HDF5, chunks of the file can be accessed.

BUT if the file is stored remote, it needs to be downloaded first for HDF5, while Zarr can be fetch only the relevant chunk.

/tmp/RtmpB5vgwq/file21d16b32b552.zarr
├── c
│   ├── 0
│   │   ├── 0
│   │   ├── 1
│   │   └── 2
│   ├── 1
│   │   ├── 0
│   │   ├── 1
│   │   └── 2
│   └── 2
│       ├── 0
│       ├── 1
│       └── 2
└── zarr.json

Zarr vs HDF5: governance

Zarr is community-driven:

much younger
less normative(?) (at least for now): e.g., sparse arrays
relies more on “subformats”(?): OME

Definitions

Wrapping: using a high-level language (e.g., R) to call functions from an underlying library (e.g., C/C++/Python) that implements the spec.
Vendoring: including a copy of the underlying library in the package source code, and wrapping it in the high-level language of choice.
Native implementation: implementing the spec from scratch, without including any third-party code in the package source code.

rhdf5 overview

Rarr overview

Vendoring: rhdf5 case

Wrapping and vendoring the HDF5 C library.

Lots of C code, but also lots of thin wrappers handling R memory management and R/C data type conversions.

SEXP _H5Freopen( SEXP _file_id ) {
  hid_t file_id = STRSXP_2_HID( _file_id );    
  hid_t hid = H5Freopen( file_id );
  addHandle(file_id);

  SEXP Rval;
  PROTECT(Rval = HID_2_STRSXP(hid));
  UNPROTECT(1);
  return Rval;
}

Native implementation: Rarr case

Native implementation of the Zarr spec in R.

Most prep steps and housekeeping is done in R. Only performance critical steps are in C.

These steps should eventually also run in parallel or on GPU.

Aside / caveat

Zarr specification is still “python-biased” and includes many “numpysms”.

Better since version 3

Some compression libraries are bundled in Rarr