Zarr (data format)

Zarr
Zarr
Filename extension	.zarr
Latest release	3
Type of format	Multidimensional array
Open format?	Yes
Free format?	Yes
Website	zarr.dev

Zarr is an open standard for storing large multidimensional array data. It specifies a protocol and data format, and is designed to be "cloud ready" including random access, by dividing data into subsets referred to as chunks.^[1]^[2] Zarr can be used within many programming languages, including Python, Java, JavaScript, C++, Rust and Julia.^[3] It has been used by organizations such as Google and Microsoft to publish large datasets.^[4]^[5] Early versions of Zarr were first released in 2015 by Alistair Miles.^[6]^[7]

Zarr is designed to support high-throughput distributed I/O on different storage systems, which is a common requirement in cloud computing. Multiple read operations can efficiently occur to a Zarr array in parallel, or multiple write operations in parallel.^[8]

Format description

The main data format in Zarr is multidimensional arrays. For parallelisable access, these arrays are stored and accessed as a grid of so-called "chunks". The actual data format on disk depends on the compressor and storage plugins selected by the user.^[8]

An illustration of Zarr's chunking data format.

Zarr's design was influenced by that of HDF5, and so it includes similar features for metadata and grouping: arrays can be grouped into named hierarchies, and they can also be annotated with key-value metadata stored alongside the array.^[8]

Applications

For bioimaging such as microscopy, a consortium called the Open Microscopy Environment (OME) created a format called "OME-Zarr", based on Zarr with some discipline-specific extensions.^[9] Similarly, Zarr is being used to publish weather and satellite data ^[10] and energy data,^[11] among others.