A to Zarr
Davis Bennett
Zarr Summit, 2025
Outline
Goals
About me
Where I am now
Previously
Zarr vs File formats
We often compare Zarr to file formats (tif, hdf5, …)
But Zarr is more like a protocol or API than a conventional file format.
Zarr vs File formats
What is a file format?
A file format assigns meaning to sequences of bytes in a file.
| byte range | meaning
------------------------
| 0 → n | header
| n + 1 → end | body
Zarr vs File formats
Zarr assigns meaning to keys in a storage backend.
| key | meaning
--------------------------------------
| "zarr.json" | header (array metadata)
| "c/0" | body (chunk data)
The way data is actually stored is an implementation detail!
Zarr vs File formats
File formats define the state of a file or stored object.
Zarr defines the behavior of a key / value storage backend.
Zarr vs File formats
# normal
meta = create_array_metadata(...)
store = LocalStore("foo")
write_metadata(meta, store) # writes "zarr.json"
# weird, but not wrong!
store = ReversedLocalStore("foo") # reverses file names
write_metadata(meta, store) # writes "rraz.json"
Zarr does not strictly define the stored representation of data.
Zarr vs File formats
File formats usually have a well-defined entry point.
Zarr forms trees that can be accessed at any level
'' <-------- We can enter at the root
|-- zarr.json
|-- sub-group <-------- We can also enter sub-group
| |-- zarr.json
| |-- sub-sub-group <-------- Or sub-sub-group
| |-- zarr.json
|-- array <-------- Or the array
|-- zarr.json
Zarr vs File formats
Maybe Zarr is a "files format"
...but "data format", "protocol", API, or just "format" are probably better terms 😅
(aspirational) with the right storage logic, Zarr should be able to wrap most array file formats
How Zarr works
Zarr V2
Zarr V3
How Zarr V2 Works
Two elements
Zarr V2 Group metadata
// .zgroup
{
"zarr_format": 2
}
// .zattrs
{
"my_group_attributes": 2
}
Zarr V2 Array Metadata
// .zarray
{
"zarr_format": 2,
"shape": [10, 10],
"dtype": "|i1",
"order": "C",
"chunks": [5, 5],
"dimension_separator": "/"
"fill_value": 0,
"filters": [{"id": "delta", "astype": "|i1"}],
"compressor": {"id": "gzip", "level": 1},
}
The shape of the array
Zarr V2 Array Metadata
// .zarray
{
"zarr_format": 2,
"shape": [10, 10],
"dtype": "|i1",
"order": "C",
"chunks": [5, 5],
"dimension_separator": "/"
"fill_value": 0,
"filters": [{"id": "delta", "astype": "|i1"}],
"compressor": {"id": "gzip", "level": 1},
}
The data type of the array
Zarr V2 Array Metadata
// .zarray
{
"zarr_format": 2,
"shape": [10],
"dtype": "|i1",
"order": "C",
"chunks": [5],
"dimension_separator": "/"
"fill_value": 0,
"filters": [{"id": "delta", "astype": "|i1"}],
"compressor": {"id": "gzip", "level": 1},
}
The memory order of the array (C or F)
Zarr V2 Array Metadata
// .zarray
{
"zarr_format": 2,
"shape": [10, 10],
"dtype": "|i1",
"order": "C",
"chunks": [5 ,5],
"dimension_separator": "/"
"fill_value": 0,
"filters": [{"id": "delta", "astype": "|i1"}],
"compressor": {"id": "gzip", "level": 1},
}
The shape of the chunks of the array
Zarr V2 Array Metadata
// .zarray
{
"zarr_format": 2,
"shape": [10, 10],
"dtype": "|i1",
"order": "C",
"chunks": [5, 5],
"dimension_separator": "/"
"fill_value": 0,
"filters": [{"id": "delta", "astype": "|i1"}],
"compressor": {"id": "gzip", "level": 1},
}
The delimiter string used for the names of the chunks:
“0.0” vs “0/0”
Zarr V2 Array Metadata
// .zarray
{
"zarr_format": 2,
"shape": [10, 10],
"dtype": "|i1",
"order": "C",
"chunks": [5, 5],
"dimension_separator": "/"
"fill_value": 0,
"filters": [{"id": "delta", "astype": "|i1"}],
"compressor": {"id": "gzip", "level": 1},
}
The value assigned to missing chunks
Zarr V2 Array Metadata
// .zarray
{
"zarr_format": 2,
"shape": [10, 10],
"dtype": "|i1",
"order": "C",
"chunks": [5, 5],
"dimension_separator": "/"
"fill_value": 0,
"filters": [{"id": "delta", "astype": "|i1"}],
"compressor": {"id": "gzip", "level": 1},
}
The functions used to encode and decode chunks
// .zattrs
{
"my_group_attributes": 2
}
How Zarr V3 works
Two elements
Zarr V3 Group Metadata
// zarr.json
{
"zarr_format": 3,
"node_type": "group",
"attributes": {
"spam": "ham",
"eggs": 42
}
}
Evolution of Zarr V3 Group Metadata
Zarr V3 Array Metadata
"shape": [10000, 1000]
// zarr.json
{
"zarr_format": 3,
"node_type": "array",
"shape": [10000, 1000],
"data_type": "float64",
"dimension_names": ["x", "y"],
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [1000, 100]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"codecs": [{
"name": "bytes",
"configuration": {
"endian": "little"
}
}],
"fill_value": "NaN",
"attributes": {
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
}
The shape of the array
Zarr V3 Array Metadata
"data_type": "float64"
// zarr.json
{
"zarr_format": 3,
"node_type": "array",
"shape": [10000, 1000],
"data_type": "float64",
"dimension_names": ["x", "y"],
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [1000, 100]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"codecs": [{
"name": "bytes",
"configuration": {
"endian": "little"
}
}],
"fill_value": "NaN",
"attributes": {
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
}
The data type of the array
Zarr V3 Array Metadata
"dimension_names": ["x", "y"]
// zarr.json
{
"zarr_format": 3,
"node_type": "array",
"shape": [10000, 1000],
"data_type": "float64",
"dimension_names": ["x", "y"],
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [1000, 100]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"codecs": [{
"name": "bytes",
"configuration": {
"endian": "little"
}
}],
"fill_value": "NaN",
"attributes": {
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
}
The names of the axes of the array
Zarr V3 Array Metadata
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [1000, 100]
}
}
// zarr.json
{
"zarr_format": 3,
"node_type": "array",
"shape": [10000, 1000],
"data_type": "float64",
"dimension_names": ["x", "y"],
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [1000, 100]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"codecs": [{
"name": "bytes",
"configuration": {
"endian": "little"
}
}],
"fill_value": "NaN",
"attributes": {
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
}
The shape of each chunk in the array
Zarr V3 Array Metadata
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
}
// zarr.json
{
"zarr_format": 3,
"node_type": "array",
"shape": [10000, 1000],
"data_type": "float64",
"dimension_names": ["x", "y"],
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [1000, 100]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"codecs": [{
"name": "bytes",
"configuration": {
"endian": "little"
}
}],
"fill_value": "NaN",
"attributes": {
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
}
The name of each chunk in the array
Zarr V3 Array Metadata
"codecs": [{
"name": "bytes",
"configuration": {
"endian": "little"
}
}]
// zarr.json
{
"zarr_format": 3,
"node_type": "array",
"shape": [10000, 1000],
"data_type": "float64",
"dimension_names": ["x", "y"],
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [1000, 100]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"codecs": [{
"name": "bytes",
"configuration": {
"endian": "little"
}
}],
"fill_value": "NaN",
"attributes": {
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
}
The functions used to convert each chunk from an array to bytes, and back
This is where sharding is configured
Zarr V3 Array Metadata
"fill_value": "NaN"
// zarr.json
{
"zarr_format": 3,
"node_type": "array",
"shape": [10000, 1000],
"data_type": "float64",
"dimension_names": ["x", "y"],
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [1000, 100]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"codecs": [{
"name": "bytes",
"configuration": {
"endian": "little"
}
}],
"fill_value": "NaN",
"attributes": {
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
}
The value assigned to missing chunks
Zarr V3 Array Metadata
"attributes": {
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
}
// zarr.json
{
"zarr_format": 3,
"node_type": "array",
"shape": [10000, 1000],
"data_type": "float64",
"dimension_names": ["x", "y"],
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [1000, 100]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"codecs": [{
"name": "bytes",
"configuration": {
"endian": "little"
}
}],
"fill_value": "NaN",
"attributes": {
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
}
Arbitrary attributes
Evolution of Zarr V3 Group Metadata
What’s next for Zarr V3
Acknowledgements
Zarr V3 spec editors:
Special thanks to Tom Nicholas (@tomnicholas) for conversation in github.com/zarr-developers/zarr-developers.github.io/pull/131