1 of 34

A to Zarr

Davis Bennett

Zarr Summit, 2025

2 of 34

Outline

Goals

  1. About me
  2. What Zarr is
  3. How Zarr V2 works
  4. How Zarr V3 works
  • Impart basic knowledge of the Zarr data model
  • Trace the evolution of Zarr

3 of 34

About me

Where I am now

  • Independent software developer
  • Würzburg, Germany
  • Zarr Python

Previously

  • Data engineer
  • Janelia Research Campus
  • Logistics for large microscopy data (www.openorganelle.org)

4 of 34

Zarr vs File formats

We often compare Zarr to file formats (tif, hdf5, …)

But Zarr is more like a protocol or API than a conventional file format.

5 of 34

Zarr vs File formats

What is a file format?

A file format assigns meaning to sequences of bytes in a file.

| byte range | meaning

------------------------

| 0 → n | header

| n + 1 → end | body

6 of 34

Zarr vs File formats

Zarr assigns meaning to keys in a storage backend.

| key | meaning

--------------------------------------

| "zarr.json" | header (array metadata)

| "c/0" | body (chunk data)

The way data is actually stored is an implementation detail!

7 of 34

Zarr vs File formats

File formats define the state of a file or stored object.

Zarr defines the behavior of a key / value storage backend.

8 of 34

Zarr vs File formats

# normal

meta = create_array_metadata(...)

store = LocalStore("foo")

write_metadata(meta, store) # writes "zarr.json"

# weird, but not wrong!

store = ReversedLocalStore("foo") # reverses file names

write_metadata(meta, store) # writes "rraz.json"

Zarr does not strictly define the stored representation of data.

9 of 34

Zarr vs File formats

File formats usually have a well-defined entry point.

Zarr forms trees that can be accessed at any level

'' <-------- We can enter at the root

|-- zarr.json

|-- sub-group <-------- We can also enter sub-group

| |-- zarr.json

| |-- sub-sub-group <-------- Or sub-sub-group

| |-- zarr.json

|-- array <-------- Or the array

|-- zarr.json

10 of 34

Zarr vs File formats

Maybe Zarr is a "files format"

...but "data format", "protocol", API, or just "format" are probably better terms 😅

(aspirational) with the right storage logic, Zarr should be able to wrap most array file formats

11 of 34

How Zarr works

Zarr V2

  • Small team
  • Started 2016, finalized ca. 2017

Zarr V3

  • Bigger team
  • Started 2020, finalized ca. 2023

12 of 34

How Zarr V2 Works

Two elements

  • Groups
    • Groups contain arrays and groups
    • Groups have attributes
  • Arrays
    • Arrays contain chunks
    • Arrays have attributes

13 of 34

Zarr V2 Group metadata

// .zgroup

{

"zarr_format": 2

}

// .zattrs

{

"my_group_attributes": 2

}

14 of 34

Zarr V2 Array Metadata

// .zarray

{

"zarr_format": 2,

"shape": [10, 10],

"dtype": "|i1",

"order": "C",

"chunks": [5, 5],

"dimension_separator": "/"

"fill_value": 0,

"filters": [{"id": "delta", "astype": "|i1"}],

"compressor": {"id": "gzip", "level": 1},

}

The shape of the array

15 of 34

Zarr V2 Array Metadata

// .zarray

{

"zarr_format": 2,

"shape": [10, 10],

"dtype": "|i1",

"order": "C",

"chunks": [5, 5],

"dimension_separator": "/"

"fill_value": 0,

"filters": [{"id": "delta", "astype": "|i1"}],

"compressor": {"id": "gzip", "level": 1},

}

The data type of the array

16 of 34

Zarr V2 Array Metadata

// .zarray

{

"zarr_format": 2,

"shape": [10],

"dtype": "|i1",

"order": "C",

"chunks": [5],

"dimension_separator": "/"

"fill_value": 0,

"filters": [{"id": "delta", "astype": "|i1"}],

"compressor": {"id": "gzip", "level": 1},

}

The memory order of the array (C or F)

17 of 34

Zarr V2 Array Metadata

// .zarray

{

"zarr_format": 2,

"shape": [10, 10],

"dtype": "|i1",

"order": "C",

"chunks": [5 ,5],

"dimension_separator": "/"

"fill_value": 0,

"filters": [{"id": "delta", "astype": "|i1"}],

"compressor": {"id": "gzip", "level": 1},

}

The shape of the chunks of the array

18 of 34

Zarr V2 Array Metadata

// .zarray

{

"zarr_format": 2,

"shape": [10, 10],

"dtype": "|i1",

"order": "C",

"chunks": [5, 5],

"dimension_separator": "/"

"fill_value": 0,

"filters": [{"id": "delta", "astype": "|i1"}],

"compressor": {"id": "gzip", "level": 1},

}

The delimiter string used for the names of the chunks:

“0.0” vs “0/0”

19 of 34

Zarr V2 Array Metadata

// .zarray

{

"zarr_format": 2,

"shape": [10, 10],

"dtype": "|i1",

"order": "C",

"chunks": [5, 5],

"dimension_separator": "/"

"fill_value": 0,

"filters": [{"id": "delta", "astype": "|i1"}],

"compressor": {"id": "gzip", "level": 1},

}

The value assigned to missing chunks

20 of 34

Zarr V2 Array Metadata

// .zarray

{

"zarr_format": 2,

"shape": [10, 10],

"dtype": "|i1",

"order": "C",

"chunks": [5, 5],

"dimension_separator": "/"

"fill_value": 0,

"filters": [{"id": "delta", "astype": "|i1"}],

"compressor": {"id": "gzip", "level": 1},

}

The functions used to encode and decode chunks

// .zattrs

{

"my_group_attributes": 2

}

21 of 34

How Zarr V3 works

Two elements

  • Groups
    • Groups contain arrays and groups
    • Groups have attributes
  • Arrays
    • Arrays contain chunks
    • Arrays have attributes

22 of 34

Zarr V3 Group Metadata

// zarr.json

{

"zarr_format": 3,

"node_type": "group",

"attributes": {

"spam": "ham",

"eggs": 42

}

}

23 of 34

Evolution of Zarr V3 Group Metadata

  • .zgroup -> zarr.json
  • .zattrs -> zarr.json
  • node_type added

24 of 34

Zarr V3 Array Metadata

"shape": [10000, 1000]

// zarr.json

{

"zarr_format": 3,

"node_type": "array",

"shape": [10000, 1000],

"data_type": "float64",

"dimension_names": ["x", "y"],

"chunk_grid": {

"name": "regular",

"configuration": {

"chunk_shape": [1000, 100]

}

},

"chunk_key_encoding": {

"name": "default",

"configuration": {

"separator": "/"

}

},

"codecs": [{

"name": "bytes",

"configuration": {

"endian": "little"

}

}],

"fill_value": "NaN",

"attributes": {

"foo": 42,

"bar": "apples",

"baz": [1, 2, 3, 4]

}

}

The shape of the array

25 of 34

Zarr V3 Array Metadata

"data_type": "float64"

// zarr.json

{

"zarr_format": 3,

"node_type": "array",

"shape": [10000, 1000],

"data_type": "float64",

"dimension_names": ["x", "y"],

"chunk_grid": {

"name": "regular",

"configuration": {

"chunk_shape": [1000, 100]

}

},

"chunk_key_encoding": {

"name": "default",

"configuration": {

"separator": "/"

}

},

"codecs": [{

"name": "bytes",

"configuration": {

"endian": "little"

}

}],

"fill_value": "NaN",

"attributes": {

"foo": 42,

"bar": "apples",

"baz": [1, 2, 3, 4]

}

}

The data type of the array

26 of 34

Zarr V3 Array Metadata

"dimension_names": ["x", "y"]

// zarr.json

{

"zarr_format": 3,

"node_type": "array",

"shape": [10000, 1000],

"data_type": "float64",

"dimension_names": ["x", "y"],

"chunk_grid": {

"name": "regular",

"configuration": {

"chunk_shape": [1000, 100]

}

},

"chunk_key_encoding": {

"name": "default",

"configuration": {

"separator": "/"

}

},

"codecs": [{

"name": "bytes",

"configuration": {

"endian": "little"

}

}],

"fill_value": "NaN",

"attributes": {

"foo": 42,

"bar": "apples",

"baz": [1, 2, 3, 4]

}

}

The names of the axes of the array

27 of 34

Zarr V3 Array Metadata

"chunk_grid": {

"name": "regular",

"configuration": {

"chunk_shape": [1000, 100]

}

}

// zarr.json

{

"zarr_format": 3,

"node_type": "array",

"shape": [10000, 1000],

"data_type": "float64",

"dimension_names": ["x", "y"],

"chunk_grid": {

"name": "regular",

"configuration": {

"chunk_shape": [1000, 100]

}

},

"chunk_key_encoding": {

"name": "default",

"configuration": {

"separator": "/"

}

},

"codecs": [{

"name": "bytes",

"configuration": {

"endian": "little"

}

}],

"fill_value": "NaN",

"attributes": {

"foo": 42,

"bar": "apples",

"baz": [1, 2, 3, 4]

}

}

The shape of each chunk in the array

28 of 34

Zarr V3 Array Metadata

"chunk_key_encoding": {

"name": "default",

"configuration": {

"separator": "/"

}

}

// zarr.json

{

"zarr_format": 3,

"node_type": "array",

"shape": [10000, 1000],

"data_type": "float64",

"dimension_names": ["x", "y"],

"chunk_grid": {

"name": "regular",

"configuration": {

"chunk_shape": [1000, 100]

}

},

"chunk_key_encoding": {

"name": "default",

"configuration": {

"separator": "/"

}

},

"codecs": [{

"name": "bytes",

"configuration": {

"endian": "little"

}

}],

"fill_value": "NaN",

"attributes": {

"foo": 42,

"bar": "apples",

"baz": [1, 2, 3, 4]

}

}

The name of each chunk in the array

29 of 34

Zarr V3 Array Metadata

"codecs": [{

"name": "bytes",

"configuration": {

"endian": "little"

}

}]

// zarr.json

{

"zarr_format": 3,

"node_type": "array",

"shape": [10000, 1000],

"data_type": "float64",

"dimension_names": ["x", "y"],

"chunk_grid": {

"name": "regular",

"configuration": {

"chunk_shape": [1000, 100]

}

},

"chunk_key_encoding": {

"name": "default",

"configuration": {

"separator": "/"

}

},

"codecs": [{

"name": "bytes",

"configuration": {

"endian": "little"

}

}],

"fill_value": "NaN",

"attributes": {

"foo": 42,

"bar": "apples",

"baz": [1, 2, 3, 4]

}

}

The functions used to convert each chunk from an array to bytes, and back

This is where sharding is configured

30 of 34

Zarr V3 Array Metadata

"fill_value": "NaN"

// zarr.json

{

"zarr_format": 3,

"node_type": "array",

"shape": [10000, 1000],

"data_type": "float64",

"dimension_names": ["x", "y"],

"chunk_grid": {

"name": "regular",

"configuration": {

"chunk_shape": [1000, 100]

}

},

"chunk_key_encoding": {

"name": "default",

"configuration": {

"separator": "/"

}

},

"codecs": [{

"name": "bytes",

"configuration": {

"endian": "little"

}

}],

"fill_value": "NaN",

"attributes": {

"foo": 42,

"bar": "apples",

"baz": [1, 2, 3, 4]

}

}

The value assigned to missing chunks

31 of 34

Zarr V3 Array Metadata

"attributes": {

"foo": 42,

"bar": "apples",

"baz": [1, 2, 3, 4]

}

}

// zarr.json

{

"zarr_format": 3,

"node_type": "array",

"shape": [10000, 1000],

"data_type": "float64",

"dimension_names": ["x", "y"],

"chunk_grid": {

"name": "regular",

"configuration": {

"chunk_shape": [1000, 100]

}

},

"chunk_key_encoding": {

"name": "default",

"configuration": {

"separator": "/"

}

},

"codecs": [{

"name": "bytes",

"configuration": {

"endian": "little"

}

}],

"fill_value": "NaN",

"attributes": {

"foo": 42,

"bar": "apples",

"baz": [1, 2, 3, 4]

}

}

Arbitrary attributes

32 of 34

Evolution of Zarr V3 Group Metadata

  • .zarray -> zarr.json
  • .zattrs -> zarr.json
  • node_type added
  • more extension points
    • chunk_grid
    • chunk_key_encoding
    • data_type

33 of 34

What’s next for Zarr V3

  • Rectilinear (irregular) chunk grid

  • Formalized “virtual” arrays

  • Formalized attributes via “Zarr Convention Metadata”

34 of 34

Acknowledgements

Zarr V3 spec editors:

  • Alistair Miles (@alimanfoo)
  • Jonathan Striebel (@jstriebel)
  • Norman Rzepka (@normanrz)
  • Jeremy Maitin-Shepard (@jbms)
  • Josh Moore (@joshmoore)

Special thanks to Tom Nicholas (@tomnicholas) for conversation in github.com/zarr-developers/zarr-developers.github.io/pull/131