1 of 16

Lance: A New Columnar Data Format

https://github.com/eto-ai/lance

For computer vision and deep learning

2 of 16

Motivation

Data tooling for unstructured datasets is sorely lacking
Data management: folder hierarchies on S3
Production readiness: most tooling are designed for data that lives on local laptops
Reproducibility: more important due to labeling and harder due to large binaries
Need to have your cake and eat it too

Parquet is good for OLAP but bad for point queries
Raw S3 is good for point queries but terrible for scans
TFRecords is good for training but nothing else

3 of 16

Introducing Lance, a new OSS columnar format

Optimized for ML/AI (images, video, sensor, embeddings)
Designed for the cloud to be production-ready
Super fast performance

300x better scan performance than S3 image / annotation files
20x faster point query performance than Parquet

Integrated with Apache Arrow for ecosystem access

Semantic types for bounding boxes, LIDAR points, segmentation
Convenience for transformations like resize, shift/skew, sampling, etc

Starting with C++ implementation with python bindings

5 of 16

Logical layout

Choose encodings that support sub-linear point access.

Plain
Fixed size binary
Var-length binary + offsets array
Dictionary (RLE or Plain)
RLE + offsets array

Nested list and struct

Offset array points to its children arrays
ML datasets often come with list-of-struct fields as annotations so efficiency here is important

Support nullables

6 of 16

Physical layout

Choose the chunk size aligned with the optimal I/O size over S3
Do not read more metadata than necessary
Scan Tensors (BCHW or BCWH) without transposing.
Align index and metadata sequentially, read them in batch and use SIMD when possible.
Cache pages and reduce the number of small reads.

7 of 16

Sample comparison with Parquet

Lance

Parquet

8 of 16

Performance benchmarks

9 of 16

Experiments Setup

AWS EC2

M6i.8xlarge
Ubuntu 22.04, Python 3.10, PyArrow 8.0.0, DuckDB 0.4

Datasets on S3

10 of 16

Point query

Ran 2 sets of experiments using Oxford Pets dataset store on local ssd and on S3
We compared single row read latency of Lance vs Parquet vs Raw image with json/xml annotations (Pascal XML & MS COCO)
Lance’s point query performance is order of magnitude better than parquet

11 of 16

Batch scan

Target workload: ML training and evaluation
Ran batch size frontier from 8-128
Compared Lance, parquet (w/ embedded image), and XML/JSON annotations w/ Image links on S3
Lance achieves comparable performance to parquet and is orders magnitude better than JSON/XML

12 of 16

Large-than-Memory Analytics via DuckDB

Lance produces a “lazy” Apache Arrow Dataset.
Duckdb and Pyarrow handle Projection (column selection) and Predicate (filter) Pushdown

import duckdb

import lance

uri = "s3://bucket/path/to/oxford_pet.lance"

pets = lance.dataset(uri)

duckdb.query(

"SELECT label, count(1) FROM pets GROUP BY label"

).to_arrow_table()

13 of 16

Roadmap

Indices and predicate push down
Better query support for list-of-struct columns (upstream)
Vector/embeddings support and indices
Versioning & Schema Evolution (e.g., adding embeddings or different model results)
WAL for quick updates (e.g., for annotations)

14 of 16

Thank you!

contact@eto.ai

Open-sourcing late july

16 of 16

Set the baseline

The experiment setup gives us the following S3 latency curve.
Guide the file layout design
This is essentially the “best we can do”