1 of 16

Lance: A New Columnar Data Format

https://github.com/eto-ai/lance

For computer vision and deep learning

2 of 16

Motivation

  • Data tooling for unstructured datasets is sorely lacking
  • Data management: folder hierarchies on S3
  • Production readiness: most tooling are designed for data that lives on local laptops
  • Reproducibility: more important due to labeling and harder due to large binaries
  • Need to have your cake and eat it too
    • Parquet is good for OLAP but bad for point queries
    • Raw S3 is good for point queries but terrible for scans
    • TFRecords is good for training but nothing else

3 of 16

Introducing Lance, a new OSS columnar format

  • Optimized for ML/AI (images, video, sensor, embeddings)
  • Designed for the cloud to be production-ready
  • Super fast performance
    1. 300x better scan performance than S3 image / annotation files
    2. 20x faster point query performance than Parquet
  • Integrated with Apache Arrow for ecosystem access
    • Semantic types for bounding boxes, LIDAR points, segmentation
    • Convenience for transformations like resize, shift/skew, sampling, etc
  • Starting with C++ implementation with python bindings

4 of 16

Data layout

5 of 16

Logical layout

  • Choose encodings that support sub-linear point access.
    • Plain
    • Fixed size binary
    • Var-length binary + offsets array
    • Dictionary (RLE or Plain)
    • RLE + offsets array
  • Nested list and struct
    • Offset array points to its children arrays
    • ML datasets often come with list-of-struct fields as annotations so efficiency here is important
  • Support nullables

6 of 16

Physical layout

  • Choose the chunk size aligned with the optimal I/O size over S3
  • Do not read more metadata than necessary
  • Scan Tensors (BCHW or BCWH) without transposing.
  • Align index and metadata sequentially, read them in batch and use SIMD when possible.
  • Cache pages and reduce the number of small reads.

7 of 16

Sample comparison with Parquet

  • Lance
  • Parquet

8 of 16

Performance benchmarks

9 of 16

Experiments Setup

  • AWS EC2
    • M6i.8xlarge
    • Ubuntu 22.04, Python 3.10, PyArrow 8.0.0, DuckDB 0.4
  • Datasets on S3
    • Oxford Pet
    • MS COCO

10 of 16

Point query

  • Ran 2 sets of experiments using Oxford Pets dataset store on local ssd and on S3
  • We compared single row read latency of Lance vs Parquet vs Raw image with json/xml annotations (Pascal XML & MS COCO)
  • Lance’s point query performance is order of magnitude better than parquet

11 of 16

Batch scan

  • Target workload: ML training and evaluation
  • Ran batch size frontier from 8-128
  • Compared Lance, parquet (w/ embedded image), and XML/JSON annotations w/ Image links on S3
  • Lance achieves comparable performance to parquet and is orders magnitude better than JSON/XML

12 of 16

Large-than-Memory Analytics via DuckDB

  • Lance produces a “lazy” Apache Arrow Dataset.
  • Duckdb and Pyarrow handle Projection (column selection) and Predicate (filter) Pushdown

import duckdb

import lance

uri = "s3://bucket/path/to/oxford_pet.lance"

pets = lance.dataset(uri)

duckdb.query(

"SELECT label, count(1) FROM pets GROUP BY label"

).to_arrow_table()

13 of 16

Roadmap

  • Indices and predicate push down
  • Better query support for list-of-struct columns (upstream)
  • Vector/embeddings support and indices
  • Versioning & Schema Evolution (e.g., adding embeddings or different model results)
  • WAL for quick updates (e.g., for annotations)

14 of 16

Thank you!

contact@eto.ai

Open-sourcing late july

15 of 16

Appendix

16 of 16

Set the baseline

  • The experiment setup gives us the following S3 latency curve.
  • Guide the file layout design
  • This is essentially the “best we can do”