1 of 26

LAION-5B and beyond: datasets, models, and…?

Robert Kaczmarczyk, LAION, TUM

2 of 26

Datasets

Models

???

Environment

Education

3 of 26

LAION - a history of successful community collaborations

Datasets

  • Release of LAION-400M, LAION-5B, LAION-aesthetics, …

Models

4 of 26

5 of 26

LAION-5B and beyond: datasets, models, and…

6 of 26

I wonder… 😉

7 of 26

8 of 26

9 of 26

10 of 26

Datasets

Models

Tools

Environment

Education

???

11 of 26

LAION-5B and beyond: datasets, models, and… tools!

  • interact
  • work
  • improve
  • search

(with) the dataset!

Tools to…

WHY?

12 of 26

13 of 26

Current LAION tools

Fundamental part of our previous efforts

  • Tools for efficient downloading and searching the dataset
    • CLIP-retrieval
    • img2dataset
    • embedding-reader

14 of 26

Opt-out feature in img2dataset

  • Respecting artist’s privacy / rights!

15 of 26

16 of 26

BIAS?

17 of 26

Dataset → Tools → (improved) models

Efficient downloading (img2dataset, …)

Subset creation (CLIP retrieval, …)

Understanding (kNN, modified tSNE, …)

18 of 26

What else?

19 of 26

Outlook

Tools in development

  • CC2imgcap
  • video2dataset
  • General Inference Framework*
  • Streamable versions of �img2dataset*�

�*HF already provides both to some degree for HF datasets

20 of 26

CC2imgcap „Creating a dataset“-pipeline

  • Easily convert common crawl to image caption set using pyspark. Common crawl has 5M wat files. They provide links of the web.
  • This simple tool allows you to process one warc in about 50s and get image link along with the alt text.
  • Deduplication against url+text in order to save on output space and speed up the process.
  • This makes it possible to do the first step of building a dataset like laion5B in 70k cpu core hours ($2.8k using aws EC2 (0.04$/core hour))

https://github.com/rom1504/cc2imgcap

21 of 26

General Inference Pipeline

  1. Pick a dataset you like (e.g., LAION-5B)
  2. Pick one or multiple models you want to run
  3. Select the configuration for you machine (slurm jobs supported!)

→ Output of the dataset including embeddings / additional columns as specified

Any Dataset

(.tar, .parquet, .npy, …)

Any inference model(s)

+

Start general inference pipeline

Files get automatically uploaded to output directory, e.g., aws s3

GPU0

GPU1

GPU2

GPU0

GPU1

GPU2

22 of 26

23 of 26

Improve understanding of our datasets…

24 of 26

25 of 26

Conclusion

  • Tools are a fundamental part of datasets & dataset-model interactions
  • Tools should made be openly available, the same as way as the models and datasets
  • Engineering tools to interact with datasets are necessary to help datasets reach a broader audience
  • Tools can help general understanding of datasets, improve privacy and quality (e.g., bias) of datasets

26 of 26

LAION is just the starting…

Join the community!

Large

Aritifical

Intelligence

Open

Network