1 of 29

Igor Mandrichenko, FNAL

vCHEP 2021

MetaCat

metadata catalog for data management systems

2 of 29

What is MetaCat ?

Objective:

  • Provide Metadata Catalog component for Data Management systems where Rucio might be used as the Replica Manager
  • DUNE is primary target experiment, but MetaCat is not DUNE-centric project

MetaCat = Metadata Catalog

2

3 of 29

MetaCat Functions

  • Keep metadata associated with objects (files) and object collections (datasets)

  • Provide efficient query mechanism to select “interesting” files

  • Provide flexible, efficient, integrated access to external metadata sources

3

4 of 29

Files or Objects

Units of operation: object or file

Abstract entity with the following properties

  • Unique text ID (assigned by user or auto generated)
    • Immutable
  • Unique name within a namespace (Rucio term: scope)
    • Can be renamed
  • Creator user, timestamp, etc.
  • Metadata
  • Provenance

4

5 of 29

MetaCat Data Model

5

6 of 29

Metadata

File or dataset metadata is any JSON dictionary

Values can be

  • Scalars (integer, floating point, string, boolean)
  • Lists
  • Dictionaries, recursively

7 of 29

Datasets

Dataset has a name within a namespace

Files are combined into datasets

  • Many-to-many

Dataset may have subsets, recursively

Dataset types

  • Frozen - files can not be added or removed
  • Monotonic - files can only be added

Dataset metadata - JSON dictionary

7

8 of 29

Parameter Categories

Optionally, restrict areas of the metadata namespace

  • Categories can be nested recursively
  • Control parameter and subcategory names
  • Category defines constraints on parameter
    • Types
      • Int, str, float, boolean
      • list of int, float, str, boolean,
      • dictionary
    • Values
    • Presence

9 of 29

Queries

  • Metadata querying is the fundamental function of MetaCat
    • Find all files or datasets matching a set of criteria expressed in terms of their metadata
  • Result: File Set (not a Dataset !) - unordered list of file IDs with metadata
  • Written in Metadata Query Language (MQL)
  • A query can be named, saved and reused as is or inside another query
  • For efficiency, MQL query is translated to SQL

9

10 of 29

Datasets vs Queries

Dataset - collection of files

  • Recorded in the database
  • Files added/removed explicitly
  • Has a name within a namespace

Query - instructions how to select files

  • Recalculated every time
  • Results can change at any time
  • Can be saved under a name within a namespace and reused by name

Bridge:

  • Query results can be saved as a new dataset or added to an existing dataset

11 of 29

Metadata Query Language (MQL)

Basic query:

11

files from dune:raw_2019

where DUNE.reco_version = “v1.2”

limit 1000

  • Keyword - files query
  • Dataset to select files from
    • namespace:name
  • Metadata filtering
  • Limit results to first 1000
  • Whitespace is ignored

datasets dune:”raw_%

having DUNE.type = “mc”

  • Keyword - datasets query
  • Name pattern
  • Metadata filtering

12 of 29

Metadata Query Language (MQL)

Example:

12

union (

files from dune:raw_2019

where DUNE.reco_version = “v1.2”

,

files from dune:“raw_2020%”

where detector = “near” and

DUNE.reco_version >= “v1.3”

)

where file_type != “root”

limit 1000

  • Selection from multiple datasets
    • ‘%’ is wildcard
  • Queries can be combined using “union”,”join”, “-” (subtraction)
  • Metadata filters can be applied on any level

13 of 29

Arrays and Dictionaries

bit_mask[2] = 1

config[“version”] > “2.3”

runs[any] = 1234

runs[all] < 1234

config[any] != “raw”

len(core.events) > 10

Array element by index

Dictionary access by key

Any array element

All elements

Any dictionary element

Array length

14 of 29

Value Ranges and Enumerations

runs[any] in 1234:1332

runs[all] in (1234,1235,2345)

run_type in (“calibration”,”test”)

file_type not in (“mc”,”test”)

Range of values

Enumerated set of values

  • Can be strings too

15 of 29

Named Queries

A complex query could be created, debugged and saved to be reused

query DUNE:supernova_production_latest_version

where len(core.events) > 10

limit 100

join (

query DUNE:supernova_production_latest_version,

query joe:my_favorite_files

)

16 of 29

External Metadata Sources

  • Use case:
    • Run conditions are stored in the runs database
    • Files need to be selected based on some run conditions values
    • We do not want to replicate run conditions as file metadata

16

17 of 29

External Filter - External Data Access Mechanism

  • Accepts intermediate query or queries results (file set) with file metadata
  • Filters results based on the user defined algorithm
    • Can access data from any sources
  • Returns filtered file set
    • Optionally with new metadata values, copied from the external source
  • User-provided Python function with standard interface
    • parametrized at run time

17

18 of 29

External Filter in MQL

18

filter run_type(“calibration”) (

files from dune:raw_2019

where file_type=”root”

) where runs_db.voltage > 110.0

filter random_mix(0.4, 0.6)

(

files from dune:raw_2019

where reco_version = “v1.2”

,

files from dune:“raw_2020*”

where detector = “near” and

reco_version >= “v1.3”

)

A filter does not always have to access any external data.

In this example, mix two file sets into 1, according to target ratios

Filter name - user defined

Intermediate query

Runs database can be used to filter query results

  • and inject new metadata

19 of 29

Architecture

19

Software Stack

  • PostgreSQL v12 - the database
  • Python3
  • psycopg2 - Python/PosrgeSQL

20 of 29

Project History and Status

  • Started in Fall 2019
  • Primary target: DUNE, but meant to be a generic tool
  • Presented and demonstrated in DUNE DB Workshop - December 2019
  • DUNE requirements collected - September 2020:
  • DUNE deployment plan:
    • https://tinyurl.com/metacat-dune-deploy
    • Database conversion tool SAM->MetaCat ready
  • GitHub: https://github.com/ivmfnal/metacat
  • Documentation: https://tinyurl.com/metacat-docs
  • ProtoDUNE MetaCat GUI: https://metacat.fnal.gov:9443/dune_meta_demo/app/gui/index

20

21 of 29

Backup

Igor Mandrichenko, MetaCat Project Status

21

9/22/2020

22 of 29

Metadata Query Language (MQL)

Metadata expressions:

22

files

from dune:raw_2019

where

( DUNE.reco_version = “v1.2” or

DUNE.reco_version = ”v1.3”

)

and core.file_type = “root”

and run.config.voltage > 1.5

  • Parameter category
  • Enumerated values
  • Expressions can be combined
    • and, or, not

23 of 29

Datasets and Subsets in MQL

23

files

from dune:raw_2019

with children

recursively

where � created_timestamp > ‘2019-05-01’ � and reco_version = “v1.2”

Include files from the top dataset

  • and its subsets
  • recursively

24 of 29

External Filter with Multiple Inputs

A filter can take multiple file sets as input

and combine them into a single output file set

25 of 29

Rucio Compatibility

MetaCat

  • File
  • Namespace
  • Dataset
  • Dataset (subset)

Rucio

  • File
  • Scope
  • Dataset
  • Container

26 of 29

File Provenance

  • Which files were used to create which files
  • File may have zero or more parents
    • merging
  • File may have zero or more children
    • splitting

26

27 of 29

User Authentication

27

  • Implemented:
    • Password, digest, LDAP
    • Token
    • X.509 including proxies
  • Planned:
    • SSO

28 of 29

File Provenance in MQL

28

parents (

files from dune:raw_2019

where reco_version = “v1.2”

)

files from dune:raw - parents(

files from dune:processed

)

files from dune:raw -

parents(

children(

files from dune:raw

)

)

All the files without any children

Unprocessed files

29 of 29

DUNE SAM Conversion

  • DUNE SAM DB to MetaCat conversion procedure developed
    • Shell + SQL (via psql)
  • 5.8 million files
  • 134 million name=value metadata pairs
    • Includes Dimensions and “core” file attributes
    • ProtoDUNE specifics
      • Events
      • Rucio file names
  • Takes about 30 minutes

29