Igor Mandrichenko, FNAL
vCHEP 2021
MetaCat
metadata catalog for data management systems
What is MetaCat ?
Objective:
MetaCat = Metadata Catalog
2
MetaCat Functions
3
Files or Objects
Units of operation: object or file
Abstract entity with the following properties
4
MetaCat Data Model
5
Metadata
File or dataset metadata is any JSON dictionary
Values can be
Datasets
Dataset has a name within a namespace
Files are combined into datasets
Dataset may have subsets, recursively
Dataset types
Dataset metadata - JSON dictionary
7
Parameter Categories
Optionally, restrict areas of the metadata namespace
Queries
9
Datasets vs Queries
Dataset - collection of files
Query - instructions how to select files
Bridge:
Metadata Query Language (MQL)
Basic query:
11
files from dune:raw_2019
where DUNE.reco_version = “v1.2”
limit 1000
datasets dune:”raw_%”
having DUNE.type = “mc”
Metadata Query Language (MQL)
Example:
12
union (
files from dune:raw_2019
where DUNE.reco_version = “v1.2”
,
files from dune:“raw_2020%”
where detector = “near” and
DUNE.reco_version >= “v1.3”
)
where file_type != “root”
limit 1000
Arrays and Dictionaries
bit_mask[2] = 1
config[“version”] > “2.3”
runs[any] = 1234
runs[all] < 1234
config[any] != “raw”
len(core.events) > 10
Array element by index
Dictionary access by key
Any array element
All elements
Any dictionary element
Array length
Value Ranges and Enumerations
runs[any] in 1234:1332
runs[all] in (1234,1235,2345)
run_type in (“calibration”,”test”)
file_type not in (“mc”,”test”)
Range of values
Enumerated set of values
Named Queries
A complex query could be created, debugged and saved to be reused
query DUNE:supernova_production_latest_version
where len(core.events) > 10
limit 100
join (
query DUNE:supernova_production_latest_version,
query joe:my_favorite_files
)
External Metadata Sources
16
External Filter - External Data Access Mechanism
17
External Filter in MQL
18
filter run_type(“calibration”) (
files from dune:raw_2019
where file_type=”root”
) where runs_db.voltage > 110.0
filter random_mix(0.4, 0.6)
(
files from dune:raw_2019
where reco_version = “v1.2”
,
files from dune:“raw_2020*”
where detector = “near” and
reco_version >= “v1.3”
)
A filter does not always have to access any external data.
In this example, mix two file sets into 1, according to target ratios
Filter name - user defined
Intermediate query
Runs database can be used to filter query results
Architecture
19
Software Stack
Project History and Status
20
Backup
Igor Mandrichenko, MetaCat Project Status
21
9/22/2020
Metadata Query Language (MQL)
Metadata expressions:
22
files
from dune:raw_2019
where
( DUNE.reco_version = “v1.2” or
DUNE.reco_version = ”v1.3”
)
and core.file_type = “root”
and run.config.voltage > 1.5
Datasets and Subsets in MQL
23
files
from dune:raw_2019
with children
recursively
where � created_timestamp > ‘2019-05-01’ � and reco_version = “v1.2”
Include files from the top dataset
External Filter with Multiple Inputs
A filter can take multiple file sets as input
and combine them into a single output file set
Rucio Compatibility
MetaCat
Rucio
File Provenance
26
User Authentication
27
File Provenance in MQL
28
parents (
files from dune:raw_2019
where reco_version = “v1.2”
)
files from dune:raw - parents(
files from dune:processed
)
files from dune:raw -
parents(
children(
files from dune:raw
)
)
All the files without any children
Unprocessed files
DUNE SAM Conversion
29