2 of 13

18.10-24.10

CERN instance:

Chicago ES access restored (new hosts)
missed metadata collected
local issues with Python modules (urllib3, request, …): seems to work fine now

RFBR: 1st year report (2-pages version) (in progress):�https://docs.google.com/document/d/1gG9uR38o_gJ-UAVQFFb0nDnfUspaQR34eGK8USvJVxc/edit?ts=5db0882e
API server app: keywords search method (in progress)
NEC paper (in progress):

skeleton: https://docs.google.com/document/d/1IBxAOCttZ7O79yLdwIzVAuljBqnoxzG1v9tNDmcbc5Q/

2 / 4

DKB weekly

24/10/2019

3 of 13

ToDos (not updated)

data4es: updates (#253):

Stage 019: handle ‘_incomplete’ marker (#264) (todo: address review comments)
‘update’ operation in data4es run script (#253) (ready for review)
update stages according to the discussion

ES scheme revision and update:

pt.2 (nested scheme): es-scheme-nested (todo: commit latest changes)

Re-reviewed: #282 (vaulov, stage 095 update) (todo: check comments, re-review)
Issues with Chicago ES: disabled stage 025, so no CPU usage info for now
PRs:

#282 (vaulov, Stage 095 improvements) -- re-review
#188 (vaulov, Stage 055 docs) -- re-review
#264 (mgolosova, Stage 019: ‘_incomplete’ marker) -- address comments
es-scheme-nested -- commit/push latest version of test utils
#277 (mgolosova, pyDKB.storages) -- keep on working

API: implement methods for queries currently in use

3 / 4

DKB weekly

24/10/2019

5 of 13

ES mapping

PR: #285

Field type characteristics:

text:

usage: wildcard search, match search (individual tokens)

keyword:

usage (queries): wildcard search, exact match, grouping (aggregation ‘terms’)
case-sensitive
may be “normalized” (e.g. to lower/upper case) to avoid case sensitivity, but�in “terms” aggregation will be provided the normalized values (not good for CamelCaseValues)

numeric/date:

usage: exact match, range search, grouping (aggregation ‘terms’)�

5 / 4

DKB weekly

24/10/2019

6 of 13

ES mapping

dataset/task name fields mapping now:�{ "type": "text",� "analyzer": "task_dataset_analyzer", /* allows search by individual� name fields */� "fields": {"keyword": { "type": "keyword", "ignore_above": 256 }} /* allows exact match search */�}�
“match” query with “minimum_should_match: 100%”:

query: "mc12_8TeV.229602.MadGraphPythia_AUET2B_CTEQ6L1_pMSSM_QCD_496706509.merge.e3481_a220_a263"
result:�"mc12_8TeV..._a263_a264_r4540_p1328"�"mc12_8TeV..._a263_a264_r4540_t87"�"mc12_8TeV..._a263_a264_r4540"�"mc12_8TeV..._a263"

is it any better than “keyword” storage + “term” query?..

6 / 4

DKB weekly

24/10/2019

7 of 13

DKB tasks overview

Chicago ES now requires authorisation (half-solved):

temporary: adjust data4es process at aiatlas171 (done)
pyDKB.storages submodule (#277) (postponed)
Stage 025: use some config instead of hardcoded Chicago ES access parameters

Stage 009 (Oracle Connector): regular interval reprocessing (trello) (mgolosova)

ATLAS data sample statistics:

specification (mgolosova, mborodin)
...

AMI-related stage improvement (trello) (vaulov)

duplicate output_dataset records in ES with wrong _parent (trello) (akaida)

7 / 4

DKB weekly

24/10/2019

8 of 13

ES: data update (1)

Q1: should service field, stored in ES, be started with ‘_’, just like dataflow service fields?

Q2: update operation may erase existing values, if in the update document some field is set to NULL or some default value.

Possible NULL/default values:

Stage 017:

chain_data/chain_id: default value, derived from taskid

rely on Oracle data stability: if once it was possible to find chain info, this information goes nowhere;
do not set at all, if not defined (remove from message);

category: default value ‘Uncategorized’

leave as is: the value is derived from taskname and hashtag_list, and if they have changed so that we can not define category, it should be set to ‘Uncategorized’;

hashtag_list/output_formats: NULL, if Oracle returns empty values

leave as is: if values were removed from Oracle, they should be removed from ES@DKB as well for consistency.

NOTE: if decide not to set some field at all, we must be sure that all ES requests use constructions like�“{must: {exists: {field: FIELD_NAME}}}” whenever it is required!

8 / 4

DKB weekly

24/10/2019

9 of 13

ES: data update (2)

Stage 025:

hs06: NULL if not found in Chicago ES

do not set at all;

toths06*(toths06/toths06_failed/toths06_finished): default value 0

do not set at all, if not defined or 0;

Stage 091:

[primary_input_]events: default value NULL (if not found or ‘null’ in Rucio)

do not set at all;

[input_]bytes: default value -1

do not set at all -- to keep values that might be stored earlier;

[primary_input_]deleted: default value TRUE (if not found or bytes == -1)

do not set at all; in requests treat “not set” as TRUE;

Stage 093:

data_format: default value empty array;

leave as is: the value is derived from dataset name and is not gonna change;

Stage 095:

dataset physics parameters from AMI: not set unless some value provided by AMI

leave as is.

9 / 4

DKB weekly

24/10/2019

10 of 13

Plans/priority outline

General:

BigData&AI paper
NEC slides
NEC paper

ATLAS:

Add “_incomplete” functionality
Fix AMI issue (single-record mode)
Sample statistics
Batch mode
ES schema update
REST API

NRC KI:

Join Slurm DBs in ES index

10 / 4

DKB weekly

24/10/2019

11 of 13

data4es: update scenario

11 / 4

Stage 009�(Oracle)

Stage XXX�--skip

Stage YYY

Stage 019�--update

Stage 069�(load to ES)

run/data4es-start --skip=XXX,ZZZ

Stage ZZZ�--skip

Stage 009�(Oracle)

Stage XXX

Stage YYY

Stage 019

Stage 069�(load to ES)

run/data4es-start

Stage ZZZ

If stage logic operation is not fully accomplished (due to --skip or some failure): mark output message as “incomplete” and push forward:�{“taskid”: …, “_incomplete”: true}

If message is incomplete (and not for UPDATE), add marker data field: �{“taskid”: …, “update_required”: true}

Treat all messages as “for update”

DKB weekly

24/10/2019

12 of 13

AMI issue

Too many unoptimised requests during archive data reload have overloaded AMI
AMI team gave us some recommendations how to optimise it�(see https://trello.com/c/x0hvd4aQ)
It also appeared, that there are some mistakes => we have not all the data we need
Batch processing gets more priority

12 / 4

DKB weekly

24/10/2019

13 of 13

DKB&ProdSys: plans

Near-term (May 2019):

“safe” update for the archived data (in progress)

“update” scenario (read from ES before writing);

consistency control tools; (in progress)
task chain statistics:

resource usage for the whole sample production;

further development of the REST API (testing and improvements, proper documentation, new methods, ...).

Medium-term (July 2019):

improve ES storage scheme (make things work faster):

involves the integration process reformation;

“safe” update for the archived data:

medium (local) storages for integration nodes;

address new use-cases.

Long-term (Dec 2019):

additional tools to help in the new ATLAS workflows validation;
detection of possible inconsistency and unexpected behaviour.

13 / 4

DKB weekly

24/10/2019

1 of 13

2 of 13

3 of 13

4 of 13

5 of 13

6 of 13

7 of 13

8 of 13

9 of 13

10 of 13

11 of 13

12 of 13

13 of 13