2 of 10

22.08-29.08

Current status:

data4es: updates (#253):

Stage 019: handle ‘_incomplete’ marker (#264) (awaits review)
discussion on default values / error handling in stages

paper for BigData&AI Conference: (submitted)

#278 (akaida, DF README update) (changes required)

#282 (vaulov, Stage 095 improvements) (reviewed)

SSL errors

ToDo:

Review PRs:

#278 (akaida, DF README update)
#282 (vaulov, Stage 095 improvements)

2 / 4

DKB weekly

29/08/2019

3 of 10

DKB tasks overview

Chicago ES now requires authorisation (half-solved):

temporary: adjust data4es process at aiatlas171 (done)
pyDKB.storages submodule (#277) (postponed)
Stage 025: use some config instead of hardcoded Chicago ES access parameters

Stage 009 (Oracle Connector): regular interval reprocessing (trello) (mgolosova)

ATLAS data sample statistics:

specification (mgolosova, mborodin)
...

AMI-related stage improvement (trello) (vaulov)

duplicate output_dataset records in ES with wrong _parent (trello) (akaida)

3 / 4

DKB weekly

29/08/2019

5 of 10

ES: data update (1)

Q1: should service field, stored in ES, be started with ‘_’, just like dataflow service fields?

Q2: update operation may erase existing values, if in the update document some field is set to NULL or some default value.

Possible NULL/default values:

Stage 017:

chain_data/chain_id: default value, derived from taskid

rely on Oracle data stability: if once it was possible to find chain info, this information goes nowhere;
do not set at all, if not defined (remove from message);

category: default value ‘Uncategorized’

leave as is: the value is derived from taskname and hashtag_list, and if they have changed so that we can not define category, it should be set to ‘Uncategorized’;

hashtag_list/output_formats: NULL, if Oracle returns empty values

leave as is: if values were removed from Oracle, they should be removed from ES@DKB as well for consistency.

NOTE: if decide not to set some field at all, we must be sure that all ES requests use constructions like�“{must: {exists: {field: FIELD_NAME}}}” whenever it is required!

5 / 4

DKB weekly

29/08/2019

6 of 10

ES: data update (2)

Stage 025:

hs06: NULL if not found in Chicago ES

do not set at all;

toths06*(toths06/toths06_failed/toths06_finished): default value 0

do not set at all, if not defined or 0;

Stage 091:

[primary_input_]events: default value NULL (if not found or ‘null’ in Rucio)

do not set at all;

[input_]bytes: default value -1

do not set at all -- to keep values that might be stored earlier;

[primary_input_]deleted: default value TRUE (if not found or bytes == -1)

do not set at all; in requests treat “not set” as TRUE;

Stage 093:

data_format: default value empty array;

leave as is: the value is derived from dataset name and is not gonna change;

Stage 095:

dataset physics parameters from AMI: not set unless some value provided by AMI

leave as is.

6 / 4

DKB weekly

29/08/2019

7 of 10

Plans/priority outline

General:

BigData&AI paper
NEC slides
NEC paper

ATLAS:

Add “_incomplete” functionality
Fix AMI issue (single-record mode)
Sample statistics
Batch mode
ES schema update
REST API

NRC KI:

Join Slurm DBs in ES index

7 / 4

DKB weekly

29/08/2019

8 of 10

data4es: update scenario

8 / 4

Stage 009�(Oracle)

Stage XXX�--skip

Stage YYY

Stage 019�--update

Stage 069�(load to ES)

run/data4es-start --skip=XXX,ZZZ

Stage ZZZ�--skip

Stage 009�(Oracle)

Stage XXX

Stage YYY

Stage 019

Stage 069�(load to ES)

run/data4es-start

Stage ZZZ

If stage logic operation is not fully accomplished (due to --skip or some failure): mark output message as “incomplete” and push forward:�{“taskid”: …, “_incomplete”: true}

If message is incomplete (and not for UPDATE), add marker data field: �{“taskid”: …, “update_required”: true}

Treat all messages as “for update”

DKB weekly

29/08/2019

9 of 10

AMI issue

Too many unoptimised requests during archive data reload have overloaded AMI
AMI team gave us some recommendations how to optimise it�(see https://trello.com/c/x0hvd4aQ)
It also appeared, that there are some mistakes => we have not all the data we need
Batch processing gets more priority

9 / 4

DKB weekly

29/08/2019

10 of 10

DKB&ProdSys: plans

Near-term (May 2019):

“safe” update for the archived data (in progress)

“update” scenario (read from ES before writing);

consistency control tools; (in progress)
task chain statistics:

resource usage for the whole sample production;

further development of the REST API (testing and improvements, proper documentation, new methods, ...).

Medium-term (July 2019):

improve ES storage scheme (make things work faster):

involves the integration process reformation;

“safe” update for the archived data:

medium (local) storages for integration nodes;

address new use-cases.

Long-term (Dec 2019):

additional tools to help in the new ATLAS workflows validation;
detection of possible inconsistency and unexpected behaviour.

10 / 4

DKB weekly

29/08/2019

1 of 10

2 of 10

3 of 10

4 of 10

5 of 10

6 of 10

7 of 10

8 of 10

9 of 10

10 of 10