1 of 13

DKB weekly

DKB meeting�24/10/2019

Marina Golosova

2 of 13

18.10-24.10

2 / 4

DKB weekly

24/10/2019

3 of 13

ToDos (not updated)

  • data4es: updates (#253):
    • Stage 019: handle ‘_incomplete’ marker (#264) (todo: address review comments)
    • ‘update’ operation in data4es run script (#253) (ready for review)
    • update stages according to the discussion
  • ES scheme revision and update:
    • pt.2 (nested scheme): es-scheme-nested (todo: commit latest changes)
  • Re-reviewed: #282 (vaulov, stage 095 update) (todo: check comments, re-review)
  • Issues with Chicago ES: disabled stage 025, so no CPU usage info for now
  • PRs:
    • #282 (vaulov, Stage 095 improvements) -- re-review
    • #188 (vaulov, Stage 055 docs) -- re-review
    • #264 (mgolosova, Stage 019: ‘_incomplete’ marker) -- address comments
    • es-scheme-nested -- commit/push latest version of test utils
    • #277 (mgolosova, pyDKB.storages) -- keep on working
  • API: implement methods for queries currently in use

3 / 4

DKB weekly

24/10/2019

4 of 13

BACKUP

5 of 13

ES mapping

PR: #285

Field type characteristics:

  • text:
    • usage: wildcard search, match search (individual tokens)
  • keyword:
    • usage (queries): wildcard search, exact match, grouping (aggregation ‘terms’)
    • case-sensitive
    • may be “normalized” (e.g. to lower/upper case) to avoid case sensitivity, but�in “terms” aggregation will be provided the normalized values (not good for CamelCaseValues)
  • numeric/date:
    • usage: exact match, range search, grouping (aggregation ‘terms’)�

5 / 4

DKB weekly

24/10/2019

6 of 13

ES mapping

  • dataset/task name fields mapping now:�{ "type": "text",� "analyzer": "task_dataset_analyzer", /* allows search by individual� name fields */� "fields": {"keyword": { "type": "keyword", "ignore_above": 256 }} /* allows exact match search */�}�
  • “match” query with “minimum_should_match: 100%”:
    • query: "mc12_8TeV.229602.MadGraphPythia_AUET2B_CTEQ6L1_pMSSM_QCD_496706509.merge.e3481_a220_a263"
    • result:�"mc12_8TeV..._a263_a264_r4540_p1328"�"mc12_8TeV..._a263_a264_r4540_t87"�"mc12_8TeV..._a263_a264_r4540"�"mc12_8TeV..._a263"

  • is it any better than “keyword” storage + “term” query?..

6 / 4

DKB weekly

24/10/2019

7 of 13

DKB tasks overview

  • Chicago ES now requires authorisation (half-solved):
    • temporary: adjust data4es process at aiatlas171 (done)
    • pyDKB.storages submodule (#277) (postponed)
    • Stage 025: use some config instead of hardcoded Chicago ES access parameters

  • Stage 009 (Oracle Connector): regular interval reprocessing (trello) (mgolosova)

  • ATLAS data sample statistics:
    • specification (mgolosova, mborodin)
    • ...

  • AMI-related stage improvement (trello) (vaulov)

  • duplicate output_dataset records in ES with wrong _parent (trello) (akaida)

7 / 4

DKB weekly

24/10/2019

8 of 13

ES: data update (1)

Q1: should service field, stored in ES, be started with ‘_’, just like dataflow service fields?

Q2: update operation may erase existing values, if in the update document some field is set to NULL or some default value.

Possible NULL/default values:

  • Stage 017:
    • chain_data/chain_id: default value, derived from taskid
      • rely on Oracle data stability: if once it was possible to find chain info, this information goes nowhere;
      • do not set at all, if not defined (remove from message);
    • category: default value ‘Uncategorized’
      • leave as is: the value is derived from taskname and hashtag_list, and if they have changed so that we can not define category, it should be set to ‘Uncategorized’;
    • hashtag_list/output_formats: NULL, if Oracle returns empty values
      • leave as is: if values were removed from Oracle, they should be removed from ES@DKB as well for consistency.

NOTE: if decide not to set some field at all, we must be sure that all ES requests use constructions like�“{must: {exists: {field: FIELD_NAME}}}” whenever it is required!

8 / 4

DKB weekly

24/10/2019

9 of 13

ES: data update (2)

  • Stage 025:
    • hs06: NULL if not found in Chicago ES
      • do not set at all;
    • toths06*(toths06/toths06_failed/toths06_finished): default value 0
      • do not set at all, if not defined or 0;
  • Stage 091:
    • [primary_input_]events: default value NULL (if not found or ‘null’ in Rucio)
      • do not set at all;
    • [input_]bytes: default value -1
      • do not set at all -- to keep values that might be stored earlier;
    • [primary_input_]deleted: default value TRUE (if not found or bytes == -1)
      • do not set at all; in requests treat “not set” as TRUE;
  • Stage 093:
    • data_format: default value empty array;
      • leave as is: the value is derived from dataset name and is not gonna change;
  • Stage 095:
    • dataset physics parameters from AMI: not set unless some value provided by AMI
      • leave as is.

9 / 4

DKB weekly

24/10/2019

10 of 13

Plans/priority outline

General:

  • BigData&AI paper
  • NEC slides
  • NEC paper

ATLAS:

  • Add “_incomplete” functionality
  • Fix AMI issue (single-record mode)
  • Sample statistics
  • Batch mode
  • ES schema update
  • REST API

NRC KI:

  • Join Slurm DBs in ES index

10 / 4

DKB weekly

24/10/2019

11 of 13

data4es: update scenario

11 / 4

Stage 009�(Oracle)

Stage XXX�--skip

Stage YYY

Stage 019�--update

Stage 069�(load to ES)

run/data4es-start --skip=XXX,ZZZ

Stage ZZZ�--skip

Stage 009�(Oracle)

Stage XXX

Stage YYY

Stage 019

Stage 069�(load to ES)

run/data4es-start

Stage ZZZ

If stage logic operation is not fully accomplished (due to --skip or some failure): mark output message as “incomplete” and push forward:�{“taskid”: …, “_incomplete”: true}

If message is incomplete (and not for UPDATE), add marker data field: �{“taskid”: …, “update_required”: true}

Treat all messages as “for update”

DKB weekly

24/10/2019

12 of 13

AMI issue

  • Too many unoptimised requests during archive data reload have overloaded AMI
  • AMI team gave us some recommendations how to optimise it�(see https://trello.com/c/x0hvd4aQ)
  • It also appeared, that there are some mistakes => we have not all the data we need
  • Batch processing gets more priority

12 / 4

DKB weekly

24/10/2019

13 of 13

DKB&ProdSys: plans

Near-term (May 2019):

  • “safe” update for the archived data (in progress)
    • “update” scenario (read from ES before writing);
  • consistency control tools; (in progress)
  • task chain statistics:
    • resource usage for the whole sample production;
  • further development of the REST API (testing and improvements, proper documentation, new methods, ...).

Medium-term (July 2019):

  • improve ES storage scheme (make things work faster):
    • involves the integration process reformation;
  • “safe” update for the archived data:
    • medium (local) storages for integration nodes;
  • address new use-cases.

Long-term (Dec 2019):

  • additional tools to help in the new ATLAS workflows validation;
  • detection of possible inconsistency and unexpected behaviour.

13 / 4

DKB weekly

24/10/2019