1 of 18

Data Knowledge Base

Torre Wenaus, BNL

PanDA Workshop

Apr 21-22 2016

CERN

2 of 18

DKB - what for?

Physics coordinator said to Simone and I this week that bigpanda (speaking of both panda and prodsys monitoring aspects) “is fantastic”
This was in the context of a conversation on whether we can/should work to capture and present the whole process from physicist idea ➡ production intent ➡ production request ➡ production status ➡ completion of the full processing chain ➡ available data

With ability to drill down within that chain for processing status, data availability, configuration and so on, drawing on content in prodsys, AMI, Rucio etc.

To which PC response was this “would be fantastic”

...and “but this would be a lot of effort”

This is what we have been calling the “data knowledge base” or “data product catalog” for a little over a year (Feb 2015 ADC weekly presentation by TW), with “a lot of effort” indeed being the sticking point on making a lot of progress

But there has been effort and even some progress

3 of 18

A concrete example of the need

Question posed this week: There are MC15c dijet samples with the 2015 mu profile (r7773)... However, I cannot find the request for these in the many Google Docs... so I cannot tell who requested this sample, and thus I don’t know there to look for the JIRA to get the link to the BigPanda table with the status... Would someone tell me the procedure?

Answered thus: There is no procedure. It needs to be done by hand. I either look through all the CP spreadsheets for it (on google), or I get the taskID from the dataset and then use a bunch of panda pages to find the request.

And "It is not possible to back link from a given dataset to the MC request that make it."

The actual procedure laid out...

4 of 18

Should we be asking physicists to do this?

1) rucio list-dids mc15_13TeV:mc15_13TeV.361025.Pythia8EvtGen_A14NNPDF23LO_jetjet_JZ5W.merge.AOD.e3668_s2576_s2132_r7773_r7676

2) Write down one of the dataset's tid's. I took tid07978260

3) Go to https://prodtask-dev.cern.ch/prodtask/task_table/#/?time_period=all&task_type=production

4) Type 7978260 into "Task ID" box and hit "Update table"

5) See the request is 6416

6) Go to https://prodtask-dev.cern.ch/prodtask/inputlist_with_request/6416/

7) You see it was CP FTAG group; the sample is at slice 24 and 100% done.

5 of 18

MC Production

Campaign Monitoring

6 of 18

Twiki based production

summaries (autogenerated)

7 of 18

Manually maintained

spreadsheets

8 of 18

ATLAS S&C should improve on this:

“Want to know the status of your samples? Ask us.”

“A standard analysis needs ~300-500 samples for their analysis.

We will now follow your recommendations and advise our group to contact atlas-phys-mcprod-team@cern.ch for each sample where they want to get information about the production status.

Maybe you are lucky and won't get any email, maybe you will get 500 mails per day with inquiries- hard to predict ;-)”

9 of 18

DKB Inputs & Elements

Data sources

PanDA/JEDI, DEFT, Rucio, AMI, AGIS, JIRA, googledocs, ...

Data source APIs

Direct, rather than cached, access to back end sources

Back end data caches

DKB internal caches synced from data sources

Back end original data repositories

Data, user input originating with DKB

Front end web UI, app
Programmatic APIs, CLIs

10 of 18

Current work

DKB R&D Maria Grigorieva and Maksim Gubin, NRC KI & TPU

Currently R&D project, proceed to realistic use cases after a May technical/planning meeting at CERN
C.f. Maksim’s talk

KB prototype, TW

‘General purpose’ KB emphasising the common elements of

Diverse back end data sources

MySQL, Oracle, Redis (in-memory cache), AWS DynamoDB, AWS S3, flat server-side json files, CouchDB/PocketDB (server/client data flow/sync), OAuth2 (authentication via Google, Github, Facebook etc)

Entity-relation model for the KB’s internal organization of knowledge
Rich, intelligent front end at the client (browser based javascript app) emphasising

Ease of use and data navigation for users
Ease of data entry for contributors
Non-intrusive infrastructure: easy authentication, intelligent/transparent data flow between back end and smart client for low latency and client-local data manipulation

11 of 18

KB Prototype Applications

One is in production, nothing to do with DKB discussed here (except all the commonalities of a KB): hepsoftware.org

Low data volume (MySQL resident), all human-entered, can get a feel for the client side app

And an actual DKB one but it hasn’t seen the light of day

Loads json from pandamon (mainly), json from AGIS, panda from Oracle
Some caching and preprocessing server side but sends a lot of data to the client
Client with local data that can locally impose and change selections a la the attribute summaries in pandamon, with dynamic hist and pie chart plotting
But, hung up on data flow server to client, and having the client cope… crashes Chrome all the time
Hence hasn’t seen the light of day

Anyway… progress needs to be in the hands of people with time for it, not a hobby

12 of 18

Proposed DKB objectives, deliverables

Identify and aim for the low hanging fruit most useful to ATLAS analyzers
There is a fuzzy boundary between some of the bigpanda monitoring requests today and DKB, sitting somewhere in the territory of the Jose monitor

Jose monitor: mine the panda/prodsys DBs to build and cache high level summaries server side and deliver json to presentation client

Proposal: put the integration of the Jose monitor onto the DKB side of the boundary as deliverable 1, as long as it can be done without adding substantial delay
Beyond that, let the wish lists of PWGs, production, ops, PCs etc. drive the list, there’s enough there to fill a busy and practically-directed program

13 of 18

DKB deliverable ideas

Full gestation-through-completion view of a production campaign and/or a specific analysis

Today, the production system becomes aware of a production objective when a request is entered
This happens relatively late so that the system is not aware of the full scope of a campaign (fractional completion of the campaign? We don’t know)
Introduce a pre-request stage to declare intent to produce before releasing it as a request
Users enter pre-requests… how? Googledoc spreadsheets programmatically read by the DKB? DKB web interface?
Provide high-level views with drill-down showing campaign progress
Full bi-directional cross-referencing between (pre)request, production progress, produced samples
Correlated also with an analysis, so “show me the 500 samples needed by my analysis and the status of producing them” is an answerable question
Provide drill-down: marshal from original sources and present information associated with an analysis or campaign

Utilize BigQuery for any of it?

14 of 18

The long view:

DKB and the Event Streaming Service

Once we have a DKB it can serve as the information gathering point and hub behind intelligent efficient data delivery through the event streaming service

15 of 18

Supplementary

16 of 18

The Event Service 2016

The 2015 Event Service is missing

its dataflow component, the

Event Streaming Service

17 of 18

The Event Streaming Service (ESS)

The ES streams in the processing assignments (from PanDA/JEDI) and streams out the outputs (to an object store) with fine granularity, but today does conventional WAN data access from workers to access the data itself
Objective of the ESS is to do for input dataflow what the ES does for the processing: feed the client a fine-grained, efficient, dynamic stream of exactly what’s needed for the task

Stream exactly/only the data needed by clients managed at the client asynchronously from the processing, avoiding WAN data delivery waits

The ESS needs to be ‘data aware’ in order to deliver exactly/only what’s needed: respond to requests for ‘science data objects’ by intelligently marshaling and sending the data needed
It needs to be ‘intelligent’ in the same way the ES is informed by PanDA/JEDI’s knowledge of the processing landscape
ESS encompasses

CDN-like optimization of data sourcing ‘close’ to the client
Intelligent cache management/exploitation at the client
Knowledge of the data itself sufficient to intelligently skim/slim during marshaling
Eventually? Servicing some requests via processing on demand rather than serving pre-existing data (replacing storage with cheaper CPU cycles)

18 of 18

Building the ESS

Two primary components: Data Streaming Service

CDN-like intelligence in efficient data delivery
With minimal replication
Data marshaling
Smart local caching

Informed by the Data Knowledge Base providing the intelligence on

Dynamic resource landscape
Science data object (SDO) knowledge
Analysis processes & priorities
ML-derived predictive knowledge