1 of 18

Data Knowledge Base

Torre Wenaus, BNL

PanDA Workshop

Apr 21-22 2016

CERN

2 of 18

DKB - what for?

  • Physics coordinator said to Simone and I this week that bigpanda (speaking of both panda and prodsys monitoring aspects) “is fantastic”
  • This was in the context of a conversation on whether we can/should work to capture and present the whole process from physicist idea ➡ production intent ➡ production request ➡ production status ➡ completion of the full processing chain ➡ available data
    • With ability to drill down within that chain for processing status, data availability, configuration and so on, drawing on content in prodsys, AMI, Rucio etc.
  • To which PC response was this “would be fantastic”
    • ...and “but this would be a lot of effort”

This is what we have been calling the “data knowledge base” or “data product catalog” for a little over a year (Feb 2015 ADC weekly presentation by TW), with “a lot of effort” indeed being the sticking point on making a lot of progress

But there has been effort and even some progress

2

3 of 18

A concrete example of the need

Question posed this week: There are MC15c dijet samples with the 2015 mu profile (r7773)... However, I cannot find the request for these in the many Google Docs... so I cannot tell who requested this sample, and thus I don’t know there to look for the JIRA to get the link to the BigPanda table with the status... Would someone tell me the procedure?

Answered thus: There is no procedure. It needs to be done by hand. I either look through all the CP spreadsheets for it (on google), or I get the taskID from the dataset and then use a bunch of panda pages to find the request.

And "It is not possible to back link from a given dataset to the MC request that make it."

The actual procedure laid out...

3

4 of 18

Should we be asking physicists to do this?

1) rucio list-dids mc15_13TeV:mc15_13TeV.361025.Pythia8EvtGen_A14NNPDF23LO_jetjet_JZ5W.merge.AOD.e3668_s2576_s2132_r7773_r7676

2) Write down one of the dataset's tid's. I took tid07978260

3) Go to https://prodtask-dev.cern.ch/prodtask/task_table/#/?time_period=all&task_type=production

4) Type 7978260 into "Task ID" box and hit "Update table"

5) See the request is 6416

6) Go to https://prodtask-dev.cern.ch/prodtask/inputlist_with_request/6416/

7) You see it was CP FTAG group; the sample is at slice 24 and 100% done.

4

5 of 18

MC Production

Campaign Monitoring

5

6 of 18

Twiki based production

summaries (autogenerated)

6

7 of 18

Manually maintained

spreadsheets

7

8 of 18

ATLAS S&C should improve on this:

8

“Want to know the status of your samples? Ask us.”

“A standard analysis needs ~300-500 samples for their analysis.

We will now follow your recommendations and advise our group to contact atlas-phys-mcprod-team@cern.ch for each sample where they want to get information about the production status.

Maybe you are lucky and won't get any email, maybe you will get 500 mails per day with inquiries- hard to predict ;-)”

9 of 18

DKB Inputs & Elements

  • Data sources
    • PanDA/JEDI, DEFT, Rucio, AMI, AGIS, JIRA, googledocs, ...
  • Data source APIs
    • Direct, rather than cached, access to back end sources
  • Back end data caches
    • DKB internal caches synced from data sources
  • Back end original data repositories
    • Data, user input originating with DKB
  • Front end web UI, app
  • Programmatic APIs, CLIs

9

10 of 18

Current work

  • DKB R&D Maria Grigorieva and Maksim Gubin, NRC KI & TPU
    • Currently R&D project, proceed to realistic use cases after a May technical/planning meeting at CERN
    • C.f. Maksim’s talk
  • KB prototype, TW
    • ‘General purpose’ KB emphasising the common elements of
      • Diverse back end data sources
        • MySQL, Oracle, Redis (in-memory cache), AWS DynamoDB, AWS S3, flat server-side json files, CouchDB/PocketDB (server/client data flow/sync), OAuth2 (authentication via Google, Github, Facebook etc)
      • Entity-relation model for the KB’s internal organization of knowledge
      • Rich, intelligent front end at the client (browser based javascript app) emphasising
        • Ease of use and data navigation for users
        • Ease of data entry for contributors
        • Non-intrusive infrastructure: easy authentication, intelligent/transparent data flow between back end and smart client for low latency and client-local data manipulation

10

11 of 18

KB Prototype Applications

  • One is in production, nothing to do with DKB discussed here (except all the commonalities of a KB): hepsoftware.org
    • Low data volume (MySQL resident), all human-entered, can get a feel for the client side app
  • And an actual DKB one but it hasn’t seen the light of day
    • Loads json from pandamon (mainly), json from AGIS, panda from Oracle
    • Some caching and preprocessing server side but sends a lot of data to the client
    • Client with local data that can locally impose and change selections a la the attribute summaries in pandamon, with dynamic hist and pie chart plotting
    • But, hung up on data flow server to client, and having the client cope… crashes Chrome all the time
    • Hence hasn’t seen the light of day
  • Anyway… progress needs to be in the hands of people with time for it, not a hobby

11

12 of 18

Proposed DKB objectives, deliverables

  • Identify and aim for the low hanging fruit most useful to ATLAS analyzers
  • There is a fuzzy boundary between some of the bigpanda monitoring requests today and DKB, sitting somewhere in the territory of the Jose monitor
    • Jose monitor: mine the panda/prodsys DBs to build and cache high level summaries server side and deliver json to presentation client
  • Proposal: put the integration of the Jose monitor onto the DKB side of the boundary as deliverable 1, as long as it can be done without adding substantial delay
  • Beyond that, let the wish lists of PWGs, production, ops, PCs etc. drive the list, there’s enough there to fill a busy and practically-directed program

12

13 of 18

DKB deliverable ideas

  • Full gestation-through-completion view of a production campaign and/or a specific analysis
    • Today, the production system becomes aware of a production objective when a request is entered
    • This happens relatively late so that the system is not aware of the full scope of a campaign (fractional completion of the campaign? We don’t know)
    • Introduce a pre-request stage to declare intent to produce before releasing it as a request
    • Users enter pre-requests… how? Googledoc spreadsheets programmatically read by the DKB? DKB web interface?
    • Provide high-level views with drill-down showing campaign progress
    • Full bi-directional cross-referencing between (pre)request, production progress, produced samples
    • Correlated also with an analysis, so “show me the 500 samples needed by my analysis and the status of producing them” is an answerable question
    • Provide drill-down: marshal from original sources and present information associated with an analysis or campaign
  • Utilize BigQuery for any of it?

13

14 of 18

The long view:

DKB and the Event Streaming Service

14

Once we have a DKB it can serve as the information gathering point and hub behind intelligent efficient data delivery through the event streaming service

15 of 18

Supplementary

15

16 of 18

The Event Service 2016

16

The 2015 Event Service is missing

its dataflow component, the

Event Streaming Service

17 of 18

The Event Streaming Service (ESS)

  • The ES streams in the processing assignments (from PanDA/JEDI) and streams out the outputs (to an object store) with fine granularity, but today does conventional WAN data access from workers to access the data itself
  • Objective of the ESS is to do for input dataflow what the ES does for the processing: feed the client a fine-grained, efficient, dynamic stream of exactly what’s needed for the task
    • Stream exactly/only the data needed by clients managed at the client asynchronously from the processing, avoiding WAN data delivery waits
  • The ESS needs to be ‘data aware’ in order to deliver exactly/only what’s needed: respond to requests for ‘science data objects’ by intelligently marshaling and sending the data needed
  • It needs to be ‘intelligent’ in the same way the ES is informed by PanDA/JEDI’s knowledge of the processing landscape
  • ESS encompasses
    • CDN-like optimization of data sourcing ‘close’ to the client
    • Intelligent cache management/exploitation at the client
    • Knowledge of the data itself sufficient to intelligently skim/slim during marshaling
    • Eventually? Servicing some requests via processing on demand rather than serving pre-existing data (replacing storage with cheaper CPU cycles)

17

18 of 18

Building the ESS

Two primary components: Data Streaming Service

  • CDN-like intelligence in efficient data delivery
  • With minimal replication
  • Data marshaling
  • Smart local caching

Informed by the Data Knowledge Base providing the intelligence on

  • Dynamic resource landscape
  • Science data object (SDO) knowledge
  • Analysis processes & priorities
  • ML-derived predictive knowledge

18