Analytics

what can we do today for sites

Ilija Vukotic, Rob Gardner, Lincoln Bryant

University of Chicago

ATLAS Sites Jamboree, January 29, 2016

Analytics Platform

2

Analytics platform continuously collects a lot of disparate data and makes it possible to easily inspect, visualize, and analyze it. The platform itself is a much larger topic and this talk will be limited only to parts of it that are the most useful to ATLAS sites.

We won’t cover: clusters, data transports, hdfs, pig, indexing ...

We will cover: ES cluster, data sources indexed, Kibana, already created dashboards, how to customize and make your own

so you can do something like this:

Analytics control center

3

Analysis

errors

PerfSonar

Overflow jobs

FTS transfers

Inter-site network

DDM space tokens, biggest users

Analytics cluster overview

Jobs efficiencies, users, resource utilization

For future reference

4

@

5

ElasticSearch is a Lucene based search engine that indexes the data and provides very fast searches, filtering, aggregations. Programatic access from CERN is possible at cl-analytics.mwt2.org:9200. Please ask if you need it.

Kibana - web based visualization on top of ElasticSearch. Accesible to everyone at cl-analytics.mwt2.org:5601. For best performance use Chrome.

The cluster is at Clemson University, provided courtesy of the CloudLab project.

Data sources

The most important for sites:

  • PandaJobs
  • Rucio DDM data
  • PerfSONAR
  • FTS
  • FAX
  • Rucio popularity

Other data:

  • xAOD traces
  • Rucio service data

6

The following slides will give links to all the dashboards shown.

All the dashboards shown are as general as possible, so you can simply filter on them for the specific information you are interested in.

Create site/cloud overview dashboard

Example:

MWT2

US

7

PandaJobs

Practically all the info on all the grid jobs: 106 variables

Describing: site, host, user, task, priorities,input and output data, timings, status, errors, parameters, memory, etc.

Period: from July 2014 up to last 30 min.

Template: jobs_archive_*

Dashboards:

8

PandaJobs

errors

9

Filter only your site by typing e.g.

computingsite:ANALY_MWT2* in search field.

Or simply drill down by clicking on the bar/slice/line on the plot.

You can save filtered dashboard. Always name site specific dashboards starting with your site name.

PandaJobs Analysis

10

PandaJobs Input data

11

PandaJobs efficiencies

12

Use to monitor jobs for batches of inefficient jobs.

Investigate reasons for low CPU eff.

Compare efficiency of the same task on other queues.

After drill-down switch to “Discover” mode to go down to a single job level.

DDM

Uses RUCIO daily dumps to index all the datasets at all the RSEs.

For each dataset gives: name, size, tags, data type, time (creation, last access, since last access), RSE

Template: ddm-* . Kept for 1 month.

The same data summed up per scope, owner, RSE available in template: ddm_aggregated*. Kept indefinitely.

Dashboards:

13

DDM US LOCALGROUPDISK

14

PerfSONAR

Gets data from ~250 sites. Three types of data: Throughput, One way delay, and Packet loss rates

Contains: vo, site, hostname, production status of both source and destination, mean, median, sd of delay, etc.

Template: network_weather_2-* Kept indefinitely.

Dashboards:

15

Perfsonar

all the measurements

select your site as:

srcSite:MWT2

or destSite:MWT2

to ATLAS sites:

destVO:ATLAS

16

Perfsonar

single link

17

Perfsonar

Inter-site

links

18

Compare your site to others

19

FTS transfers

All the FTS transfers in real time.

Contains: source and destination site and RSE, activity, bytes, times (created, started, ended)

Template: [fts-]YYYY.MM.DD

Dashboards:

20

FTS global

21

FTS

site specific

22

FAX cost

collects xrdcp rates between largest analysis queues and all FAX storages.

Contains: source, destination, rate

Template: faxcost-*

Dashboards:

23

FAX cost

24

Ale’s requested visualizations

25

Ale’s dashboard

26

Reporting from ES

One can use ES to periodically check for irregular situations (high packet loss, low DDM token space, low efficiency jobs,...) and generate reports (send e-mails, sms,...). There is an ES plugin that makes it trivial to do so, but it is not free.

Still it is easy to make a cron job to do so. Cron job should run a query against ES and send mail based on returned result. Simplest is to use curl, for a more complicated things (checking different things and generating one mail with full report) it can be easier to use python. Simple examples can be found HERE.

Two things to keep in mind:

  • don’t do it every 5 seconds, this is a shared resource
  • do all the filtering and aggregations on ES, you don’t want to transport back a lot of data.

27

Advanced topic

We deployed Timelion, kibana plugin that allows for easy time series analysis, derivations, moving averages, arithmetic operations,...

Not yet documented, so only for the brave ones.

28

Analytics for Sites - Google Slides