1 of 28

Linux Clusters Institute:�Monitoring

J.D. Maloney | Sr. HPC Storage Engineer

National Center for Supercomputing Applications (NCSA)

malone12@illinois.edu

May 1-5, 2023

This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/).

2 of 28

Purposes Behind Monitoring

May 1-5, 2023

3 of 28

External Sources of Monitoring Deliverables

Theses reasons for monitoring come from ”outside” your group/engineers that manage infrastructure
Use cases can be:

Reports to external funding agencies (NSF, DOE, NIH, etc.)
Reports to campus/parent-organization management
Reports and metrics for users/consumers of the infrastructure you provide/operate
Peers in the our field of HPC, monitoring provides key data that can underpin papers/posters/talks/etc.

Can be a big driver behind keeping long-term metrics and having good retention policies, to track long term trends

May 1-5, 2023

4 of 28

Internal Sources of Monitoring Deliverables

Theses reasons for monitoring come from you/your team, things handy to see to help operate infrastructure
Use cases can be:

Service degradation/down reporting…know before your users tell you something is broken
Trend analysis for capacity planning, helps you give good budget allocation request feedback; prevent future headaches
Correlate events/utilization to help you help users with their work, at the end of the day that’s what it is all about
Single pane of glass, no islands for every vendor

Needs will be derived over time and are always growing and evolving, work on monitoring is never complete

May 1-5, 2023

5 of 28

High Level “Layers” to the Monitoring Stack

May 1-5, 2023

6 of 28

High Level “Layers” to the Monitoring Stack

May 1-5, 2023

7 of 28

What to Collect (Metrics)

So many things we can collect metrics from that are useful, too many to talk about in such a short time

Thankfully many things here are common regardless of environment side
Mostly dependent on technologies and hardware that are used

When collecting metrics make sure that you make clear certain key things:

Units -- when gathering counters you need to know what the unit is, otherwise it’s useless
How often to ingest metric data, what is needed, what can you practically support
From where/what tool captures each metric, consolidate metrics capture as much as you can

May 1-5, 2023

8 of 28

What to Collect (Metrics)

An abbreviated list of metrics you can collect

May 1-5, 2023

Compute Infrastructure

CPU Utilization (Aggregate/per-core)
CPU performance counters (flops)
Memory Utilization (and BW utilization)
NIC performance & error counters
Accelerator performance counters and error counters
Inlet/Exhaust/Internal temps, power draw, fan speeds, voltages
Health of critical services (scheduler daemons, file system mounts, time sync, firewalls, etc.)
Software: module loads, etc.

Network Infrastructure

Bandwidth utilization, esp. up/down-links
Port error counters

Network Infrastructure (cont.)

Hardware health (power supplies, fans, temperatures, etc.)
Fabric Manager health (if applicable)

Storage Infrastructure

File system performance (bandwidth, metadata operations, per client/user/job performance trends)
File System Utilization (overall, by projects, users, teams) -- eg. quota data
Health of infrastructure (bad drives, links, cables, controllers, etc.)
Health of services that underpin the file system

9 of 28

What to Collect (Metrics)

An abbreviated list of metrics you can collect

May 1-5, 2023

Scheduler/Job Related

Utilization of each node, partition
Usage by user/group allocation on the system
Scheduler health (iteration time, backfill sizes, etc.)
Job counts and sizes (running/pending by user/group/project)
Efficiency of resource utilization (actual used vs. requested)
Job wait time information, latency for getting a job running

Security Related

Logs of all logins/sudo escallations/etc.
Pretty much all logs and some key metrics

Common Infrastructure

Power draw at the PDU/rack/transformer level
Balance power draw across PDUs/Phases/Banks
Water temperatures at various stages on water cooled systems
Humidity and Ambient air temperatures
Air particulate matter (especially in wildfire regions)
Weather data for your location(s)

10 of 28

Collection Tools

Tons of options out there to gather metrics
When choosing, some things to consider:

Familiarity, if your team prefers a tool, using it will help motivate people to enhance it and keep it up to date
Use as few tools to gather what you need as possible, if using multiple tools track where each metric stream comes from/why
Test out a few of your favorite options and see what you link

Some common options:

Telegraf
Prometheus
Nagios
Logstash
Zabbix
Cacti
Syslog

May 1-5, 2023

11 of 28

Collection Intervals

Frequent enough that you can “see” what’s going on with your systems
In-frequent enough that you’re not burdening your systems collecting the metrics
Also need to think about retention of the metrics and database size/queries

Changing from 1/60s to 1/10s is a 6x increase in data volume (may seem obvious…it’s math, but it can sneak up on ya especially at scale)
Too much resolution can make queries and plotting slower as well which can be hard on users/engineers/support folks

You don’t have to collect everything on the same interval

Eg. Performance data can be once per minute, but quota metrics once every 10 minutes

May 1-5, 2023

12 of 28

Metric Analysis and Storage

Most metrics parsing and prep is done on the collection side before it gets to the storage layer

Sometimes you may need to “roll up” or create new data streams within the storage/query layer

Handy to have a solution that can down-sample metrics after a period of time to be more space efficient
Retain metrics for as long as possible

Especially key metrics that help forecast demand
Those that help with reports due to external entities

Some common metrics and log storage options:

May 1-5, 2023

InfluxDB
Prometheus/Mimir
Victoria Metrics

ElasticSearch
MariaDB/MySQL

13 of 28

Visualization/Reporting

Lots of tools out there for this

You can even write your own! many sites have/do

Some more “legacy” tools are things like:

Ganglia
CheckMK

Some newer tools are things like:

Grafana
Kibana
Kapacitor
Zabbix

This is definitely a layer where you don’t *have* to choose just one tool; you can do multiple ones against the same backing databases

May 1-5, 2023

14 of 28

Notifications

Lots of solutions out there for alerts/notifications
Often these are tied to the tools that you are using for collection/storage/visualization

Key is to choose things that tie in well with what you are doing in other parts of the stack

A very critical piece of the stack since it’s what tells you what’s happening and helps you prevent a crisis or start mitigation as fast as possible
Lots of “soft” work at this layer also:

Alert definitions aren’t always black/white
Who gets alerted and when, etc.
We’ll talk about this here for a few more slides

May 1-5, 2023

15 of 28

Notification

What do we alert on? What constitutes a service interrupt?
Seems like simple questions to answer…often isn’t

Does compute node #273 going down at 1am warrant a wake up call…likely not, but the scheduler/file system…likely so
Ok, so one compute node dropping out at night == no wake up call, what about 10 nodes, or 50? Where do you draw the line?

Often can tie back to SLAs in place or organizational policy, or maybe even user expectation

Do the hard work of laying out definitions and communicating them thoroughly to all parties involved

Alerts can go to different places (more on this soon), but not everyone has/should see everything

May 1-5, 2023

16 of 28

Notification

How do ”we” get notified?... Will vary by party involved

Admins/engineers may get alarms emails, or maybe Slack messages (or $IM_platform_of_choice)
Things can be escalated by humans or software platforms (like Pager Duty)
Users usually get emails about issues, maybe you also have a status page/dashboard(s) for them to look at to check status

Communication to ”users” should usually *not* be automatic (there are exceptions to this)

False-positives are a thing
Critical issues can only happen briefly and have very little, that warrants a different tone to the users

May 1-5, 2023

17 of 28

Notification

How often do we alert, when can they be ignored, how are alerts escalated?

Some alerts are once-and-done (rely on you to see alert & check in)
Some alerts need to be repeating on a given frequency if still “tripped” to ensure someone gets eyes on it quick

Escalation of critical alarms can happen via methods such as a NOC/Operations Center, tools like PagerDuty, or even just an on-call engineer
Alert frequency should correspond to severity, some non-critical alerts can even be aggregated and delivered in batch
You want to avoid “boy-who-cried-wolf” situations, try not to desensitize people to alarms

May 1-5, 2023

18 of 28

Notification

Security related alerts and alarms can carry a different protocol than others depending on site

Certain compliance requirements can dictate time-to-response or time-to-remediation
Make sure you include security in your work on monitoring and alerting to ensure deliverables to them are being met

These types of situations can occur/be related to things like:

Failures to sudo (privilege escalation failures)
Repeated login failures/login attempts outside of certain geographies
DDOS attempts against services
Unexpected service behaviors

Prioritize as is appropriate, consequences for failures to meet compliance mandates can carry large consequences

May 1-5, 2023

19 of 28

Log Management

Logs can contain crucial and invaluable information for debugging or troubleshooting
Also logs can be a source of metrics

We create metric telemetry streams from data derived from logs
Things like failed login attempts over time, number of file system client expels, file transfer statistics, etc.

A big gain can be had from centralizing your log storage from various systems so that patterns can be seen an analyzed

Logs messages often contain references to other machines/services that the machine doing the logging interacts with
Correlating matching logs from client/server relationships can prove critical for timely debugging and analysis

May 1-5, 2023

20 of 28

Log Management

Many log tools out there too, just like there are a lot of tools out there for use with metrics
Some common log related tools are:

ELK stack (Elasticsearch, Logstash, and Kibana)
Graylog
Loki/Grafana
Splunk

Pick a tool that works well for you and your organization

Best to centralize so you can correlate as many logs as possible

Unless dictated by policy, log retention can be much less than metrics retention

Especially if you’ll pulling out the long-term metrics data

May 1-5, 2023

21 of 28

A Handful of Monitoring Examples

May 1-5, 2023

22 of 28

Example: Grafana

May 1-5, 2023

23 of 28

Example: Grafana

May 1-5, 2023

24 of 28

Example: Kibana

May 1-5, 2023

25 of 28

Example: Kibana

May 1-5, 2023

26 of 28

Example: Zabbix

May 1-5, 2023

27 of 28

The Future Side of Monitoring

Trend analysis using machine learning

Noise reduction to filter out false-positives and more clearly see actual issues, not just “busy-ness”
Anomaly detection which helps identify when something is outside “normal” even if it’s inside “normal” limits
Correlation of events automatically given things like timestamps/matching patterns/traces

Some examples

Moogsoft (https://www.moogsoft.com)
Metricly (https://www.metricly.com/product)
Anodot (https://www.anodot.com)

May 1-5, 2023

1 of 28

2 of 28

3 of 28

4 of 28

5 of 28

6 of 28

7 of 28

8 of 28

9 of 28

10 of 28

11 of 28

12 of 28

13 of 28

14 of 28

15 of 28

16 of 28

17 of 28

18 of 28

19 of 28

20 of 28

21 of 28

22 of 28

23 of 28

24 of 28

25 of 28

26 of 28

27 of 28

28 of 28