1 of 28

Linux Clusters Institute:�Monitoring

J.D. Maloney | Sr. HPC Storage Engineer

National Center for Supercomputing Applications (NCSA)

malone12@illinois.edu

1

May 1-5, 2023

This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/).

2 of 28

Purposes Behind Monitoring

2

May 1-5, 2023

3 of 28

External Sources of Monitoring Deliverables

  • Theses reasons for monitoring come from ”outside” your group/engineers that manage infrastructure
  • Use cases can be:
    • Reports to external funding agencies (NSF, DOE, NIH, etc.)
    • Reports to campus/parent-organization management
    • Reports and metrics for users/consumers of the infrastructure you provide/operate
    • Peers in the our field of HPC, monitoring provides key data that can underpin papers/posters/talks/etc.
  • Can be a big driver behind keeping long-term metrics and having good retention policies, to track long term trends

3

May 1-5, 2023

4 of 28

Internal Sources of Monitoring Deliverables

  • Theses reasons for monitoring come from you/your team, things handy to see to help operate infrastructure
  • Use cases can be:
    • Service degradation/down reporting…know before your users tell you something is broken
    • Trend analysis for capacity planning, helps you give good budget allocation request feedback; prevent future headaches
    • Correlate events/utilization to help you help users with their work, at the end of the day that’s what it is all about
    • Single pane of glass, no islands for every vendor
  • Needs will be derived over time and are always growing and evolving, work on monitoring is never complete

4

May 1-5, 2023

5 of 28

High Level “Layers” to the Monitoring Stack

5

May 1-5, 2023

6 of 28

High Level “Layers” to the Monitoring Stack

6

May 1-5, 2023

7 of 28

What to Collect (Metrics)

  • So many things we can collect metrics from that are useful, too many to talk about in such a short time
    • Thankfully many things here are common regardless of environment side
    • Mostly dependent on technologies and hardware that are used
  • When collecting metrics make sure that you make clear certain key things:
    • Units -- when gathering counters you need to know what the unit is, otherwise it’s useless
    • How often to ingest metric data, what is needed, what can you practically support
    • From where/what tool captures each metric, consolidate metrics capture as much as you can

7

May 1-5, 2023

8 of 28

What to Collect (Metrics)

  • An abbreviated list of metrics you can collect

8

May 1-5, 2023

Compute Infrastructure

  • CPU Utilization (Aggregate/per-core)
  • CPU performance counters (flops)
  • Memory Utilization (and BW utilization)
  • NIC performance & error counters
  • Accelerator performance counters and error counters
  • Inlet/Exhaust/Internal temps, power draw, fan speeds, voltages
  • Health of critical services (scheduler daemons, file system mounts, time sync, firewalls, etc.)
  • Software: module loads, etc.

Network Infrastructure

  • Bandwidth utilization, esp. up/down-links
  • Port error counters

Network Infrastructure (cont.)

  • Hardware health (power supplies, fans, temperatures, etc.)
  • Fabric Manager health (if applicable)

Storage Infrastructure

  • File system performance (bandwidth, metadata operations, per client/user/job performance trends)
  • File System Utilization (overall, by projects, users, teams) -- eg. quota data
  • Health of infrastructure (bad drives, links, cables, controllers, etc.)
  • Health of services that underpin the file system

9 of 28

What to Collect (Metrics)

  • An abbreviated list of metrics you can collect

9

May 1-5, 2023

Scheduler/Job Related

  • Utilization of each node, partition
  • Usage by user/group allocation on the system
  • Scheduler health (iteration time, backfill sizes, etc.)
  • Job counts and sizes (running/pending by user/group/project)
  • Efficiency of resource utilization (actual used vs. requested)
  • Job wait time information, latency for getting a job running

Security Related

  • Logs of all logins/sudo escallations/etc.
  • Pretty much all logs and some key metrics

Common Infrastructure

  • Power draw at the PDU/rack/transformer level
  • Balance power draw across PDUs/Phases/Banks
  • Water temperatures at various stages on water cooled systems
  • Humidity and Ambient air temperatures
  • Air particulate matter (especially in wildfire regions)
  • Weather data for your location(s)

10 of 28

Collection Tools

  • Tons of options out there to gather metrics
  • When choosing, some things to consider:
    • Familiarity, if your team prefers a tool, using it will help motivate people to enhance it and keep it up to date
    • Use as few tools to gather what you need as possible, if using multiple tools track where each metric stream comes from/why
    • Test out a few of your favorite options and see what you link
  • Some common options:
    • Telegraf
    • Prometheus
    • Nagios
    • Logstash
    • Zabbix
    • Cacti
    • Syslog

10

May 1-5, 2023

11 of 28

Collection Intervals

  • Frequent enough that you can “see” what’s going on with your systems
  • In-frequent enough that you’re not burdening your systems collecting the metrics
  • Also need to think about retention of the metrics and database size/queries
    • Changing from 1/60s to 1/10s is a 6x increase in data volume (may seem obvious…it’s math, but it can sneak up on ya especially at scale)
    • Too much resolution can make queries and plotting slower as well which can be hard on users/engineers/support folks
  • You don’t have to collect everything on the same interval
    • Eg. Performance data can be once per minute, but quota metrics once every 10 minutes

11

May 1-5, 2023

12 of 28

Metric Analysis and Storage

  • Most metrics parsing and prep is done on the collection side before it gets to the storage layer
    • Sometimes you may need to “roll up” or create new data streams within the storage/query layer
  • Handy to have a solution that can down-sample metrics after a period of time to be more space efficient
  • Retain metrics for as long as possible
    • Especially key metrics that help forecast demand
    • Those that help with reports due to external entities
  • Some common metrics and log storage options:

12

May 1-5, 2023

    • InfluxDB
    • Prometheus/Mimir
    • Victoria Metrics
    • ElasticSearch
    • MariaDB/MySQL

13 of 28

Visualization/Reporting

  • Lots of tools out there for this
    • You can even write your own! many sites have/do
  • Some more “legacy” tools are things like:
    • Ganglia
    • CheckMK
  • Some newer tools are things like:
    • Grafana
    • Kibana
    • Kapacitor
    • Zabbix
  • This is definitely a layer where you don’t *have* to choose just one tool; you can do multiple ones against the same backing databases

13

May 1-5, 2023

14 of 28

Notifications

  • Lots of solutions out there for alerts/notifications
  • Often these are tied to the tools that you are using for collection/storage/visualization
    • Key is to choose things that tie in well with what you are doing in other parts of the stack
  • A very critical piece of the stack since it’s what tells you what’s happening and helps you prevent a crisis or start mitigation as fast as possible
  • Lots of “soft” work at this layer also:
    • Alert definitions aren’t always black/white
    • Who gets alerted and when, etc.
    • We’ll talk about this here for a few more slides

14

May 1-5, 2023

15 of 28

Notification

  • What do we alert on? What constitutes a service interrupt?
  • Seems like simple questions to answer…often isn’t
    • Does compute node #273 going down at 1am warrant a wake up call…likely not, but the scheduler/file system…likely so
    • Ok, so one compute node dropping out at night == no wake up call, what about 10 nodes, or 50? Where do you draw the line?
  • Often can tie back to SLAs in place or organizational policy, or maybe even user expectation
    • Do the hard work of laying out definitions and communicating them thoroughly to all parties involved
  • Alerts can go to different places (more on this soon), but not everyone has/should see everything

15

May 1-5, 2023

16 of 28

Notification

  • How do ”we” get notified?... Will vary by party involved
    • Admins/engineers may get alarms emails, or maybe Slack messages (or $IM_platform_of_choice)
    • Things can be escalated by humans or software platforms (like Pager Duty)
    • Users usually get emails about issues, maybe you also have a status page/dashboard(s) for them to look at to check status
  • Communication to ”users” should usually *not* be automatic (there are exceptions to this)
    • False-positives are a thing
    • Critical issues can only happen briefly and have very little, that warrants a different tone to the users

16

May 1-5, 2023

17 of 28

Notification

  • How often do we alert, when can they be ignored, how are alerts escalated?
    • Some alerts are once-and-done (rely on you to see alert & check in)
    • Some alerts need to be repeating on a given frequency if still “tripped” to ensure someone gets eyes on it quick
  • Escalation of critical alarms can happen via methods such as a NOC/Operations Center, tools like PagerDuty, or even just an on-call engineer
  • Alert frequency should correspond to severity, some non-critical alerts can even be aggregated and delivered in batch
  • You want to avoid “boy-who-cried-wolf” situations, try not to desensitize people to alarms

17

May 1-5, 2023

18 of 28

Notification

  • Security related alerts and alarms can carry a different protocol than others depending on site
    • Certain compliance requirements can dictate time-to-response or time-to-remediation
    • Make sure you include security in your work on monitoring and alerting to ensure deliverables to them are being met
  • These types of situations can occur/be related to things like:
    • Failures to sudo (privilege escalation failures)
    • Repeated login failures/login attempts outside of certain geographies
    • DDOS attempts against services
    • Unexpected service behaviors
  • Prioritize as is appropriate, consequences for failures to meet compliance mandates can carry large consequences

18

May 1-5, 2023

19 of 28

Log Management

  • Logs can contain crucial and invaluable information for debugging or troubleshooting
  • Also logs can be a source of metrics
    • We create metric telemetry streams from data derived from logs
    • Things like failed login attempts over time, number of file system client expels, file transfer statistics, etc.
  • A big gain can be had from centralizing your log storage from various systems so that patterns can be seen an analyzed
    • Logs messages often contain references to other machines/services that the machine doing the logging interacts with
    • Correlating matching logs from client/server relationships can prove critical for timely debugging and analysis

19

May 1-5, 2023

20 of 28

Log Management

  • Many log tools out there too, just like there are a lot of tools out there for use with metrics
  • Some common log related tools are:
    • ELK stack (Elasticsearch, Logstash, and Kibana)
    • Graylog
    • Loki/Grafana
    • Splunk
  • Pick a tool that works well for you and your organization
    • Best to centralize so you can correlate as many logs as possible
  • Unless dictated by policy, log retention can be much less than metrics retention
    • Especially if you’ll pulling out the long-term metrics data

20

May 1-5, 2023

21 of 28

A Handful of Monitoring Examples

21

May 1-5, 2023

22 of 28

Example: Grafana

22

May 1-5, 2023

23 of 28

Example: Grafana

23

May 1-5, 2023

24 of 28

Example: Kibana

24

May 1-5, 2023

25 of 28

Example: Kibana

25

May 1-5, 2023

26 of 28

Example: Zabbix

26

May 1-5, 2023

27 of 28

The Future Side of Monitoring

  • Trend analysis using machine learning
    • Noise reduction to filter out false-positives and more clearly see actual issues, not just “busy-ness”
    • Anomaly detection which helps identify when something is outside “normal” even if it’s inside “normal” limits
    • Correlation of events automatically given things like timestamps/matching patterns/traces

  • Some examples

27

May 1-5, 2023

28 of 28

Questions?

28

May 1-5, 2023