1 of 26

An intro to metrics collection with Prometheus and visualization with Grafana at the Colorado School of Mines

MINES.EDU

2 of 26

About Us

Colorado School of Mines and the CIARC group

MINES.EDU

3 of 26

Mines numbers

  • 6,268 total students (about 5,000 undergrad and 1,200 graduate students)
  • 17 undergraduate majors
  • 35 graduate programs
  • $84 million in research awards in FY2019

4 of 26

CIARC

  • CYBERINFRASTRUCTURE AND ADVANCED RESEARCH COMPUTING
  • A GROUP WITHIN THE CENTRAL IT ORGANIZATION
  • 5 FULL TIME STAFF
  • CURRENTLY RUNNING 3 HPC CLUSTERS AVAILABLE TO RESEARCHERS ACROSS CAMPUS
  • Over 200 active HPC users

5 of 26

Resources

Wendian

78 Intel Skylake CPU nodes + 5 Intel with GPU nodes + 4 Power nodes

874TB BeeGFS filesystem

AuN

144 Intel Sandy Bridge CPU nodes

350TB GPFS filesystem

Mio

202 Intel nodes – several generations

2 Intel nodes with GPU

2 Power8 nodes with GPU

176TB GPFS filesystem

Orebits

800TB ZFS – exported to campus with SMB or NFS

6 of 26

Motivation

Can anything replace Ganglia?

MINES.EDU

7 of 26

Why replace Ganglia?

  • Pros
    • Ganglia has served us well for many years
    • Ganglia is easy to deploy
    • Good overall cluster views

  • Cons
    • Adding metrics isn’t as easy as I’d like
    • Customizing the dashboards isn’t easy
    • Adding infrastructure metrics doesn’t make sense
    • Zooming in and comparing metrics
    • Overall the interface and graph manipulation was feeling dated

MINES.EDU

8 of 26

A side by side comparison

Ganglia

Grafana

MINES.EDU

9 of 26

Introduction

Architecture overview and some definitions of terms

MINES.EDU

10 of 26

Architecture

Credit: https://prometheus.io/assets/architecture.png

MINES.EDU

11 of 26

Data gathering

ExportersRun as a service and provide access to a collection of metrics as JSON data when queried

MetricAny time series data. Types are counter, gauge, histogram, and summary.

Label - Optional key-value pairs that are added to metrics to give dimensionality to data

PushgatewayA service that allows for pushing metrics from ephemeral and batch jobs. It is not meant to turn Prometheus into a push-based system.

MINES.EDU

12 of 26

  • Database
    • MySQL
    • Postgresql
    • Elasticsearch
  • Hardware
    • Apcupsd
    • Dell OMSA
    • IPMI exporter
    • NVIDIA GPU
  • HTTP
    • Apache
    • HAProxy
    • NGINX
    • Passenger
    • Squid

  • Storage
    • Ceph
    • Ceph RADOSGW
    • Gluster
    • Hadoop HDFS
    • Lustre
  • Other monitoring systems
    • Collectd
    • Graphite
    • InfluxDB
    • Munin
    • Nagios
    • SNMP
    • Slurm

MINES.EDU

13 of 26

Data Collection

Prometheus – The server that collects metrics from exporters and stores in the TSDB

Service discovery – A way to dynamically discover services to be scraped. Most useful in dynamic cloud environments.

TSDBTime Series Database. Prometheus custom on disk storage for metric streams

ScrapingThe retrieval of streams of metrics from the targets which is usually an exporter service running on a remote host.

MINES.EDU

14 of 26

Alerting

  • We don’t use this yet.
  • We use check_mk for major up/down events.
  • Grafana also includes alerting
    • We have not done a comparison of the two alerting features.

MINES.EDU

15 of 26

Data viewing

  • Prometheus web UI
    • Built into Prometheus server
    • Useful for debugging or building new panels
  • Grafana
    • Data source A database of metrics from which Grafana can query
    • PanelThe basic visualization building block. Based on one or more queries of a data source.
    • DashboardA set of one or more panels arranged into one or more rows.

MINES.EDU

16 of 26

MINES.EDU

17 of 26

Installation

Exporters, Prometheus Server, and Grafana

MINES.EDU

18 of 26

node_exporter

  • Static binary from tarball
  • Init script examples are in the Github repository
  • Default listen port 9100/tcp
  • Many command line options to customize what it tries to collect. Use --help to see all the options
  • Test with curl
    • curl http://localhost:9100/metrics

MINES.EDU

19 of 26

Prometheus Server

  • Static binary plus a few assets from tarball
  • Provide your own init script
  • Command line option to point to prometheus.yml and data directory.
  • Default listen port is 9090/tcp
  • Test by browsing to the built-in WebUI
    • http://host.domain.com:9090/

[Unit]

Description=Prometheus Server

Documentation=https://prometheus.io/docs/introduction/overview/

After=network-online.target

[Service]

User=prometheus

Restart=on-failure

#Change this line if you download the

#Prometheus on different path user

ExecStart=/home/prometheus/prometheus/prometheus \

--config.file=/home/prometheus/prometheus/prometheus.yml \

--storage.tsdb.path=/home/prometheus/prometheus/data

[Install]

WantedBy=multi-user.target

MINES.EDU

20 of 26

Grafana server

  • RPM or Debian package install
  • Enterprise and Cloud hosted versions also available with commercial support
  • systemctl start grafana-server
  • Listens on port 3000/tcp by default
  • Advanced configuration
    • /etc/grafana/grafana.ini
      • Paths, ports, certificates, database, etc.
      • Authentication – AuthProxy(no Role mapping),LDAP, Oauth(multiple providers), SAML(Enterprise only)

MINES.EDU

21 of 26

Configuration and Customization

MINES.EDU

22 of 26

Typical workflow overview

  • Add exporters to nodes or other resources with metrics that can be gathered
  • Add exporter targets to Prometheus for scraping
  • Add data source to Grafana
  • Add or Import dashboard
  • Optionally set or customize dashboard variables
  • Add or customize panels

MINES.EDU

23 of 26

Add exporter targets to Prometheus

- job_name: 'compute'

file_sd_configs:

- files:

- /etc/prometheus/nodes/*.yml

[

{

targets: [ "c001:9100" ],

"labels": {

"cluster": "wendian",

"host": "c001",

"role": "compute",

}

},

]

prometheus.yml (excerpt)

/etc/prometheus/nodes/c001.yml

MINES.EDU

24 of 26

Add data source to Grafana

MINES.EDU

25 of 26

Import a dashboard

  • Import
    • Search the Grafana dashboard site - https://grafana.com/grafana/dashboards
    • If there are many to choose from look at downloads and reviews
    • There is a button to copy the dashboard ID to clipboard. That can be pasted into the import interface.
    • Set a name and choose a default data source
    • Start editing
  • Common problems
    • If graphs have no data - check variables and hard coded labels

MINES.EDU

26 of 26

Let’s see a demo…