1 of 95

Prometheus Workshop

Adam Chen, Owen Wu, Zz Chen

2 of 95

Outline

Installation time
Prometheus Overview & Details
Service Discovery (Kubernetes)
Familiar with Metric, Grafana , Pushgateway
Alert Manager with Practical Cases

3 of 95

Chapter 0: Setup Cloud9 and EKS

Following pictures of installation guide comes from: https://github.com/pahud/amazon-eks-workshop

4 of 95

5 of 95

6 of 95

7 of 95

8 of 95

7. execute ‘aws configure‘ to configure the credentials for your IAM user. Make sure this IAM User has AdministratorAccess and run ‘aws sts get-caller-identity’ - you should be able to see the returned JSON output like this.

9 of 95

Create IAM key if you have no one (1)

10 of 95

Create IAM key if you have no one (2)

11 of 95

Create IAM key if you have no one (3)

12 of 95

Create IAM key if you have no one (4)

13 of 95

Run command in Cloud9

$ git clone https://github.com/Taipei-HUG/Prometheus-worksho p

$ cd CH_0

$ ./step1.sh # get all binary

$ ./step2.sh # setup eks cluster

$ ./step3.sh # get and setup helm

$ ./step4.sh # install kube-prometheus and push-gateway with LBS

$ ./get_links.sh # show all components link

14 of 95

Prometheus Overview

15 of 95

Chapter I : Prometheus Overview

Introduction
Arch and components

With Config

Service discovery

Metrics type (counter, gauge, histogram, summary) (5 mins hands-on)

Push Gateway
Data Store

Local storage
Remote storage

PromQL overview

WebUI and official site, functions (5 mins hands-on)

16 of 95

17 of 95

Prometheus

Pull based monitoring system
Service Discovery
Time series database
Alertmanager
Plenty of exporters
Hierarchical architecture
Support remote storage

18 of 95

Basic Prometheus Config

global:

[ scrape_interval: <duration> | default = 1m ]

[ scrape_timeout: <duration> | default = 10s ]

[ evaluation_interval: <duration> | default = 1m ]

rule_files: # Trigger alert

[ - <filepath_glob> ... ]

scrape_configs: # Find Target

[ - <scrape_config> ... ]

alerting:

alertmanagers:

[ - <alertmanager_config> ... ]

remote_write:

[ - <remote_write> ... ]

remote_read:

[ - <remote_read> ... ]

19 of 95

Service Discovery

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config

20 of 95

POD

21 of 95

Exporters

https://github.com/prometheus/prometheus/wiki/Default-port-allocations

22 of 95

Check EKS Cluster

Time for step 3 & 4

23 of 95

Obverservability : metrics

<Metric Name>{label1=value1, label2=value2, ...}

24 of 95

Data type

Counter 單調遞增
Gauge 可增可減
Histogram (計算於 Prom-Server)
Summary (Client 直接提供)

25 of 95

Data type (Histogram)

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="100"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1000"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="10000"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="100000"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1e+06"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1e+07"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1e+08"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="+Inf"} 374

alertmanager_http_response_size_bytes_sum{handler="/alerts",method="post"} 7480

alertmanager_http_response_size_bytes_count{handler="/alerts",method="post"} 374

26 of 95

Data type (Summary)

alertmanager_nflog_gc_duration_seconds{quantile="0.5"} 3.273e-06

alertmanager_nflog_gc_duration_seconds{quantile="0.9"} 3.273e-06

alertmanager_nflog_gc_duration_seconds{quantile="0.99"} 3.273e-06

alertmanager_nflog_gc_duration_seconds_sum 1.9283e-05

alertmanager_nflog_gc_duration_seconds_count 6

27 of 95

Practice #01

Get metrics from prometheus server

28 of 95

Get metrics

Get metrics from alertmanager
Browse http://Prometheus_ELB_DNS_NAME:9090/metrics

# HELP alertmanager_alerts How many alerts by state.

# TYPE alertmanager_alerts gauge

alertmanager_alerts{state="active"} 12

alertmanager_alerts{state="suppressed"} 0

# HELP alertmanager_alerts_invalid_total The total number of received alerts that were invalid.

# TYPE alertmanager_alerts_invalid_total counter

alertmanager_alerts_invalid_total 0

29 of 95

Remote Storage and sidecar (Thanos)

https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage

30 of 95

Remote Storage and sidecar (Thanos)

31 of 95

Remote Storage and sidecar (Thanos)

32 of 95

PromQL with functions

sum
avg
bool
rate
...

https://prometheus.io/docs/prometheus/latest/querying/functions/

33 of 95

Practice #02

Find out basic functions of PromQL

34 of 95

PromQL with functions

Browse HTTP://Prometheus_ELB:9090
pushgateway_http_requests_total
pushgateway_http_requests_total[2m]
increase(pushgateway_http_requests_total[2m])
increase(pushgateway_http_requests_total[2m]) / 120
rate(pushgateway_http_requests_total[2m])
pushgateway_http_requests_total{code="200"}
sum(pushgateway_http_requests_total{code="200"}) by (instance)
sum(http_requests_total{code="200"}) by (instance)

35 of 95

Alertmanager

36 of 95

Break Time

37 of 95

Chapter II : Service Discovery & Kubernetes

Introduce the Service Discovery for Prometheus

kubernetes

Kubernetes Introduction
Where to run Prometheus ?
Prometheus Operator : Prometheus in Kubernetes
Service Monitor - Elegant Service Discovery
Exporter

38 of 95

Service Discovery

Service Discovery，說他是微服務架構的靈魂也當之無愧 By 安德魯大大
In Cloud Native environment, there may be a lot instances ( VM / pod ) start or shutdown at any time
For Prometheus, Service Discovery is a key path to find where & what is the target to fetch metric.
Kubernetes & Prometheus is perfect match

39 of 95

Prometheus with Service Discovery on file or …….

40 of 95

Service Discovery Configs

We want to integrate Prometheus with the SD that's already there in your infrastructure, not invent yet more ways to do service discovery.
The general principle with SD is to extract all the potentially useful information we can out of the SD, and let the user choose what they need of it using relabelling. This information is generally termed metadata.
Ref

41 of 95

Resources & Service Discovery in Kubernetes

Pod
Label
Selector
Service

42 of 95

Pods

43 of 95

Pods

Logical Application

One or more containers and volumes
Shared namespaces
One IP per pod

Pod

nginx

monolith

10.10.1.100

A pod is the unit of scheduling in Kubernetes. It is a resource envelope in which one or more containers run. Containers that are part of the same pod are guaranteed to be scheduled together onto the same machine, and can share state via local volumes.

Kubernetes is able to give every pod and service its own IP address. This removes the infrastructure complexity of managing ports, and allows developers to choose any ports they want rather than requiring their software to adapt to the ones chosen by the infrastructure. The latter point is crucial for making it easy to run off-the-shelf open-source applications on Kubernetes--pods can be treated much like VMs or physical hosts, with access to the full port space, oblivious to the fact that they may be sharing the same physical machine with other pods.

44 of 95

Labels

45 of 95

Labels

Arbitrary meta-data attached to Kubernetes object

Pod

hello

Pod

hello

labels:� version: v1

track: stable

labels:� version: v1

track: test

46 of 95

Labels

selector: “version=v1”

Pod

hello

Pod

hello

labels:� version: v1

track: stable

labels:� version: v1

track: test

47 of 95

Labels

selector: “track=stable”

Pod

hello

Pod

hello

labels:� version: v1

track: stable

labels:� version: v1

track: test

48 of 95

Services

49 of 95

Kubernetes Service

49

Client

Service�selector: app=app2

Service�selector: app=myApp

50 of 95

Practice #03

Try to understand label selector with kubectl

51 of 95

Demo Selector with kubectl

$ kubectl get pod -n kube-system --show-labels
$ kubectl get pod -n kube-system --show-labels -l k8s-app=kube-dns
$ kubectl get service -n kube-system
$ kubectl describe service kube-dns -n kube-system
$ kubectl get pod -n kube-system --show-labels -l k8s-app=kube-dns -o wide

52 of 95

Flow with service discovery

`relabel_configs`

For Service Discovery

`metrics_relabel_configs`

For metrics

`keep` vs `drop`
mostly is replacement
Ref:

https://blog.freshtracks.io/prometheus-relabel-rules-and-the-action-parameter-39c71959354a

53 of 95

Practice #04

Using kubernerte_sd_configs to discover coredns

54 of 95

Kuberentes Service Discovery - CoreDNS

$ cd CH_2/coredns_scrape_configs
$ sh generate_yaml.sh
$ kubectl apply -f manifests/
$ sh restart_prometheus.sh

Ref

https://github.com/prometheus/prometheus/blob/release-2.8/documentation/examples/prometheus-kubernetes.yml

55 of 95

Where to run Prometheus?

In Kubernetes ?
On a dedicated machine/VM?
In Kubernetes, Easily to access the Pod, otherwise it would encounter a lot of difficulty.
But Kubernetes prefers to treat application as stateless, the restart / upgrade need more care.
On dedicated Machine, it should be more easy to management ?

56 of 95

Why use Kubernetes ?

57 of 95

Why use Kubernetes ?

A universal platform for manage application

Easy to scale
Unified, powerful interface for operation

Cost

Leverage Infrastructure with other application
Using Spot Instance https://eksworkshop.com/spotworkers/

A lot of resources & support

58 of 95

Prometheus Operator

59 of 95

Operators

A Kubernetes Operator helps extend the types of applications that can run on Kubernetes by allowing developers to provide additional knowledge to applications that need to maintain state.

Mostly it focus on automating and the special know how of the application. Simply the deployment and maintain.

60 of 95

Prometheus Operator Architecture

61 of 95

Prometheus Operator Object(CRD)

Prometheus

which defines a desired Prometheus deployment.

PrometheusRule

which can be loaded by a Prometheus instance containing Prometheus alerting and recording rules.

Alertmanager

which defines a desired Alertmanager deployment.

ServiceMonitor

62 of 95

Service Monitor

Declaratively define how a dynamic set of services should be monitored.
Which services are selected to be monitored with the desired configuration is defined using label selections.
The ServiceMonitor object introduced by the Prometheus Operator in turn discovers those Endpoints objects and configures Prometheus to monitor those Pods.

63 of 95

Practice #05

Using Service Monitor to discover coredns

64 of 95

Service Monitor Demo

$ cd CH_2/coredns_service_monitor
diff ../coredns_scrape_configs/manifests/prometheus-prometheus.yaml manifests/prometheus-prometheus.yaml
$ kubectl apply -f manifests/
$ sh reload_prometheus.sh

65 of 95

For resource not in kubernetes

66 of 95

How to supply metrics for Prometheus

67 of 95

Natively or Exporter

Prometheus is pull-base monitoring system
Prometheus’ best practices are to natively instrument the services .
But for non-natively-instrumented services (such as Memcached, Postgres, etc.) it is possible to use an exporter.
An exporter is a process that runs alongside your service and translates metrics from the service into the format Prometheus understands.

68 of 95

Exporter

69 of 95

More Exporter

70 of 95

Chapter III

Grafana & Pushgateway

71 of 95

Grafana

a analytics platform

to query, visualize and alert

72 of 95

73 of 95

74 of 95

Access Dashboards

./CH_0/get_links.sh

75 of 95

Preloaded dashboards from kube-prometheus

76 of 95

Practice: Monitor CoreDNS status

Create a new dashboard
Add a new query to get the status of CoreDNS

77 of 95

kube_pod_status_ready{pod=~"coredns-(.*)", condition="false"}

78 of 95

Pushgateway

allow ephemeral and batch jobs

to expose metrics

79 of 95

Pushgateway

https://github.com/prometheus/pushgateway
For ephemeral workloads and batch jobs

They are not exist long enough to be scrapped

Pushgateway keep metrics for these kind of jobs

Prometheus can pull metrics from /metrics of pushgateway

80 of 95

81 of 95

Play with pushgateway

$ echo "some_metric 3.14" | curl --data-binary @- http://{URL_OF_PUSHGATEWAY}:9091/metrics/job/some_job

$ echo "progress 12" | curl --data-binary @- http://{URL_OF_PUSHGATEWAY}:9091/metrics/job/playing

82 of 95

Check the metrics

Metrics: http://<pushgateway-host>:9091/metric

UI: http://<pushgateway-host>:9091/

83 of 95

Check at prometheus console

84 of 95

Break Time

85 of 95

CH 4 : Alerting & Practical Cases

Let’s pull the trigger.

86 of 95

Alertmanager

87 of 95

Alertmanager

Global setting
Template
Route
Receiver
Inhibit rules

global:

[ resolve_timeout: <duration> | default = 5m ]

[ slack_api_url: <secret> ]

[ http_config: <http_config> ]

templates:

[ - <filepath> ... ]

route: <route>

receivers:

- <receiver> ...

inhibit_rules:

[ - <inhibit_rule> ... ]

88 of 95

Alertmanager - Route

"route":

"group_by":

- "job"

"group_interval": "1m"

"group_wait": "30s"

"receiver": "slack_alert1"

"repeat_interval": "3m"

"routes":

- "match":

"alertname": "Watchdog"

"receiver": "slack_alert2"

89 of 95

Setup Slack

Into Slack Workspace
Create Incoming Webhook URL
Create your own channel x 2

90 of 95

Setup Alertmanager & Nginx

$ vi CH_4/alertmanager.yaml

$ CH_4/apply_change.sh

"receivers":

- "name": "slack_alert1"

"slack_configs":

- "api_url": "https://hooks.slack.com/services/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"

"channel": "#alert_1"

- "name": "slack_alert2"

"slack_configs":

- "api_url": "https://hooks.slack.com/services/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"

"channel": "#alert_2"

$ CH_4/helm_nginx_install.sh

91 of 95

Install rule 1

$ cat alert_rule_1.yaml

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

labels:

prometheus: k8s

role: alert-rules

name: prometheus-example-rules

namespace: monitoring

spec:

groups:

- name: yourname.rules

rules:

- alert: YournameAlert

expr: vector(1)

$ kubectl apply -f alert_rule_1.yaml

92 of 95

Install rule 2

$ cat alert_rule_2.yaml

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

labels:

prometheus: k8s

role: alert-rules

name: prometheus-example-rules2

namespace: monitoring

spec:

groups:

- name: nginx_rule

rules:

- alert: NGINXAlert

expr: nginx_ingress_controller_nginx_process_requests_total > 1000

$ kubectl apply -f alert_rule_2.yaml

$ CH_4/trigger_nginx_alert.sh

93 of 95

Checkout Slack for alerts

94 of 95

Remove all resource in AWS

Please remember run below script, remove all AWS resource to prevent unnecessary cost.

$ cd CH_0

$ ./uninstall.sh

95 of 95

THANKS