1 of 95

Prometheus Workshop

Adam Chen, Owen Wu, Zz Chen

2 of 95

Outline

  • Installation time
  • Prometheus Overview & Details
  • Service Discovery (Kubernetes)
  • Familiar with Metric, Grafana , Pushgateway
  • Alert Manager with Practical Cases

3 of 95

Chapter 0: Setup Cloud9 and EKS

Following pictures of installation guide comes from: https://github.com/pahud/amazon-eks-workshop

4 of 95

5 of 95

6 of 95

7 of 95

8 of 95

7. execute ‘aws configure‘ to configure the credentials for your IAM user. Make sure this IAM User has AdministratorAccess and run ‘aws sts get-caller-identity’ - you should be able to see the returned JSON output like this.

9 of 95

Create IAM key if you have no one (1)

10 of 95

Create IAM key if you have no one (2)

11 of 95

Create IAM key if you have no one (3)

12 of 95

Create IAM key if you have no one (4)

13 of 95

Run command in Cloud9

$ git clone https://github.com/Taipei-HUG/Prometheus-workshop

$ cd CH_0

$ ./step1.sh # get all binary

$ ./step2.sh # setup eks cluster

$ ./step3.sh # get and setup helm

$ ./step4.sh # install kube-prometheus and push-gateway with LBS

$ ./get_links.sh # show all components link

14 of 95

Prometheus Overview

15 of 95

Chapter I : Prometheus Overview

  • Introduction
  • Arch and components
    • With Config
  • Service discovery
    • Metrics type (counter, gauge, histogram, summary) (5 mins hands-on)
  • Push Gateway
  • Data Store
    • Local storage
    • Remote storage
  • PromQL overview
    • WebUI and official site, functions (5 mins hands-on)

16 of 95

17 of 95

Prometheus

  • Pull based monitoring system
  • Service Discovery
  • Time series database
  • Alertmanager
  • Plenty of exporters
  • Hierarchical architecture
  • Support remote storage

18 of 95

Basic Prometheus Config

global:

[ scrape_interval: <duration> | default = 1m ]

[ scrape_timeout: <duration> | default = 10s ]

[ evaluation_interval: <duration> | default = 1m ]

rule_files: # Trigger alert

[ - <filepath_glob> ... ]

scrape_configs: # Find Target

[ - <scrape_config> ... ]

alerting:

alertmanagers:

[ - <alertmanager_config> ... ]

remote_write:

[ - <remote_write> ... ]

remote_read:

[ - <remote_read> ... ]

19 of 95

Service Discovery

20 of 95

POD

21 of 95

Exporters

22 of 95

Check EKS Cluster

Time for step 3 & 4

23 of 95

Obverservability : metrics

<Metric Name>{label1=value1, label2=value2, ...}

24 of 95

Data type

  • Counter 單調遞增
  • Gauge 可增可減
  • Histogram (計算於 Prom-Server)
  • Summary (Client 直接提供)

25 of 95

Data type (Histogram)

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="100"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1000"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="10000"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="100000"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1e+06"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1e+07"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1e+08"} 374

alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="+Inf"} 374

alertmanager_http_response_size_bytes_sum{handler="/alerts",method="post"} 7480

alertmanager_http_response_size_bytes_count{handler="/alerts",method="post"} 374

26 of 95

Data type (Summary)

alertmanager_nflog_gc_duration_seconds{quantile="0.5"} 3.273e-06

alertmanager_nflog_gc_duration_seconds{quantile="0.9"} 3.273e-06

alertmanager_nflog_gc_duration_seconds{quantile="0.99"} 3.273e-06

alertmanager_nflog_gc_duration_seconds_sum 1.9283e-05

alertmanager_nflog_gc_duration_seconds_count 6

27 of 95

Practice #01

Get metrics from prometheus server

28 of 95

Get metrics

  • Get metrics from alertmanager
  • Browse http://Prometheus_ELB_DNS_NAME:9090/metrics

# HELP alertmanager_alerts How many alerts by state.

# TYPE alertmanager_alerts gauge

alertmanager_alerts{state="active"} 12

alertmanager_alerts{state="suppressed"} 0

# HELP alertmanager_alerts_invalid_total The total number of received alerts that were invalid.

# TYPE alertmanager_alerts_invalid_total counter

alertmanager_alerts_invalid_total 0

29 of 95

Remote Storage and sidecar (Thanos)

30 of 95

Remote Storage and sidecar (Thanos)

31 of 95

Remote Storage and sidecar (Thanos)

32 of 95

PromQL with functions

  • sum
  • avg
  • bool
  • rate
  • ...

33 of 95

Practice #02

Find out basic functions of PromQL

34 of 95

PromQL with functions

  • Browse HTTP://Prometheus_ELB:9090
  • pushgateway_http_requests_total
  • pushgateway_http_requests_total[2m]
  • increase(pushgateway_http_requests_total[2m])
  • increase(pushgateway_http_requests_total[2m]) / 120
  • rate(pushgateway_http_requests_total[2m])
  • pushgateway_http_requests_total{code="200"}
  • sum(pushgateway_http_requests_total{code="200"}) by (instance)
  • sum(http_requests_total{code="200"}) by (instance)

35 of 95

Alertmanager

36 of 95

Break Time

37 of 95

Chapter II : Service Discovery & Kubernetes

  • Introduce the Service Discovery for Prometheus
    • kubernetes
  • Kubernetes Introduction
  • Where to run Prometheus ?
  • Prometheus Operator : Prometheus in Kubernetes
  • Service Monitor - Elegant Service Discovery
  • Exporter

38 of 95

Service Discovery

  • Service Discovery,說他是微服務架構的靈魂也當之無愧 By 安德魯大大
  • In Cloud Native environment, there may be a lot instances ( VM / pod ) start or shutdown at any time
  • For Prometheus, Service Discovery is a key path to find where & what is the target to fetch metric.
  • Kubernetes & Prometheus is perfect match

39 of 95

Prometheus with Service Discovery on file or …….

40 of 95

Service Discovery Configs

41 of 95

Resources & Service Discovery in Kubernetes

  • Pod
  • Label
  • Selector
  • Service

42 of 95

Pods

43 of 95

Pods

Logical Application

  • One or more containers and volumes
  • Shared namespaces
  • One IP per pod

Pod

nginx

monolith

10.10.1.100

44 of 95

Labels

45 of 95

Labels

Arbitrary meta-data attached to Kubernetes object

Pod

hello

Pod

hello

labels:� version: v1

track: stable

labels: version: v1

track: test

46 of 95

Labels

selector: “version=v1”

Pod

hello

Pod

hello

labels:version: v1

track: stable

labels: version: v1

track: test

47 of 95

Labels

selector: “track=stable”

Pod

hello

Pod

hello

labels:� version: v1

track: stable

labels: version: v1

track: test

48 of 95

Services

49 of 95

Kubernetes Service

49

Client

Service�selector: app=app2

Service�selector: app=myApp

50 of 95

Practice #03

Try to understand label selector with kubectl

51 of 95

Demo Selector with kubectl

  • $ kubectl get pod -n kube-system --show-labels
  • $ kubectl get pod -n kube-system --show-labels -l k8s-app=kube-dns
  • $ kubectl get service -n kube-system
  • $ kubectl describe service kube-dns -n kube-system
  • $ kubectl get pod -n kube-system --show-labels -l k8s-app=kube-dns -o wide

52 of 95

Flow with service discovery

  • `relabel_configs`
    • For Service Discovery
  • `metrics_relabel_configs`
    • For metrics
  • `keep` vs `drop`
  • mostly is replacement
  • Ref:

53 of 95

Practice #04

Using kubernerte_sd_configs to discover coredns

54 of 95

Kuberentes Service Discovery - CoreDNS

  • $ cd CH_2/coredns_scrape_configs
  • $ sh generate_yaml.sh
  • $ kubectl apply -f manifests/
  • $ sh restart_prometheus.sh

55 of 95

Where to run Prometheus?

  • In Kubernetes ?
  • On a dedicated machine/VM?
  • In Kubernetes, Easily to access the Pod, otherwise it would encounter a lot of difficulty.
  • But Kubernetes prefers to treat application as stateless, the restart / upgrade need more care.
  • On dedicated Machine, it should be more easy to management ?

56 of 95

Why use Kubernetes ?

57 of 95

Why use Kubernetes ?

  • A universal platform for manage application
    • Easy to scale
    • Unified, powerful interface for operation
  • Cost
    • Leverage Infrastructure with other application
    • Using Spot Instance https://eksworkshop.com/spotworkers/
  • A lot of resources & support

58 of 95

Prometheus Operator

59 of 95

Operators

Operators

A Kubernetes Operator helps extend the types of applications that can run on Kubernetes by allowing developers to provide additional knowledge to applications that need to maintain state.

Mostly it focus on automating and the special know how of the application. Simply the deployment and maintain.

60 of 95

Prometheus Operator Architecture

61 of 95

Prometheus Operator Object(CRD)

  • Prometheus
    • which defines a desired Prometheus deployment.
  • PrometheusRule
    • which can be loaded by a Prometheus instance containing Prometheus alerting and recording rules.
  • Alertmanager
    • which defines a desired Alertmanager deployment.
  • ServiceMonitor

62 of 95

Service Monitor

  • Declaratively define how a dynamic set of services should be monitored.
  • Which services are selected to be monitored with the desired configuration is defined using label selections.
  • The ServiceMonitor object introduced by the Prometheus Operator in turn discovers those Endpoints objects and configures Prometheus to monitor those Pods.

63 of 95

Practice #05

Using Service Monitor to discover coredns

64 of 95

Service Monitor Demo

  • $ cd CH_2/coredns_service_monitor
  • diff ../coredns_scrape_configs/manifests/prometheus-prometheus.yaml manifests/prometheus-prometheus.yaml
  • $ kubectl apply -f manifests/
  • $ sh reload_prometheus.sh

65 of 95

For resource not in kubernetes

66 of 95

How to supply metrics for Prometheus

67 of 95

Natively or Exporter

  • Prometheus is pull-base monitoring system
  • Prometheus’ best practices are to natively instrument the services .
  • But for non-natively-instrumented services (such as Memcached, Postgres, etc.) it is possible to use an exporter.
  • An exporter is a process that runs alongside your service and translates metrics from the service into the format Prometheus understands.

68 of 95

Exporter

69 of 95

More Exporter

70 of 95

Chapter III

Grafana & Pushgateway

71 of 95

Grafana

a analytics platform

to query, visualize and alert

72 of 95

73 of 95

74 of 95

Access Dashboards

./CH_0/get_links.sh

75 of 95

Preloaded dashboards from kube-prometheus

76 of 95

Practice: Monitor CoreDNS status

  1. Create a new dashboard
  2. Add a new query to get the status of CoreDNS

77 of 95

kube_pod_status_ready{pod=~"coredns-(.*)", condition="false"}

78 of 95

Pushgateway

allow ephemeral and batch jobs

to expose metrics

79 of 95

Pushgateway

  • https://github.com/prometheus/pushgateway
  • For ephemeral workloads and batch jobs
    • They are not exist long enough to be scrapped
  • Pushgateway keep metrics for these kind of jobs
    • Prometheus can pull metrics from /metrics of pushgateway

80 of 95

81 of 95

Play with pushgateway

$ echo "some_metric 3.14" | curl --data-binary @- http://{URL_OF_PUSHGATEWAY}:9091/metrics/job/some_job

$ echo "progress 12" | curl --data-binary @- http://{URL_OF_PUSHGATEWAY}:9091/metrics/job/playing

82 of 95

Check the metrics

Metrics: http://<pushgateway-host>:9091/metric

UI: http://<pushgateway-host>:9091/

83 of 95

Check at prometheus console

84 of 95

Break Time

85 of 95

CH 4 : Alerting & Practical Cases

Let’s pull the trigger.

86 of 95

Alertmanager

87 of 95

Alertmanager

  • Global setting
  • Template
  • Route
  • Receiver
  • Inhibit rules

global:

[ resolve_timeout: <duration> | default = 5m ]

[ slack_api_url: <secret> ]

[ http_config: <http_config> ]

templates:

[ - <filepath> ... ]

route: <route>

receivers:

- <receiver> ...

inhibit_rules:

[ - <inhibit_rule> ... ]

88 of 95

Alertmanager - Route

"route":

"group_by":

- "job"

"group_interval": "1m"

"group_wait": "30s"

"receiver": "slack_alert1"

"repeat_interval": "3m"

"routes":

- "match":

"alertname": "Watchdog"

"receiver": "slack_alert2"

89 of 95

Setup Slack

  • Into Slack Workspace
  • Create Incoming Webhook URL
  • Create your own channel x 2

90 of 95

Setup Alertmanager & Nginx

$ vi CH_4/alertmanager.yaml

$ CH_4/apply_change.sh

"receivers":

- "name": "slack_alert1"

"slack_configs":

- "api_url": "https://hooks.slack.com/services/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"

"channel": "#alert_1"

- "name": "slack_alert2"

"slack_configs":

- "api_url": "https://hooks.slack.com/services/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"

"channel": "#alert_2"

$ CH_4/helm_nginx_install.sh

91 of 95

Install rule 1

$ cat alert_rule_1.yaml

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

labels:

prometheus: k8s

role: alert-rules

name: prometheus-example-rules

namespace: monitoring

spec:

groups:

- name: yourname.rules

rules:

- alert: YournameAlert

expr: vector(1)

$ kubectl apply -f alert_rule_1.yaml

92 of 95

Install rule 2

$ cat alert_rule_2.yaml

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

labels:

prometheus: k8s

role: alert-rules

name: prometheus-example-rules2

namespace: monitoring

spec:

groups:

- name: nginx_rule

rules:

- alert: NGINXAlert

expr: nginx_ingress_controller_nginx_process_requests_total > 1000

$ kubectl apply -f alert_rule_2.yaml

$ CH_4/trigger_nginx_alert.sh

93 of 95

Checkout Slack for alerts

94 of 95

Remove all resource in AWS

Please remember run below script, remove all AWS resource to prevent unnecessary cost.

$ cd CH_0

$ ./uninstall.sh

95 of 95

THANKS