Prometheus Workshop
Adam Chen, Owen Wu, Zz Chen
Outline
Chapter 0: Setup Cloud9 and EKS
Following pictures of installation guide comes from: https://github.com/pahud/amazon-eks-workshop
7. execute ‘aws configure‘ to configure the credentials for your IAM user. Make sure this IAM User has AdministratorAccess and run ‘aws sts get-caller-identity’ - you should be able to see the returned JSON output like this.
Create IAM key if you have no one (1)
Create IAM key if you have no one (2)
Create IAM key if you have no one (3)
Create IAM key if you have no one (4)
Run command in Cloud9
$ git clone https://github.com/Taipei-HUG/Prometheus-workshop
$ cd CH_0
$ ./step1.sh # get all binary
$ ./step2.sh # setup eks cluster
$ ./step3.sh # get and setup helm
$ ./step4.sh # install kube-prometheus and push-gateway with LBS
$ ./get_links.sh # show all components link
Prometheus Overview
Chapter I : Prometheus Overview
Prometheus
Basic Prometheus Config
global:
[ scrape_interval: <duration> | default = 1m ]
[ scrape_timeout: <duration> | default = 10s ]
[ evaluation_interval: <duration> | default = 1m ]
rule_files: # Trigger alert
[ - <filepath_glob> ... ]
scrape_configs: # Find Target
[ - <scrape_config> ... ]
alerting:
alertmanagers:
[ - <alertmanager_config> ... ]
remote_write:
[ - <remote_write> ... ]
remote_read:
[ - <remote_read> ... ]
Service Discovery
POD
Exporters
Check EKS Cluster
Time for step 3 & 4
Obverservability : metrics
<Metric Name>{label1=value1, label2=value2, ...}
Data type
Data type (Histogram)
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="100"} 374
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1000"} 374
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="10000"} 374
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="100000"} 374
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1e+06"} 374
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1e+07"} 374
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1e+08"} 374
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="+Inf"} 374
alertmanager_http_response_size_bytes_sum{handler="/alerts",method="post"} 7480
alertmanager_http_response_size_bytes_count{handler="/alerts",method="post"} 374
Data type (Summary)
alertmanager_nflog_gc_duration_seconds{quantile="0.5"} 3.273e-06
alertmanager_nflog_gc_duration_seconds{quantile="0.9"} 3.273e-06
alertmanager_nflog_gc_duration_seconds{quantile="0.99"} 3.273e-06
alertmanager_nflog_gc_duration_seconds_sum 1.9283e-05
alertmanager_nflog_gc_duration_seconds_count 6
Practice #01
Get metrics from prometheus server
Get metrics
# HELP alertmanager_alerts How many alerts by state.
# TYPE alertmanager_alerts gauge
alertmanager_alerts{state="active"} 12
alertmanager_alerts{state="suppressed"} 0
# HELP alertmanager_alerts_invalid_total The total number of received alerts that were invalid.
# TYPE alertmanager_alerts_invalid_total counter
alertmanager_alerts_invalid_total 0
Remote Storage and sidecar (Thanos)
Remote Storage and sidecar (Thanos)
Remote Storage and sidecar (Thanos)
PromQL with functions
Practice #02
Find out basic functions of PromQL
PromQL with functions
Alertmanager
Break Time
Chapter II : Service Discovery & Kubernetes
Service Discovery
Prometheus with Service Discovery on file or …….
Service Discovery Configs
Resources & Service Discovery in Kubernetes
Pods
Pods
Logical Application
Pod
nginx
monolith
10.10.1.100
Labels
Labels
Arbitrary meta-data attached to Kubernetes object
Pod
hello
Pod
hello
labels:� version: v1
track: stable
labels:� version: v1
track: test
Labels
selector: “version=v1”
Pod
hello
Pod
hello
labels:� version: v1
track: stable
labels:� version: v1
track: test
Labels
selector: “track=stable”
Pod
hello
Pod
hello
labels:� version: v1
track: stable
labels:� version: v1
track: test
Services
Kubernetes Service
49
Client
Service�selector: app=app2
Service�selector: app=myApp
Practice #03
Try to understand label selector with kubectl
Demo Selector with kubectl
Flow with service discovery
Practice #04
Using kubernerte_sd_configs to discover coredns
Kuberentes Service Discovery - CoreDNS
Where to run Prometheus?
Why use Kubernetes ?
Why use Kubernetes ?
Prometheus Operator
Operators
Operators
A Kubernetes Operator helps extend the types of applications that can run on Kubernetes by allowing developers to provide additional knowledge to applications that need to maintain state.
Mostly it focus on automating and the special know how of the application. Simply the deployment and maintain.
Prometheus Operator Architecture
Prometheus Operator Object(CRD)
Service Monitor
Practice #05
Using Service Monitor to discover coredns
Service Monitor Demo
For resource not in kubernetes
How to supply metrics for Prometheus
Natively or Exporter
Exporter
More Exporter
Chapter III
Grafana & Pushgateway
Grafana
a analytics platform
to query, visualize and alert
Access Dashboards
./CH_0/get_links.sh
Preloaded dashboards from kube-prometheus
Practice: Monitor CoreDNS status
kube_pod_status_ready{pod=~"coredns-(.*)", condition="false"}
Pushgateway
allow ephemeral and batch jobs
to expose metrics
Pushgateway
Play with pushgateway
$ echo "some_metric 3.14" | curl --data-binary @- http://{URL_OF_PUSHGATEWAY}:9091/metrics/job/some_job
$ echo "progress 12" | curl --data-binary @- http://{URL_OF_PUSHGATEWAY}:9091/metrics/job/playing
Check the metrics
Metrics: http://<pushgateway-host>:9091/metric
UI: http://<pushgateway-host>:9091/
Check at prometheus console
Break Time
CH 4 : Alerting & Practical Cases
Let’s pull the trigger.
Alertmanager
Alertmanager
global:
[ resolve_timeout: <duration> | default = 5m ]
[ slack_api_url: <secret> ]
[ http_config: <http_config> ]
templates:
[ - <filepath> ... ]
route: <route>
receivers:
- <receiver> ...
inhibit_rules:
[ - <inhibit_rule> ... ]
Alertmanager - Route
"route":
"group_by":
- "job"
"group_interval": "1m"
"group_wait": "30s"
"receiver": "slack_alert1"
"repeat_interval": "3m"
"routes":
- "match":
"alertname": "Watchdog"
"receiver": "slack_alert2"
Setup Slack
Setup Alertmanager & Nginx
$ vi CH_4/alertmanager.yaml
$ CH_4/apply_change.sh
"receivers":
- "name": "slack_alert1"
"slack_configs":
- "api_url": "https://hooks.slack.com/services/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"
"channel": "#alert_1"
- "name": "slack_alert2"
"slack_configs":
- "api_url": "https://hooks.slack.com/services/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"
"channel": "#alert_2"
$ CH_4/helm_nginx_install.sh
Install rule 1
$ cat alert_rule_1.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: prometheus-example-rules
namespace: monitoring
spec:
groups:
- name: yourname.rules
rules:
- alert: YournameAlert
expr: vector(1)
$ kubectl apply -f alert_rule_1.yaml
Install rule 2
$ cat alert_rule_2.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: prometheus-example-rules2
namespace: monitoring
spec:
groups:
- name: nginx_rule
rules:
- alert: NGINXAlert
expr: nginx_ingress_controller_nginx_process_requests_total > 1000
$ kubectl apply -f alert_rule_2.yaml
$ CH_4/trigger_nginx_alert.sh
Checkout Slack for alerts
Remove all resource in AWS
Please remember run below script, remove all AWS resource to prevent unnecessary cost.
$ cd CH_0
$ ./uninstall.sh
THANKS