How Maersk is navigating the seas of Observability with the LGTM Stack��
Roshith Radhakrishnan
Director - Platform Engineering
Henry Kühl�Senior Engineering Manager for Observability Platforms
Kerala, India
Roshith Radhakrishnan
Director - Platform Engineering
Where is your favorite place to go or thing to do on the weekend?
Henry Kühl�Senior Engineering Manager for Observability Platforms
4
Agenda
Problem Statement�
Maersk Observability Platform Overview�
Platform Capabilities: Ingestion, Query, RUM�
Reliability�
What next ?
Disclaimer: �WE ARE JUST THE MANAGERS!�(the team did all the great work)
5
Improving life for all by integrating the world
6
A team of 110,000+ �employees, operating in more than 130 countries
Maersk Air Cargo with own controlled capacity and a global network of scheduled flights
4.5m FFE intermodal volumes handled
700+ container vessels �deployed, 12m FFE transported
59 terminals across �31 countries
7,104k SQM �warehousing capacity worldwide in 452 sites
Facilitate and impact | |
Customers worldwide, �large and small | 100,000+ |
Containers moved in the world by the Ocean fleet | ~16% |
Countries on all continents where we call on 500+ ports | 130+ |
Net zero GHG emissions across our business | 2040 |
Green methanol-enabled vessels on order | 19 |
Digitising� global logistics
7
We’ve brought technology into the heart of A.P. Moller - Maersk, spearheading our industry’s
digital transition to support our customers’ evolving needs and future growth.
Our customers benefit
from more agility,
predictability and reliability.
8
Challenges
Lack of standardization �
Multiple Vendor Tools�
No comprehensive coverage�
Increasing Cost
A.P. Moller - Maersk
9
Centralized platform for all observability capabilities.
Self Service
Opensource
Democratized data
Purpose built for Maersk
Maersk Observability Platform (MOP)
Platform Overview
A.P. Moller - Maersk
10
10
MOP breaks down into a growing number of capabilities.
A.P. Moller - Maersk
11
We run the stock LGTM images.
A.P. Moller - Maersk
12
Open-source FTW!
Grafana
Grafana Loki
Grafana Faro
Grafana Mimir
Grafana Tempo
Welcome to the team!
A.P. Moller - Maersk
13
Here is your first task.
Just send it!™
A.P. Moller - Maersk
14
How can I find my data?
Labels.
Photo by Brett Jordan on Unsplash
Retention
Usage tracking
Cardinality tracking
Traffic routing
A.P. Moller - Maersk
15
But my app runs on pre-ci-prod-pen-iteration-12!
We have seen this…
preproduction
preprod
pre-prod
pre-production
pp
p-production
Brett Jordan on Unsplash.
A.P. Moller - Maersk
16
Guard
Photo by Kristijan Arsov on Unsplash
A.P. Moller - Maersk
17
Guard as our write proxy enforces standards.
Required labels and even label-values.
A.P. Moller - Maersk
18
So what happens if I violate the standards?
We drop your data!
A.P. Moller - Maersk
19
Ok, I get it. Standards matter. But now let me deep dive.
A.P. Moller - Maersk
20
Wizard
Photo by izzombol on Unsplash
A.P. Moller - Maersk
21
Wizard as our read proxy enforces Standards.
It requires you to provide certain labels or blocks queries exceeding 200GB of log-volume-search.
A.P. Moller - Maersk
22
LGTM + Guard + Wizard = AWESOME.
Big picture.
A.P. Moller - Maersk
23
LGTM + Guard + Wizard = AWESOME.
How about Open Source going both ways?
Project name | Link to pull request | Contribution description |
------------------ | ------------------------------------------------------------------ | ------------------------------------------------------------------------ |
Grafana Tempo | https://github.com/grafana/tempo/pull/2882 | Upgrade the Azure SDK of Grafana Tempo |
Grafana Tempo | https://github.com/grafana/tempo/pull/2911 | Fix flaky test |
Grafana Loki | https://github.com/grafana/loki/pull/8293 | Reduce distributor code duplication |
Grafana Loki | https://github.com/grafana/loki/pull/8247 | Update clusterLabel usage in dashboards |
Grafana Loki | https://github.com/grafana/loki/pull/8138 | Update alert rule to prevent false alerts |
Grafana Tempo | https://github.com/grafana/tempo/pull/2000 | Fix last Jsonnet alerts for Tempo. |
Memcached Mixin | https://github.com/grafana/jsonnet-libs/pull/901 | Add custom label selector to dashboard |
Grafana Tempo | https://github.com/grafana/tempo/pull/1936 | Add zone aware replication for ingesters |
Grafana Mimir | https://github.com/grafana/mimir/pull/3452 | Add runbook URLs to mimir alerts |
Grafana Loki | https://github.com/grafana/loki/pull/6662 | [FIX] Fix Memberlist when using a stateful ruler. |
grafana/agent | https://github.com/grafana/agent/pull/1711 | Allow proxy_url on oauth2 for metrics and logs |
Grafana Mimir | https://github.com/grafana/mimir/pull/1651 | [ENHANCEMENT] Added the option to use a custom cluster label for the mimir dashboards |
Front end Observability with
24
24
A.P. Moller - Maersk
25
Architecture
A.P. Moller - Maersk
26
Web Vitals
A.P. Moller - Maersk
27
Exception & Errors
A.P. Moller - Maersk
28
User Analytics
A.P. Moller - Maersk
29
XHR�Fetch �Resource Load
Instrumentations & Customizations
A.P. Moller - Maersk
30
Beacon
A.P. Moller - Maersk
31
Performance Mark
Session Sampling
A.P. Moller - Maersk
32
Capture User/Country Region
A.P. Moller - Maersk
33
Also ..
�
Replace ZoneContextManager with StackContextManager in Faro Web Tracing v1.0.4.�
Implement custom error handling for Offline Mode.�
� Mobile SDK
34
34
A.P. Moller - Maersk
35
Architecture
A.P. Moller - Maersk
36
Mobile Vitals
A.P. Moller - Maersk
37
Errors
A.P. Moller - Maersk
38
Mobile Analytics
Current Offerings
A.P. Moller - Maersk
39
� Reliability
40
We barely couldn’t keep up with growth and adoption.
A.P. Moller - Maersk
41
Volumes per data stream also reflect the growth.
A.P. Moller - Maersk
42
A compute-footprint that is also still growing.
A.P. Moller - Maersk
43
Azure – AKS
Spot Nodes
Single tenant
Total Core Count
Total Memory (GB)
How do you keep this platform reliable? We started simple.
A.P. Moller - Maersk
44
Cloud Setup: Version 1
region-1
kube-cluster-1
nodepool-1
storage-account-1
zone-A
zone-B
zone-C
And then reality hit us.
Insert presentation title via Header & Footer
.
===== RESPONSE ERROR (ServiceCode=ServerBusy) ===== Description=Egress is over the account limit.
An improved cluster-setup helped us to keep our 99% promise.
A.P. Moller - Maersk
46
Cloud Setup: Version 2
region-1
kube-cluster-1
storage-account-loki
storage-account-tempo
storage-account-mimir
nodepool-loki
zone-C
zone-B
zone-A
nodepool-generic
nodepool-mimir
zone-C
zone-B
zone-A
zone-C
zone-B
zone-A
nodepool-tempo
zone-C
zone-B
zone-A
And this one will get us to 99.5% or higher.
A.P. Moller - Maersk
47
Cloud Setup: Version 3
region-1
storage-account-loki
storage-account-tempo
storage-account-mimir
kube-cluster-generic
kube-cluster-loki
nodepool-1
zone-C
zone-B
zone-A
kube-cluster-tempo
nodepool-1
zone-C
zone-B
zone-A
kube-cluster-mimir
nodepool-1
zone-C
zone-B
zone-A
Insert presentation title via Header & Footer
.
We have many ideas to improve our setup even further.
Multi Region.
Multi Cloud.�
Data Buffer.�
ARM instead of x86.�
* Currently blocked due to bug in go compiler.
� What next?
49
A.P. Moller - Maersk
50
What next …
700+ vessels
60+ terminals
450+ warehouse facilities
Observability for our edge endpoints.
Provide Consistent Observability Experience
Intelligent Observability - Detection, Alerting, Remediation & RCA
Open Sourcing Wizard, Guard and other tools
Thank you