1 of 51

How Maersk is navigating the seas of Observability with the LGTM Stack�

Roshith Radhakrishnan

Director - Platform Engineering

Henry Kühl�Senior Engineering Manager for Observability Platforms 

2 of 51

  1. Where you are from?

Kerala, India

Roshith Radhakrishnan

Director - Platform Engineering

3 of 51

Where is your favorite place to go or thing to do on the weekend?

Henry Kühl�Senior Engineering Manager for Observability Platforms 

4 of 51

4

Agenda

Problem Statement�

Maersk Observability Platform Overview�

Platform Capabilities: Ingestion, Query, RUM�

Reliability�

What next ?

5 of 51

Disclaimer: �WE ARE JUST THE MANAGERS!�(the team did all the great work)

5

6 of 51

Improving life for all by integrating the world

6

A team of 110,000+ �employees, operating in more than 130 countries

Maersk Air Cargo with own controlled capacity and a global network of scheduled flights

4.5m FFE intermodal volumes handled

700+ container vessels �deployed, 12m FFE transported

59 terminals across �31 countries

7,104k SQM �warehousing capacity worldwide in 452 sites 

Facilitate and impact

Customers worldwide, �large and small

100,000+

Containers moved in the

world by the Ocean fleet

~16%

Countries on all continents

where we call on 500+ ports

130+

Net zero GHG emissions

across our business

2040

Green methanol-enabled

vessels on order

19

7 of 51

Digitising� global logistics

7

We’ve brought technology into the heart of A.P. Moller - Maersk, spearheading our industry’s

digital transition to support our customers’ evolving needs and future growth.

Our customers benefit

from more agility,

predictability and reliability.

8 of 51

8

Challenges

Lack of standardization �

Multiple Vendor Tools�

No comprehensive coverage�

Increasing Cost

9 of 51

A.P. Moller - Maersk

9

Centralized platform for all observability capabilities.

Self Service

Opensource

Democratized data

Purpose built for Maersk

Maersk Observability Platform (MOP)

10 of 51

Platform Overview

A.P. Moller - Maersk

10

10

11 of 51

MOP breaks down into a growing number of capabilities.

A.P. Moller - Maersk

11

12 of 51

We run the stock LGTM images.

A.P. Moller - Maersk

12

Open-source FTW!

Grafana

Grafana Loki

Grafana Faro

Grafana Mimir

Grafana Tempo

13 of 51

Welcome to the team!

A.P. Moller - Maersk

13

Here is your first task.

Just send it!™

14 of 51

A.P. Moller - Maersk

14

How can I find my data?

Labels.

Photo by Brett Jordan on Unsplash

Retention

Usage tracking

Cardinality tracking

Traffic routing

15 of 51

A.P. Moller - Maersk

15

But my app runs on pre-ci-prod-pen-iteration-12!

We have seen this…

preproduction

preprod

pre-prod

pre-production

pp

p-production

16 of 51

A.P. Moller - Maersk

16

Guard

17 of 51

A.P. Moller - Maersk

17

Guard as our write proxy enforces standards.

Required labels and even label-values.

18 of 51

A.P. Moller - Maersk

18

So what happens if I violate the standards?

We drop your data!

19 of 51

A.P. Moller - Maersk

19

Ok, I get it. Standards matter. But now let me deep dive.

20 of 51

A.P. Moller - Maersk

20

Wizard

Photo by izzombol on Unsplash 

21 of 51

A.P. Moller - Maersk

21

Wizard as our read proxy enforces Standards.

It requires you to provide certain labels or blocks queries exceeding 200GB of log-volume-search.

22 of 51

A.P. Moller - Maersk

22

LGTM + Guard + Wizard = AWESOME.

Big picture.

23 of 51

A.P. Moller - Maersk

23

LGTM + Guard + Wizard = AWESOME.

How about Open Source going both ways?

Project name

Link to pull request

Contribution description

------------------

------------------------------------------------------------------

------------------------------------------------------------------------

Grafana Tempo

https://github.com/grafana/tempo/pull/2882

Upgrade the Azure SDK of Grafana Tempo

Grafana Tempo

https://github.com/grafana/tempo/pull/2911

Fix flaky test

Grafana Loki

https://github.com/grafana/loki/pull/8293

Reduce distributor code duplication

Grafana Loki

https://github.com/grafana/loki/pull/8247

Update clusterLabel usage in dashboards

Grafana Loki

https://github.com/grafana/loki/pull/8138

Update alert rule to prevent false alerts

Grafana Tempo

https://github.com/grafana/tempo/pull/2000

Fix last Jsonnet alerts for Tempo.

Memcached Mixin

https://github.com/grafana/jsonnet-libs/pull/901

Add custom label selector to dashboard

Grafana Tempo

https://github.com/grafana/tempo/pull/1936

Add zone aware replication for ingesters

Grafana Mimir

https://github.com/grafana/mimir/pull/3452

Add runbook URLs to mimir alerts

Grafana Loki

https://github.com/grafana/loki/pull/6662

[FIX] Fix Memberlist when using a stateful ruler.

grafana/agent

https://github.com/grafana/agent/pull/1711

Allow proxy_url on oauth2 for metrics and logs

Grafana Mimir

https://github.com/grafana/mimir/pull/1651

[ENHANCEMENT] Added the option to use a custom cluster label for the mimir dashboards

24 of 51

Front end Observability with

24

24

25 of 51

A.P. Moller - Maersk

25

Architecture

26 of 51

A.P. Moller - Maersk

26

Web Vitals

27 of 51

A.P. Moller - Maersk

27

Exception & Errors

28 of 51

A.P. Moller - Maersk

28

User Analytics

29 of 51

A.P. Moller - Maersk

29

XHR�Fetch �Resource Load

Instrumentations & Customizations

30 of 51

A.P. Moller - Maersk

30

Beacon

31 of 51

A.P. Moller - Maersk

31

Performance Mark

Session Sampling

32 of 51

A.P. Moller - Maersk

32

Capture User/Country Region

33 of 51

A.P. Moller - Maersk

33

Also ..

Replace ZoneContextManager with StackContextManager in Faro Web Tracing v1.0.4.�

Implement custom error handling for Offline Mode.�​

34 of 51

� Mobile SDK

34

34

35 of 51

A.P. Moller - Maersk

35

Architecture

36 of 51

A.P. Moller - Maersk

36

Mobile Vitals

37 of 51

A.P. Moller - Maersk

37

Errors

38 of 51

A.P. Moller - Maersk

38

Mobile Analytics

39 of 51

Current Offerings

A.P. Moller - Maersk

39

40 of 51

� Reliability

40

41 of 51

We barely couldn’t keep up with growth and adoption.

A.P. Moller - Maersk

41

42 of 51

Volumes per data stream also reflect the growth.

A.P. Moller - Maersk

42

43 of 51

A compute-footprint that is also still growing.

A.P. Moller - Maersk

43

Azure – AKS

Spot Nodes

Single tenant

Total Core Count

Total Memory (GB)

44 of 51

How do you keep this platform reliable? We started simple.

A.P. Moller - Maersk

44

Cloud Setup: Version 1

region-1

kube-cluster-1

nodepool-1

storage-account-1

zone-A

zone-B

zone-C

45 of 51

And then reality hit us.

Insert presentation title via Header & Footer

.

===== RESPONSE ERROR (ServiceCode=ServerBusy) ===== Description=Egress is over the account limit.

46 of 51

An improved cluster-setup helped us to keep our 99% promise.

A.P. Moller - Maersk

46

Cloud Setup: Version 2

region-1

kube-cluster-1

storage-account-loki

storage-account-tempo

storage-account-mimir

nodepool-loki

zone-C

zone-B

zone-A

nodepool-generic

nodepool-mimir

zone-C

zone-B

zone-A

zone-C

zone-B

zone-A

nodepool-tempo

zone-C

zone-B

zone-A

47 of 51

And this one will get us to 99.5% or higher.

A.P. Moller - Maersk

47

Cloud Setup: Version 3

region-1

storage-account-loki

storage-account-tempo

storage-account-mimir

kube-cluster-generic

kube-cluster-loki

nodepool-1

zone-C

zone-B

zone-A

kube-cluster-tempo

nodepool-1

zone-C

zone-B

zone-A

kube-cluster-mimir

nodepool-1

zone-C

zone-B

zone-A

48 of 51

Insert presentation title via Header & Footer

.

We have many ideas to improve our setup even further.

Multi Region.

Multi Cloud.

Data Buffer.

ARM instead of x86.

* Currently blocked due to bug in go compiler.

49 of 51

� What next?

49

50 of 51

A.P. Moller - Maersk

50

What next …

700+ vessels

60+ terminals

450+ warehouse facilities

Observability for our edge endpoints.

Provide Consistent Observability Experience

Intelligent Observability - Detection, Alerting, Remediation & RCA

Open Sourcing Wizard, Guard and other tools

51 of 51

Thank you