1 of 44

2 of 44

Daniel Dyla

Michael Beemer

Observable Feature Rollouts

With OpenTelemetry and OpenFeature

3 of 44

Who are we?

Michael Beemer

Senior Product Management, Dynatrace

  • Experience as a Consultant, DevOps Engineer, Software Developer, and Product management
  • Active open source contributor
  • Co-founder of OpenFeature and member of Governance Committee

Daniel Dyla

Senior Product Architect, Dynatrace

  • OpenTelemetry Governance Committee
  • Maintainer of OpenTelemetry JS
  • Contributor to OpenFeature

4 of 44

Standardizing Feature Flagging

An open specification that provides a vendor-agnostic, community-driven API for feature flagging that works with your favorite feature flag management tool or in-house solution.

5 of 44

What’s a feature flag?

Feature flags are a software development technique that allows teams to enable, disable or change the behavior of certain features or code paths in a product or service, without modifying the source code.

6 of 44

Feature flags have many benefits

Coordinate

Reduce risk

Experiment

7 of 44

… but introduce challenges

My environment so complex…

8 of 44

OpenTelemetry to the rescue

9 of 44

What is OpenTelemetry

A collection of APIs and SDKs used to collect telemetry data in a vendor agnostic way.

10 of 44

Basic Telemetry Types

Traces

Events (Logs)

Metrics

11 of 44

Events (Logs)

  • A point in time without duration
  • Allow for arbitrary data and data types
  • Powerful data analytics capabilities
  • No collection processing requirements
  • Logs are a particular type of event

12 of 44

Traces

  • Collection of related spans
    • Operation with a duration and timestamp
  • Arbitrary data stored as attributes
  • Spans linked together in a tree
  • Requires propagation of span context
  • Essentially 2 events - Start and End

Time

Spans

13 of 44

Metrics

  • Numeric data aggregated from a series of events
  • Usually original events are dropped
  • Usually attributes are more restricted
  • Requires keeping state on the client
  • Usually requires strict control of cardinality
  • Possible to generate later from events or traces

14 of 44

Telemetry Type Tradeoffs

Unprocessed Data

Aggregate Data

Events

Traces

Metrics

  • Data stored in raw form and processed later
  • More storage, transport, and server processing cost
  • More available analysis options
  • Data aggregated before transit and storage
  • More efficient storage and transit
  • More client processing cost
  • Less flexible analysis options

15 of 44

Choosing Appropriate Signals

  • How much data am I collecting?
  • What types of analysis do I need to do later?
  • Is my data structured, unstructured, or unknown?
  • Am I collecting numeric data or some other type?

16 of 44

Instrumenting Your Application

Resource identifies your application

Exporters send telemetry to your backend

Instrumentations gather data from common libraries

17 of 44

Instrumenting Your Application

18 of 44

How about an example?

19 of 44

Sneaker shop architecture

20 of 44

Scenario

Response times are reasonable

21 of 44

Scenario

As load increases….

response time become worse.

22 of 44

Drilling into a trace

The database is the culprit.

23 of 44

The DB is the bottleneck

24 of 44

Let’s add read replicas

25 of 44

Here’s the plan

Put access to the read replica behind a feature flag

1

2

3

4

5

Enable the read replica for a small number of users

Analyze the impact

Enable the read replica for everyone

Remove the feature flag

26 of 44

Adding the feature flag

Identifies the feature flag

Controls the database connection

Evaluation context

27 of 44

Collect telemetry

Monitors flag evaluations with OpenTelemetry events

28 of 44

Enabling feature

Matching flag key

Enabling for 25% of sessions

29 of 44

Starting the rollout

Uh no! That shouldn’t happen.

30 of 44

Abort!

31 of 44

That was close…

Back to normal.

32 of 44

Let’s see what went wrong

33 of 44

Failure rate by flag variant

It only failed when the read replica was enabled.

It only failed when the read replica was enabled.

34 of 44

Aggregate error messages

Something is wrong with node 3

We can drill into a representative trace to confirm.

35 of 44

Confirming the diagnosis

It’s definitely node 3

36 of 44

We know what happened

37 of 44

Let’s try that again…

38 of 44

Phase: Read replica disabled

39 of 44

Phase: Read replica enabled for 25%

40 of 44

Phase: Read replica enabled for 50%

41 of 44

Phase: Read replica enabled for 75%

42 of 44

Phase: Read replica enabled

43 of 44

Recap

Rolled out an important performance fix

1

2

3

4

5

Controlled the impact of unforeseen problems

Validated assumptions

Rolled the new feature out for all users

Continuously monitor impact

44 of 44

Questions?

Feedback Appreciated