Daniel Dyla
Michael Beemer
Observable Feature Rollouts
With OpenTelemetry and OpenFeature
Who are we?
Michael Beemer
Senior Product Management, Dynatrace
Daniel Dyla
Senior Product Architect, Dynatrace
Standardizing Feature Flagging
An open specification that provides a vendor-agnostic, community-driven API for feature flagging that works with your favorite feature flag management tool or in-house solution.
What’s a feature flag?
Feature flags are a software development technique that allows teams to enable, disable or change the behavior of certain features or code paths in a product or service, without modifying the source code.
Feature flags have many benefits
Coordinate
Reduce risk
Experiment
… but introduce challenges
My environment so complex…
OpenTelemetry to the rescue
What is OpenTelemetry
A collection of APIs and SDKs used to collect telemetry data in a vendor agnostic way.
Basic Telemetry Types
Traces
Events (Logs)
Metrics
Events (Logs)
Traces
Time
Spans
Metrics
Telemetry Type Tradeoffs
Unprocessed Data
Aggregate Data
Events
Traces
Metrics
Choosing Appropriate Signals
Instrumenting Your Application
Resource identifies your application
Exporters send telemetry to your backend
Instrumentations gather data from common libraries
Instrumenting Your Application
How about an example?
Sneaker shop architecture
Scenario
Response times are reasonable
Scenario
As load increases….
response time become worse.
Drilling into a trace
The database is the culprit.
The DB is the bottleneck
Let’s add read replicas
Here’s the plan
Put access to the read replica behind a feature flag
1
2
3
4
5
Enable the read replica for a small number of users
Analyze the impact
Enable the read replica for everyone
Remove the feature flag
Adding the feature flag
Identifies the feature flag
Controls the database connection
Evaluation context
Collect telemetry
Monitors flag evaluations with OpenTelemetry events
Enabling feature
Matching flag key
Enabling for 25% of sessions
Starting the rollout
Uh no! That shouldn’t happen.
Abort!
That was close…
Back to normal.
Let’s see what went wrong
Failure rate by flag variant
It only failed when the read replica was enabled.
It only failed when the read replica was enabled.
Aggregate error messages
Something is wrong with node 3
We can drill into a representative trace to confirm.
Confirming the diagnosis
It’s definitely node 3
We know what happened
Let’s try that again…
Phase: Read replica disabled
Phase: Read replica enabled for 25%
Phase: Read replica enabled for 50%
Phase: Read replica enabled for 75%
Phase: Read replica enabled
Recap
Rolled out an important performance fix
1
2
3
4
5
Controlled the impact of unforeseen problems
Validated assumptions
Rolled the new feature out for all users
Continuously monitor impact
Questions?
Feedback Appreciated