1 of 17

Reliability Scoring

February, 2019

Presentation Name

Month 00, 2018

1

2 of 17

The Reliability Score

A quantifiable measure to determine whether to promote code from test -> stage -> prod.

A deployment, application or code tier is deemed reliable the less it -

Introduces new errors.

Makes existing errors happen at a higher rate (i.e. rate increase).

Introduces slowdowns into the environment.

2

3 of 17

The Reliability Scorecard provides an overview of all anomalies detected within a target deployment, application or infrastructure tier (e.g. AWS, SQL,..) and day within the selected environment(s), assigning each a weighted Reliability score.

Anomalies include:

* New errors.

* Increasing errors.

* Slowdowns.

Each anomaly is assigned a Severity, and can be drilled into its Root cause.

3

4 of 17

4

5 of 17

Scoring Process

1. Detect all errors and slowdowns, and correlate all events to their exact origin in the application codebase to enable trend analysis.

2. Classify each event according to its code location, tier, version, frequency, handling by the code (log, swallowed, slowdown, etc..)

3. Prioritize events with the highest severity for predictive impact on reliability.

5

6 of 17

Prioritizing Anomalies

Full control over how anomalies are prioritized, without manual threshold configuration.

Assign higher priority to critical applications or code tiers.

All switches denoted as <switch-name> used to prioritize / score anomalies are available in the Settings screen.

6

7 of 17

Prioritizing New Errors

An error is considered new (p2) if:

That type of error (e.g. NPE, log error,..) and code location did not occur before the selected timeframe (e.g. last 24h, 7d), or was introduced by the deployment being inspected.

A new error is considered severe new (p1) if:

It is uncaught.
Its type is defined in the <critical_exception_types> list (e.g. NullPointer, AssertionError,..)
Its volume exceeds <error_min_volume_threshold> (default 50) AND its rate exceeds <error_min_rate_threshold> (default 10%)

OR

7

8 of 17

Increasing Errors

An error is considered increasing (or regressed) if:

its rate has increased in the selected timeframe (e.g. last 24h, 7d) or the lifetime of deployment when compared to its dynamic baseline.

* Example: a increasing error in the last 24h compared to a baseline of 14d. Left graph: Rate (%). Right: Volume.

8

9 of 17

Determining Baselines

The baseline of an error /slowdown is the greater of :

<min_baseline_timespan> (default 14d, a medium length baseline)

<baseline_timespan_factor> (default 4) * selected time window or lifetime of the deployment being scored (in case the time frame is too big compared to just 14d).

For example, when inspecting:

A “payment” app scored over the last 1d, baseline = 14d

The “payment” app over the last 7d, baseline = 28d (4 * 7d)

Deployment 1.1 introduced yesterday, baseline = 14d

Deployment 2.0 introduced 10d ago, baseline = 40d

Greater

9

10 of 17

Prioritizing Increasing Errors

1. Its volume exceeds <error_min_volume_threshold> (default 50) AND its relative rate exceeds <error_min_rate_threshold> (default 10%)

2. Its rate increased by more than <error_regression_delta> (default 50%) vs. baseline. If increased by > <error_critical_regression_delta> (default 100%), it is severe (p1).

3. A similar volume was not previously observed in the baseline

(i.e. exclude “seasonal” behaviour)

14d vs.1d

* Example: even though avg rate in last 1d > 14d baseline, it’is still expected or “seasonal”.

An error is considered increasing (p2) if:

10

11 of 17

Slowdowns

All code entry points within the application (e.g. transaction handlers) are automatically identified, and their response time is continuously tracked.

Like regressions, each transaction is compared against its own baseline.

11

12 of 17

Prioritizing Slowdowns

It was called more than <active_invocations_threshold> (default 50)

AND more than <baseline_invocations_threshold> within the baseline period (default 50).

Percentage of calls with response time slower than <std_dev_factor> (default 1.5) of

slowest calls in the baseline is greater than <over_avg_slowing_percentage> (default 30%).

The delta between the average and the baseline average

is greater than <min_delta_threshold> (default 5ms).

If percentage of slow calls greater than <over_avg_critical_percentage>

(default 60%) slowdown is severe slowdown (p1).

A transaction is considered slowing down (p2) if:

1.5 std dev = slowest 9%

AND

12

13 of 17

Slowdowns: Example

For transaction “payment” over the last 24h (baseline = 14d). If:

1. The transaction was called more than 50 times in the last 24h

2. The transaction was called more than 50 times in the last 14h baseline.

3. More than 30% of calls had a response time longer than the slowest 9%

(= 1.5 std dev) calls in the baseline period.

4. The difference in response between last 24h and baseline of 14d was greater than 5ms (e.g. 50ms in last 24h vs. 40 ms in prev 14d).

is considered a slowdown. Purpose of these gates is to avoid false positives.

AND

13

14 of 17

Reliability Scores

For each new error, <new_event_score> (default 1) is deduced from the score. For severe new errors, <severe_new_event_score> (default 2) is used.

For each slowdown or regression <regression_score> (default 1) is deduced. For severe slowdowns or regressions <critical_regression_score> (default 2) is used.

The sum of deductions is multiplied by <score_weight> (default 2.5). For apps / tiers

marked as “key”, <key_score_weight> (default 5) is used.

Each score is normalized (divided) over the selected time period or n app / tier (e.g. 2d), or in the case of deployments over its lifetime.

Scoring is done for a target app, deployment or tier.

-

*

/

14

15 of 17

Deployment Scoring: Example

Each deployment is scored according to its individual lifetime.

15

16 of 17

App Scoring: Example

Scoring applications over the last 2d.

16

17 of 17

Learn more

Learn more about additional OverOps Reliability dashboards:

Home Dashboard
Reliability Review for execs and managers
Site Reliability for SRE and DevOps
Performance Analysis for SRE and DevOps
Reliability Scorecard for QA and developers
Reliability Analysis for developers
Root Cause for developers.
Reliability Scores

17