1 of 41

Software Engineering

for Machine Learning Systems

Week five: Reliability and monitoring

Imperial DoC, Spring 2024

Andrew Eland

a.eland@imperial.ac.uk

CC BY-SA 4.0 (photos covered separately)

2 of 41

Reliability and monitoring

3 of 41

4 of 41

5 of 41

6 of 41

7 of 41

“As this was an inversion of the usual regulations, I inquired very minutely into the authority on which is rested.

Being satisfied of this point, I desired him to order our my train immediately.”

8 of 41

“He returned with news that the fireman had neglected his duty, but that the engine would be ready in less than a quarter of an hour. The officer took pains to ensure me that there was no danger whichever line we might travel, as there could be no engine but our own until 5 o’clock in the evening.”

9 of 41

“While we were conversing together, my ear, which had become particularly sensitive to the distant sound of an engine, told me that one was approaching.

“I mentioned it to the railway official. He did not hear it, and said - ‘Sir, it is impossible”

10 of 41

“Knowing that it would stop at the engine house, I ran as fast as I could to that spot. I found a single engine, from which Brunel, covered with smoke and black had just descended.

“Brunel told me that he had posted from Bristol to meet the only train, but had missed it. ‘Fortunately,’ he said, ‘I found this engine with its fire up, so I ordered it up and have driven it the whole way at the rate of 50 miles per hour’.”

Letter from Charles Babbage, quoted from “Red for danger” by L. T. C. Rolt.

11 of 41

Clarity Focus

Noise Uncertainty Patterns Insights

The design squiggle by Damien Newman

12 of 41

Reliability

What does reliable even mean?

For a basic static website, reliable could just mean that the people you care about being able to access the website can, actually, access it.

In more complex cases, for example our AKI detection system, defining reliability can be more involved. Ideally, it involves a conversation with the people who are commissioning, or depending on the system we’re building, but it’s often a conversation they’re unprepared for.

andreweland.org/swemls a.eland@imperial.ac.uk

13 of 41

Service level indicators and objectives

If we believe reliability is an important quality for the system we’re building, we need to define quantitative proxy metrics to allow design tradeoffs, and to ensure a running system maintains the desired level of reliability.

A service level indicator, or SLI, is a quantified measure of service availability.

A service level objective, or SLO, is the target value for an SLI. They form the basis for principled discussions about the desired reliability of a system.

andreweland.org/swemls a.eland@imperial.ac.uk

14 of 41

Service level agreements

A service level agreement, or SLA, is an agreement between the people building a system, and those commissioning and using it, on the acceptable value of an SLI.

While an SLO is set by the people building a system, the SLA is agreed to by the people building it. Failing to meet an SLA metric will likely cause financial or legal penalties for the organisation operating a system (or cause you to lose marks in coursework 6).

andreweland.org/swemls a.eland@imperial.ac.uk

15 of 41

Availability

The fraction of time during which a system can be used, or availability, is a common, informal, high level metric. Defining it formally can be difficult. It’s often described as a “number of nines”.

Weekly downtime (approximate)

17 hours

2 hours

10 minutes

1 minute

6 seconds

90%

99%

99.9%

99.99%

99.999%

andreweland.org/swemls a.eland@imperial.ac.uk

16 of 41

Availability

The fraction of time during which a system can be used, or availability, is a common, informal, high level metric. Defining it formally can be difficult. It’s often described as a “number of nines”.

90%

99%

99.9%

99.99%

99.999%

Gmail (99.984% )

Single region Kubernetes cluster on GCP (99.5% or 50 minutes week-1)

andreweland.org/swemls a.eland@imperial.ac.uk

17 of 41

Availability

The fraction of time during which a system can be used, or availability, is a common, informal, high level metric. Defining it formally can be difficult. It’s often described as a “number of nines”.

Reliable design

Reliable team

Reliable organisation

Basically not possible

90%

99%

99.9%

99.99%

99.999%

andreweland.org/swemls a.eland@imperial.ac.uk

18 of 41

Metrics for reliability

If we’re designing our system around an SLI, we obviously need to be able to measure it for our running system, so we know whether we’re meeting it or not.

Monitoring the SLI for a running may involve a significant amount of code and infrastructure.

Though we’ll ultimately measure ourselves against the SLI, it may not be the most useful leading-edge metric to detect and resolve problems. Metrics designed to spot specific failure modes (such as a socket failing to connect, or a pod crashing) will be more useful.

andreweland.org/swemls a.eland@imperial.ac.uk

19 of 41

Metrics for machine learning systems

When a system incorporates machine learning, the data fed to a model, and the predictions made by it, are key aspects of the system’s reliability. We need to define metrics that will allow us to tell whether the model is performing as we expect.

We can design metrics for the model output (for example, positive prediction rate), or for the model input (for example, the median value of a feature over the last hour). We’d typically include both to help diagnose a problem.

andreweland.org/swemls a.eland@imperial.ac.uk

20 of 41

Metrics influence design

One of the goals of defining reliability metrics is to help make reasoned design tradeoffs.

If we care about 90th percentile latency, for example, it would make more sense to pick the next request to serve using a stack, rather than a queue.

andreweland.org/swemls a.eland@imperial.ac.uk

21 of 41

Exporting metrics

Metrics can be pulled, for example, your system can expose them via a simple web server that can be queried by another process, or through a remote procedure call.

Metrics can also be pushed, for example, by adding them to a request that was being made as part of the system’s normal operation, or by writing them to a file in a specific location.

andreweland.org/swemls a.eland@imperial.ac.uk

22 of 41

Failure domains

A failure domain is the subsection of a software system that is negatively impacted when a component fails.

The failure domain could be defined by software, for example, you could consider which components of our AKI detection systems would fail if the PAS stopped sending us ADT messages.

Failure domains are also defined by hardware, for example, the set of services that that will fail if a given data center catches fire.

Designing for reliability requires reducing the size of failure domains, and actively mitigating those that remain, often through replication.

andreweland.org/swemls a.eland@imperial.ac.uk

23 of 41

Failing safe

Ensuring the behaviour of a system is safe in the presence of a given error.

We might ensure, for example, that our AKI detection system didn’t acknowledge incoming HL7 messages until they’ve been successfully processed or persisted. If our code fails, we’d receive the same message when we restart, as we wouldn’t have acknowledged it.

This design would reduce the size of the failure domain for message processing.

If it’s the message itself that caused our code to fail, and this happens reliably, we have a “query of death” that will cause our code to fail on every restart. This itself could lead to cascading failures.

andreweland.org/swemls a.eland@imperial.ac.uk

24 of 41

Failing open

Ensuring an action isn’t unnecessarily blocked in the presence of an error.

We might decide, for example, that our AKI detection system continued to send alerts even if we can’t persist our state. It would make eventual recovery more complex, but we’d stay with our SLO until our system was restarted.

This approach would reduce the size of the failure domain for our persistence system.

andreweland.org/swemls a.eland@imperial.ac.uk

25 of 41

Designing for recovery

Making recovery from a failed state an explicit, and automated, part of the design.

Rather than ignore blood test results for patients with unknown demographics, our AKI detection system could persist the results anyway, and generate predictions if demographics ever did become available.

This design would allow automated recovery from a failed PAS, and therefore reduce the PAS’s failure domain.

andreweland.org/swemls a.eland@imperial.ac.uk

26 of 41

Postmortems

A postmortem is a process that leads to a written document documenting an incident, it’s root cause, the actions taken to resolve it, engineering actions that would prevent it from occurring again, and process changes that would reduce the time to resolution.

Postmortems see failure as an opportunity to strengthen a system. They’re blameless and transparent.

andreweland.org/swemls a.eland@imperial.ac.uk

27 of 41

Postmortems in the engineering process

We can define an error budget as 1 - SLO. This is quantifiable amount of unreliability we a spend on risky changes, or choosing not to improve software components that are no longer fit for purpose.

When an error budget comes close to being exhausted, the organisation can respond by reducing risky changes, or prioritising more of the engineering work identified in postmortems.

A mechanism to prioritise work from postmortems closes a loop in the reliability system, and brings us closer to the reliable teams and organisations that are a requirement for building high availability systems.

andreweland.org/swemls a.eland@imperial.ac.uk

28 of 41

Coursework one feedback:

Inference

29 of 41

Coursework one feedback

Automated tests that exit on failure

Useful docstrings

Uses Python typing to help explain code

Explicitly tests inference

Code in well named functions that provide encapsulation

Explicitly implements the NHS algorithm for comparison

Long functions, or files, hard to understand

Comments that duplicate code

No obvious testing strategy

Confusing error handling, e.g. misusing exceptions

Confusing algorithm design, e.g. no explanation for pre-processing

Unused or commented out code

Needless inefficiency, eg O(n2), when O(n) is trivial

andreweland.org/swemls a.eland@imperial.ac.uk

30 of 41

Coursework one feedback

# find if higher ratio above the threshold of 1.5

ratio = (C1 / min(RV1, RV2))

is_above = ratio >= 1.5

andreweland.org/swemls a.eland@imperial.ac.uk

31 of 41

Coursework one feedback

# find if higher ratio above the threshold of 1.5

ratio = (C1 / min(RV1, RV2))

andreweland.org/swemls a.eland@imperial.ac.uk

32 of 41

Coursework five and six:

Running in the live environment

33 of 41

Timeline

Lecture: reliability (you are here)

Lab

Lecture: ethics and society (guest lecturer!)

Lab

Lecture: design

Lab

No lecture

Possibly prizes???

Coursework 5

Coursework 6

andreweland.org/swemls a.eland@imperial.ac.uk

34 of 41

Timeline

Coursework 5

Coursework 6

Do whatever you need to make your system run reliably. I suggest making sure you handle state correctly, and adding basic metrics.

Mark weighting: 0 weeks.

Meet the expected SLA. The testing period starts after Monday’s lecture, and ends at the start of Friday’s lab.

Mark weighting: 4 weeks.

andreweland.org/swemls a.eland@imperial.ac.uk

35 of 41

Timeline

Coursework 5

Coursework 6

Live environment will start at 1pm on Monday. Messages will be sent in real time. It will not be possible to replay messages.

Test environment available around 11am on Friday. Messages will be sent in real time. It won’t replay messages when reconnecting. We can rewind it for you if you ask.

andreweland.org/swemls a.eland@imperial.ac.uk

36 of 41

SLA for coursework 6

Maintain a better f3 score than the NHS model.

Alerts delivered within 3 seconds of having the necessary data.

Your mark will be the fraction of time during the testing period that your service meets the SLA.

andreweland.org/swemls a.eland@imperial.ac.uk

37 of 41

API change

To page the hospital’s clinical response team, make a HTTP post request to /page, with the MRN of the relevant patient as the body of the post, and optionally the timestamp for the blood test result triggering the alert. Without a timestamp, we assume the time of the last test result sent for that patient.

POST /page HTTP/1.0

Content-type: text/plain

478237423,202401221000

38 of 41

Late results and postmortems

Sending a page with a historical blood test result will allow you to correct for an earlier system failure, bringing your f3 score back within the SLA.

We will ignore services that breach the latency SLA from the time between an incident is declared, and an acceptable postmortem is published.

Incidents should be declared by posting an edstem thread, and closed by following up to that thread with a postmortem document.

andreweland.org/swemls a.eland@imperial.ac.uk

39 of 41

Metrics

I strongly suggest you use coursework 5 to implement basic metrics. If you export messages for received and pages sent, we will monitor them for you and alert by email. Prometheus is an extremely well used open source monitoring library that may make this easier.

GET /metrics HTTP/1.0

HTTP/1.0 200 OK

Content-Type: text/plain

messages_received_total 1402

pages_sent_total 23

andreweland.org/swemls a.eland@imperial.ac.uk

40 of 41

State

If you are using the example Kubernetes configuration, remember that /state is the only directory that will survive restarts. Changes written inside your docker image will not be persisted.

Remember that the test and live environment will not replay messages. You may need to restructure your code to handle this.

andreweland.org/swemls a.eland@imperial.ac.uk

41 of 41

Good luck. See you Friday.

a.eland@imperial.ac.uk

andreweland.org/swemls