1 of 82

Principles of Software Construction: Objects, Design, and Concurrency��DevOps

�Jeremy Lacomis Christian Kaestner

1

17-214/514

2 of 82

Almost there…

Subtype Polymorphism ✓

Information Hiding, Contracts ✓

Immutability ✓

Types ✓ �Static Analysis ✓

Unit Testing ✓

Domain Analysis ✓

Inheritance & Del. ✓

Responsibility�Assignment,�Design Patterns,�Antipattern ✓

Promises/�Reactive P. ✓

Static Analysis ✓

GUI vs Core ✓

Frameworks and Libraries ✓, APIs ✓

Distributed systems,�microservices ✓

Testing for Robustness ✓

CI ✓, DevOps

Design for

understanding

change/ext.

reuse

robustness

...

Small scale:

One/few objects

Mid scale:

Many objects

Large scale:

Subsystems

2

17-214/514

3 of 82

Testing in Production

3

17-214/514

4 of 82

Which design is better for signups?

4

17-214/514

5 of 82

Which design is better for sales?

5

17-214/514

6 of 82

Bing’s $100M/year experiment

6

17-214/514

7 of 82

Bing’s $100M/year experiment

Small change
Low priority, queued for month
+12% ad revenue
No reduction in other engagement metrics

7

17-214/514

8 of 82

How often to release a new version?

8

17-214/514

9 of 82

9

17-214/514

10 of 82

10

17-214/514

11 of 82

11

17-214/514

12 of 82

Enabling Frequent Releases & Experimentation��DevOps

12

17-214/514

13 of 82

Programming Reality

13

17-214/514

14 of 82

Today’s Topics

From CI to CD

Containers

Configuration management

Monitoring

Feature flags, testing in production

14

17-214/514

15 of 82

Recall: Continuous Integration

15

17-214/514

16 of 82

16

17-214/514

17 of 82

17

17-214/514

18 of 82

18

17-214/514

19 of 82

Continuous Integration

Automation
Ensures absence of obvious build issues and configuration issues (e.g., dependencies all checked in)
Ensures tests are executed
May encourage more tests
Can run checks on different platforms

19

17-214/514

20 of 82

Continuous Integration

Automation
Ensures absence of obvious build issues and configuration issues (e.g., dependencies all checked in)
Ensures tests are executed
May encourage more tests
Can run checks on different platforms

What else can be automated?

20

17-214/514

21 of 82

Any repetitive QA work remaining?

21

17-214/514

22 of 82

Releasing Software

22

17-214/514

23 of 82

Semantic Versioning for Releases

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes,
MINOR version when you add functionality in a backwards-compatible manner, and
PATCH version when you make backwards-compatible bug fixes.

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

http://semver.org/

23

17-214/514

24 of 82

Versioning entire projects

24

17-214/514

25 of 82

Release management �with branches

25

17-214/514

26 of 82

Release cycle of Facebook’s apps

26

17-214/514

27 of 82

Release Challenges for Mobile Apps

Large downloads
Download time at user discretion
Different versions in production
Pull support for old releases?�

Any alternatives?

27

17-214/514

28 of 82

Release Challenges for Mobile Apps

Large downloads
Download time at user discretion
Different versions in production
Pull support for old releases?�

Server side releases silent and quick, consistent�→ App as container, most content + layout from server

28

17-214/514

29 of 82

From Release Date to Continuous Release

Traditional View: Boxed Software

Working toward fixed release date, QA heavy before release
Release and move on
Fix post-release defects in next release or through expensive patches

29

17-214/514

30 of 82

From Release Date to Continuous Release

Traditional View: Boxed Software

Working toward fixed release date, QA heavy before release
Release and move on
Fix post-release defects in next release or through expensive patches

Frequent releases

Incremental updates delivered frequently (weeks, days, …), e.g. Browsers
Automated updates (“patch culture”; “updater done? ship it”)

30

17-214/514

31 of 82

Efficiency of release pipeline

https://www.slideshare.net/jmcgarr/continuous-delivery-at-netflix-and-beyond

31

17-214/514

32 of 82

From Release Date to Continuous Release

Traditional View: Boxed Software

Working toward fixed release date, QA heavy before release
Release and move on
Fix post-release defects in next release or through expensive patches

Frequent releases

Incremental updates delivered frequently (weeks, days, …), e.g. Browsers
Automated updates (“patch culture”; “updater done? ship it”)

Hosted software

Frequent incremental releases, hot patches, different versions for different customers, customer may not even notice update

32

17-214/514

33 of 82

33

17-214/514

34 of 82

CC BY-SA 4.0�G. Détrez

34

17-214/514

35 of 82

The Shifting

Development-Operations Barrier

35

17-214/514

36 of 82

36

17-214/514

37 of 82

Common Release Problems?

https://bit.ly/214f23q16

37

17-214/514

38 of 82

Common Release Problems (Examples)

Missing dependencies
Different compiler versions or library versions
Different local utilities (e.g. unix grep vs mac grep)
Database problems
OS differences
Too slow in real settings
Difficult to roll back changes
Source from many different repositories
Obscure hardware? Cloud? Enough memory?

38

17-214/514

39 of 82

The Dev – Ops Divide

Coding
Testing, static analysis, reviews
Continuous integration
Bug tracking
Running local tests and scalability experiments
…

Allocating hardware resources
Managing OS updates
Monitoring performance
Monitoring crashes
Managing load spikes, …
Tuning database performance
Running distributed at scale
Rolling back releases
…

39

QA responsibilities in both roles

39

17-214/514

40 of 82

QA Does not Stop in Dev

40

17-214/514

41 of 82

QA Does not Stop in Dev

Ensuring product builds correctly (e.g., reproducible builds)
Ensuring scalability under real-world loads
Supporting environment constraints from real systems (hardware, software, OS)
Efficiency with given infrastructure
Monitoring (server, database, Dr. Watson, etc)
Bottlenecks, crash-prone components, … (possibly thousands of crash reports per day/minute)

41

17-214/514

42 of 82

DevOps

42

17-214/514

43 of 82

43

17-214/514

44 of 82

Key Ideas and Principles

Better coordinate between developers and operations (collaborative)

Key goal: Reduce friction bringing changes from development into production

Considering the entire tool chain into production (holistic)

Documentation and versioning of all dependencies and configurations ("configuration as code")

Heavy automation, e.g., continuous delivery, monitoring

Small iterations, incremental and continuous releases

Buzz word!

44

17-214/514

45 of 82

Common Practices

All configurations in version control

Test and deploy in containers

Automated testing, testing, testing, ...

Monitoring, orchestration, and automated actions in practice

Microservice architectures

Release frequently

45

17-214/514

46 of 82

Heavy Tooling and Automation

46

17-214/514

47 of 82

Heavy tooling and automation -- Examples

Infrastructure as code — Ansible, Terraform, Puppet, Chef

CI/CD — Jenkins, TeamCity, GitLab, Shippable, Bamboo, Azure DevOps

Test automation — Selenium, Cucumber, Apache JMeter

Containerization — Docker, Rocket, Unik

Orchestration — Kubernetes, Swarm, Mesos

Software deployment — Elastic Beanstalk, Octopus, Vamp

Measurement — Datadog, DynaTrace, Kibana, NewRelic, ServiceNow

47

17-214/514

48 of 82

DevOps: Tooling Overview

48

17-214/514

49 of 82

DevOps Tools

Containers and virtual machines (Docker, …)
Orchestration and configuration (ansible, Puppet, Chef, Kubernetes, …)

Sophisticated (custom) pipelines

49

17-214/514

50 of 82

Tooling for Building

Let’s talk about Docker

50

17-214/514

51 of 82

A virtual machine, but:
Lightweight virtualization
Sub-second boot time
Shareable virtual images with full setup incl. configuration settings
Used in development and deployment
Separate docker images for separate services (web server, business logic, database, …)

51

17-214/514

52 of 82

Why might DevOps programmers like this?
How do you automate infrastructure?

52

17-214/514

53 of 82

Configuration management, �Infrastructure as Code

Scripts to change system configurations (configuration files, install packages, versions, …); declarative vs imperative
Usually put under version control

$nameservers = ['10.0.2.3']

file { '/etc/resolv.conf':

ensure => file,

owner => 'root',

group => 'root',

mode => '0644',

content => template('resolver/r.conf'),

}

- hosts: all

sudo: yes

tasks:

- apt: name={{ item }}

with_items:

- ldap-auth-client

- nscd

- shell: auth-client-config -t nss -p lac_ldap

- copy: src=ldap/my_mkhomedir dest=/…

- copy: src=ldap/ldap.conf dest=/etc/ldap.conf

- shell: pam-auth-update --package

- shell: /etc/init.d/nscd restart

(Puppet)

(ansible)

53

17-214/514

54 of 82

Tooling for Execution

Containers drastically simplify managing ops

54

17-214/514

55 of 82

Container Orchestration with Kubernetes

Manages which container to deploy to which machine
Launches and kills containers depending on load
Manage updates and routing
Automated restart, replacement, replication, scaling
Kubernetes master controls many nodes

55

17-214/514

56 of 82

CC BY-SA 4.0 Khtan66

56

17-214/514

57 of 82

Tooling for Execution

We’ll talk about Cloud next week

How about monitoring?

57

17-214/514

58 of 82

Monitoring

Monitor server health
Monitor service health
Collect and analyze measures or log files
Dashboards and triggering automated decisions

Many tools, e.g., Grafana as dashboard, Prometheus for metrics, Loki + ElasticSearch for logs
Push and pull models

58

17-214/514

59 of 82

59

17-214/514

60 of 82

Grafana

60

17-214/514

61 of 82

61

17-214/514

62 of 82

Testing in Production

62

17-214/514

63 of 82

Testing in �Production

63

17-214/514

64 of 82

Chaos �Experiments

64

17-214/514

65 of 82

65

17-214/514

66 of 82

Crash Telemetry

66

17-214/514

67 of 82

What If

... we had plenty of subjects for experiments

... we could randomly assign subjects to treatment and control group without them knowing

... we could analyze small individual changes and keep everything else constant

▶ Ideal conditions for controlled experiments

67

17-214/514

68 of 82

Experiment Size

With enough subjects (users), we can run many many experiments

Even very small experiments become feasible

Toward causal inference

68

17-214/514

69 of 82

A/B Testing

69

17-214/514

70 of 82

Implementing A/B Testing

Implement alternative versions of the system

Using feature flags (decisions in implementation)
Separate deployments (decision in router/load balancer)

Map users to treatment group

Randomly from distribution
Static user - group mapping
Online service (e.g., launchdarkly, split)

Monitor outcomes per group

Telemetry, sales, time on site, server load, crash rate

70

17-214/514

71 of 82

Feature Flags

Boolean options

Good practices: tracked explicitly, documented, keep them localized and independent

External mapping of flags to customers

who should see what configuration
e.g., 1% of users sees one_click_checkout, but always the same users; or 50% of beta-users and 90% of developers and 0.1% of all users

if (features.enabled(userId, "one_click_checkout")) {

// new one click checkout function

} else {

// old checkout functionality

}

def isEnabled(user): Boolean = (hash(user.id) % 100) < 10

71

17-214/514

72 of 82

72

17-214/514

73 of 82

Comparing Outcomes

Group A

base game�

2158 Users

average 18:13 min time on site

Group B

game with extra god cards

10 Users

average 20:24 min time on site

73

17-214/514

74 of 82

74

17-214/514

75 of 82

75

17-214/514

76 of 82

https://techcrunch.com/2014/06/29/ethics-in-a-data-driven-world/

76

17-214/514

77 of 82

Canary�Releases

77

17-214/514

78 of 82

Canary Releases

Testing releases in production
Incrementally deploy a new release to users, not all at once
Monitor difference in outcomes (e.g., crash rates, performance, user engagement)
Automatically roll back bad releases
Technically similar to A/B testing
Telemetry essential

78

17-214/514

79 of 82

Canary Releases

79

17-214/514

80 of 82

Canary Releases at Facebook

Phase 0: Automated unit tests

Phase 1: Release to Facebook employees

Phase 2: Release to subset of production machines

Phase 3: Release to full cluster

Phase 4: Commit to master, rollout everywhere

Monitored metrics: server load, crashes, click-through rate

Further readings: Tang, Chunqiang, Thawan Kooburat, Pradeep Venkatachalam, Akshay Chander, Zhe Wen, Aravind Narayanan, Patrick Dowell, and Robert Karl. Holistic configuration management at Facebook. In Proceedings of the 25th Symposium on Operating Systems Principles, pp. 328-343. ACM, 2015. and Rossi, Chuck, Elisa Shibley, Shi Su, Kent Beck, Tony Savor, and Michael Stumm. Continuous deployment of mobile software at facebook (showcase). In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 12-23. ACM, 2016.

80

17-214/514

81 of 82

Real DevOps Pipelines are �Complex

Incremental rollout, reconfiguring routers
Canary testing
Automatic rolling back changes

81

Chunqiang Tang, Thawan Kooburat, Pradeep Venkatachalam, Akshay Chander, Zhe Wen, Aravind Narayanan, Patrick Dowell, and Robert Karl. Holistic Configuration Management at Facebook. Proc. of SOSP: 328--343 (2015).

81

17-214/514

82 of 82

Summary

Increasing automation of tests and deployments

Containers and configuration management tools help with automation, deployment, and rollbacks

Monitoring becomes important

Many new opportunities for testing in production (feature flags are common)

82

17-214/514