1 of 82

Principles of Software Construction: Objects, Design, and Concurrency��DevOps

Jeremy Lacomis Christian Kaestner

1

17-214/514

2 of 82

Almost there…

Subtype Polymorphism ✓

Information Hiding, Contracts ✓

Immutability ✓

Types ✓ �Static Analysis ✓

Unit Testing ✓

Domain Analysis ✓

Inheritance & Del. ✓

Responsibility�Assignment,�Design Patterns,�Antipattern ✓

Promises/�Reactive P. ✓

Static Analysis ✓

GUI vs Core ✓

Frameworks and Libraries ✓, APIs ✓

Distributed systems,�microservices ✓

Testing for Robustness ✓

CI ✓, DevOps

Design for

understanding

change/ext.

reuse

robustness

...

Small scale:

One/few objects

Mid scale:

Many objects

Large scale:

Subsystems

2

17-214/514

3 of 82

Testing in Production

3

17-214/514

4 of 82

Which design is better for signups?

4

17-214/514

5 of 82

Which design is better for sales?

5

17-214/514

6 of 82

Bing’s $100M/year experiment

6

17-214/514

7 of 82

Bing’s $100M/year experiment

  • Small change
  • Low priority, queued for month
  • +12% ad revenue
  • No reduction in other engagement metrics

7

17-214/514

8 of 82

How often to release a new version?

8

17-214/514

9 of 82

9

17-214/514

10 of 82

10

17-214/514

11 of 82

11

17-214/514

12 of 82

Enabling Frequent Releases & Experimentation��DevOps

12

17-214/514

13 of 82

Programming Reality

13

17-214/514

14 of 82

Today’s Topics

From CI to CD

Containers

Configuration management

Monitoring

Feature flags, testing in production

14

17-214/514

15 of 82

Recall: Continuous Integration

15

17-214/514

16 of 82

16

17-214/514

17 of 82

17

17

17-214/514

18 of 82

18

17-214/514

19 of 82

Continuous Integration

  • Automation
  • Ensures absence of obvious build issues and configuration issues (e.g., dependencies all checked in)
  • Ensures tests are executed
  • May encourage more tests
  • Can run checks on different platforms

19

17-214/514

20 of 82

Continuous Integration

  • Automation
  • Ensures absence of obvious build issues and configuration issues (e.g., dependencies all checked in)
  • Ensures tests are executed
  • May encourage more tests
  • Can run checks on different platforms

  • What else can be automated?

20

17-214/514

21 of 82

Any repetitive QA work remaining?

21

21

17-214/514

22 of 82

Releasing Software

22

17-214/514

23 of 82

Semantic Versioning for Releases

  • Given a version number MAJOR.MINOR.PATCH, increment the:
    • MAJOR version when you make incompatible API changes,
    • MINOR version when you add functionality in a backwards-compatible manner, and
    • PATCH version when you make backwards-compatible bug fixes.
  • Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

http://semver.org/

23

17-214/514

24 of 82

Versioning entire projects

24

24

17-214/514

25 of 82

Release management �with branches

25

17-214/514

26 of 82

Release cycle of Facebook’s apps

26

17-214/514

27 of 82

Release Challenges for Mobile Apps

  • Large downloads
  • Download time at user discretion
  • Different versions in production
  • Pull support for old releases?�

Any alternatives?

27

17-214/514

28 of 82

Release Challenges for Mobile Apps

  • Large downloads
  • Download time at user discretion
  • Different versions in production
  • Pull support for old releases?�

Server side releases silent and quick, consistent�→ App as container, most content + layout from server

28

17-214/514

29 of 82

From Release Date to Continuous Release

  • Traditional View: Boxed Software
    • Working toward fixed release date, QA heavy before release
    • Release and move on
    • Fix post-release defects in next release or through expensive patches

29

17-214/514

30 of 82

From Release Date to Continuous Release

  • Traditional View: Boxed Software
    • Working toward fixed release date, QA heavy before release
    • Release and move on
    • Fix post-release defects in next release or through expensive patches
  • Frequent releases
    • Incremental updates delivered frequently (weeks, days, …), e.g. Browsers
    • Automated updates (“patch culture”; “updater done? ship it”)

30

17-214/514

31 of 82

Efficiency of release pipeline

https://www.slideshare.net/jmcgarr/continuous-delivery-at-netflix-and-beyond

31

17-214/514

32 of 82

From Release Date to Continuous Release

  • Traditional View: Boxed Software
    • Working toward fixed release date, QA heavy before release
    • Release and move on
    • Fix post-release defects in next release or through expensive patches
  • Frequent releases
    • Incremental updates delivered frequently (weeks, days, …), e.g. Browsers
    • Automated updates (“patch culture”; “updater done? ship it”)
  • Hosted software
    • Frequent incremental releases, hot patches, different versions for different customers, customer may not even notice update

32

17-214/514

33 of 82

33

17-214/514

34 of 82

CC BY-SA 4.0�G. Détrez

34

17-214/514

35 of 82

The Shifting

Development-Operations Barrier

35

17-214/514

36 of 82

36

17-214/514

37 of 82

Common Release Problems?

37

17-214/514

38 of 82

Common Release Problems (Examples)

  • Missing dependencies
  • Different compiler versions or library versions
  • Different local utilities (e.g. unix grep vs mac grep)
  • Database problems
  • OS differences
  • Too slow in real settings
  • Difficult to roll back changes
  • Source from many different repositories
  • Obscure hardware? Cloud? Enough memory?

38

17-214/514

39 of 82

The Dev – Ops Divide

  • Coding
  • Testing, static analysis, reviews
  • Continuous integration
  • Bug tracking
  • Running local tests and scalability experiments
  • Allocating hardware resources
  • Managing OS updates
  • Monitoring performance
  • Monitoring crashes
  • Managing load spikes, …
  • Tuning database performance
  • Running distributed at scale
  • Rolling back releases

39

QA responsibilities in both roles

39

17-214/514

40 of 82

QA Does not Stop in Dev

40

17-214/514

41 of 82

QA Does not Stop in Dev

  • Ensuring product builds correctly (e.g., reproducible builds)
  • Ensuring scalability under real-world loads
  • Supporting environment constraints from real systems (hardware, software, OS)
  • Efficiency with given infrastructure
  • Monitoring (server, database, Dr. Watson, etc)
  • Bottlenecks, crash-prone components, … (possibly thousands of crash reports per day/minute)

41

17-214/514

42 of 82

DevOps

42

17-214/514

43 of 82

43

17-214/514

44 of 82

Key Ideas and Principles

Better coordinate between developers and operations (collaborative)

Key goal: Reduce friction bringing changes from development into production

Considering the entire tool chain into production (holistic)

Documentation and versioning of all dependencies and configurations ("configuration as code")

Heavy automation, e.g., continuous delivery, monitoring

Small iterations, incremental and continuous releases

Buzz word!

44

17-214/514

45 of 82

Common Practices

All configurations in version control

Test and deploy in containers

Automated testing, testing, testing, ...

Monitoring, orchestration, and automated actions in practice

Microservice architectures

Release frequently

45

17-214/514

46 of 82

Heavy Tooling and Automation

46

17-214/514

47 of 82

Heavy tooling and automation -- Examples

Infrastructure as code — Ansible, Terraform, Puppet, Chef

CI/CD — Jenkins, TeamCity, GitLab, Shippable, Bamboo, Azure DevOps

Test automation — Selenium, Cucumber, Apache JMeter

Containerization — Docker, Rocket, Unik

Orchestration — Kubernetes, Swarm, Mesos

Software deployment — Elastic Beanstalk, Octopus, Vamp

Measurement — Datadog, DynaTrace, Kibana, NewRelic, ServiceNow

47

17-214/514

48 of 82

DevOps: Tooling Overview

48

17-214/514

49 of 82

DevOps Tools

  • Containers and virtual machines (Docker, …)
  • Orchestration and configuration (ansible, Puppet, Chef, Kubernetes, …)

  • Sophisticated (custom) pipelines

49

17-214/514

50 of 82

Tooling for Building

  • Let’s talk about Docker

50

17-214/514

51 of 82

  • A virtual machine, but:
  • Lightweight virtualization
  • Sub-second boot time
  • Shareable virtual images with full setup incl. configuration settings
  • Used in development and deployment
  • Separate docker images for separate services (web server, business logic, database, …)

51

17-214/514

52 of 82

  • Why might DevOps programmers like this?
  • How do you automate infrastructure?

52

17-214/514

53 of 82

Configuration management, �Infrastructure as Code

  • Scripts to change system configurations (configuration files, install packages, versions, …); declarative vs imperative
  • Usually put under version control

$nameservers = ['10.0.2.3']

file { '/etc/resolv.conf':

ensure => file,

owner => 'root',

group => 'root',

mode => '0644',

content => template('resolver/r.conf'),

}

- hosts: all

sudo: yes

tasks:

- apt: name={{ item }}

with_items:

- ldap-auth-client

- nscd

- shell: auth-client-config -t nss -p lac_ldap

- copy: src=ldap/my_mkhomedir dest=/…

- copy: src=ldap/ldap.conf dest=/etc/ldap.conf

- shell: pam-auth-update --package

- shell: /etc/init.d/nscd restart

(Puppet)

(ansible)

53

17-214/514

54 of 82

Tooling for Execution

Containers drastically simplify managing ops

54

17-214/514

55 of 82

Container Orchestration with Kubernetes

  • Manages which container to deploy to which machine
  • Launches and kills containers depending on load
  • Manage updates and routing
  • Automated restart, replacement, replication, scaling
  • Kubernetes master controls many nodes

55

17-214/514

56 of 82

CC BY-SA 4.0 Khtan66

56

17-214/514

57 of 82

Tooling for Execution

We’ll talk about Cloud next week

How about monitoring?

57

17-214/514

58 of 82

Monitoring

  • Monitor server health
  • Monitor service health
  • Collect and analyze measures or log files
  • Dashboards and triggering automated decisions
    • Many tools, e.g., Grafana as dashboard, Prometheus for metrics, Loki + ElasticSearch for logs
    • Push and pull models

58

17-214/514

59 of 82

59

17-214/514

60 of 82

Grafana

60

17-214/514

61 of 82

61

17-214/514

62 of 82

Testing in Production

62

17-214/514

63 of 82

Testing in �Production

63

17-214/514

64 of 82

Chaos �Experiments

64

17-214/514

65 of 82

65

17-214/514

66 of 82

Crash Telemetry

66

17-214/514

67 of 82

What If

... we had plenty of subjects for experiments

... we could randomly assign subjects to treatment and control group without them knowing

... we could analyze small individual changes and keep everything else constant

▶ Ideal conditions for controlled experiments

67

17-214/514

68 of 82

Experiment Size

With enough subjects (users), we can run many many experiments

Even very small experiments become feasible

Toward causal inference

68

17-214/514

69 of 82

A/B Testing

69

17-214/514

70 of 82

Implementing A/B Testing

Implement alternative versions of the system

  • Using feature flags (decisions in implementation)
  • Separate deployments (decision in router/load balancer)

Map users to treatment group

  • Randomly from distribution
  • Static user - group mapping
  • Online service (e.g., launchdarkly, split)

Monitor outcomes per group

  • Telemetry, sales, time on site, server load, crash rate

70

17-214/514

71 of 82

Feature Flags

Boolean options

Good practices: tracked explicitly, documented, keep them localized and independent

External mapping of flags to customers

  • who should see what configuration
  • e.g., 1% of users sees one_click_checkout, but always the same users; or 50% of beta-users and 90% of developers and 0.1% of all users

if (features.enabled(userId, "one_click_checkout")) {

// new one click checkout function

} else {

// old checkout functionality

}

def isEnabled(user): Boolean = (hash(user.id) % 100) < 10

71

17-214/514

72 of 82

72

17-214/514

73 of 82

Comparing Outcomes

Group A

base game�

2158 Users

average 18:13 min time on site

Group B

game with extra god cards

10 Users

average 20:24 min time on site

73

73

17-214/514

74 of 82

74

74

17-214/514

75 of 82

75

75

17-214/514

76 of 82

76

17-214/514

77 of 82

Canary�Releases

77

17-214/514

78 of 82

Canary Releases

  • Testing releases in production
  • Incrementally deploy a new release to users, not all at once
  • Monitor difference in outcomes (e.g., crash rates, performance, user engagement)
  • Automatically roll back bad releases
  • Technically similar to A/B testing
  • Telemetry essential

78

17-214/514

79 of 82

Canary Releases

79

17-214/514

80 of 82

Canary Releases at Facebook

Phase 0: Automated unit tests

Phase 1: Release to Facebook employees

Phase 2: Release to subset of production machines

Phase 3: Release to full cluster

Phase 4: Commit to master, rollout everywhere

Monitored metrics: server load, crashes, click-through rate

Further readings: Tang, Chunqiang, Thawan Kooburat, Pradeep Venkatachalam, Akshay Chander, Zhe Wen, Aravind Narayanan, Patrick Dowell, and Robert Karl. Holistic configuration management at Facebook. In Proceedings of the 25th Symposium on Operating Systems Principles, pp. 328-343. ACM, 2015. and Rossi, Chuck, Elisa Shibley, Shi Su, Kent Beck, Tony Savor, and Michael Stumm. Continuous deployment of mobile software at facebook (showcase). In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 12-23. ACM, 2016.

80

17-214/514

81 of 82

Real DevOps Pipelines are �Complex

  • Incremental rollout, reconfiguring routers
  • Canary testing
  • Automatic rolling back changes

81

Chunqiang Tang, Thawan Kooburat, Pradeep Venkatachalam, Akshay Chander, Zhe Wen, Aravind Narayanan, Patrick Dowell, and Robert Karl. Holistic Configuration Management at Facebook. Proc. of SOSP: 328--343 (2015).

81

17-214/514

82 of 82

Summary

Increasing automation of tests and deployments

Containers and configuration management tools help with automation, deployment, and rollbacks

Monitoring becomes important

Many new opportunities for testing in production (feature flags are common)

82

17-214/514