From Inception to Production

A Continuous Delivery Story

  • Ian Randall

pushpay

https://www.nzx.com/companies/PAY/announcements/285705

CONTEXT aka Setting the scene:

Pushpay context is:

Challenge the listener to apply things from this talk to their own context

Tools

People & PRactices

Just culture & blameless postmortems

How we continuously deliver code to prod is interesting (tools, practices, etc)

But stress that it would not be possible without the underlying Just Culture.

Our journey begins...

Somebody - somewhere - has an idea

* So we have a discussion…*

Why?

Shared vision over the value to the business

PROTIP for talking to nerds: Don’t tell us what you want us to do

Tell us *why* we need to do something. We’re engineers! If you tell us what the problem is that you’re trying to solve, we’ll come up with a pretty good solution - that’s What We Do!

Examples of reasons:

Who?

Product

Qa

Dev

QA *must* involved (and is probably the most important person) in the scoping discussion

because...

You can’t test the quality in at the end

...

Building a feature

Dev

Qa

“How will I Build this thing?”

“How will I break this thing?”

Happens in parallel.

QA writes out testing notes with scenarios for what users will do.

Exposes dev to idea that users *will* do that. Yes, really!

Loop back and share thoughts *before* code is written (maybe you did a spike)

TDD FTW! We now have a bunch of scenarios we can use to create a suite of unit tests and use these to flesh out the implementation of the feature! \o/

* by the way: QA in our org stands for Quality Assistance - developers are responsible for shipping production quality code onto our servers, not testers. Devs are rubbish at this, so we need Assistance from the Quality specialists.

Building a larger feature

Long-lived feature branches

feature switches

Before we investigate which option - there is a piece of terminology to define…

The delta

The delta is the difference between what’s currently running in Production, and what’s currently sitting at the HEAD of the master branch.

We keep it small to *Minimise risk*

** Keep a small delta is critical to every other part of what we do **

Building a larger feature

Long-lived feature branches

feature switches

  • Delta gets too big

  • Small deltas
  • No feedback

  • Regular feedback
  • DRY code

  • Technical debt

We choose to incur the technical debt of code duplication, because of all the benefits of feature switches, and the tech debt is *short lived*

Short lived? Yes, seriously… We have a bot that runs a report of feature switches that are ON in prod for > 30 days, and leave messages in the appropriate slack channel for a dev to follow up and clean out the dead switches.

Feature switches

Configuration per environment

Feature switches

URL manipulation to toggle switches on/off in QA (DO NOT do this in Production)

Why not in PROD?

Because the state of production environments must be immutable! Flicking switches on and off makes trouble-shooting a waking nightmare.

DevOps means *I* (the developer) have to do that support!

Feature switches

  • Deliver daily increments of (non-running) code
  • Light up a slice of feature
  • Measure
  • Re-think road-map to complete feature

Map out a large feature, then…

WOMM

Traditionally this means ‘the developer doesn’t care about your bug’

At Pushpay, we use pair-programming (or pair-testing, if you prefer) to make sure it *really* WOMMs before you land it in QA.

Code review

  • Every line of code gets reviewed
  • Code must be reviewed and WOMMed before merging.
  • “Roll Forwards To Victory”

Define: Roll Forward To Victory (assess the risk of landing an incorrect, or more likely incomplete, feature - and if the risk is low, then land it and subsequently land another PR to fix/complete)

Michael Lopp (@rands) “Managing Humans” talks about Incrementalist vs Completionist?

We’re Incrementalists.

Code review

Do

Don’t

Validate approach

Performance, Security, Operability

Cohesion, Coupling and Connascence

Be honest and positive.

Be rude.

Seriously, don’t be rude.

Sweat the small stuff, like bracing, spaces

As well as performance, security, operability, we also review for unit test coverage, short (and well-named) classes and methods, etc.

No one has “architect” in their job title - but architecture is a key component to being a developer at Pushpay.

Coupling, Cohesion and Connascence: http://codemania.io/2015/josh_robb.html

Talk about “Dude, that’s gross” → Lazy review. Offer alternatives, and let the engineer know that the Person is not their Code. A great engineer can write a bad bit of code for a myriad reasons, and still be a great engineer.

Cross-pollination

  • Someone else does it all again!
  • Pollinator is not (necessarily) involved with feature

Pollinator

4 Continuouses

The last part of the journey involves 4 continuouses...

Fairly sure that’s a word.

  • Continuous Integration
  • Source control
  • Build & test

CI - Source control

  • PR-based workflow.
  • Review happens in the PR
  • Everything is public.

First Continuous

----------------------

Create the PR from a feature branch. Do it early, so the PR is open for discussion.

Use a label to mark the PR as ‘ready for review’ and we have bots that will ping a specific team slack channel to let the team know.

Other engineers are actively encouraged to stick their noses in other people’s business (x-pollination)

CI - Build & test

PR Branch: Build, unit and integration tests

Merge into master: Build, unit, integration, acceptance and visual diff tests

Acceptance: Running business-critical workflows through selenium-based acceptance tests. Workflows that we want to regression-test on every build.

Visual diff: We use Applitools. https://applitools.com/

(2) Continuous deployment

  • Automatic build, package and deploy to QA
  • Manually promote package to PROD

We use TeamCity for build, and Octopus for package and deploy - you might prefer Jenkins, or TFS or whatever makes you happy.

The acceptance and visual-diff tests (from the previous slide) run on QA

We retain the manual deploy to PROD step, as we still have work to do before we can make dev -> prod a 1-click experience.

(3) continuous delivery

  • Operability
  • value

CD - Operability

  • Exception logging
  • App logging (log4net)
  • App metrics (Statsd)
  • Incident alerting

Landing a feature in production is only the beginning of the journey

“How is your feature performing”

There are tools available to help you answer these questions:

Exception logging: Raygun* / Airbrake, Crashlytics, etc.

Logging: Sumo Logic* / Splunk, Logstash

Metrics: Librato* / Datadog, New Relic

Alerting: PagerDuty*

* Pushpay uses this one.

CD - Value

  • Delivering incremental bits of value to the business
  • Measuring the effectiveness
  • Constantly iterating on the product

Refer back to Incrementalists vs Completionists - We are incrementalists!

We implement a tiny slice of a feature and measure uptake / usage.

“Don’t boil the ocean” - don’t need to do ALL THE THINGS in one go.

E.g. Shipped the front end to a user self-reporting feature *before* we’d finished the dev work on the ‘follow-up email‘ piece - because on its own, the front-end feature added value to the users, and there was no reason for us to wait and launch the whole feature in one go.

(4) Continuous Improvement

Actively seeking out opportunities to improve

  • Code (fix broken windows)
  • Process (automate where you can)
  • All The Things.

We call it “Fix the broken windows”

Because the model I’ve talked about today isn’t a model, it’ a snapshot at a point in time, and continuous evolution is key!

Continuous improvement (and rapid scaling) means we will 100% guaranteed be doing things differently in 6 months.

Chatops. we :sparkling_heart: Slack

  • Shipbot, Beebot, Salesbot
  • Many, many, more…

and...

  • @c3pr

Shipbot - everywhere, e.g. We talked about how a bot pings a channel when a PR is ready for review, etc. - shipbot does that.

Beebot coordinates the x-pollination - Beebot doesn’t care about your status. If you’re in the pollination group, then the most junior dev can pollinate the PR of the most senior principal engineers.

Salesbot pings the #sales channel to celebrate new sales

@c3pr bot coordinates the “train”

@c3pr in action

Talk through what is happening here: Stress that it is a regular channel, so people can interact in it, as well as the bots doing their thing…

Just culture

Sidney Dekker: A professor at Griffith University in Australia

http://sidneydekker.com/just-culture/

Just: Morally Right and Fair

Just culture

Retributive - clarity around acceptable vs unacceptable behaviour

Restorative - “safe-to-fail”

Dekker’s two definitons for a Just (fair) culture. The first one may involve retribution for unacceptable behaviour, but it is *fair*.

Fear of breaking things will paralyze your organization.

If, as an organization, you are *afraid* to break something, then you are not going to push changes to your production servers.

If I fear I may suffer a negative consequence (miss a promotion -- lose my job)

→ then of course I’m not going to shine a light on the things I did wrong. In fact, I’m going to find inventive and interesting ways to *not* actually do anything. (Not adding features to production? You’re not adding value)

Toyota’s five whys

5-Why’s is a *fabulous* and very successful tool, in a myriad companies for their context,

BUT

By building a culture of blame, you are encouraging your people to hide their mistakes, and setting yourself up for some MAJOR FAILS.

Blameless Postmortems

So let’s talk about documenting the mistakes…

Shamelessly stolen from (and fully attributed to) Etsy: https://codeascraft.com/2012/05/22/blameless-postmortems/

When?

  • Opportunity to learn
  • Something that impacted production
  • Near-miss

Stress: Doesn’t have to be a production outage, e.g. bringing down QA will most times result in a Blameless PM

How?

  • Asynchronously in a wiki
  • Coordinated in Slack channel #morgue
  • Co-ordinated by person(s) closest to the incident

If you run a PM in a room, the loudest person will talk the loudest.

But they (9 times out of 10) won’t have the most insight into the incident. The quiet person was the one making decisions, and knows the most about it.

What?

  • Scenario and impact
  • Timeline
  • Discussion
  • Mitigations

And to finish: an inspirational quote...

This stuff be hard, yo.

  • Ian randall, 2016

@kiwipom

For us: Senior engineering leads worked incredibly hard on building a Just culture all way up to C-level. THIS IS HARD.

THANK YOU FOR LISTENING