From Inception to Production
A Continuous Delivery Story
CONTEXT aka Setting the scene:
Pushpay context is:
Challenge the listener to apply things from this talk to their own context
People & PRactices
Just culture & blameless postmortems
How we continuously deliver code to prod is interesting (tools, practices, etc)
But stress that it would not be possible without the underlying Just Culture.
Our journey begins...
Somebody - somewhere - has an idea
* So we have a discussion…*
Shared vision over the value to the business
PROTIP for talking to nerds: Don’t tell us what you want us to do
Tell us *why* we need to do something. We’re engineers! If you tell us what the problem is that you’re trying to solve, we’ll come up with a pretty good solution - that’s What We Do!
Examples of reasons:
QA *must* involved (and is probably the most important person) in the scoping discussion
You can’t test the quality in at the end
Building a feature
“How will I Build this thing?”
“How will I break this thing?”
Happens in parallel.
QA writes out testing notes with scenarios for what users will do.
Exposes dev to idea that users *will* do that. Yes, really!
Loop back and share thoughts *before* code is written (maybe you did a spike)
TDD FTW! We now have a bunch of scenarios we can use to create a suite of unit tests and use these to flesh out the implementation of the feature! \o/
* by the way: QA in our org stands for Quality Assistance - developers are responsible for shipping production quality code onto our servers, not testers. Devs are rubbish at this, so we need Assistance from the Quality specialists.
Building a larger feature
Long-lived feature branches
Before we investigate which option - there is a piece of terminology to define…
The delta is the difference between what’s currently running in Production, and what’s currently sitting at the HEAD of the master branch.
We keep it small to *Minimise risk*
** Keep a small delta is critical to every other part of what we do **
Building a larger feature
Long-lived feature branches
We choose to incur the technical debt of code duplication, because of all the benefits of feature switches, and the tech debt is *short lived*
Short lived? Yes, seriously… We have a bot that runs a report of feature switches that are ON in prod for > 30 days, and leave messages in the appropriate slack channel for a dev to follow up and clean out the dead switches.
Configuration per environment
URL manipulation to toggle switches on/off in QA (DO NOT do this in Production)
Why not in PROD?
Because the state of production environments must be immutable! Flicking switches on and off makes trouble-shooting a waking nightmare.
DevOps means *I* (the developer) have to do that support!
Map out a large feature, then…
Traditionally this means ‘the developer doesn’t care about your bug’
At Pushpay, we use pair-programming (or pair-testing, if you prefer) to make sure it *really* WOMMs before you land it in QA.
Define: Roll Forward To Victory (assess the risk of landing an incorrect, or more likely incomplete, feature - and if the risk is low, then land it and subsequently land another PR to fix/complete)
Michael Lopp (@rands) “Managing Humans” talks about Incrementalist vs Completionist?
Performance, Security, Operability
Cohesion, Coupling and Connascence
Be honest and positive.
Seriously, don’t be rude.
Sweat the small stuff, like bracing, spaces
As well as performance, security, operability, we also review for unit test coverage, short (and well-named) classes and methods, etc.
No one has “architect” in their job title - but architecture is a key component to being a developer at Pushpay.
Coupling, Cohesion and Connascence: http://codemania.io/2015/josh_robb.html
Talk about “Dude, that’s gross” → Lazy review. Offer alternatives, and let the engineer know that the Person is not their Code. A great engineer can write a bad bit of code for a myriad reasons, and still be a great engineer.
The last part of the journey involves 4 continuouses...
Fairly sure that’s a word.
CI - Source control
Create the PR from a feature branch. Do it early, so the PR is open for discussion.
Use a label to mark the PR as ‘ready for review’ and we have bots that will ping a specific team slack channel to let the team know.
Other engineers are actively encouraged to stick their noses in other people’s business (x-pollination)
CI - Build & test
PR Branch: Build, unit and integration tests
Merge into master: Build, unit, integration, acceptance and visual diff tests
Acceptance: Running business-critical workflows through selenium-based acceptance tests. Workflows that we want to regression-test on every build.
Visual diff: We use Applitools. https://applitools.com/
(2) Continuous deployment
We use TeamCity for build, and Octopus for package and deploy - you might prefer Jenkins, or TFS or whatever makes you happy.
The acceptance and visual-diff tests (from the previous slide) run on QA
We retain the manual deploy to PROD step, as we still have work to do before we can make dev -> prod a 1-click experience.
(3) continuous delivery
CD - Operability
Landing a feature in production is only the beginning of the journey
“How is your feature performing”
There are tools available to help you answer these questions:
Exception logging: Raygun* / Airbrake, Crashlytics, etc.
Logging: Sumo Logic* / Splunk, Logstash
Metrics: Librato* / Datadog, New Relic
* Pushpay uses this one.
CD - Value
Refer back to Incrementalists vs Completionists - We are incrementalists!
We implement a tiny slice of a feature and measure uptake / usage.
“Don’t boil the ocean” - don’t need to do ALL THE THINGS in one go.
E.g. Shipped the front end to a user self-reporting feature *before* we’d finished the dev work on the ‘follow-up email‘ piece - because on its own, the front-end feature added value to the users, and there was no reason for us to wait and launch the whole feature in one go.
(4) Continuous Improvement
Actively seeking out opportunities to improve
We call it “Fix the broken windows”
Because the model I’ve talked about today isn’t a model, it’ a snapshot at a point in time, and continuous evolution is key!
Continuous improvement (and rapid scaling) means we will 100% guaranteed be doing things differently in 6 months.
Chatops. we :sparkling_heart: Slack
Shipbot - everywhere, e.g. We talked about how a bot pings a channel when a PR is ready for review, etc. - shipbot does that.
Beebot coordinates the x-pollination - Beebot doesn’t care about your status. If you’re in the pollination group, then the most junior dev can pollinate the PR of the most senior principal engineers.
Salesbot pings the #sales channel to celebrate new sales
@c3pr bot coordinates the “train”
@c3pr in action
Talk through what is happening here: Stress that it is a regular channel, so people can interact in it, as well as the bots doing their thing…
Sidney Dekker: A professor at Griffith University in Australia
Just: Morally Right and Fair
Retributive - clarity around acceptable vs unacceptable behaviour
Restorative - “safe-to-fail”
Dekker’s two definitons for a Just (fair) culture. The first one may involve retribution for unacceptable behaviour, but it is *fair*.
Fear of breaking things will paralyze your organization.
If, as an organization, you are *afraid* to break something, then you are not going to push changes to your production servers.
If I fear I may suffer a negative consequence (miss a promotion -- lose my job)
→ then of course I’m not going to shine a light on the things I did wrong. In fact, I’m going to find inventive and interesting ways to *not* actually do anything. (Not adding features to production? You’re not adding value)
Toyota’s five whys
5-Why’s is a *fabulous* and very successful tool, in a myriad companies for their context,
By building a culture of blame, you are encouraging your people to hide their mistakes, and setting yourself up for some MAJOR FAILS.
So let’s talk about documenting the mistakes…
Shamelessly stolen from (and fully attributed to) Etsy: https://codeascraft.com/2012/05/22/blameless-postmortems/
Stress: Doesn’t have to be a production outage, e.g. bringing down QA will most times result in a Blameless PM
If you run a PM in a room, the loudest person will talk the loudest.
But they (9 times out of 10) won’t have the most insight into the incident. The quiet person was the one making decisions, and knows the most about it.
And to finish: an inspirational quote...
This stuff be hard, yo.
For us: Senior engineering leads worked incredibly hard on building a Just culture all way up to C-level. THIS IS HARD.
THANK YOU FOR LISTENING