1 of 14

Engineering for Data Scientists

A fairy tale by Rich Winslow

Stuff you probably want to know for that shiny new job

2 of 14

Your goal at work every single day is to write reusable, maintainable code

3 of 14

Features of bad code / Spaghetti is for eating

Poor or no documentation

Does not use functions or classes (if language supports them)

Abstract, unclear variable and function names

Inconsistent coding style

Switching between snake_case and CamelCase

Wrong whitespace character for the language

4 of 14

Style matters / Hey everyone, come see how good I look!

Python has PEP8 and you should follow it unless your team specifies something different.

Go has gofmt that should be run every time you save a file.

Scala has one in the official documentation.

R has several, so pick your favorite and stick with it.

Et cetera, et cetera...

5 of 14

Documentation / Me write good words

You will not be the only person maintaining a codebase. Others need to be able to understand what your code does without spending a full work day trying to visualize your crazy data structure or what three nested loops will do.

You’re also guaranteed to forget project idiosyncrasies when you move onto your next assignment unless you document it.

When someone else has to add to your code, chances are they won’t know WTF you intended unless you document it.

6 of 14

Documentation / Me write good words

Write comments for each logical code block.

Use docstrings for high-level concepts in Python.

Assume no one else can understand your code in any reasonable amount of time.

Create a Google or Word doc as needed, and add images if it will make things clearer.

7 of 14

Productionization / Don’t bring the house down

If you’re at a small to medium-sized company, chances are good you’ll be writing production code. If you write code with a bad bug you haven’t caught, it could take down a model, or worse, the whole company.

Testing is the single most important thing you can do before releasing code to production. Unit testing will check the smallest parts of your code (e.g. a function). Integration testing will make sure the software itself works as expected (e.g. a scheduled job).

8 of 14

Productionization / Don’t bring the house down

Always write tests to make sure your code works the way you want. Run the tests every time you build, before you deploy to production. Make sure it checks for common and uncommon errors.

If your team doesn’t have a testing environment, you need to build one. Period. And everyone on the team needs to use it.

9 of 14

Version control / You can never go home again...

Version control was created to build complex systems while ensuring mistakes can be rolled back. Learn it and git gud (pun intended).

Strive to make smaller rather than larger commits.

Use whichever VCS (version control system) is used by your org.

Don’t ever write directly to the master branch. Ever.

10 of 14

Compartmentalize / No one builds monoliths anymore

Don’t ever repeat yourself. If you find that you write a similar code block any more than once, it needs to be in its own function.

Functions and classes are the de facto ways to compartmentalize logic. Use them often. Try to make functions short and simple. Classes can be more complex and composed of many functions (called methods).

Each part of your code should be able to be used by someone else in your org without running a whole piece of software.

11 of 14

Compartmentalize / No one builds monoliths anymore

Common examples for functions: Equations, filters, validation

Common examples for classes: Data models, systems

If an application needs to “remember” things or be given some “settings” during a run, it will probably need to use classes.

12 of 14

Refactor / We can rebuild him, make him stronger...

The first time you write an application, you’re going to learn a lot about the project specs, what’s expected, and the right way to solve it. As you add features, your code will probably become less maintainable. There will come a time where you will have to refactor, or rewrite, large sections of code (or even all of it). If you do not you will accrue tech debt. That’s bad and will slow everyone down.

Refactor often as you learn more about a project.

13 of 14

Design patterns / We passed this rock once before…

Software engineering has evolved to follow certain coding patterns to solve particular types of problems. It’s less important to know the details of them than it is to know they exist and how to use one. Wikipedia does a great job of indexing and discussing them. Links for the lazy:

https://en.wikipedia.org/wiki/Software_design_pattern

https://en.wikipedia.org/wiki/Architectural_pattern

14 of 14

Architecture case study / Whiteboard like a BAMF

You are a data scientist embedded in the marketing team in your company. They are focused on Asia, the first market your company is breaking into. You've been tasked with creating a system that will perform the following daily activities:

  • Measure of active users
  • Measure of market penetration
  • Predict daily active users and market penetration 2 weeks out from historical data

This will be a major addition to the codebase. Calculations and statistics here will be used by other parts of the company.

How will you architect this solution?