1 of 10

Byte-Sized

Computer Science for Data People

Part 1: Clean Code

2 of 10

Topics We’ll Cover

1

Clean Code

2

System Design

3

Collaboration

Writing performant code that others will be excited to reuse

Building systems and�products that scale

Working productively�with other people

3 of 10

Principles of Clean Code

Style

Writing developer-friendly code that encourages others to build on top of and reuse your work

Speed

Ability to handle larger volumes of data without slowing down

Ability to handle larger volumes of data without breaking

Space

4 of 10

Deep-Dive: Style

Don’t repeat yourself:�write short*, simple functions
Tell a story using descriptive names, not comments
Pass objects, collections, and�functions as parameters

Hastily writing code that’s easier for a future developer to abandon than to fix

Writing clean code once and having future users thank you forever

Benefits

Big Ideas

Related CS Concepts

Evils of Duplication
Software Entropy (‘broken windows’)
Functional Programming

Errors are easier to spot and only have to be fixed in one place
Reusing patterns means less code to write and read
New team members are easier to onboard and get up to speed

*Clean Code recommends keeping functions to <4 lines, <3 parameters

5 of 10

Deep-Dive: Speed

Serialization
Vectorization & Linear Algebra
Big O Notation

Your modeling pipeline was very fast when running on a small sample of data, but started to hang when you tried the same code on a larger dataframe. What should you do first?

Pick the right data structure�(don’t distribute the small stuff)
Use arrays, not for-loops
Double-check if a built-in exists�before writing your own function

Running nested for-loops on a Spark DataFrame

Using a built-in function on a filtered pandas DataFrame instead

Related CS Concepts

Thinking Exercise

Big Ideas

6 of 10

Deep-Dive: Space

Lossless / Lossy Compression
Parallelism & Concurrency
Distributed Computing

Your modeling pipeline was working fine when you were filtered on one region, but suddenly throws an out-of-memory error when include all regions in your dataframe. What steps would you take to solve this problem?

Filter first, especially before joining
Drop columns, downcast, and use in-place operations and sparse data structures to compress your data
Index, chunk, and distribute the big stuff

Wasting money�on expensive�storage and�compute clusters

Profiling your code and compressing your data so you don’t have to

Related CS Concepts

Thinking Exercise

Big Ideas

7 of 10

Real-World Examples

8 of 10

Real-World Examples

9 of 10

Real-World Examples

10 of 10

Book Recommendations

1

Clean Code

2

System Design

3

Collaboration