1 of 10

Byte-Sized

Computer Science for Data People

Part 1: Clean Code

1 | Copyright © 2020 Nick Lind. All rights reserved.

2 of 10

Topics We’ll Cover

1

Clean Code

2

System Design

3

Collaboration

Writing performant code that others will be excited to reuse

Building systems and�products that scale

Working productively�with other people

2 | Copyright © 2020 Nick Lind. All rights reserved.

3 of 10

Principles of Clean Code

Style

Writing developer-friendly code that encourages others to build on top of and reuse your work

Speed

Ability to handle larger volumes of data without slowing down

Ability to handle larger volumes of data without breaking

Space

3 | Copyright © 2020 Nick Lind. All rights reserved.

4 of 10

Deep-Dive: Style

  • Don’t repeat yourself:�write short*, simple functions
  • Tell a story using descriptive names, not comments
  • Pass objects, collections, and�functions as parameters

Hastily writing code that’s easier for a future developer to abandon than to fix

Writing clean code once and having future users thank you forever

Benefits

Big Ideas

Related CS Concepts

  • Evils of Duplication
  • Software Entropy (‘broken windows’)
  • Functional Programming
  • Errors are easier to spot and only have to be fixed in one place
  • Reusing patterns means less code to write and read
  • New team members are easier to onboard and get up to speed

*Clean Code recommends keeping functions to <4 lines, <3 parameters

4 | Copyright © 2020 Nick Lind. All rights reserved.

5 of 10

Deep-Dive: Speed

  • Serialization
  • Vectorization & Linear Algebra
  • Big O Notation

Your modeling pipeline was very fast when running on a small sample of data, but started to hang when you tried the same code on a larger dataframe. What should you do first?

  • Pick the right data structure�(don’t distribute the small stuff)
  • Use arrays, not for-loops
  • Double-check if a built-in exists�before writing your own function

Running nested for-loops on a Spark DataFrame

Using a built-in function on a filtered pandas DataFrame instead

Related CS Concepts

Thinking Exercise

Big Ideas

5 | Copyright © 2020 Nick Lind. All rights reserved.

6 of 10

Deep-Dive: Space

  • Lossless / Lossy Compression
  • Parallelism & Concurrency
  • Distributed Computing

Your modeling pipeline was working fine when you were filtered on one region, but suddenly throws an out-of-memory error when include all regions in your dataframe. What steps would you take to solve this problem?

  • Filter first, especially before joining
  • Drop columns, downcast, and use in-place operations and sparse data structures to compress your data
  • Index, chunk, and distribute the big stuff

Wasting money�on expensive�storage and�compute clusters

Profiling your code and compressing your data so you don’t have to

Related CS Concepts

Thinking Exercise

Big Ideas

6 | Copyright © 2020 Nick Lind. All rights reserved.

7 of 10

Real-World Examples

7 | Copyright © 2020 Nick Lind. All rights reserved.

8 of 10

Real-World Examples

8 | Copyright © 2020 Nick Lind. All rights reserved.

9 of 10

Real-World Examples

9 | Copyright © 2020 Nick Lind. All rights reserved.

10 of 10

Book Recommendations

1

Clean Code

2

System Design

3

Collaboration

10 | Copyright © 2020 Nick Lind. All rights reserved.