how becoming not a data scientist made me a better data scientist

Joel Grus

@joelgrus

Research Engineer, Allen Institute for AI

about me

so let me tell you who I am

I'm a research engineer at the Allen Institute for Artificial Intelligence in Seattle.

I build deep learning tools for NLP researchers.

wrote a book

second edition coming soon!

I wrote a book about data science

second edition coming soon

Python 3

deep learning

etc

co-host of the Adversarial learning podcast

me

Andrew Musselman

I co-host a podcast with Andrew Musselman

it's nominally about data science

someday we'll make a new episode

THE

GUY

I'm the "I don't like notebooks" guy

you may have seen my slides

THE

GUY

I'm also the "Fizz Buzz in Tensorflow" guy

you may have seen my blog

and I make livecoding videos

you may have seen those too

why is this on my mind?

ok, so why is this on my mind?

for one thing, I agreed to give a talk and didn't have anything to talk about.

I could have given you my notebooks talk, but you can just watch that one on youtube

EDITOR'S

NOTES

AMAZON REVIEWS

OLDNESS + BALDNESS + WISDOM

2nd EDITION

AMAZON REVIEWS

but another reason is that right now I'm revising the second edition of my data science book

so I have some editors notes

and then I have a bunch of amazon reviews

and then I have oldness and baldness and wisdom

which tells me to ignore the amazon reviews

Artificial Intelligence

Software Engineering

Being Good at Twitter

but technically I haven't really been a data scientist since I finished the first edition

I work at the intersection of artificial intelligence and software engineering

which has allowed me to find many faults with the book (and in particular with the way I was doing data science)

and so eventually I'm going to talk about those faults

part 1

how I became a data scientist

but if I'm going to talk about how I became not a data scientist, I should probably first talk about how I became a data scientist

my background is originally math and economics

I went to grad school, but I dropped out

who here's a grad school dropout?

coming out of school I had no useful skills, other than lebesgue theory

but thanks to the magic of cronyism, I got a job doing quantitative finance

where I became very, very good at Excel

which makes it sort of a shame that I don't use it anymore

so then eventually I worked at a hedge fund

where I wrote just enough SQL (like 3 queries) to put it on my resume

and then very soon after that I didn't work at a hedge fund

so then I laid on my couch for a few weeks watching Law & Order reruns

you can see this is back in the early days when we used to underline links and use awful fonts

data science was totally not a thing

2007

thanks to the a combination of cronyism, having SQL on my resume, and being immediately available, I got hired at Farecast as an analyst

doing BI stuff:

how good are our predictions

how much money are we saving customers

do customers who get BUY predictions behave differently

etc

I built a lot of spreadsheets

and I learned SQL for real and then I wrote a lot of SQL queries

(which is another thing that I got really good at and now I almost never use)

and then I needed to write some simple scripts, so I started writing some really bad Python

but I wasn't really doing any machine learning, that stuff was done by serious PhD types

2010

data science was still not a thing

so farecast got acquired by Microsoft and became "Bing Travel"

and we stopped underlining many of our links

but again, data science was not a thing yet

and in particular (this was 2010) there was no job at Microsoft for someone with my proto-data-science skills

which made it hard for them to stack rank me, which made it hard for me to have much of a career at Microsoft

2011

data science was starting to be a thing

and I wanted in on it

so then I joined decide.com, looks familiar

but I didn't want to keep being an analyst, this was 2011, I wanted to be a data scientist

2011

and back in 2011 you could just ask for a data science job and they'd give it to you

2013

so thanks to the magic of cronyism (and a whole lot of bullshitting) I got hired as a "data scientist" at VoloMetrix.

and then I talked a good game and convinced them to promote me to chief scientist

where I managed a couple of junior data scientists and I was able to give lame quotes

2014

so this was kind of the peak of my career as a data scientist

this was 2014

for those of you who weren't born then, it was a time that was "all about that bass"

part 2

how I became not a data scientist

ok, so how did I become not a data scientist

WROTE

PRODUCTION CODE

*

* BUT I WASN'T VERY GOOD AT IT

and at volometrix the data science was the product

so I ended up writing a lot of production code

and I liked it

but I wasn't very good at it

and also I had a vision of the future that standalone data science would not be a thing

I was very very wrong

and I started participating in coding contests

which it turned out were secretly recruiting events

which is how I got a call from the google recruiter

CRACKING

THE

ODING

TERVIEW

it turns out that google doesn't care whether you have a CS degree

just as long as you know all the things that someone with a CS degree would know

and so I crammed like I'd never crammed before

and somehow I managed to pass the whiteboard coding interview

2014

and then suddenly I was a software engineer

this was still 2014

and it was still all about that bass

and so for two years I immersed myself in the Google way and wrote a ton of C++

but the joke about Google is that they have 50k really talented engineers but they don't have 50k really interesting engineering jobs

and I missed doing machine learning type things

so I left, and let's not talk about that part

but I ended up at AI2 as a research engineer

I build deep learning tools for NLP researchers

I get to do my two favorite things: python library design + machine learning

(and sometimes my third favorite thing: writing reactjs)

and now suddenly I have to worry about good science and reproducibility

part 3

how that made me a better data scientist

right, so here's the part where I talk about how that made me a better data scientist

even though I'm not really a data scientist anymore

readability

of all the things I took away from my Google experience, this is #1.

make your code readable

this is important if you're writing a book

but it's just as important if you're doing work

case study: code from my book

I hate to pick on people for writing bad code, but it's more ok if the person I'm picking on is me

code from the first edition

what is this minimize_stochastic function that takes 6 arguments?

why did I use partial not once but twice?!

what is 0.001 here?

so this is me being way too clever

it's not the ugliest code in the world, but also it's not the most readable

proposed replacement

sweet, beautiful type hints

explicit iteration

explicit gradient computation

explicit parameter update

roughly the same amount of code, but way more readable

explicit is better than implicit

code review

when I was a data scientist I never did code reviews

but GOOG has a very strong code review culture

now I code review everything

so, you know, this is me trying to get a PR merged

unit tests

data scientists don't appreciate unit tests

what is a unit test?

In computer programming, unit testing is a software testing method by which individual units of source code, sets of one or more computer program modules together with associated control data, usage procedures, and operating procedures, are tested to determine whether they are fit for use.

a unit test is an automated check that a small part of your code works correctly

what's a unit test?

which tool should you use?

what does this have to do with data science?

code from the first edition

putting the expected answer in the comments is good for the reader, but less good for making sure the code is doing what it's supposed to

proposed replacement

putting the expected answer in an assert forces the code to get the right answer

(as long as you run the code)

in "real" code you probably want your tests separate

  • run them as much as you want without running the code itself
  • e.g. make sure your model can predict correctly without having to train it

this is from some livecoding videos I made about building an autograd library

these are some simple tests I wrote to make sure that tensor addition works

make sure model can train

make sure it produces "expected" results

this is a test I wrote last week for a deep learning model I implemented

static types

the most controversial of my points

again, from the first edition

dicts as data structures

when I was a data scientist, I thought dicts as data structures were just fine

whereas in the second edition

sweet, sweet NamedTuple

add methods

typed fields

access as properties

unit tests too!

we'll see if my second edition readers like this or not

and your IDE becomes a million times more helpful

and of course if you do this your text editor becomes way more helpful

version control

you want to be able to break your code and fix it

the usual suite of data science tools don't lend themselves to collaboration

reproducible workflows

and finally, this is what I've really learned working with researchers

i'm just going to crib a bit from my notebooks talk, which touched on many of these issues

pytorch 0.3? 0.4? 0.4.1?

I suppose I could guess based on the commit date

model definition

model instantiation (with hard-coded parameters)

more parameters

hard-coded paths

more parameters

training loop

  • This code is pretty much impossible to reproduce
  • First, you have to look at the notebooks, manually install all the dependencies, and hope you got the right version
  • Then you have to hope all your data still exists at the exact same paths
  • It's pretty much guaranteed that if you came back to this notebook in 6 months you would not easily be able to run it
  • In fact, it's likely that if you gave this notebook to your co-worker today she would not be able to easily run it

what can you do as a data scientist?

so what can you do if you're a data scientist

work at google as a software engineer for a few years

read other people's code

go look at your old code and see if you can understand it

see if you can replicate your work

and write tests!

luckily, you don't need to become not a data scientist to become a better data scientist; you can become a better data scientist while still being a data scientist!

thanks!

how becoming not a data scientist made me a better data scientist - Google Slides