1 of 67

Demystifying Data Science

Joel Grus

@joelgrus

2 of 67

who am I?

3 of 67

research engineer at ai2

4 of 67

previously

  • Software Engineer at Google
  • Chief Scientist at VoloMetrix
  • other data science roles at Decide and Farecast / Microsoft
  • hedge fund jerk

5 of 67

wrote a book!

6 of 67

co-host a podcast

please listen to it!

7 of 67

write a blog

8 of 67

How did I become a data scientist?

9 of 67

NOW I'M A DATA SCIENTIST

TRY TO GET SPARK TO WORK

TRY TO GET MYSQL TO WORK

TRY TO GET R TO WORK

TRY TO GET AWK TO WORK

TRY TO GET SCIKIT TO WORK

TRY TO GET TENSORFLOW TO WORK

TRY TO GET HADOOP TO WORK

TRY TO GET MATPLOTLIB TO WORK

TRY TO GET DOCKER TO WORK

TRY TO GET D3 TO WORK

SAY "BIG DATA"

TWEET

ATTEND STRATA

SCRAPE AMAZON, GET BANNED

10 of 67

how did data science become me?

11 of 67

Grad School

12 of 67

13 of 67

14 of 67

15 of 67

16 of 67

17 of 67

2010

18 of 67

19 of 67

2011

20 of 67

21 of 67

22 of 67

23 of 67

what do I do on a daily basis?

24 of 67

25 of 67

26 of 67

27 of 67

28 of 67

but what do i DO on a daily basis?

29 of 67

Ai2 ("Research engineer")

  • implement Scala bindings for a C++ neural net library
  • experiment with probabilistic programming
  • build NLP models to answer science questions
  • steal from the d3 gallery
  • crowdsource datasets
  • build web scrapers
  • perform stand-up comedy at the holiday party

30 of 67

google ("Software engineer")

  • built back-end services in C++
  • figured out complicated data structures problems
  • wrote huge MapReduce jobs
  • ate way too much frogurt
  • tried to convince my co-workers to say "frogurt" instead of "froyo"
  • was not successful

31 of 67

volometrix ("Chief scientist")

  • designed algorithms and implemented them in F#
  • designed visualizations and implemented them in JavaScript
  • built scikit-learn models for customers
  • talked to customers to get product ideas/requirements
  • managed junior data scientists
  • built ugly ETL pipelines in C# (don't do this!)
  • tier 1 customer support (don't do this!)
  • pretty much everything else

32 of 67

decide ("Analyst")

  • used regexes to pull product specs out of scraped data
  • crowdsourced "model histories" of electronics products
  • built (ML) models to predict new (consumer electronics) models
  • let's not talk anymore about this one

33 of 67

farecast ("analyst / fareologist")

  • wrote a ton of SQL queries
  • built Pivot Tables
  • analyzed user behavior
  • did data driven PR
  • wrote embarrassingly hacky Python scripts
  • got quoted in the New York Times
  • inherited a bunch of Perl dashboards and kept them running

34 of 67

hedge fund ("Senior analyst")

  • priced FX options
  • built really complicated spreadsheets
  • learned first baby steps of SQL
  • lost a lot of money

35 of 67

grad school ("grad student")

  • hid from my advisor
  • learned stats
  • learned a little bit of Python
  • secretly took a creative writing class

36 of 67

what is data science like in practice?

37 of 67

heuristic/joke:

two types of data scientists

38 of 67

Type A - the analyst

what's a unit test?

39 of 67

type b - the builder

"what's a train-test split?"

what's a train-test split?

40 of 67

funny, but doesn't give you the full picture

41 of 67

type c - the conformist

I read on hacker news that everyone is using keras, maybe we should too

42 of 67

type d - the dEEP learner

keras or gtfo

43 of 67

type E - the educator

you're using keras all wrong, give me the keyboard

44 of 67

type f - the failure

I can't get keras to work

45 of 67

type g - the go-getter

I signed up for the keras MOOC and like 10 other MOOCs too

46 of 67

type h - the hater

I hate keras

47 of 67

type i - the inventor

here's my new library, I call it keras

48 of 67

type J - the jerk

maybe I should just put a keras joke on every slide

49 of 67

type k - the kaggler

my first attempt only got me to 61%, but then I stayed up all night for a week renting GPU instances on Amazon, and now I'm getting close to breaking into the top 100. that will get me a job, right?

50 of 67

type L - the lifer

I've been doing data science since before data science was even a thing!

51 of 67

type M - the moocher

hey, can I use your Spark cluster?

52 of 67

type N - the nerd

did you see the new paper on dynamic adversarial generalized deep recurrent reinforcement memory networks on arxiv?

53 of 67

type O - the overqualified

I printed out that pie chart you wanted, it's over there next to my Physics PhD

54 of 67

type P - the p-hacker

sure, the first 19 results weren't significant at the 5% level, but...

55 of 67

type Q - the questioner

why would you categorize data scientists when "data science" is supposed to be an umbrella term?

56 of 67

type R - the R User

set_nas <- function(x) ifelse(is.na(x) | !str_detect(x, "SP"), NA, x)

57 of 67

type s - the self-promoter

hey, have you bought my book and listened to my podcast and read my blog and followed me on twitter?

58 of 67

type T - the thOUGHT-LEADER

there are 26 types of data scientists, as you can see in this combination venn diagram / gartner hype cycle

59 of 67

type U - the unicorn

the most important skill for a data scientist is empathy!

60 of 67

type v - the venn-diagram sharer

61 of 67

type W - the worrier

oh god, what if everyone understands deep learning except me?

62 of 67

type X - the xenophobe

nothing against scikit-learn, I just feel more comfortable using my own implementations

63 of 67

type Y - the Yeller

DATA SCIENCE IS THE SEXIEST JOB OF THE 21ST CENTURY! #Data #DataScience #Analytics #BigData #Innovation

64 of 67

type Z - the zookeeper

I'm proficient in Hadoop, Pig, Python, Pandas, Anaconda, Hive, Ant, Giraph, Oozie, Capybara, Orangutan, Coelacanth, ...

65 of 67

in practice, data science is like a lot of different things!

66 of 67

should we categorize data scientists?

think of it as "clustering" instead

67 of 67

type s - the self-promoter

hey, have you bought my book and listened to my podcast and read my blog and followed my twitter?

THANKS!

joelgrus.com

@joelgrus