1 of 78

400 bags of grocery receipts

+

Neo4j

An exploration of Instacart data using Neo4j

2 of 78

3 of 78

4 of 78

5 of 78

6 of 78

Data and Data Modeling

Mechanics of getting data into Neo4j

Dumpster Diving

Production Concerns (Sizing, import strategy)

7 of 78

8 of 78

Data and Data Modeling

Mechanics of getting data into Neo4j

Dumpster Diving

Production Concerns (Sizing, import strategy)

9 of 78

But what is Neo4j?

10 of 78

11 of 78

CSV Model

12 of 78

Graph Model

Aisle

Department

Product

Order

User

:ORDERED

:IN_ORDER

:ON

:IN

13 of 78

Data and Data Modeling

Mechanics of getting data into Neo4j

Dumpster Diving

Production Concerns (Sizing, import strategy)

14 of 78

IMPORT FROM CSV WITH HEADERS

15 of 78

Works with relationships, too

16 of 78

17 of 78

18 of 78

19 of 78

Data and Data Modeling

Mechanics of getting data into Neo4j

Dumpster Diving

Production Concerns (Sizing, import strategy)

20 of 78

Dietary Restrictions

21 of 78

Get Departments

22 of 78

Department-based Vegetarian

23 of 78

Original Beef Jerky

24 of 78

Check out our aisles

25 of 78

Aisles with Meat or Seafood in the name

26 of 78

Aisles to Avoid

27 of 78

What do "vegetarians" buy?

28 of 78

29 of 78

How Many Vegetarians?

30 of 78

Evolve the model

31 of 78

Evolve the model

32 of 78

Add a label, et voila

33 of 78

Products “Vegans” buy

34 of 78

How Many “Vegans”?

35 of 78

Vegans like me

36 of 78

37 of 78

Vegans like me - Round 2

38 of 78

39 of 78

Instadate

hungry for love?

40 of 78

41 of 78

42 of 78

43 of 78

44 of 78

Cups and ….

45 of 78

Cups and Alcohol

46 of 78

What else could you want?

47 of 78

Recommendations

48 of 78

Problem w/ recommendations

49 of 78

Problem w/ recommendations

50 of 78

Product Adjacency

Product

Product

Product

:NEXT

{ count: 30 }

:NEXT

{ count: 2 }

51 of 78

52 of 78

What to buy after Chocolate Cookies?

53 of 78

What people buy around Chocolate Cookies?

54 of 78

What people buy around Chocolate Cookies?

55 of 78

What people buy around Chocolate Cookies?

56 of 78

What Else?

57 of 78

What to people like to eat/drink when they’re sick?

Match people who just ordered pasta and sauce with someone who just ordered a bottle of red wine

Match a person ordering frozen meals with someone ordering the ingredients in those meals

Find out if it’s a single person, a couple, or a family

Can you find out if people have “unhealthy” habits? i.e., order flu medicine an bottle of whiskey

58 of 78

Data and Data Modeling

Mechanics of getting data into Neo4j

Dumpster Diving

Production Concerns (Sizing, import strategy)

59 of 78

Production Concerns

  1. Hardware Sizing
  2. Initial Data Load Strategy

60 of 78

Make sure you have the memory!

Data size on disk for FS cache + 8-16 GB JVM heap + 1 GB for OS and misc stuff

61 of 78

Total Memory Ballpark

  1. Load a representative subset of the data
  2. Extrapolate for full data size on disk

62 of 78

34,000 Orders (~1%)

Total Nodes: 412k

Total Rels: 1.5m

Total Props: 1.2m

Total Size on Disk: 333 MB

63 of 78

Extrapolated full set

Total Nodes: 41m

Total Rels: 150m

Total Props: 120m

Total Size on Disk: 33 GB

64 of 78

Total Memory Ballpark

Data size on disk for FS cache + 8 - 16 GB Java heap + 1 GB for OS and misc stuff

33 GB + 16 GB + 1GB = 50GB

m4.4xl with 64GB of memory for $0.80/hour

65 of 78

Extrapolated full set

Total Nodes: 41m -> 40.7m

Total Rels: 150m -> 71m

Total Props: 120m -> 128m

Total Size on Disk: 33 GB -> 5.18GB

66 of 78

Importing a larger dataset

LOAD CSV

67 of 78

Importing a larger dataset

LOAD CSV

neo4j-import

68 of 78

Importing a larger dataset

LOAD CSV

neo4j-import

neo4j-admin import

69 of 78

Importing a larger dataset

LOAD CSV

neo4j-import

neo4j-admin import???

70 of 78

71 of 78

72 of 78

Thanks!

73 of 78

Questions?

74 of 78

Slides and Github

Github: github.com/Spantree/instacart-neo4j

Or: bit.ly/instacart-neo4j

75 of 78

Resources

76 of 78

Data Citation

“The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on May 7th, 2017

77 of 78

PageRank Unweighted

78 of 78

Carnivore Coefficient