400 bags of grocery receipts
+
Neo4j
An exploration of Instacart data using Neo4j
Data and Data Modeling
Mechanics of getting data into Neo4j
Dumpster Diving
Production Concerns (Sizing, import strategy)
Data and Data Modeling
Mechanics of getting data into Neo4j
Dumpster Diving
Production Concerns (Sizing, import strategy)
But what is Neo4j?
CSV Model
Graph Model
Aisle
Department
Product
Order
User
:ORDERED
:IN_ORDER
:ON
:IN
Data and Data Modeling
Mechanics of getting data into Neo4j
Dumpster Diving
Production Concerns (Sizing, import strategy)
IMPORT FROM CSV WITH HEADERS
Works with relationships, too
Data and Data Modeling
Mechanics of getting data into Neo4j
Dumpster Diving
Production Concerns (Sizing, import strategy)
Dietary Restrictions
Get Departments
Department-based Vegetarian
Original Beef Jerky
Check out our aisles
Aisles with Meat or Seafood in the name
Aisles to Avoid
What do "vegetarians" buy?
How Many Vegetarians?
Evolve the model
Evolve the model
Add a label, et voila
Products “Vegans” buy
How Many “Vegans”?
Vegans like me
Vegans like me - Round 2
Instadate
hungry for love?
Cups and ….
Cups and Alcohol
What else could you want?
Recommendations
Problem w/ recommendations
Problem w/ recommendations
Product Adjacency
Product
Product
Product
:NEXT
{ count: 30 }
:NEXT
{ count: 2 }
What to buy after Chocolate Cookies?
What people buy around Chocolate Cookies?
What people buy around Chocolate Cookies?
What people buy around Chocolate Cookies?
What Else?
What to people like to eat/drink when they’re sick?
Match people who just ordered pasta and sauce with someone who just ordered a bottle of red wine
Match a person ordering frozen meals with someone ordering the ingredients in those meals
Find out if it’s a single person, a couple, or a family
Can you find out if people have “unhealthy” habits? i.e., order flu medicine an bottle of whiskey
Data and Data Modeling
Mechanics of getting data into Neo4j
Dumpster Diving
Production Concerns (Sizing, import strategy)
Production Concerns
Make sure you have the memory!
Data size on disk for FS cache + 8-16 GB JVM heap + 1 GB for OS and misc stuff
Total Memory Ballpark
34,000 Orders (~1%)
Total Nodes: 412k
Total Rels: 1.5m
Total Props: 1.2m
Total Size on Disk: 333 MB
Extrapolated full set
Total Nodes: 41m
Total Rels: 150m
Total Props: 120m
Total Size on Disk: 33 GB
Total Memory Ballpark
Data size on disk for FS cache + 8 - 16 GB Java heap + 1 GB for OS and misc stuff
33 GB + 16 GB + 1GB = 50GB
m4.4xl with 64GB of memory for $0.80/hour
Extrapolated full set
Total Nodes: 41m -> 40.7m
Total Rels: 150m -> 71m
Total Props: 120m -> 128m
Total Size on Disk: 33 GB -> 5.18GB
Importing a larger dataset
LOAD CSV
Importing a larger dataset
LOAD CSV
neo4j-import
Importing a larger dataset
LOAD CSV
neo4j-import
neo4j-admin import
Importing a larger dataset
LOAD CSV
neo4j-import
neo4j-admin import???
Thanks!
Questions?
Slides and Github
Github: github.com/Spantree/instacart-neo4j
Or: bit.ly/instacart-neo4j
Resources
Instagram Blog Post
https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2
Data Set
https://www.instacart.com/datasets/grocery-shopping-2017
Neo4j Hardware Sizing
https://neo4j.com/news/video-hardware-sizing-for-neo4j/
Data Citation
“The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on May 7th, 2017
PageRank Unweighted
Carnivore Coefficient