1 of 50

Data Engineering��Querying Semi-structured Data

2 of 50

Document Store: MongoDB

Short for humongous
First version in 2009!
Still very popular

IPO in 2017
Now worth >7B in market capital (as of 2021)

Motivation: Internet & social media boom led to a demand for

Rapid data model evolution: "a move fast and break things" mentality to system dev

E.g., adding a new attrib to a Facebook profile
Contrary to DBMS wisdom of declaring schema upfront and changing rarely (costly!)

Early version centered around storing and querying json documents quickly; several trade-offs

No joins 🡺 now support left outer joins
Limited query opt 🡺 still limited, but many improvements

3 of 50

MongoDB Data Model: JSON with Names

MongoDB	DBMS
Database	Database
Collection	Relation
Document	Row/Record
Field	Column

In MongoDB, you operate on collections, each of which is a document, of object type, consisting of field value pairs.

Recall an object has field value pairs enclosed by { }

Document = {…, field: value, …}

Where value can be

A primitive
An object
An array of values

{ qty : 1, status : "D", size : {h : 14, w : 21}, tags : ["a", "b"] },

4 of 50

MongoDB Data Model 2

MongoDB	DBMS
Database	Database
Collection	Relation
Document	Row/Record
Field	Column

Document = {…, field: value, …}

Special field in each document: _id

Primary key
Will also be indexed by default
If it is not present during ingest, it will be added
Will be first attribute of each doc.
This field requires special treatment during projections as we will see later

5 of 50

MongoDB Query Language (MQL)

Input = collections, output = collections

Three main types of queries in the query language

Retrieval: Restricted SELECT-WHERE-ORDER BY-LIMIT type queries
Aggregation: A bit of a misnomer; a general pipeline of operators

Can capture Retrieval as a special case
But worth understanding Retrieval queries first…

Updates

All queries are invoked as

db.collection.operation1(…).operation2(…)… // collection: name of collection
Like dataframes, but unlike SQL which lists tables in a FROM clause, manipulates a single collection

Pretty messy for two reasons

JSON is messy: you sacrifice simplicity when you give up flatness
Like dataframes, MQL has some weird peculiarities – evolved quickly to meet a need
We’ll give you a taste, but it will be hard for me to answer questions (on the fly at least)

6 of 50

Principles : Dot (.) Notation for Traversing Trees

"." is used to drill deeper into nested objects/arrays

Recall that a value could be primitive, a nested object, an array of primitives/objects
“<expr>.Y”, where Y is a string only applies when <expr> evaluates to a nested object or an array of nested objects

If <expr> is a nested object, Y is a field within that object
If <expr> is an array, Y is a field within objects in that array

“<expr>.n”, where n is a number references the n+1th value in the array corresponding to <expr>

Examples

“person.address" 🡺 address field within array of person objects
“person.1" 🡺 second value within the person array
“person.1.name" 🡺 name field within the object that is listed second in the person array

Note: such dot expressions need to be in quotes

person

Mary

name

address

name

address

street

no

city

Maple

345

SF

John

Thai

phone

23456

0

1

7 of 50

Principles : Dollar ($) Notation

$ indicates that the string is a special keyword

E.g., $gt, $lte, $add, $elemMatch, …

Used as the "field" part of a "field : value" expression
If it is a binary operator, usually done as:

{LOperand : { $keyword : ROperand}}
e.g., {qty : {$gt : 30}}

If it is a multi-arg operator, usually employs arrays:

{$keyword : [argument list]}
e.g., {$add : [1, 2]}

Exception: $fieldName, used to refer to a previously defined field on the value side

Purpose: disambiguation
Only relevant for aggregation pipelines
Let’s not worry about this for now.

8 of 50

Retrieval Queries Template

db.collection.find(<predicate>, optional <projection>)

returns documents that match <predicate>

keep fields as specified in <projection>

both <predicate> and <projection> expressed as objects

in fact, most things are objects!

db.inventory.find( { } )

returns all documents

9 of 50

Retrieval Queries: Basic Queries

db.collection.find(<predicate>, optional <projection>)

find( { status : "D"} )

all documents with status D 🡺 paper, planner

find ( { qty : {$gte : 50} } )

all documents with qty >= 50 🡺 notebook, paper, planner

find ( { status : "D", qty : {$gte : 50} } )

all documents that satisfy both 🡺 paper, planner

find( { $or: [ { status : "D" }, { qty : { $lt : 30 } } ] } )

all documents that satisfy either 🡺 journal, paper, planner�

10 of 50

Retrieval Queries: Nested Documents

db.collection.find(<predicate>, optional <projection>)

find( { size: { h: 14, w: 21, uom: "cm" } } )

exact match of nested object, including ordering of fields! 🡺 journal

find ( { "size.uom" : "cm", "size.h" : {$gt : 14 } )

querying a nested field 🡺 planner
Note: when using . notation for sub-fields, expression must be in quotes

11 of 50

Retrieval Queries: Arrays of Primitives

Slightly different example dataset for Array Examples

db.collection.find(<predicate>, optional <projection>)

find( { tags: ["red", "blank"] } )

Exact match of array 🡺 notebook

find( { tags: "red" } )

If one of the elements matches red 🡺 journal, notebook, paper, planner

find( { tags: "red", tags: "plain" } )

If one matches red, one matches plain 🡺 paper

find( { dim: { $gt: 15, $lt: 20 } } )

If one element is >15 and another is <20 🡺 journal, notebook, paper, postcard

find( { dim: {$elemMatch: { $gt: 15, $lt: 20 } } } )

If a single element is >15 and <20 🡺 postcard

find( { "dim.1": { $gt: 25 } } )

If second item > 25 🡺 planner
Notice again that we use quotes to when using . notation

12 of 50

Retrieval Queries: Arrays of Objects

db.collection.find(<predicate>, optional <projection>)

find( { instock: { loc: "A", qty: 5 } } )

Exact match of document [like nested obj/primitive array case] 🡺 journal

find( { "instock.qty": { $gte : 20 } } )

One nested obj has >= 20 🡺 paper, planner, postcard

find( { "instock.0.qty": { $gte : 20 } } )

First nested obj has >= 20 🡺 paper, planner

find( { "instock": { $elemMatch: { qty: { $gt: 10, $lte: 20 } } } } )

One obj has 20 >= qty >10🡺 paper, journal, postcard

find( { "instock.qty": { $gt: 10, $lte: 20 } } )

One obj has 20 >= qty, another has qty>10 🡺 paper, journal, postcard, planner

13 of 50

Retrieval Queries Template: Projection

db.collection.find(<predicate>, optional <projection>)

Use 1s to indicate fields that you want

Exception: _id is always present unless explicitly excluded

OR Use 0s to indicate fields you don’t want
Mixing 0s and 1s is not allowed for non _id fields

find( { }, {item: 1})

{ "_id" : ObjectId("5fb59ab9f50b800678c0e196"), "item" : "journal" }

{ "_id" : ObjectId("5fb59ab9f50b800678c0e197"), "item" : "notebook" }

{ "_id" : ObjectId("5fb59ab9f50b800678c0e198"), "item" : "paper" }

{ "_id" : ObjectId("5fb59ab9f50b800678c0e199"), "item" : "planner" }

{ "_id" : ObjectId("5fb59ab9f50b800678c0e19a"), "item" : "postcard" }

find( { }, {item: 1, _id : 0})

{ "item" : "journal" }

{ "item" : "notebook" }

{ "item" : "paper" }

{ "item" : "planner" }

{ "item" : "postcard" }

find({},{item : 1, tags: 0, _id : 0})

Error: error: {

"ok" : 0,

"errmsg" : "Cannot do exclusion on field tags in inclusion projection",

"code" : 31254,

"codeName" : "Location31254" }

find({},{item : 1, "instock.loc": 1, _id : 0})

{ "item" : "journal", "instock" : [ { "loc" : "A" }, { "loc" : "C" } ] }

{ "item" : "notebook", "instock" : [ { "loc" : "C" } ] }

{ "item" : "paper", "instock" : [ { "loc" : "A" }, { "loc" : "B" } ] }

{ "item" : "planner", "instock" : [ { "loc" : "A" }, { "loc" : "B" } ] }

{ "item" : "postcard", "instock" : [ { "loc" : "B" }, { "loc" : "C" } ] }

14 of 50

Retrieval Queries : Addendum

Two additional operations that are useful for retrieval:

Limit (k) like LIMIT in SQL

e.g., db.inventory.find( { } ).limit(1)

Sort ({ }) like ORDER BY in SQL

List of fields, -1 indicates decreasing 1 indicates ascending
e.g., db.inventory.find( { }, {_id : 0, instock : 0} ).sort( { "dim.0": -1, item: 1 } )

{ "item" : "planner", "tags" : [ "blank", "red" ], "dim" : [ 22.85, 30 ] }

{ "item" : "journal", "tags" : [ "blank", "red" ], "dim" : [ 14, 21 ] }

{ "item" : "notebook", "tags" : [ "red", "blank" ], "dim" : [ 14, 21 ] }

{ "item" : "paper", "tags" : [ "red", "blank", "plain" ], "dim" : [ 14, 21 ] }

{ "item" : "postcard", "tags" : [ "blue" ], "dim" : [ 10, 15.25 ] }

15 of 50

Retrieval Queries: Summary

find() = SELECT <projection>

FROM Collection

WHERE <predicate>

limit() = LIMIT

sort() = ORDER BY

db.inventory.find(

{ tags : red },

{_id : 0, instock : 0} )

.sort ( { "dim.0": -1, item: 1 } )

.limit (2)

WHERE

SELECT

ORDER BY

LIMIT

FROM

16 of 50

What did I not cover?

The use of regexes for matching
$all : all entries in an array satisfy a condition
$in : checking if a value is present in an array of atomic values
The presence or absence of fields

Can use special “null” values
{field : null} checks if a field is null or missing
$exists : checking the presence/absence of a field

17 of 50

MongoDB Query Language (MQL)

Input = collections, output = collections

Three main types of queries in the query language

Retrieval: Restricted SELECT-WHERE-ORDER BY-LIMIT type queries
Aggregation: A bit of a misnomer; a general pipeline of operators

Can capture Retrieval as a special case
But worth understanding Retrieval queries first…

Updates

All queries are invoked as

db.collection.operation1(…).operation2(…)… // collection: name of collection
Like dataframes, but unlike SQL which lists tables in a FROM clause, manipulates a single collection

Pretty messy for two reasons

JSON is messy: you sacrifice simplicity when you give up flatness
Like dataframes, MQL has some weird peculiarities – evolved quickly to meet a need
We’ll give you a taste, but it will be hard for me to answer questions (on the fly at least)

18 of 50

Aggregation Pipelines

Composed of a linear pipeline of stages
Each stage corresponds to one of:

match // first arg of find ( )
project // second arg of find ( ) but more expressiveness
sort/limit // same
group
unwind
lookup
… lots more!!

Each stage manipulates the existing collection in some way

match

group

lookup

project

Syntax:

db.collection.aggregate ( [

{ $stage1Op: { } },

{ $stage2Op: { } },

…

{ $stageNOp: { } }

] )

19 of 50

Next Set of Examples

One document per zipcode: 29353 zipcodes

20 of 50

Grouping (with match/sort) Simple Example

Find states with population > 15M, sort by decending order

db.zips.aggregate( [

{ $group: { _id: "$state", totalPop: { $sum: "$pop" } } },

{ $match: { totalPop: { $gte: 15000000 } } },

{ $sort : { totalPop : -1 } }

] )

{ "_id" : "CA", "totalPop" : 29754890 }

{ "_id" : "NY", "totalPop" : 17990402 }

{ "_id" : "TX", "totalPop" : 16984601 }

…

Q: what would the SQL query for this be?

match

group

sort

SELECT state AS id, SUM(pop) AS totalPop

FROM zips

GROUP BY state

HAVING totalPop >= 15000000

ORDER BY totalPop DESCENDING

GROUP BY

AGGS.

match after group = HAVING

21 of 50

Grouping Syntax

$group : {

_id: <expression>, // Group By Expression

<field1>: { <aggfunc1> : <expression1> },

... }

Returns one document per unique group, indexed by _id

Agg.func. can be standard ops like $sum, $avg, $max

Also MQL specific ones:

$first : return the first expression value per group

makes sense only if docs are in a specific order [usually done after sort]

$push : create an array of expression values per group

didn’t make sense in a relational context because values are atomic

$addToSet : like $push, but eliminates duplicates

22 of 50

Multiple Attrib. Grouping Example

aggregate( [

{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },

{ $group: { _id: "$_id.state", avgCityPop: { $avg: "$pop" } } }

] )

Q: Guesses on what this might be doing?

A: Find average city population per state

{ "_id" : "GA", "avgCityPop" : 11547.62210338681 }

{ "_id" : "WI", "avgCityPop" : 7323.00748502994 }

{ "_id" : "FL", "avgCityPop" : 27400.958963282937 }

{ "_id" : "OR", "avgCityPop" : 8262.561046511628 }

{ "_id" : "SD", "avgCityPop" : 1839.6746031746031 }

{ "_id" : "NM", "avgCityPop" : 5872.360465116279 }

{ "_id" : "MD", "avgCityPop" : 12615.775725593667 }

…

group

Group by previously def. id.state

Notice use of $ to refer to previously defined fields

Group by 2 attribs, giving nested id

23 of 50

Multiple Agg. Example

Find, for every state, the biggest city and its population

aggregate( [

{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },

{ $sort: { pop: -1 } },

{ $group: { _id : "$_id.state", bigCity: { $first: "$_id.city" }, bigPop: { $first: "$pop" } } },

{ $sort : {bigPop : -1} }

] )

Approach:

Group by pair of city and state, and compute population per city
Order by population descending
Group by state, and find first city and population per group (i.e., the highest population city)
Order by population descending

{ ”_id" : "IL", "bigCity" : "CHICAGO", "bigPop" : 2452177 }

{ "_id" : "NY", "bigCity" : "BROOKLYN", "bigPop" : 2300504 }

{ "_id" : "CA", "bigCity" : "LOS ANGELES", "bigPop" : 2102295 }

{ "_id" : "TX", "bigCity" : "HOUSTON", "bigPop" : 2095918 }

{ "_id" : "PA", "bigCity" : "PHILADELPHIA", "bigPop" : 1610956 }

{ "_id" : "MI", "bigCity" : "DETROIT", "bigPop" : 963243 }

…

group

sort

group

sort

Can list multiple aggregations after grouping id

24 of 50

Multiple Agg. with Vanilla Projection Example

If we only want to keep the state and city …

aggregate( [

{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },

{ $sort: { pop: -1 } },

{ $group: { _id : "$_id.state", bigCity: { $first: "$_id.city" }, bigPop: { $first: "$pop" } } },

{ $sort : {bigPop : -1} }

{ $project : {bigPop : 0} }

] )

{ "_id" : "IL", "bigCity" : "CHICAGO" }

{ "_id" : "NY", "bigCity" : "BROOKLYN" }

{ "_id" : "CA", "bigCity" : "LOS ANGELES" }

{ "_id" : "TX", "bigCity" : "HOUSTON" }

{ "_id" : "PA", "bigCity" : "PHILADELPHIA" }

…

group

sort

group

sort

project

25 of 50

Multiple Agg. with Adv. Projection Example

If we wanted to nest the name of the city and population into a nested doc

aggregate( [

{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },

{ $sort: { pop: -1 } },

{ $group: { _id : "$_id.state", bigCity: { $first: "$_id.city" }, bigPop: { $first: "$pop" } } },

{ $sort : {bigPop : -1} },

{ $project : { _id : 0, state : "$_id", bigCityDeets: { name: "$bigCity", pop: "$bigPop" } } }

] )

{ "state" : "IL", "bigCityDeets" : { "name" : "CHICAGO", "pop" : 2452177 } }

{ "state" : "NY", "bigCityDeets" : { "name" : "BROOKLYN", "pop" : 2300504 } }

{ "state" : "CA", "bigCityDeets" : { "name" : "LOS ANGELES", "pop" : 2102295 } }

{ "state" : "TX", "bigCityDeets" : { "name" : "HOUSTON", "pop" : 2095918 } }

{ "state" : "PA", "bigCityDeets" : { "name" : "PHILADELPHIA", "pop" : 1610956 } }

…

Can construct new nested documents in output, unlike vanilla projection

26 of 50

Advanced Projection vs. Vanilla Projection

In addition to excluding/including fields like in projection during retrieval (find), projection in the aggregation pipeline allows you to:

Rename fields
Redefine new fields using complex expressions on old fields
Reorganize fields into nestings or unnestings
Reorganize fields into arrays or break down arrays

Try them at home!

27 of 50

Aggregation Pipelines

Composed of a linear pipeline of stages
Each stage corresponds to one of:

match // first arg of find ( )
project // second arg of find ( ) but more expressiveness
sort/limit // same
group
unwind
lookup
… lots more!!

Each stage manipulates the existing collection in some way

Syntax:

db.collection.aggregate ( [

{ $stage1Op: { } },

{ $stage2Op: { } },

…

{ $stageNOp: { } }

] }

match

group

lookup

project

28 of 50

Unwinding Arrays

Unwind expands an array by constructing documents one per element of the array

Going back to our old example with an array of tags

Notice no relational analog here: no arrays so no unwinding

[in fact, some RDBMSs do support arrays, but not in the rel. model]

aggregate( [

{ $unwind : "$tags" },

{ $project : {_id : 0, instock: 0}}

] )

{ "item" : "journal", "tags" : "blank", "dim" : [ 14, 21 ] }

{ "item" : "journal", "tags" : "red", "dim" : [ 14, 21 ] }

{ "item" : "notebook", "tags" : "red", "dim" : [ 14, 21 ] }

{ "item" : "notebook", "tags" : "blank", "dim" : [ 14, 21 ] }

{ "item" : "paper", "tags" : "red", "dim" : [ 14, 21 ] }

{ "item" : "paper", "tags" : "blank", "dim" : [ 14, 21 ] }

{ "item" : "paper", "tags" : "plain", "dim" : [ 14, 21 ] }

{ "item" : "planner", "tags" : "blank", "dim" : [ 22.85, 30 ] }

{ "item" : "planner", "tags" : "red", "dim" : [ 22.85, 30 ] }

{ "item" : "postcard", "tags" : "blue", "dim" : [ 10, 15.25 ] }

29 of 50

Unwind: A Common Template

Q: Imagine if we want to find sum of qtys across items. How would we do this?

A common recipe in MQL queries is to unwind and then group by

aggregate( [

{ $unwind : "$instock" },

{ $group : {_id : "$item", totalqty : {$sum : "$instock.qty"}}}

] )

{ "_id" : "notebook", "totalqty" : 5 }

{ "_id" : "postcard", "totalqty" : 50 }

{ "_id" : "journal", "totalqty" : 20 }

{ "_id" : "planner", "totalqty" : 45 }

{ "_id" : "paper", "totalqty" : 75 }

30 of 50

Demo: Aggregation Queries!

> db.prizes.aggregate([{$group: {_id: "$category", awardyears: {$sum : 1}}}])

> db.prizes.aggregate([{$group: {_id: "$category", awardyears: {$sum : 1}}}, {$match : {awardyears: {$lt: 100}}}])

> db.prizes.aggregate([{$group: {_id: "$category", awardyears: {$sum : 1}}}, {$match : {awardyears: {$lt: 100}}}, {$project : {_id: 0, awardyears: 1}}])

> db.prizes.aggregate([{$unwind: "$laureates"}, {$group: {_id: "$category", awards: {$sum : 1}}}])

> db.prizes.aggregate([{$unwind: "$laureates"}, {$group: {_id: {category: "$category", year: "$year"}, awards: {$sum : 1}}}])

> db.prizes.aggregate([{$unwind: "$laureates"}, {$group: {_id: {category: "$category", year: "$year"}, awards: {$sum : 1}}}, {$sort : {awards: -1}}])

> db.prizes.aggregate([{$unwind: "$laureates"}, {$group: {_id: {category: "$category", year: "$year"}, awards: {$sum : 1}}}, {$group: {_id:"$_id.category", avgawards: {$avg : "$awards"}}}])

31 of 50

Looking Up Other Collections

{ $lookup: {

from: <collection to join>,

localField: <referencing field>,

foreignField: <referenced field>,

as: <output array field>

} }

Conceptually, for each document

find documents in other coll that join (equijoin)

local field must match foreign field

place each of them in an array

Thus, a left outer equi-join, with the join results stored in an array

Straightforward, but kinda gross. Let’s see…

Say, for each item, I want to find other items located in the same location = self-join

db.inventory.aggregate( [

{ $lookup : {from : "inventory", localField: "instock.loc", foreignField: "instock.loc", as:"otheritems"}},

{ $project : {_id : 0, tags : 0, dim : 0}}

] )

{ "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "otheritems" : [

{ "_id" : ObjectId("5fb6f9605f0594e0227d3c24"), "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "tags" : [ "blank", "red" ], "dim" : [ 14, 21 ] },

{ "_id" : ObjectId("5fb6f9605f0594e0227d3c25"), "item" : "notebook", "instock" : [ { "loc" : "C", "qty" : 5 } ], "tags" : [ "red", "blank" ], "dim" : [ 14, 21 ] },

{ "_id" : ObjectId("5fb6f9605f0594e0227d3c26"), "item" : "paper", "instock" : [ { "loc" : "A", "qty" : 60 }, { "loc" : "B", "qty" : 15 } ], "tags" : [ "red", "blank", "plain" ], "dim" : [ 14, 21 ] },

…

] }

And many other records!

aggregate( [ { $lookup : {from:"inventory", localField:"instock.loc", foreignField:"instock.loc", as:"otheritems"}}, {$project : {_id : 0, tags :0, dim :0}} ] )

{ "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "otheritems" : [ { "_id" : ObjectId("5fb6f9605f0594e0227d3c24"), "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "tags" : [ "blank", "red" ], "dim" : [ 14, 21 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c25"), "item" : "notebook", "instock" : [ { "loc" : "C", "qty" : 5 } ], "tags" : [ "red", "blank" ], "dim" : [ 14, 21 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c26"), "item" : "paper", "instock" : [ { "loc" : "A", "qty" : 60 }, { "loc" : "B", "qty" : 15 } ], "tags" : [ "red", "blank", "plain" ], "dim" : [ 14, 21 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c27"), "item" : "planner", "instock" : [ { "loc" : "A", "qty" : 40 }, { "loc" : "B", "qty" : 5 } ], "tags" : [ "blank", "red" ], "dim" : [ 22.85, 30 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c28"), "item" : "postcard", "instock" : [ { "loc" : "B", "qty" : 15 }, { "loc" : "C", "qty" : 35 } ], "tags" : [ "blue" ], "dim" : [ 10, 15.25 ] } ] }

{ "item" : "notebook", "instock" : [ { "loc" : "C", "qty" : 5 } ], "otheritems" : [ { "_id" : ObjectId("5fb6f9605f0594e0227d3c24"), "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "tags" : [ "blank", "red" ], "dim" : [ 14, 21 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c25"), "item" : "notebook", "instock" : [ { "loc" : "C", "qty" : 5 } ], "tags" : [ "red", "blank" ], "dim" : [ 14, 21 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c28"), "item" : "postcard", "instock" : [ { "loc" : "B", "qty" : 15 }, { "loc" : "C", "qty" : 35 } ], "tags" : [ "blue" ], "dim" : [ 10, 15.25 ] } ] }

{ "item" : "paper", "instock" : [ { "loc" : "A", "qty" : 60 }, { "loc" : "B", "qty" : 15 } ], "otheritems" : [ { "_id" : ObjectId("5fb6f9605f0594e0227d3c24"), "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "tags" : [ "blank", "red" ], "dim" : [ 14, 21 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c26"), "item" : "paper", "instock" : [ { "loc" : "A", "qty" : 60 }, { "loc" : "B", "qty" : 15 } ], "tags" : [ "red", "blank", "plain" ], "dim" : [ 14, 21 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c27"), "item" : "planner", "instock" : [ { "loc" : "A", "qty" : 40 }, { "loc" : "B", "qty" : 5 } ], "tags" : [ "blank", "red" ], "dim" : [ 22.85, 30 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c28"), "item" : "postcard", "instock" : [ { "loc" : "B", "qty" : 15 }, { "loc" : "C", "qty" : 35 } ], "tags" : [ "blue" ], "dim" : [ 10, 15.25 ] } ] }

{ "item" : "planner", "instock" : [ { "loc" : "A", "qty" : 40 }, { "loc" : "B", "qty" : 5 } ], "otheritems" : [ { "_id" : ObjectId("5fb6f9605f0594e0227d3c24"), "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "tags" : [ "blank", "red" ], "dim" : [ 14, 21 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c26"), "item" : "paper", "instock" : [ { "loc" : "A", "qty" : 60 }, { "loc" : "B", "qty" : 15 } ], "tags" : [ "red", "blank", "plain" ], "dim" : [ 14, 21 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c27"), "item" : "planner", "instock" : [ { "loc" : "A", "qty" : 40 }, { "loc" : "B", "qty" : 5 } ], "tags" : [ "blank", "red" ], "dim" : [ 22.85, 30 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c28"), "item" : "postcard", "instock" : [ { "loc" : "B", "qty" : 15 }, { "loc" : "C", "qty" : 35 } ], "tags" : [ "blue" ], "dim" : [ 10, 15.25 ] } ] }

{ "item" : "postcard", "instock" : [ { "loc" : "B", "qty" : 15 }, { "loc" : "C", "qty" : 35 } ], "otheritems" : [ { "_id" : ObjectId("5fb6f9605f0594e0227d3c24"), "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "tags" : [ "blank", "red" ], "dim" : [ 14, 21 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c25"), "item" : "notebook", "instock" : [ { "loc" : "C", "qty" : 5 } ], "tags" : [ "red", "blank" ], "dim" : [ 14, 21 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c26"), "item" : "paper", "instock" : [ { "loc" : "A", "qty" : 60 }, { "loc" : "B", "qty" : 15 } ], "tags" : [ "red", "blank", "plain" ], "dim" : [ 14, 21 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c27"), "item" : "planner", "instock" : [ { "loc" : "A", "qty" : 40 }, { "loc" : "B", "qty" : 5 } ], "tags" : [ "blank", "red" ], "dim" : [ 22.85, 30 ] }, { "_id" : ObjectId("5fb6f9605f0594e0227d3c28"), "item" : "postcard", "instock" : [ { "loc" : "B", "qty" : 15 }, { "loc" : "C", "qty" : 35 } ], "tags" : [ "blue" ], "dim" : [ 10, 15.25 ] } ] }

32 of 50

Lookup… after some more projection

db.inventory.aggregate( [

{ $lookup : {from:"inventory", localField:"instock.loc", foreignField:"instock.loc", as:"otheritems"}},

{$project : {_id : 0, tags :0, dim :0, "otheritems._id":0, "otheritems.tags":0, "otheritems.dim":0, "otheritems.instock.qty":0}} ] )

{ "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "otheritems" : [

{ "item" : "journal", "instock" : [ { "loc" : "A" }, { "loc" : "C" } ] },

{ "item" : "notebook", "instock" : [ { "loc" : "C" } ] },

{ "item" : "paper", "instock" : [ { "loc" : "A" }, { "loc" : "B" } ] },

{ "item" : "planner", "instock" : [ { "loc" : "A" }, { "loc" : "B" } ] },

{ "item" : "postcard", "instock" : [ { "loc" : "B" }, { "loc" : "C" } ] } ] }

{ "item" : "notebook", "instock" : [ { "loc" : "C", "qty" : 5 } ], "otheritems" : [

{ "item" : "journal", "instock" : [ { "loc" : "A" }, { "loc" : "C" } ] },

{ "item" : "notebook", "instock" : [ { "loc" : "C" } ] },

{ "item" : "postcard", "instock" : [ { "loc" : "B" }, { "loc" : "C" } ] } ] }

…

33 of 50

Some Rules of Thumb when Writing Queries

$project is helpful if you want to construct or deconstruct nestings (in addition to removing fields or creating new ones)
$group is helpful to construct arrays (using $push or $addToSet)
$unwind is helpful for unwinding arrays
$lookup is your only hope for joins. Be prepared for a mess. Lots of $project needed

34 of 50

MongoDB Query Language (MQL)

Input = collections, output = collections

Three main types of queries in the query language

Retrieval: Restricted SELECT-WHERE-ORDER BY-LIMIT type queries
Aggregation: A bit of a misnomer; a general pipeline of operators

Can capture Retrieval as a special case
But worth understanding Retrieval queries first…

Updates

All queries are invoked as

db.collection.operation1(…).operation2(…)… // collection: name of collection
Like dataframes, but unlike SQL which lists tables in a FROM clause, manipulates a single collection

Pretty messy for two reasons

JSON is messy: you sacrifice simplicity when you give up flatness
Like dataframes, MQL has some weird peculiarities – evolved quickly to meet a need
We’ll give you a taste, but it will be hard for me to answer questions (on the fly at least)

35 of 50

Update Queries: InsertMany

[Insert/Delete/Update] [One/Many]

Many is more general, so we’ll discuss that instead

db.inventory.insertMany( [

{ item: "journal", instock: [ { loc: "A", qty: 5 }, { loc: "C", qty: 15 } ], tags: ["blank", "red"], dim: [ 14, 21 ] },

{ item: "notebook", instock: [ { loc: "C", qty: 5 } ], tags: ["red", "blank"] , dim: [ 14, 21 ]},

{ item: "paper", instock: [ { loc: "A", qty: 60 }, { loc: "B", qty: 15 } ], tags: ["red", "blank", "plain"] , dim: [ 14, 21 ]},

{ item: "planner", instock: [ { loc: "A", qty: 40 }, { loc: "B", qty: 5 } ], tags: ["blank", "red"], dim: [ 22.85, 30 ] },

{ item: "postcard", instock: [ {loc: "B", qty: 15 }, { loc: "C", qty: 35 } ], tags: ["blue"] , dim: [ 10, 15.25 ] }

] );

Several actions will be taken as part of this insert:

Will create inventory collection if absent [No schema specification/DDL needed!]
Will add the _id attrib to each document added (since it isn’t there)
_id will be the first field for each document by default

36 of 50

Update Queries: UpdateMany

Syntax: updateMany ( {<condition>}, {<change>})

db.inventory.updateMany (

{"dim.0": { $lt: 15 } },

{ $set: { "dim.0": 15, status: "InvalidWidth"} }

) // if any width <15, set it to 15 and set status to InvalidWidth.

Analogous to: UPDATE R SET <change> WHERE <condition>

37 of 50

Update Queries: UpdateMany 2

Syntax: updateMany ( {<condition>}, {<change>})

db.inventory.updateMany (

{"dim.0": { $lt: 15 } },

{ $inc: { "dim.0": 5},

$set: {status: "InvalidWidth"} } )

// if any width <15, increment by 5 and set status to InvalidWidth.

Analogous to: UPDATE R SET <change> WHERE <condition

38 of 50

MongoDB Internals

Optimization heuristics

Will use indexes for $match if early in the pipeline [user can explicitly declare]
$match will be merged with other $match if possible

Selection fusion

$match will be moved early in the pipeline sometimes

Selection pushdown
But: not done always (e.g., not pushed before $lookup)

No cost-based optimization as far as one can tell

Weird constraint: intermediate results of aggregations must not be too large (100MB)

Else will end up spilling to disk
Not clear if they perform any pipelining across aggregation operators

39 of 50

MongoDB: Summary

Bottomline:

MongoDB has now evolved into a mature data system with some different design decisions, and relearning many of the canonical relational database lessons

MongoDB has a flexible data model and a powerful (if confusing) query language.

Many of the internal design decisions as well as the query & data model can be understood when compared with what we know

Database systems provide a good "gold standard" to compare against.

40 of 50

41 of 50

MongoDB: Connecting

Running example: nobel prize dataset

Will reveal aspects of it as needed

Connecting to MongoDB is just like in RDBMS: you create a connection, and then you query

41

42 of 50

MongoDB: Retrieval

Retrieval of documents is done via the find() operation

find_one() is find but with a LIMIT 1

Input: documents & Output: documents

42

43 of 50

MongoDB: Retrieval with Predicates

Retrieval of documents is done via the find()/find_one() operation
Can add predicates for fields within json
Essentially a WHERE clause

43

Notice the nested object!

Convenient to capture all awardees for a given category in a given year.

44 of 50

MongoDB: Retrieval Cursors

Just like in RDBMSs, you use cursors to walk through all result documents

44

45 of 50

MongoDB: Retrieval with Selection & Projection

MongoDB’s find operation allows you to project out certain attributes you don’t want, or keep attributes you want

Either indicate what you want with 1’s or don’t want with 0’s
Can’t use both!

Essentially a WHERE with a SELECT

45

46 of 50

MongoDB: Retrieval with Selection/Projection II

46

47 of 50

MongoDB: Simple Aggregation with Counts

Simple counting operation via count_documents
Notice that this doesn’t return a set of documents

47

48 of 50

MongoDB: Advanced Aggregation

Specifies:

Grouping attributes ($group)
Aggregates (e.g., $sum)

Simple example with just counts of number of documents per category

48

49 of 50

MongoDB: Even more Advanced Aggregation

49

50 of 50

JSON in Relational Systems

Relational database systems have also started supporting JSON as a data type

Declare a column to contain either json or jsonb (binary)

CREATE TABLE nobel_prizes (nobel json)

In our previous example, we will have a row per document

i.e., a row corresponding to prizes for a given year and field

Allowing mixing of relational and semi-structured syntax

SELECT * FROM nobel_prizes

WHERE nobel @> ‘{“year”: “1990”}’;

50