1 of 50

Data Engineering��Querying Semi-structured Data

2 of 50

Document Store: MongoDB

  • Short for humongous
  • First version in 2009!
  • Still very popular
    • IPO in 2017
    • Now worth >7B in market capital (as of 2021)
  • Motivation: Internet & social media boom led to a demand for
    • Rapid data model evolution: "a move fast and break things" mentality to system dev
      • E.g., adding a new attrib to a Facebook profile
      • Contrary to DBMS wisdom of declaring schema upfront and changing rarely (costly!)
  • Early version centered around storing and querying json documents quickly; several trade-offs
    • No joins 🡺 now support left outer joins
    • Limited query opt 🡺 still limited, but many improvements

3 of 50

MongoDB Data Model: JSON with Names

MongoDB

DBMS

Database

Database

Collection

Relation

Document

Row/Record

Field

Column

In MongoDB, you operate on collections, each of which is a document, of object type, consisting of field value pairs.

Recall an object has field value pairs enclosed by { }

Document = {…, field: value, …}

Where value can be

  • A primitive
  • An object
  • An array of values

{ qty : 1, status : "D", size : {h : 14, w : 21}, tags : ["a", "b"] },

4 of 50

MongoDB Data Model 2

MongoDB

DBMS

Database

Database

Collection

Relation

Document

Row/Record

Field

Column

Document = {…, field: value, …}

Special field in each document: _id

  • Primary key
  • Will also be indexed by default
  • If it is not present during ingest, it will be added
  • Will be first attribute of each doc.
  • This field requires special treatment during projections as we will see later

5 of 50

MongoDB Query Language (MQL)

  • Input = collections, output = collections

  • Three main types of queries in the query language
    • Retrieval: Restricted SELECT-WHERE-ORDER BY-LIMIT type queries
    • Aggregation: A bit of a misnomer; a general pipeline of operators
      • Can capture Retrieval as a special case
      • But worth understanding Retrieval queries first…
    • Updates
  • All queries are invoked as
    • db.collection.operation1(…).operation2(…)… // collection: name of collection
    • Like dataframes, but unlike SQL which lists tables in a FROM clause, manipulates a single collection
  • Pretty messy for two reasons
    • JSON is messy: you sacrifice simplicity when you give up flatness
    • Like dataframes, MQL has some weird peculiarities – evolved quickly to meet a need
    • We’ll give you a taste, but it will be hard for me to answer questions (on the fly at least)

6 of 50

Principles : Dot (.) Notation for Traversing Trees

  • "." is used to drill deeper into nested objects/arrays
    • Recall that a value could be primitive, a nested object, an array of primitives/objects
    • “<expr>.Y”, where Y is a string only applies when <expr> evaluates to a nested object or an array of nested objects
      • If <expr> is a nested object, Y is a field within that object
      • If <expr> is an array, Y is a field within objects in that array
    • “<expr>.n”, where n is a number references the n+1th value in the array corresponding to <expr>
  • Examples
    • person.address" 🡺 address field within array of person objects
    • person.1" 🡺 second value within the person array
    • person.1.name" 🡺 name field within the object that is listed second in the person array
  • Note: such dot expressions need to be in quotes

person

Mary

name

address

name

address

street

no

city

Maple

345

SF

John

Thai

phone

23456

0​

1

7 of 50

Principles : Dollar ($) Notation

  • $ indicates that the string is a special keyword
    • E.g., $gt, $lte, $add, $elemMatch, …
  • Used as the "field" part of a "field : value" expression
  • If it is a binary operator, usually done as:
    • {LOperand : { $keyword : ROperand}}
    • e.g., {qty : {$gt : 30}}
  • If it is a multi-arg operator, usually employs arrays:
    • {$keyword : [argument list]}
    • e.g., {$add : [1, 2]}

  • Exception: $fieldName, used to refer to a previously defined field on the value side
    • Purpose: disambiguation
    • Only relevant for aggregation pipelines
    • Let’s not worry about this for now.

8 of 50

Retrieval Queries Template

db.collection.find(<predicate>, optional <projection>)

returns documents that match <predicate>

keep fields as specified in <projection>

both <predicate> and <projection> expressed as objects

in fact, most things are objects!

db.inventory.find( { } )

returns all documents

9 of 50

Retrieval Queries: Basic Queries

db.collection.find(<predicate>, optional <projection>)

  • find( { status : "D"} )
    • all documents with status D 🡺 paper, planner
  • find ( { qty : {$gte : 50} } )
    • all documents with qty >= 50 🡺 notebook, paper, planner
  • find ( { status : "D", qty : {$gte : 50} } )
    • all documents that satisfy both 🡺 paper, planner
  • find( { $or: [ { status : "D" }, { qty : { $lt : 30 } } ] } )
    • all documents that satisfy either 🡺 journal, paper, planner�

10 of 50

Retrieval Queries: Nested Documents

db.collection.find(<predicate>, optional <projection>)

  • find( { size: { h: 14, w: 21, uom: "cm" } } )
    • exact match of nested object, including ordering of fields! 🡺 journal
  • find ( { "size.uom" : "cm", "size.h" : {$gt : 14 } )
    • querying a nested field 🡺 planner
    • Note: when using . notation for sub-fields, expression must be in quotes

11 of 50

Retrieval Queries: Arrays of Primitives

Slightly different example dataset for Array Examples

db.collection.find(<predicate>, optional <projection>)

  • find( { tags: ["red", "blank"] } )
    • Exact match of array 🡺 notebook
  • find( { tags: "red" } )
    • If one of the elements matches red 🡺 journal, notebook, paper, planner
  • find( { tags: "red", tags: "plain" } )
    • If one matches red, one matches plain 🡺 paper
  • find( { dim: { $gt: 15, $lt: 20 } } )
    • If one element is >15 and another is <20 🡺 journal, notebook, paper, postcard
  • find( { dim: {$elemMatch: { $gt: 15, $lt: 20 } } } )
    • If a single element is >15 and <20 🡺 postcard
  • find( { "dim.1": { $gt: 25 } } )
    • If second item > 25 🡺 planner
    • Notice again that we use quotes to when using . notation

12 of 50

Retrieval Queries: Arrays of Objects

db.collection.find(<predicate>, optional <projection>)

  • find( { instock: { loc: "A", qty: 5 } } )
    • Exact match of document [like nested obj/primitive array case] 🡺 journal
  • find( { "instock.qty": { $gte : 20 } } )
    • One nested obj has >= 20 🡺 paper, planner, postcard
  • find( { "instock.0.qty": { $gte : 20 } } )
    • First nested obj has >= 20 🡺 paper, planner
  • find( { "instock": { $elemMatch: { qty: { $gt: 10, $lte: 20 } } } } )
    • One obj has 20 >= qty >10🡺 paper, journal, postcard
  • find( { "instock.qty": { $gt: 10,  $lte: 20 } } )
    • One obj has 20 >= qty, another has qty>10 🡺 paper, journal, postcard, planner

13 of 50

Retrieval Queries Template: Projection

db.collection.find(<predicate>, optional <projection>)

  • Use 1s to indicate fields that you want
    • Exception: _id is always present unless explicitly excluded
  • OR Use 0s to indicate fields you don’t want
  • Mixing 0s and 1s is not allowed for non _id fields

  • find( { }, {item: 1})

{ "_id" : ObjectId("5fb59ab9f50b800678c0e196"), "item" : "journal" }

{ "_id" : ObjectId("5fb59ab9f50b800678c0e197"), "item" : "notebook" }

{ "_id" : ObjectId("5fb59ab9f50b800678c0e198"), "item" : "paper" }

{ "_id" : ObjectId("5fb59ab9f50b800678c0e199"), "item" : "planner" }

{ "_id" : ObjectId("5fb59ab9f50b800678c0e19a"), "item" : "postcard" }

  • find( { }, {item: 1, _id : 0})

{ "item" : "journal" }

{ "item" : "notebook" }

{ "item" : "paper" }

{ "item" : "planner" }

{ "item" : "postcard" }

  • find({},{item : 1, tags: 0, _id : 0})

Error: error: {

"ok" : 0,

"errmsg" : "Cannot do exclusion on field tags in inclusion projection",

"code" : 31254,

"codeName" : "Location31254" }

  • find({},{item : 1, "instock.loc": 1, _id : 0})

{ "item" : "journal", "instock" : [ { "loc" : "A" }, { "loc" : "C" } ] }

{ "item" : "notebook", "instock" : [ { "loc" : "C" } ] }

{ "item" : "paper", "instock" : [ { "loc" : "A" }, { "loc" : "B" } ] }

{ "item" : "planner", "instock" : [ { "loc" : "A" }, { "loc" : "B" } ] }

{ "item" : "postcard", "instock" : [ { "loc" : "B" }, { "loc" : "C" } ] }

14 of 50

Retrieval Queries : Addendum

Two additional operations that are useful for retrieval:

  • Limit (k) like LIMIT in SQL
    • e.g., db.inventory.find( { } ).limit(1)
  • Sort ({ }) like ORDER BY in SQL
    • List of fields, -1 indicates decreasing 1 indicates ascending
    • e.g., db.inventory.find( { }, {_id : 0, instock : 0} ).sort( { "dim.0": -1, item: 1 } )

{ "item" : "planner", "tags" : [ "blank", "red" ], "dim" : [ 22.85, 30 ] }

{ "item" : "journal", "tags" : [ "blank", "red" ], "dim" : [ 14, 21 ] }

{ "item" : "notebook", "tags" : [ "red", "blank" ], "dim" : [ 14, 21 ] }

{ "item" : "paper", "tags" : [ "red", "blank", "plain" ], "dim" : [ 14, 21 ] }

{ "item" : "postcard", "tags" : [ "blue" ], "dim" : [ 10, 15.25 ] }

15 of 50

Retrieval Queries: Summary

find() = SELECT <projection>

FROM Collection

WHERE <predicate>

limit() = LIMIT

sort() = ORDER BY

db.inventory.find(

{ tags : red },

{_id : 0, instock : 0} )

.sort ( { "dim.0": -1, item: 1 } )

.limit (2)

WHERE

SELECT

ORDER BY

LIMIT

FROM

16 of 50

What did I not cover?

  • The use of regexes for matching
  • $all : all entries in an array satisfy a condition
  • $in : checking if a value is present in an array of atomic values
  • The presence or absence of fields
    • Can use special “null” values
    • {field : null} checks if a field is null or missing
    • $exists : checking the presence/absence of a field

17 of 50

MongoDB Query Language (MQL)

  • Input = collections, output = collections

  • Three main types of queries in the query language
    • Retrieval: Restricted SELECT-WHERE-ORDER BY-LIMIT type queries
    • Aggregation: A bit of a misnomer; a general pipeline of operators
      • Can capture Retrieval as a special case
      • But worth understanding Retrieval queries first…
    • Updates
  • All queries are invoked as
    • db.collection.operation1(…).operation2(…)… // collection: name of collection
    • Like dataframes, but unlike SQL which lists tables in a FROM clause, manipulates a single collection
  • Pretty messy for two reasons
    • JSON is messy: you sacrifice simplicity when you give up flatness
    • Like dataframes, MQL has some weird peculiarities – evolved quickly to meet a need
    • We’ll give you a taste, but it will be hard for me to answer questions (on the fly at least)

18 of 50

Aggregation Pipelines

  • Composed of a linear pipeline of stages
  • Each stage corresponds to one of:
    • match // first arg of find ( )
    • project // second arg of find ( ) but more expressiveness
    • sort/limit // same
    • group
    • unwind
    • lookup
    • … lots more!!
  • Each stage manipulates the existing collection in some way

match

match

group

lookup

project

  • Syntax:

db.collection.aggregate ( [

{ $stage1Op: { } },

{ $stage2Op: { } },

{ $stageNOp: { } }

] )

19 of 50

Next Set of Examples

One document per zipcode: 29353 zipcodes

20 of 50

Grouping (with match/sort) Simple Example

Find states with population > 15M, sort by decending order

db.zips.aggregate( [   

{ $group: { _id: "$state", totalPop: { $sum: "$pop" } } },   

{ $match: { totalPop: { $gte: 15000000 } } },

{ $sort : { totalPop : -1 } }

] )

{ "_id" : "CA", "totalPop" : 29754890 }

{ "_id" : "NY", "totalPop" : 17990402 }

{ "_id" : "TX", "totalPop" : 16984601 }

Q: what would the SQL query for this be?

match

group

sort

SELECT state AS id, SUM(pop) AS totalPop

FROM zips

GROUP BY state

HAVING totalPop >= 15000000

ORDER BY totalPop DESCENDING

GROUP BY

AGGS.

match after group = HAVING

21 of 50

Grouping Syntax

$group : {

_id: <expression>, // Group By Expression

<field1>: { <aggfunc1> : <expression1> },

... }

Returns one document per unique group, indexed by _id

Agg.func. can be standard ops like $sum, $avg, $max

Also MQL specific ones:

  • $first : return the first expression value per group
    • makes sense only if docs are in a specific order [usually done after sort]
  • $push : create an array of expression values per group
    • didn’t make sense in a relational context because values are atomic
  • $addToSet : like $push, but eliminates duplicates

22 of 50

Multiple Attrib. Grouping Example

aggregate( [   

{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },   

{ $group: { _id: "$_id.state", avgCityPop: { $avg: "$pop" } } }

] )

Q: Guesses on what this might be doing?

A: Find average city population per state

{ "_id" : "GA", "avgCityPop" : 11547.62210338681 }

{ "_id" : "WI", "avgCityPop" : 7323.00748502994 }

{ "_id" : "FL", "avgCityPop" : 27400.958963282937 }

{ "_id" : "OR", "avgCityPop" : 8262.561046511628 }

{ "_id" : "SD", "avgCityPop" : 1839.6746031746031 }

{ "_id" : "NM", "avgCityPop" : 5872.360465116279 }

{ "_id" : "MD", "avgCityPop" : 12615.775725593667 }

group

group

Group by previously def. id.state

Notice use of $ to refer to previously defined fields

Group by 2 attribs, giving nested id

23 of 50

Multiple Agg. Example

Find, for every state, the biggest city and its population

aggregate( [

{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },

{ $sort: { pop: -1 } },

{ $group: { _id : "$_id.state", bigCity: { $first: "$_id.city" }, bigPop: { $first: "$pop" } } },

{ $sort : {bigPop : -1} }

] )

Approach:

  • Group by pair of city and state, and compute population per city
  • Order by population descending
  • Group by state, and find first city and population per group (i.e., the highest population city)
  • Order by population descending

{ ”_id" : "IL", "bigCity" : "CHICAGO", "bigPop" : 2452177 }

{ "_id" : "NY", "bigCity" : "BROOKLYN", "bigPop" : 2300504 }

{ "_id" : "CA", "bigCity" : "LOS ANGELES", "bigPop" : 2102295 }

{ "_id" : "TX", "bigCity" : "HOUSTON", "bigPop" : 2095918 }

{ "_id" : "PA", "bigCity" : "PHILADELPHIA", "bigPop" : 1610956 }

{ "_id" : "MI", "bigCity" : "DETROIT", "bigPop" : 963243 }

group

sort

group

sort

Can list multiple aggregations after grouping id

24 of 50

Multiple Agg. with Vanilla Projection Example

If we only want to keep the state and city …

aggregate( [

{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },

{ $sort: { pop: -1 } },

{ $group: { _id : "$_id.state", bigCity: { $first: "$_id.city" }, bigPop: { $first: "$pop" } } },

{ $sort : {bigPop : -1} }

{ $project : {bigPop : 0} }

] )

{ "_id" : "IL", "bigCity" : "CHICAGO" }

{ "_id" : "NY", "bigCity" : "BROOKLYN" }

{ "_id" : "CA", "bigCity" : "LOS ANGELES" }

{ "_id" : "TX", "bigCity" : "HOUSTON" }

{ "_id" : "PA", "bigCity" : "PHILADELPHIA" }

group

sort

group

sort

project

25 of 50

Multiple Agg. with Adv. Projection Example

If we wanted to nest the name of the city and population into a nested doc

aggregate( [

{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },

{ $sort: { pop: -1 } },

{ $group: { _id : "$_id.state", bigCity: { $first: "$_id.city" }, bigPop: { $first: "$pop" } } },

{ $sort : {bigPop : -1} },

{ $project : { _id : 0, state : "$_id", bigCityDeets: { name: "$bigCity", pop: "$bigPop" } } }

] )

{ "state" : "IL", "bigCityDeets" : { "name" : "CHICAGO", "pop" : 2452177 } }

{ "state" : "NY", "bigCityDeets" : { "name" : "BROOKLYN", "pop" : 2300504 } }

{ "state" : "CA", "bigCityDeets" : { "name" : "LOS ANGELES", "pop" : 2102295 } }

{ "state" : "TX", "bigCityDeets" : { "name" : "HOUSTON", "pop" : 2095918 } }

{ "state" : "PA", "bigCityDeets" : { "name" : "PHILADELPHIA", "pop" : 1610956 } }

Can construct new nested documents in output, unlike vanilla projection

26 of 50

Advanced Projection vs. Vanilla Projection

  • In addition to excluding/including fields like in projection during retrieval (find), projection in the aggregation pipeline allows you to:
    • Rename fields
    • Redefine new fields using complex expressions on old fields
    • Reorganize fields into nestings or unnestings
    • Reorganize fields into arrays or break down arrays
  • Try them at home!

27 of 50

Aggregation Pipelines

  • Composed of a linear pipeline of stages
  • Each stage corresponds to one of:
    • match // first arg of find ( )
    • project // second arg of find ( ) but more expressiveness
    • sort/limit // same
    • group
    • unwind
    • lookup
    • … lots more!!
  • Each stage manipulates the existing collection in some way
  • Syntax:

db.collection.aggregate ( [

{ $stage1Op: { } },

{ $stage2Op: { } },

{ $stageNOp: { } }

] }

match

match

group

lookup

project

28 of 50

Unwinding Arrays

Unwind expands an array by constructing documents one per element of the array

Going back to our old example with an array of tags

Notice no relational analog here: no arrays so no unwinding

[in fact, some RDBMSs do support arrays, but not in the rel. model]

aggregate( [

{ $unwind : "$tags" },

{ $project : {_id : 0, instock: 0}}

] )

{ "item" : "journal", "tags" : "blank", "dim" : [ 14, 21 ] }

{ "item" : "journal", "tags" : "red", "dim" : [ 14, 21 ] }

{ "item" : "notebook", "tags" : "red", "dim" : [ 14, 21 ] }

{ "item" : "notebook", "tags" : "blank", "dim" : [ 14, 21 ] }

{ "item" : "paper", "tags" : "red", "dim" : [ 14, 21 ] }

{ "item" : "paper", "tags" : "blank", "dim" : [ 14, 21 ] }

{ "item" : "paper", "tags" : "plain", "dim" : [ 14, 21 ] }

{ "item" : "planner", "tags" : "blank", "dim" : [ 22.85, 30 ] }

{ "item" : "planner", "tags" : "red", "dim" : [ 22.85, 30 ] }

{ "item" : "postcard", "tags" : "blue", "dim" : [ 10, 15.25 ] }

29 of 50

Unwind: A Common Template

Q: Imagine if we want to find sum of qtys across items. How would we do this?

A common recipe in MQL queries is to unwind and then group by

aggregate( [

{ $unwind : "$instock" }, 

{ $group : {_id : "$item", totalqty : {$sum : "$instock.qty"}}}

] )

{ "_id" : "notebook", "totalqty" : 5 }

{ "_id" : "postcard", "totalqty" : 50 }

{ "_id" : "journal", "totalqty" : 20 }

{ "_id" : "planner", "totalqty" : 45 }

{ "_id" : "paper", "totalqty" : 75 }

30 of 50

Demo: Aggregation Queries!

> db.prizes.aggregate([{$group: {_id: "$category", awardyears: {$sum : 1}}}])

> db.prizes.aggregate([{$group: {_id: "$category", awardyears: {$sum : 1}}}, {$match : {awardyears: {$lt: 100}}}])

> db.prizes.aggregate([{$group: {_id: "$category", awardyears: {$sum : 1}}}, {$match : {awardyears: {$lt: 100}}}, {$project : {_id: 0, awardyears: 1}}])

> db.prizes.aggregate([{$unwind: "$laureates"}, {$group: {_id: "$category", awards: {$sum : 1}}}])

> db.prizes.aggregate([{$unwind: "$laureates"}, {$group: {_id: {category: "$category", year: "$year"}, awards: {$sum : 1}}}])

> db.prizes.aggregate([{$unwind: "$laureates"}, {$group: {_id: {category: "$category", year: "$year"}, awards: {$sum : 1}}}, {$sort : {awards: -1}}])

> db.prizes.aggregate([{$unwind: "$laureates"}, {$group: {_id: {category: "$category", year: "$year"}, awards: {$sum : 1}}}, {$group: {_id:"$_id.category", avgawards: {$avg : "$awards"}}}])

31 of 50

Looking Up Other Collections

{ $lookup: {

from: <collection to join>,

localField: <referencing field>,

foreignField: <referenced field>,

as: <output array field>

} }

Conceptually, for each document

  • find documents in other coll that join (equijoin)
    • local field must match foreign field
  • place each of them in an array

Thus, a left outer equi-join, with the join results stored in an array

Straightforward, but kinda gross. Let’s see…

Say, for each item, I want to find other items located in the same location = self-join

db.inventory.aggregate( [

{ $lookup : {from : "inventory", localField: "instock.loc", foreignField: "instock.loc", as:"otheritems"}}, 

{ $project : {_id : 0, tags : 0, dim : 0}}

] )

{ "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "otheritems" : [

{ "_id" : ObjectId("5fb6f9605f0594e0227d3c24"), "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "tags" : [ "blank", "red" ], "dim" : [ 14, 21 ] },

{ "_id" : ObjectId("5fb6f9605f0594e0227d3c25"), "item" : "notebook", "instock" : [ { "loc" : "C", "qty" : 5 } ], "tags" : [ "red", "blank" ], "dim" : [ 14, 21 ] },

{ "_id" : ObjectId("5fb6f9605f0594e0227d3c26"), "item" : "paper", "instock" : [ { "loc" : "A", "qty" : 60 }, { "loc" : "B", "qty" : 15 } ], "tags" : [ "red", "blank", "plain" ], "dim" : [ 14, 21 ] },

] }

And many other records!

32 of 50

Lookup… after some more projection

db.inventory.aggregate( [

{ $lookup : {from:"inventory", localField:"instock.loc", foreignField:"instock.loc", as:"otheritems"}}, 

{$project : {_id : 0, tags :0, dim :0, "otheritems._id":0, "otheritems.tags":0, "otheritems.dim":0, "otheritems.instock.qty":0}} ] )

{ "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "otheritems" : [

{ "item" : "journal", "instock" : [ { "loc" : "A" }, { "loc" : "C" } ] },

{ "item" : "notebook", "instock" : [ { "loc" : "C" } ] },

{ "item" : "paper", "instock" : [ { "loc" : "A" }, { "loc" : "B" } ] },

{ "item" : "planner", "instock" : [ { "loc" : "A" }, { "loc" : "B" } ] },

{ "item" : "postcard", "instock" : [ { "loc" : "B" }, { "loc" : "C" } ] } ] }

{ "item" : "notebook", "instock" : [ { "loc" : "C", "qty" : 5 } ], "otheritems" : [

{ "item" : "journal", "instock" : [ { "loc" : "A" }, { "loc" : "C" } ] },

{ "item" : "notebook", "instock" : [ { "loc" : "C" } ] },

{ "item" : "postcard", "instock" : [ { "loc" : "B" }, { "loc" : "C" } ] } ] }

33 of 50

Some Rules of Thumb when Writing Queries

  • $project is helpful if you want to construct or deconstruct nestings (in addition to removing fields or creating new ones)
  • $group is helpful to construct arrays (using $push or $addToSet)
  • $unwind is helpful for unwinding arrays
  • $lookup is your only hope for joins. Be prepared for a mess. Lots of $project needed

34 of 50

MongoDB Query Language (MQL)

  • Input = collections, output = collections

  • Three main types of queries in the query language
    • Retrieval: Restricted SELECT-WHERE-ORDER BY-LIMIT type queries
    • Aggregation: A bit of a misnomer; a general pipeline of operators
      • Can capture Retrieval as a special case
      • But worth understanding Retrieval queries first…
    • Updates
  • All queries are invoked as
    • db.collection.operation1(…).operation2(…)… // collection: name of collection
    • Like dataframes, but unlike SQL which lists tables in a FROM clause, manipulates a single collection
  • Pretty messy for two reasons
    • JSON is messy: you sacrifice simplicity when you give up flatness
    • Like dataframes, MQL has some weird peculiarities – evolved quickly to meet a need
    • We’ll give you a taste, but it will be hard for me to answer questions (on the fly at least)

35 of 50

Update Queries: InsertMany

[Insert/Delete/Update] [One/Many]

    • Many is more general, so we’ll discuss that instead

db.inventory.insertMany( [   

{ item: "journal", instock: [ { loc: "A", qty: 5 }, { loc: "C", qty: 15 } ], tags: ["blank", "red"], dim: [ 14, 21 ] },   

{ item: "notebook", instock: [ { loc: "C", qty: 5 } ], tags: ["red", "blank"] , dim: [ 14, 21 ]},   

{ item: "paper", instock: [ { loc: "A", qty: 60 }, { loc: "B", qty: 15 } ], tags: ["red", "blank", "plain"] , dim: [ 14, 21 ]},   

{ item: "planner", instock: [ { loc: "A", qty: 40 }, { loc: "B", qty: 5 } ], tags: ["blank", "red"], dim: [ 22.85, 30 ] },   

{ item: "postcard", instock: [ {loc: "B", qty: 15 }, { loc: "C", qty: 35 } ], tags: ["blue"] , dim: [ 10, 15.25 ] }

] );

Several actions will be taken as part of this insert:

  • Will create inventory collection if absent [No schema specification/DDL needed!]
  • Will add the _id attrib to each document added (since it isn’t there)
  • _id will be the first field for each document by default

36 of 50

Update Queries: UpdateMany

Syntax: updateMany ( {<condition>}, {<change>})

db.inventory.updateMany (

{"dim.0": { $lt: 15 } },

{ $set: { "dim.0": 15, status: "InvalidWidth"} }

) // if any width <15, set it to 15 and set status to InvalidWidth.

Analogous to: UPDATE R SET <change> WHERE <condition>

37 of 50

Update Queries: UpdateMany 2

Syntax: updateMany ( {<condition>}, {<change>})

db.inventory.updateMany (

{"dim.0": { $lt: 15 } }, 

{ $inc: { "dim.0":  5},

$set: {status: "InvalidWidth"} } )

// if any width <15, increment by 5 and set status to InvalidWidth.

Analogous to: UPDATE R SET <change> WHERE <condition

38 of 50

MongoDB Internals

  • Optimization heuristics
    • Will use indexes for $match if early in the pipeline [user can explicitly declare]
    • $match will be merged with other $match if possible
      • Selection fusion
    • $match will be moved early in the pipeline sometimes
      • Selection pushdown
      • But: not done always (e.g., not pushed before $lookup)
    • No cost-based optimization as far as one can tell
  • Weird constraint: intermediate results of aggregations must not be too large (100MB)
    • Else will end up spilling to disk
    • Not clear if they perform any pipelining across aggregation operators

39 of 50

MongoDB: Summary

Bottomline:

MongoDB has now evolved into a mature data system with some different design decisions, and relearning many of the canonical relational database lessons

MongoDB has a flexible data model and a powerful (if confusing) query language.

Many of the internal design decisions as well as the query & data model can be understood when compared with what we know

  • Database systems provide a good "gold standard" to compare against.

40 of 50

41 of 50

MongoDB: Connecting

  • Running example: nobel prize dataset
    • Will reveal aspects of it as needed
  • Connecting to MongoDB is just like in RDBMS: you create a connection, and then you query

41

42 of 50

MongoDB: Retrieval

  • Retrieval of documents is done via the find() operation
    • find_one() is find but with a LIMIT 1
  • Input: documents & Output: documents

42

43 of 50

MongoDB: Retrieval with Predicates

  • Retrieval of documents is done via the find()/find_one() operation
  • Can add predicates for fields within json
  • Essentially a WHERE clause

43

Notice the nested object!

Convenient to capture all awardees for a given category in a given year.

44 of 50

MongoDB: Retrieval Cursors

  • Just like in RDBMSs, you use cursors to walk through all result documents

44

45 of 50

MongoDB: Retrieval with Selection & Projection

  • MongoDB’s find operation allows you to project out certain attributes you don’t want, or keep attributes you want
    • Either indicate what you want with 1’s or don’t want with 0’s
    • Can’t use both!
  • Essentially a WHERE with a SELECT

45

46 of 50

MongoDB: Retrieval with Selection/Projection II

46

47 of 50

MongoDB: Simple Aggregation with Counts

  • Simple counting operation via count_documents
  • Notice that this doesn’t return a set of documents

47

48 of 50

MongoDB: Advanced Aggregation

  • Specifies:
    • Grouping attributes ($group)
    • Aggregates (e.g., $sum)
  • Simple example with just counts of number of documents per category

48

49 of 50

MongoDB: Even more Advanced Aggregation

49

50 of 50

JSON in Relational Systems

  • Relational database systems have also started supporting JSON as a data type
    • Declare a column to contain either json or jsonb (binary)
      • CREATE TABLE nobel_prizes (nobel json)
    • In our previous example, we will have a row per document
      • i.e., a row corresponding to prizes for a given year and field
  • Allowing mixing of relational and semi-structured syntax

SELECT * FROM nobel_prizes

WHERE nobel @> ‘{“year”: “1990”}’;

50