Data Engineering��Querying Semi-structured Data
Document Store: MongoDB
MongoDB Data Model: JSON with Names
MongoDB | DBMS |
Database | Database |
Collection | Relation |
Document | Row/Record |
Field | Column |
In MongoDB, you operate on collections, each of which is a document, of object type, consisting of field value pairs.
Recall an object has field value pairs enclosed by { }
Document = {…, field: value, …}
Where value can be
{ qty : 1, status : "D", size : {h : 14, w : 21}, tags : ["a", "b"] },
MongoDB Data Model 2
MongoDB | DBMS |
Database | Database |
Collection | Relation |
Document | Row/Record |
Field | Column |
Document = {…, field: value, …}
Special field in each document: _id
MongoDB Query Language (MQL)
Principles : Dot (.) Notation for Traversing Trees
person
Mary
name
address
name
address
street
no
city
Maple
345
SF
John
Thai
phone
23456
0
1
Principles : Dollar ($) Notation
Retrieval Queries Template
db.collection.find(<predicate>, optional <projection>)
returns documents that match <predicate>
keep fields as specified in <projection>
both <predicate> and <projection> expressed as objects
in fact, most things are objects!
db.inventory.find( { } )
returns all documents
Retrieval Queries: Basic Queries
db.collection.find(<predicate>, optional <projection>)
Retrieval Queries: Nested Documents
db.collection.find(<predicate>, optional <projection>)
Retrieval Queries: Arrays of Primitives
Slightly different example dataset for Array Examples
db.collection.find(<predicate>, optional <projection>)
Retrieval Queries: Arrays of Objects
db.collection.find(<predicate>, optional <projection>)
Retrieval Queries Template: Projection
db.collection.find(<predicate>, optional <projection>)
{ "_id" : ObjectId("5fb59ab9f50b800678c0e196"), "item" : "journal" }
{ "_id" : ObjectId("5fb59ab9f50b800678c0e197"), "item" : "notebook" }
{ "_id" : ObjectId("5fb59ab9f50b800678c0e198"), "item" : "paper" }
{ "_id" : ObjectId("5fb59ab9f50b800678c0e199"), "item" : "planner" }
{ "_id" : ObjectId("5fb59ab9f50b800678c0e19a"), "item" : "postcard" }
{ "item" : "journal" }
{ "item" : "notebook" }
{ "item" : "paper" }
{ "item" : "planner" }
{ "item" : "postcard" }
Error: error: {
"ok" : 0,
"errmsg" : "Cannot do exclusion on field tags in inclusion projection",
"code" : 31254,
"codeName" : "Location31254" }
{ "item" : "journal", "instock" : [ { "loc" : "A" }, { "loc" : "C" } ] }
{ "item" : "notebook", "instock" : [ { "loc" : "C" } ] }
{ "item" : "paper", "instock" : [ { "loc" : "A" }, { "loc" : "B" } ] }
{ "item" : "planner", "instock" : [ { "loc" : "A" }, { "loc" : "B" } ] }
{ "item" : "postcard", "instock" : [ { "loc" : "B" }, { "loc" : "C" } ] }
Retrieval Queries : Addendum
Two additional operations that are useful for retrieval:
{ "item" : "planner", "tags" : [ "blank", "red" ], "dim" : [ 22.85, 30 ] }
{ "item" : "journal", "tags" : [ "blank", "red" ], "dim" : [ 14, 21 ] }
{ "item" : "notebook", "tags" : [ "red", "blank" ], "dim" : [ 14, 21 ] }
{ "item" : "paper", "tags" : [ "red", "blank", "plain" ], "dim" : [ 14, 21 ] }
{ "item" : "postcard", "tags" : [ "blue" ], "dim" : [ 10, 15.25 ] }
Retrieval Queries: Summary
find() = SELECT <projection>
FROM Collection
WHERE <predicate>
limit() = LIMIT
sort() = ORDER BY
db.inventory.find(
{ tags : red },
{_id : 0, instock : 0} )
.sort ( { "dim.0": -1, item: 1 } )
.limit (2)
WHERE
SELECT
ORDER BY
LIMIT
FROM
What did I not cover?
MongoDB Query Language (MQL)
Aggregation Pipelines
match
match
group
lookup
project
db.collection.aggregate ( [
{ $stage1Op: { } },
{ $stage2Op: { } },
…
{ $stageNOp: { } }
] )
Next Set of Examples
One document per zipcode: 29353 zipcodes
Grouping (with match/sort) Simple Example
Find states with population > 15M, sort by decending order
db.zips.aggregate( [
{ $group: { _id: "$state", totalPop: { $sum: "$pop" } } },
{ $match: { totalPop: { $gte: 15000000 } } },
{ $sort : { totalPop : -1 } }
] )
{ "_id" : "CA", "totalPop" : 29754890 }
{ "_id" : "NY", "totalPop" : 17990402 }
{ "_id" : "TX", "totalPop" : 16984601 }
…
Q: what would the SQL query for this be?
match
group
sort
SELECT state AS id, SUM(pop) AS totalPop
FROM zips
GROUP BY state
HAVING totalPop >= 15000000
ORDER BY totalPop DESCENDING
GROUP BY
AGGS.
match after group = HAVING
Grouping Syntax
$group : {
_id: <expression>, // Group By Expression
<field1>: { <aggfunc1> : <expression1> },
... }
Returns one document per unique group, indexed by _id
Agg.func. can be standard ops like $sum, $avg, $max
Also MQL specific ones:
Multiple Attrib. Grouping Example
aggregate( [
{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },
{ $group: { _id: "$_id.state", avgCityPop: { $avg: "$pop" } } }
] )
Q: Guesses on what this might be doing?
A: Find average city population per state
{ "_id" : "GA", "avgCityPop" : 11547.62210338681 }
{ "_id" : "WI", "avgCityPop" : 7323.00748502994 }
{ "_id" : "FL", "avgCityPop" : 27400.958963282937 }
{ "_id" : "OR", "avgCityPop" : 8262.561046511628 }
{ "_id" : "SD", "avgCityPop" : 1839.6746031746031 }
{ "_id" : "NM", "avgCityPop" : 5872.360465116279 }
{ "_id" : "MD", "avgCityPop" : 12615.775725593667 }
…
group
group
Group by previously def. id.state
Notice use of $ to refer to previously defined fields
Group by 2 attribs, giving nested id
Multiple Agg. Example
Find, for every state, the biggest city and its population
aggregate( [
{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },
{ $sort: { pop: -1 } },
{ $group: { _id : "$_id.state", bigCity: { $first: "$_id.city" }, bigPop: { $first: "$pop" } } },
{ $sort : {bigPop : -1} }
] )
Approach:
{ ”_id" : "IL", "bigCity" : "CHICAGO", "bigPop" : 2452177 }
{ "_id" : "NY", "bigCity" : "BROOKLYN", "bigPop" : 2300504 }
{ "_id" : "CA", "bigCity" : "LOS ANGELES", "bigPop" : 2102295 }
{ "_id" : "TX", "bigCity" : "HOUSTON", "bigPop" : 2095918 }
{ "_id" : "PA", "bigCity" : "PHILADELPHIA", "bigPop" : 1610956 }
{ "_id" : "MI", "bigCity" : "DETROIT", "bigPop" : 963243 }
…
group
sort
group
sort
Can list multiple aggregations after grouping id
Multiple Agg. with Vanilla Projection Example
If we only want to keep the state and city …
aggregate( [
{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },
{ $sort: { pop: -1 } },
{ $group: { _id : "$_id.state", bigCity: { $first: "$_id.city" }, bigPop: { $first: "$pop" } } },
{ $sort : {bigPop : -1} }
{ $project : {bigPop : 0} }
] )
{ "_id" : "IL", "bigCity" : "CHICAGO" }
{ "_id" : "NY", "bigCity" : "BROOKLYN" }
{ "_id" : "CA", "bigCity" : "LOS ANGELES" }
{ "_id" : "TX", "bigCity" : "HOUSTON" }
{ "_id" : "PA", "bigCity" : "PHILADELPHIA" }
…
group
sort
group
sort
project
Multiple Agg. with Adv. Projection Example
If we wanted to nest the name of the city and population into a nested doc
aggregate( [
{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },
{ $sort: { pop: -1 } },
{ $group: { _id : "$_id.state", bigCity: { $first: "$_id.city" }, bigPop: { $first: "$pop" } } },
{ $sort : {bigPop : -1} },
{ $project : { _id : 0, state : "$_id", bigCityDeets: { name: "$bigCity", pop: "$bigPop" } } }
] )
{ "state" : "IL", "bigCityDeets" : { "name" : "CHICAGO", "pop" : 2452177 } }
{ "state" : "NY", "bigCityDeets" : { "name" : "BROOKLYN", "pop" : 2300504 } }
{ "state" : "CA", "bigCityDeets" : { "name" : "LOS ANGELES", "pop" : 2102295 } }
{ "state" : "TX", "bigCityDeets" : { "name" : "HOUSTON", "pop" : 2095918 } }
{ "state" : "PA", "bigCityDeets" : { "name" : "PHILADELPHIA", "pop" : 1610956 } }
…
Can construct new nested documents in output, unlike vanilla projection
Advanced Projection vs. Vanilla Projection
Aggregation Pipelines
db.collection.aggregate ( [
{ $stage1Op: { } },
{ $stage2Op: { } },
…
{ $stageNOp: { } }
] }
match
match
group
lookup
project
Unwinding Arrays
Unwind expands an array by constructing documents one per element of the array
Going back to our old example with an array of tags
Notice no relational analog here: no arrays so no unwinding
[in fact, some RDBMSs do support arrays, but not in the rel. model]
aggregate( [
{ $unwind : "$tags" },
{ $project : {_id : 0, instock: 0}}
] )
{ "item" : "journal", "tags" : "blank", "dim" : [ 14, 21 ] }
{ "item" : "journal", "tags" : "red", "dim" : [ 14, 21 ] }
{ "item" : "notebook", "tags" : "red", "dim" : [ 14, 21 ] }
{ "item" : "notebook", "tags" : "blank", "dim" : [ 14, 21 ] }
{ "item" : "paper", "tags" : "red", "dim" : [ 14, 21 ] }
{ "item" : "paper", "tags" : "blank", "dim" : [ 14, 21 ] }
{ "item" : "paper", "tags" : "plain", "dim" : [ 14, 21 ] }
{ "item" : "planner", "tags" : "blank", "dim" : [ 22.85, 30 ] }
{ "item" : "planner", "tags" : "red", "dim" : [ 22.85, 30 ] }
{ "item" : "postcard", "tags" : "blue", "dim" : [ 10, 15.25 ] }
Unwind: A Common Template
Q: Imagine if we want to find sum of qtys across items. How would we do this?
A common recipe in MQL queries is to unwind and then group by
aggregate( [
{ $unwind : "$instock" },
{ $group : {_id : "$item", totalqty : {$sum : "$instock.qty"}}}
] )
{ "_id" : "notebook", "totalqty" : 5 }
{ "_id" : "postcard", "totalqty" : 50 }
{ "_id" : "journal", "totalqty" : 20 }
{ "_id" : "planner", "totalqty" : 45 }
{ "_id" : "paper", "totalqty" : 75 }
Demo: Aggregation Queries!
> db.prizes.aggregate([{$group: {_id: "$category", awardyears: {$sum : 1}}}])
> db.prizes.aggregate([{$group: {_id: "$category", awardyears: {$sum : 1}}}, {$match : {awardyears: {$lt: 100}}}])
> db.prizes.aggregate([{$group: {_id: "$category", awardyears: {$sum : 1}}}, {$match : {awardyears: {$lt: 100}}}, {$project : {_id: 0, awardyears: 1}}])
> db.prizes.aggregate([{$unwind: "$laureates"}, {$group: {_id: "$category", awards: {$sum : 1}}}])
> db.prizes.aggregate([{$unwind: "$laureates"}, {$group: {_id: {category: "$category", year: "$year"}, awards: {$sum : 1}}}])
> db.prizes.aggregate([{$unwind: "$laureates"}, {$group: {_id: {category: "$category", year: "$year"}, awards: {$sum : 1}}}, {$sort : {awards: -1}}])
> db.prizes.aggregate([{$unwind: "$laureates"}, {$group: {_id: {category: "$category", year: "$year"}, awards: {$sum : 1}}}, {$group: {_id:"$_id.category", avgawards: {$avg : "$awards"}}}])
Looking Up Other Collections
{ $lookup: {
from: <collection to join>,
localField: <referencing field>,
foreignField: <referenced field>,
as: <output array field>
} }
Conceptually, for each document
Thus, a left outer equi-join, with the join results stored in an array
Straightforward, but kinda gross. Let’s see…
Say, for each item, I want to find other items located in the same location = self-join
db.inventory.aggregate( [
{ $lookup : {from : "inventory", localField: "instock.loc", foreignField: "instock.loc", as:"otheritems"}},
{ $project : {_id : 0, tags : 0, dim : 0}}
] )
{ "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "otheritems" : [
{ "_id" : ObjectId("5fb6f9605f0594e0227d3c24"), "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "tags" : [ "blank", "red" ], "dim" : [ 14, 21 ] },
{ "_id" : ObjectId("5fb6f9605f0594e0227d3c25"), "item" : "notebook", "instock" : [ { "loc" : "C", "qty" : 5 } ], "tags" : [ "red", "blank" ], "dim" : [ 14, 21 ] },
{ "_id" : ObjectId("5fb6f9605f0594e0227d3c26"), "item" : "paper", "instock" : [ { "loc" : "A", "qty" : 60 }, { "loc" : "B", "qty" : 15 } ], "tags" : [ "red", "blank", "plain" ], "dim" : [ 14, 21 ] },
…
] }
And many other records!
Lookup… after some more projection
db.inventory.aggregate( [
{ $lookup : {from:"inventory", localField:"instock.loc", foreignField:"instock.loc", as:"otheritems"}},
{$project : {_id : 0, tags :0, dim :0, "otheritems._id":0, "otheritems.tags":0, "otheritems.dim":0, "otheritems.instock.qty":0}} ] )
{ "item" : "journal", "instock" : [ { "loc" : "A", "qty" : 5 }, { "loc" : "C", "qty" : 15 } ], "otheritems" : [
{ "item" : "journal", "instock" : [ { "loc" : "A" }, { "loc" : "C" } ] },
{ "item" : "notebook", "instock" : [ { "loc" : "C" } ] },
{ "item" : "paper", "instock" : [ { "loc" : "A" }, { "loc" : "B" } ] },
{ "item" : "planner", "instock" : [ { "loc" : "A" }, { "loc" : "B" } ] },
{ "item" : "postcard", "instock" : [ { "loc" : "B" }, { "loc" : "C" } ] } ] }
{ "item" : "notebook", "instock" : [ { "loc" : "C", "qty" : 5 } ], "otheritems" : [
{ "item" : "journal", "instock" : [ { "loc" : "A" }, { "loc" : "C" } ] },
{ "item" : "notebook", "instock" : [ { "loc" : "C" } ] },
{ "item" : "postcard", "instock" : [ { "loc" : "B" }, { "loc" : "C" } ] } ] }
…
Some Rules of Thumb when Writing Queries
MongoDB Query Language (MQL)
Update Queries: InsertMany
[Insert/Delete/Update] [One/Many]
db.inventory.insertMany( [
{ item: "journal", instock: [ { loc: "A", qty: 5 }, { loc: "C", qty: 15 } ], tags: ["blank", "red"], dim: [ 14, 21 ] },
{ item: "notebook", instock: [ { loc: "C", qty: 5 } ], tags: ["red", "blank"] , dim: [ 14, 21 ]},
{ item: "paper", instock: [ { loc: "A", qty: 60 }, { loc: "B", qty: 15 } ], tags: ["red", "blank", "plain"] , dim: [ 14, 21 ]},
{ item: "planner", instock: [ { loc: "A", qty: 40 }, { loc: "B", qty: 5 } ], tags: ["blank", "red"], dim: [ 22.85, 30 ] },
{ item: "postcard", instock: [ {loc: "B", qty: 15 }, { loc: "C", qty: 35 } ], tags: ["blue"] , dim: [ 10, 15.25 ] }
] );
Several actions will be taken as part of this insert:
Update Queries: UpdateMany
Syntax: updateMany ( {<condition>}, {<change>})
db.inventory.updateMany (
{"dim.0": { $lt: 15 } },
{ $set: { "dim.0": 15, status: "InvalidWidth"} }
) // if any width <15, set it to 15 and set status to InvalidWidth.
Analogous to: UPDATE R SET <change> WHERE <condition>
Update Queries: UpdateMany 2
Syntax: updateMany ( {<condition>}, {<change>})
db.inventory.updateMany (
{"dim.0": { $lt: 15 } },
{ $inc: { "dim.0": 5},
$set: {status: "InvalidWidth"} } )
// if any width <15, increment by 5 and set status to InvalidWidth.
Analogous to: UPDATE R SET <change> WHERE <condition
MongoDB Internals
MongoDB: Summary
Bottomline:
MongoDB has now evolved into a mature data system with some different design decisions, and relearning many of the canonical relational database lessons
MongoDB has a flexible data model and a powerful (if confusing) query language.
Many of the internal design decisions as well as the query & data model can be understood when compared with what we know
MongoDB: Connecting
41
MongoDB: Retrieval
42
MongoDB: Retrieval with Predicates
43
Notice the nested object!
Convenient to capture all awardees for a given category in a given year.
MongoDB: Retrieval Cursors
44
MongoDB: Retrieval with Selection & Projection
45
MongoDB: Retrieval with Selection/Projection II
46
MongoDB: Simple Aggregation with Counts
47
MongoDB: Advanced Aggregation
48
MongoDB: Even more Advanced Aggregation
49
JSON in Relational Systems
SELECT * FROM nobel_prizes
WHERE nobel @> ‘{“year”: “1990”}’;
50