Aggregation Framework
Senior Solutions Architect, MongoDB
Rick Houlihan
MongoDB World
Agenda
• What is theAggregation Framework?
• The Aggregation Pipeline
• Usage and Limitations
• Aggregation and Sharding
• Summary
What is the Aggregation
Framework?
Aggregation Framework
Aggregation in Nutshell
• We're storing our data in
MongoDB
• Our applications need ad-hoc
queries
• We must have a way to reshape
data easily
• You can use Aggregation Framework for
this!
• Extremely versatile, powerful
• Overkill for simple aggregation
tasks
• Averages
• Summation
• Grouping
• Reshaping
MapReduce is great, but…
• High level of complexity
• Difficult to program and debug
Aggregation Framework
• Plays nice with sharding
• Executes in native code
– Written in C++
– JSON parameters
• Flexible, functional, and simple
– Operation pipeline
– Computational expressions
Aggregation Pipeline
What is an Aggregation Pipeline?
• ASeries of Document Transformations
– Executed in stages
– Original input is a collection
– Output as a document, cursor or a collection
• Rich Library of Functions
– Filter, compute, group, and summarize data
– Output of one stage sent to input of next
– Operations executed in sequential order
$match $project $group $sort
Pipeline Operators
• $sort
• Order documents
• $limit / $skip
• Paginate documents
• $redact
• Restrict documents
• $geoNear
• Proximity sort
documents
• $let, $map
• Subexpression variables
• $match
• Filter documents
• $project
• Reshape documents
• $group
• Summarize documents
• $unwind
• Expand documents
{
_id: 375,
title: "The Great Gatsby",
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}
Our Example Data
$match
• Filter documents
– Uses existing query syntax
– Can facilitate shard exclusion
– No $where (server side Javascript)
Matching Field Values
{
title: "Atlas Shrugged",
pages: 1088,
language: "English"
}
{
title: "The Great Gatsby",
pages: 218,
language: "English"
}
{
title: "War and Peace",
pages: 1440,
language: "Russian"
}
{ $match: {
language: "Russian"
}}
{
title: "War and Peace",
pages: 1440,
language: "Russian"
}
Matching with Query Operators
{
title: "Atlas Shrugged",
pages: 1088,
language: "English"
}
{
title: "The Great Gatsby",
pages: 218,
language: "English"
}
{
title: "War and Peace",
pages: 1440,
language: "Russian"
}
{ $match: {
pages: {$gt:100}
}}
{
title: "War and Peace",
pages: 1440,
language: "Russian"
}
{
title: ”Atlas Shrugged",
pages: 1088,
language: “English"
}
$project
• Reshape Documents
– Include, exclude or rename
fields
– Inject computed fields
– Create sub-document fields
Including and Excluding Fields
{
_id: 375,
title: "Great Gatsby",
ISBN: "9781857150193",
available: true,
pages: 218,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}
{ $project: {
_id: 0,
title: 1,
language: 1
}}
{
title: "Great Gatsby",
language: "English"
}
Renaming and Computing Fields
{
_id: 375,
title: "Great Gatsby",
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}
{ $project: {
avgChapterLength: {
$divide: ["$pages",
"$chapters"]
},
lang: "$language"
}}
{
_id: 375,
avgChapterLength: 24.2222,
lang: "English"
}
Creating Sub-Document Fields
{
_id: 375,
title: "Great Gatsby",
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}
{ $project: {
title: 1,
stats: {
pages: "$pages",
language: "$language",
}
}}
{
_id: 375,
title: "Great Gatsby",
stats: {
pages: 218,
language: "English"
}
}
$group
• Group documents by value
– Field reference, object, constant
– Other output fields are computed
• $max, $min, $avg, $sum
• $addToSet, $push
• $first, $last
– Processes all data in memory by
default
Calculating An Average
{
title: "The Great Gatsby",
pages: 218,
language: "English"
}
{ $group: {
_id: "$language",
avgPages: { $avg:
"$pages" }
}}
{
_id: "Russian",
avgPages: 1440
}
{
title: "War and Peace",
pages: 1440,
language: "Russian"
}
{
title: "Atlas Shrugged",
pages: 1088,
language: "English"
}
{
_id: "English",
avgPages: 653
}
Summing Fields and Counting
{
title: "The Great Gatsby",
pages: 218,
language: "English"
}
{ $group: {
_id: "$language",
pages: { $sum: "$pages" },
books: { $sum: 1 }
}}
{
_id: "Russian",
pages: 1440,
books: 1
}
{
title: "War and Peace",
pages: 1440,
language: "Russian"
}
{
title: "Atlas Shrugged",
pages: 1088,
language: "English"
}
{
_id: "English",
pages: 1316,
books: 2
}
Collecting Distinct Values
{
title: "The Great Gatsby",
pages: 218,
language: "English"
}
{ $group: {
_id: "$language",
titles: { $addToSet: "$title" }
}}
{
_id: "Russian",
titles: [“War and Peace”]
}
{
title: "War and Peace",
pages: 1440,
language: "Russian"
}
{
title: "Atlas Shrugged",
pages: 1088,
language: "English"
}
{
_id: "English",
titles: [
"Atlas Shrugged",
"The Great Gatsby” ]
}
$unwind
• Operate on an array field
– Create documents from array elements
• Array replaced by element value
• Missing/empty fields → no output
• Non-array fields → error
– Pipe to $group to aggregate
Collecting Distinct Values
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: [
"Long Island",
"New York",
"1920s"
]
}
{ title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "Long Island” }
{ title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "New York” }
{ title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "1920s” }
{ $unwind: "$subjects" }
$sort, $limit, $skip
• Sort documents by one or more fields
– Same order syntax as cursors
– Waits for earlier pipeline operator to return
– In-memory unless early and indexed
• Limit and skip follow cursor behavior
Sort All the Documents in the
Pipeline
{ title: “Animal Farm” }
{ $sort: {title: 1} }
{ title: “Brave New World” }
{ title: “Great Gatsby” }
{ title: “Grapes of Wrath, The” }
{ title: “Lord of the Flies” }
{ title: “Great Gatsby, The” }
{ title: “Brave New World” }
{ title: “Grapes of Wrath” }
{ title: “Animal Farm” }
{ title: “Lord of the Flies” }
Limit Documents Through the
Pipeline
{ title: “Great Gatsby, The” }
{ $limit: 5 }
{ title: “Brave New World” }
{ title: “Grapes of Wrath” }
{ title: “Animal Farm” }
{ title: “Lord of the Flies” }
{ title: “Great Gatsby, The” }
{ title: “Brave New World” }
{ title: “Grapes of Wrath” }
{ title: “Animal Farm” }
{ title: “Lord of the Flies” }
{ title: “Fathers and Sons” }
{ title: “Invisible Man” }
Skip Documents in the Pipeline
{ title: “Animal Farm” }
{ $skip: 3 }
{ title: “Lord of the Flies” }
{ title: “Fathers and Sons” }
{ title: “Invisible Man” }
{ title: “Great Gatsby, The” }
{ title: “Brave New World” }
{ title: “Grapes of Wrath” }
{ title: “Animal Farm” }
{ title: “Lord of the Flies” }
{ title: “Fathers and Sons” }
{ title: “Invisible Man” }
$redact
• Restrict access to Documents
– Use document fields to define privileges
– Apply conditional queries to validate users
• Field LevelAccess Control
– $$DESCEND, $$PRUNE, $$KEEP
– Applies to root and subdocument fields
{
_id: 375,
item: "Sony XBR55X900A 55Inch 4K Ultra High Definition TV",
Manufacturer: "Sony",
security: 0,
quantity: 12,
list: 4999,
pricing: {
security: 1,
sale: 2698,
wholesale: {
security: 2,
amount: 2300 }
}
}
$redact Example Data
Query by Security Level
security =
0
db.catalog.aggregate([
{
$match: {item: /^.*XBR55X900A*/}
},
{
$redact: {
$cond: {
if: { $lte: [ "$security", ?? ] },
then: "$$DESCEND",
else: "$$PRUNE"
}
}
}])
{
"_id" : 375,
"item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition TV",
"Manufacturer" : "Sony”,
"security" : 0,
"quantity" : 12,
"list" : 4999
}
{
"_id" : 375,
"item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition
TV",
"Manufacturer" : "Sony",
"security" : 0,
"quantity" : 12,
"list" : 4999,
"pricing" : {
"security" : 1,
"sale" : 2698,
"wholesale" : {
"security" : 2,
"amount" : 2300
}
}
}
security =
2
$geoNear
• Order/Filter Documents by Location
– Requires a geospatial index
– Output includes physical distance
– Must be first aggregation stage
{
"_id" : 10021,
"city" : “NEW YORK”,
"loc" : [
-73.958805,
40.768476
],
"pop" : 106564,
"state" : ”NY”
}
$geonear Example Data
Query by Proximity
db.catalog.aggregate([
{
$geoNear : {
near: [ -86.000, 33.000 ],
distanceField: "dist",
maxDistance: .050,
spherical: true,
num: 3
}
}])
{
"_id" : "35089",
"city" : "KELLYTON",
"loc" : [ -86.048397, 32.979068 ],
"pop" : 1584,
"state" : "AL",
"dist" : 0.0007971432165364155
},
{
"_id" : "35010",
"city" : "NEW SITE",
"loc" : [ -85.951086, 32.941445 ],
"pop" : 19942,
"state" : "AL",
"dist" : 0.0012479615347306806
},
{
"_id" : "35072",
"city" : "GOODWATER",
"loc" : [ -86.078149, 33.074642 ],
"pop" : 3813,
"state" : "AL",
"dist" : 0.0017333719627032555
}
$let / $map
• Bind variables to subexpressions
– Apply conditional logic
– Define complex calculations
– Operate on array field values
{
"_id" : 1,
”price" : 10,
”tax" : 0.50,
”discount" : true
}
$let Example Data
Subexpression Calculations
db.sales.aggregate( [
{
$project: {
finalPrice: {
$let: {
vars: {
total: { $cond: {
if: '$applyDiscount',
then: { $multiply: [0.9, '$price’] },
else: '$price'
}
}
},
in: { $add: [ "$$total", '$tax'] }
}}}}])
{ "_id" : 1, "finalPrice" : 9.5 }
{ "_id" : 2, "finalPrice" : 10.25 }
{
"_id" : 1,
”price" : 10,
”tax" : 0.50,
”discount" : true,
”units" : [ 1, 0, 3, 4, 0, 0, 10, 12, 6, 5 ]
}
$map Example Data
Subexpressions on Arrays
db.sales.aggregate( [ {
$project: {
finalPrice: {
$map: {
input: "$units",
as: "unit",
in: {
$multiply: [ “$$unit”, {
$cond: {
if: '$applyDiscount', then: {
$add : [
{ $multiply: [ 0.9, '$price'] }, '$tax’ ] },
else: { $add: [ '$price', '$tax’ ] }
} } ] } } } } } ] )
{
"_id" : 1,
"finalPrice" :
[ 9.5, 0, 28.5, 38, 0, 0, 95, 114, 57, 47.5 ]
}
{
"_id" : 2,
"finalPrice" :
[ 51.25, 30.75, 20.5, 51.25, 0, 0, 0, 30.75, 41, 71.75 ]
}
Aggregation and Sharding
Sharding
Result
mongos
Shard 1
(Primary)
$match,
$project, $group
Shard 2
$match,
$project, $group
Shard 3
excluded
Shard 4
$match,
$project, $group
• Workload split between shards
– Shards execute pipeline up to a point
– Primary shard merges cursorsand
continues processing*
– Use explain to analyze pipeline split
– Early $match may excuse shards
– Potential CPU and memory implications
for primary shard host
* Priortov2.6secondstagepipelineprocessingwasdonebymongos
Usage and Limitations
Usage
• collection.aggregate([…], {<options>})
– Returns a cursor
– Takes an optional document to specify aggregation options
• allowDiskUse, explain
– Use $out to send results to a Collection
• db.runCommand({aggregate:<collection>, pipeline:[…]})
– Returns a document, limited to 16 MB
Collection
db.books.aggregate([
{ $project: { language: 1 }},
{ $group: { _id: "$language", numTitles: { $sum: 1 }}}
])
{ _id: "Russian", numTitles: 1 },
{ _id: "English", numTitles: 2 }
Database Command
db.runCommand({
aggregate: "books",
pipeline: [
{ $project: { language: 1 }},
{ $group: { _id: "$language", numTitles: { $sum: 1
}}}
]
})
{
result : [
{ _id: "Russian", numTitles: 1 },
{ _id: "English", numTitles: 2 }
],
“ok” : 1
}
Limitations
• Pipeline operator memory limits
– Stages limited to 100 MB
– “allowDiskUse” for larger data sets
• Some BSON types unsupported
– Symbol, MinKey, MaxKey, DBRef, Code, and
CodeWScope
Summary
Aggregation Use Cases
Ad-hoc reporting
Real-timeAnalytics
Transforming Data
Enabling Developers and DBA’s
• Do more with MongoDB and do it
faster
• Eliminate MapReduce
– Replace pages of JavaScript
– More efficient data processing
• Not just a nice feature
– Enabler for real time big data analytics
Thank You