MongoDB's aggregation framework
As part of my preparations for my MongoDB workshop at PHP Benelux, I ran into a nice use case for MongoDB's aggregation framework. As I have already promised to write about it, this seems to be a good time to actually write the article.
The dataset that I am using for the workshop contains of restaurants and other points of interest, extracted from OpenStreetMap data. As an example, one of the documents that I am storing is:
{ "_id" : "n558797601", "type" : NumberLong(1), "loc" : [ 4.4577708, 51.1611465 ], "tags" : [ "addr:city=Edegem", "addr:country=BE", "addr:housenumber=398", "addr:postcode=2650", "addr:street=Mechelsesteenweg", "amenity=restaurant", "cuisine=regional", "name=La Rosa", ] }
I wanted to find out which cuisines are used in all of the documents of my dataset. Described in a different way: I want all the different cuisine=β¦
tags as used in my dataset. Traditionally you would write a really complexβ’ Map/Reduce job, but since MongoDB 2.2, there is a new feature called the aggregation framework. The aggregation framework is meant to be an easy way to do fairly complex aggregation jobs.
The idea behind it is that each aggregation job is defined by a pipeline of operators. Each operator does a specific task. There is for example $match, which allows you to restrict which documents pass through. If the document matches the predicate contained in the $match
operator, then it is allowed through, and otherwise it is dropped from the pipeline.
On the MongoDB shell, you use the aggregate() shell helper to execute a pipeline of operators, and with PHP you use the MongoCollection::aggregate() method.
In order to let pass all the documents that have the tag amenity=restaurant
, you would run on the shell:
db.poi.aggregate( { $match: { tags: "amenity=restaurant" } } );
And in PHP you would use the following script:
<?php $m = new MongoClient; $c = $m->demo->poi; $result = $c->aggregate( array( array( '$match' => array( 'tags' => 'amenity=restaurant' ) ) ) ); var_dump( $result['result'] ); ?>
If everything goes well, the return value of the aggregate()
helper method is an array with two elements: ok with a value of double(1)
as well as a result element containing an array of all the documents that made it through the whole pipeline. Because the aggregation framework returns all of its results as one document over the network, the full result is limited to 16MB. There are also memory limits internally, so it is always wise to restrict the data coming through the pipeline with an operator as soon as you can.
Let's try to construct a pipeline to get a list of all the different cuisine=β¦
tags including how often they appear.
$match
The first thing to do is to match all documents that have the cuisine=β¦
tag in the first place. We use the $match
operator for that. If a $match
operator is the first operator in a pipeline than it can make use of indexes. It is therefore important that you have indexes in place. The $match
operator definition that we need is:
$allWithCuisine = array( '$match' => array( 'tags' => new MongoRegex( '/^cuisine=/' ) ) );
To fit that into our script, we change it to:
<?php $m = new MongoClient; $c = $m->demo->poi; $allWithCuisine = array( '$match' => array( 'tags' => new MongoRegex( '/^cuisine=/' ) ) ); $result = $c->aggregate( array( $allWithCuisine ) ); var_dump( $result['result'] ); ?>
In my case, this returns an array of 117 documents in the result
element.
For each pipeline step, we create a variable such as $allWithCuisine
to define the operator, and then add those to the array that is passed to the aggregate()
method.
$project
To reduce the amount of data going through the pipeline, our next step is to remove all the fields from the documents that we are not interested in. In fact, we are actually only interested in the tags
field. In order to "re-shape" a document into a different structure, we use the $project operator. In its most basic form, it works the same as the $fields argument to MongoCollection::find(). It is a lot more powerful that that, as it supports changing the whole structure of a document, as well as computed fields. Have a look at the $project documentation for some more inspiration.
As we are only interested in the tags
field of the documents, we just put that in the projection:
$justTheTags = array( '$project' => array( 'tags' => 1 ) );
and modify the aggregate()
call:
$result = $c->aggregate( array( $allWithCuisine, $justTheTags ) );
$unwind
In order to be able to do some work on individual tags, we need to split up the tags
array. The $unwind operator does just that. It is a rather tricky operator to explain, so I will try with an example. Take for example this document:
{ _id: "n478547159", related_ids: [ "n516583937", "n401309937" ] }
Using the $unwind
operator on related_ids
removes each document from the pipeline and introduces two new ones. One for each of the related_ids
elements. At the same time, it replaces the related_ids
array with one of the values. Running { $unwind: '$related_ids' }
turns the above document into the following two:
{ _id: "n478547159", related_ids: "n516583937" } { _id: "n478547159", related_ids: "n401309937" }
In our case, we want a document for each of the elements in the tags
array so that we can group on this field later. We introduce our $unwind
operator:
$unwindTags = array( '$unwind' => '$tags' );
and add it to our list of pipeline operators:
$result = $c->aggregate( array( $allWithCuisine, $justTheTags, $unwindTags ) );
When we run the script now, we get 554 documents in the following form:
β¦ array ( '_id' => 'n470071537', 'tags' => 'amenity=fast_food', ), array ( '_id' => 'n470071537', 'tags' => 'cuisine=burger', ), array ( '_id' => 'n470071537', 'tags' => 'name=C&Ms', ), β¦
Because we are only interested in the cuisine=β¦
tag, we use our previously defined $match
operator to filter out all the documents that don't have this tag:
$result = $c->aggregate( array( $allWithCuisine, $justTheTags, $unwindTags, $allWithCuisine ) );
Which leaves us with 117 documents again.
$group
Now that we have extracted and massaged our data, we are ready to group the documents by their cuisine=β¦
key. The $group operator groups all documents in the pipeline by a key, and allows for computed fields. In our case we want to group by the tags
field:
$groupByTags = array( '$group' => array( '_id' => '$tags' ) );
Then we add it to our list of pipeline operators:
$result = $c->aggregate( array( $allWithCuisine, $justTheTags, $unwindTags, $allWithCuisine, $groupByTags, ) );
Our results includes one document for each distinct $tags
value. A small excerpt:
β¦ array ( '_id' => 'cuisine=kebab;turkish', ), array ( '_id' => 'cuisine=pizza', ), array ( '_id' => 'cuisine=fine_dining', ), β¦
In order to also have a count for each of the distinct values in an extra count
field, we need to modify the $group
operator in the pipeline. I have already mentioned that you can have computed fields, and that's what we need here. A computed field attaches an expression to a field name. In this case, we want the count
field to increment by 1
each time we find a document with this fieldβfor this we use the $sum operator:
$groupByTags = array( '$group' => array( '_id' => '$tags', 'count' => array( '$sum' => 1 ) ) );
Each document that now comes out of the pipeline looks like:
β¦ array ( '_id' => 'cuisine=turkish', 'count' => 2, ), array ( '_id' => 'cuisine=japanese', 'count' => 7, ), array ( '_id' => 'cuisine=italian', 'count' => 10, ), β¦
Other computed fields are also possible. If we want for example to also record which original _id
field had a cuisine=β¦
tag, we modify the group operator to add this field as well:
$groupByTags = array( '$group' => array( '_id' => '$tags', 'count' => array( '$sum' => 1 ), 'ids' => array( '$addToSet' => '$_id' ) ) );
The $addToSet operator adds the original _id
value as a new value to the ids
array for each grouped cuisine=β¦
tag. When we run the full script with the modified $group
operator, we now get documents in the form:
array ( '_id' => 'cuisine=friture', 'count' => 3, 'ids' => array ( 0 => 'n2040116467', 1 => 'n1701471939', 2 => 'n1701465430', ), ),
Because our requirement didn't really want this ids
array, I have removed it from future examples.
$sort
The only thing left to do is now sort our cuisine=β¦
tags with the most used tags first. For this we use the $sort operator. The sort operator works in the same way as the MongoCursor::sort() method and accepts the same arguments. In order to sort by the count
field in descending order, we create the pipeline operator as follows:
$sort = array( '$sort' => array( 'count' => -1 ) );
And add it to our pipeline:
$result = $c->aggregate( array( $allWithCuisine, $justTheTags, $unwindTags, $allWithCuisine, $groupByTags, $sort, ) );
When running our script now, we get a list of all distinct cuisine=β¦
tags ordered by their occurrence:
array ( '_id' => 'cuisine=regional', 'count' => 19, ), array ( '_id' => 'cuisine=burger', 'count' => 15, ), array ( '_id' => 'cuisine=chinese', 'count' => 14, ), β¦
Conclusion
With this I conclude my introduction to the aggregation framework. You can find the final script here. The documentation is extensive so I would suggest to give it a good read.
I'm going back to preparing my PHP Benelux workshop now!
Shortlink
This article has a short URL available: https://drck.me/mdbaggr-a2a