Natural Language Sorting with MongoDB
Note: I have updated this article for the MongoDB 3.4 release in Natural Language Sorting with MongoDB 3.4.
Arranging English words in order is simple—most of the time. You simply arrange them in alphabetical order. Sorting a set of German words, or French words with all their accents, or Chinese with their different characters is a lot harder than it looks. Sorting rules are specified through "locales", which determine how accents are sorted, in which order the letters are in and how to do case-insensitive sorts. There is a good set of those sorting rules available through CLDR, and there is a neat example to play with all kinds of sorting at ICU's demo site. If you really want to know how the algorithms work, have a look at the Unicode Consortium's report on the Unicode Collation Algorithm.
Right now, MongoDB does not support indexes or sorting on anything but Unicode Code Points. Basically, that means, that it can't sort anything but English. There is a long standing issue, SERVER-1920, that is at the top of the priority list, but is not scheduled to be added to a future release. I expect this to be addressed at a point in the near future. However, with some tricks there is a way to solve the sorting problem manually.
Many languages have their own implementation of the Unicode Collation Algorithm, often implemented through ICU. PHP has an ICU based implementation as part of the intl extension, in the form of the Collator class.
The Collator class encapsulates the Collation Algorithm to allow you to sort an array of text yourself. More importantly for this article, it also allows you extract the "sort key". By storing this generated sort key in a separate field in MongoDB, we can sort by locale—and even multiple locales.
Take for example the following array of words:
$words = [
'bailey', 'boffey', 'böhm', 'brown', 'серге́й', 'сергій', 'swag',
'svere'
];
Which we can turn into sort keys with a simple PHP script like:
$collator = new Collator( 'en' );
foreach ( $words as $word )
{
$sortKey = $collator->getSortKey( $word );
echo $word, ': ', bin2hex( $sortKey ), "\n";
}
We create a collator object for the en locale, which is generic English. When running the script, the output is (after formatting):
bailey: 2927373d2f57010a010a boffey: 294331312f57010a010a böhm: 2943353f01859d060109 brown: 294943534101090109 серге́й: 5cba34b41a346601828d05010b сергій: 5cba34b41a6066010a010a swag: 4b53273301080108 svere: 4b512f492f01090109
Those sort keys can be used to then sort the array of names. In PHP, that would be:
$collator->sort( $words ); print_r( $words );
Which returns the following list:
[0] => bailey [1] => boffey [2] => böhm [3] => brown [4] => svere [5] => swag [6] => серге́й [7] => сергій
We can extend this script to use multiple collations, and import each word including its sort keys for each different locale into MongoDB.
Below, we define the words we want to sort on, and the collations we want to compare. They are in order: English, German with phone book sorting, Norwegian, Russian and two forms of Swedish: "default" and "standard":
<?php
$words = [
'bailey', 'boffey', 'böhm', 'brown', 'серге́й', 'сергій',
'swag', 'svere'
];
$collations = [
'en', 'de_DE@collation=phonebook', 'no', 'ru',
'sv', 'sv@collation=standard',
];
Make the connection to MongoDB and clean out the collection:
$m = new MongoClient; $d = $m->demo; $c = $d->collate; $c->drop();
Create the Collator objects for each of our collations:
$collators = [];
foreach ( $collations as $collation )
{
$c->createIndex( [ $collation => 1 ] );
$collators[$collation] = new Collator( $collation );
}
Loop over all the words, and for each collation we have define, use the created Collator object to generate the sort key. We encode the sort key with bin2hex() because sort keys are binary data, and MongoDB requires UTF-8 for strings. My original plan of using MongoDB's BinData type did not work, as it sorts first according to the length of the data. Encoding with base64_encode() also does not work, as it's encoding scheme does not keep the original order. Encoding with utf8_encode() does work, but as it creates some binary (but valid-for-MongoDB-UTF-8) data, it's not good to use as an example.
foreach ( $words as $word )
{
$doc = [ 'word' => $word ];
foreach ( $collations as $collation )
{
$sortKey = $collators[$collation]->getSortKey( $word );
$doc[$collation] = bin2hex( $sortKey );
}
$c->insert( $doc );
}
When we run the script, and see what's in the database, we find something like the following for böhm:
> db.collate.find( { word: 'böhm' }).pretty();
{
"_id" : ObjectId("53fc721844670a35498b4569"),
"word" : "böhm",
"en" : "2943353f01859d060109",
"de_DE@collation=phonebook" : "29432f353f0186870701848f06",
"no" : "295aa105353f018687060108",
"ru" : "2b45374101859d060109",
"sv@collation=standard" : "295aa106353f01080108",
"sv@collation=default" : "295aa106353f01080108"
}
To see the sorting for the words in all the locales, I've added the following to the end of the script:
foreach ( $collations as $collation )
{
echo $collation, ":\n";
$r = $c->find()->sort( [ $collation => 1 ] );
foreach ( $r as $res )
{
echo $res['word'], ' ';
}
echo "\n\n";
}
As you can see, we call sort() and specify which field to sort on. The $collation variable contains the name of the collation. In each stored document, the field with the name of the collation, stores the sort key for that collation as you saw in the previous MongoDB shell output.
Running with this part of the code added, we get:
en: bailey boffey böhm brown svere swag серге́й сергій de_DE@collation=phonebook: bailey böhm boffey brown svere swag серге́й сергій no: bailey boffey brown böhm svere swag серге́й сергій ru: серге́й сергій bailey boffey böhm brown svere swag sv@collation=standard: bailey boffey brown böhm swag svere серге́й сергій sv@collation=default: bailey boffey brown böhm svere swag серге́й сергій
-
In English, the
öinböhmsorts as ano. -
In Germany's phone book collation, the
öinböhmsorts like anoe. -
In Norwegian, the
öinböhmsorts as an extra letter afterz. -
In Russian, the Cyrillic letters sort before Latin letters.
-
In Sweden's "standard" collation, the
vandware considered equivalent letters.
By generating a sort key for your data, you get to chose with which locale MongoDB will do the sorting, but with the overhead of having to maintain an index yourself. ICU, the library that lies underneath PHP's intl extension supports a lot more customisations for collators, and even allows you to define your own custom rules. In the future, we will likely see some of this functionality make it into MongoDB as well. Until this implemented, generating your own sort-key field for each document like this article shows, is your best MongoDB-only approach. If you find collation sorting in MongoDB important, feel free to vote on the SERVER-1920 issue in Jira.
Comments
Nice tip. I just wanted to mention that even for english you have to use a technique like this to sort properly: résumé is sorted after rope if you use a binary sorting method. Sorting people's names is another common requirement in english-only software where you need UCA collation to sort it properly.
Life Line
I've finished reading This Way Up. It's about maps, that went wrong.
It's a good read, but htyerr were several chapters that were written in a novel way (as a video transcript, a series of letters), and I found distracting from the a tail content. It'll have worked better in a produced video.
No mention of @openstreetmap though :-(
Updated a bench
Created a tree; Updated 3 humps and a waste_basket
The Early Cormorant Catches the Eel
Sorry, not the best photo! But I caught this Cormorant catching this large eel when looking for Bank Swallows, right next to Eel Pie Island in the Thames.
#Birds #BirdPhotography #BirdsOfMastodon #Photography #London
Updated an estate_agent office
I went to my nieces' birthday party yesterday.
The theme was pink, and that included all the food, mostly died with beet root.
Shock and horror this morning when doing number two. Not only was my turd dark red, it was also glittering at me. Apparently the carrot cake had edible glitter...
So now I know what's worse than glitter.
😂 ✨ 💩 🟣Long-Tailed Tit on a Branch with Lichen
I've been spending some time in random London local nature reserves.
Sitting and listening, and in fifteen minutes you spot countless species.
This one was in Ham Lands Local Nature Reserve near Teddington.
#london #BirdPhotogaphy #BirdsOfMastodon #Birds #LichenSubscribe
A Colourful Mandarin
In The Long Water in Kensington Palace Gardens, London.
Created 7 benches
Created 2 benches
Created a bench
I walked 7.3km in 2h28m39s
Added a note about a duplicate Papersmiths
I walked 4.1km in 49m02s
Fixed website
fix typo
Updated a bench
I walked 1.6km in 20m26s
I walked 1.1km in 11m49s
The Yellow Eye
A blue heron's head, with its very yellow stare-y eye.
#BirdPhotography #Photography #BirdsOfFediverse #BirdsOfMastodon #London
My little Lego box is telling me it really is quite warm outside.
Created a bicycle_parking and a crossing
I walked 3.3km in 41m56s








Shortlink
This article has a short URL available: https://drck.me/mdbcoll-b2p