Good Bye PHP 5

A few days ago I merged a patch into Xdebug that removes support for PHP 5 in Xdebug's master branch on GitHub. Maintaining PHP 5 and PHP 7 support in one code base is not particularly easy, and even more complicated for something like Xdebug, with its deep interactions with PHP's internals.

As PHP 5.6's active support has ended on December 31st, I also felt it no longer needed to support PHP 5 with Xdebug any more. It saves more than 5000 lines of code:

Many people people were quite positive about that:

Others were less keen:

Removing PHP 5 support from Xdebug's master branch does not mean that Xdebug suddenly stops working for PHP 5 installations. Xdebug 2.5, which was recently released supports PHP 5.5 and 5.6, and is not going to go away.

Right now, Xdebug will no longer receive new features in the branch that also supports PHP 5. New features will only go into master (to become Xdebug 2.6). However, Xdebug 2.5 continues to receive bug fixes until Xdebug 2.6 comes out.

Once Xdebug 2.6 comes out, the Xdebug 2.5 branch will no longer get bug fixes, and hence support for PHP 5 goes away. That still does not mean that you can no longer use Xdebug with PHP 5. The releases of the 2.5 branch will still be available.

On the positive side, not having to implement lots of code twice, also means that new features can be added faster, as less work is required. Xdebug 2.6 has already have some new features lined up.

Shortlink

This article has a short URL available: https://drck.me/byephp5-d4v

Comments

No comments yet

Natural Language Sorting with MongoDB 3.4

Arranging English words in order is simple—most of the time. You simply arrange them in alphabetical order. Sorting a set of German words, or French words with all of their accents, or Chinese with their different characters is a lot harder than it looks. Sorting rules are specified through locales, which determine how accents are sorted, in which order the characters are in, and how to do case-insensitive sorting. There is a good set of those sorting rules available through CLDR, and there is a neat example to play with all kinds of sorting at ICU's demo site. If you want to know how the algorithms work, have a look at the Unicode Consortium's report on the Unicode Collation Algorithm.

Years ago I wrote about collation and MongoDB. There is an old issue in MongoDB's JIRA tracker, SERVER-1920, to implement collation so that sorting and indexing could work depending on the different sorting orders as described for each language (locale).

Support for these collations have finally landed in MongoDB 3.4 and in this article we are going to have a look at how they work.

How Unicode Collation Works

Many computer languages have their own implementation of the Unicode Collation Algorithm, often implemented through ICU. PHP has an ICU based implementation as part of the intl extension, in the form of the Collator class.

The Collator class encapsulates the Unicode Collation Algorithm to allow you to sort an array of text yourself. It also allows you to visualise the "sort key" to see how the algorithm works:

Take for example the following array of words:

$dictionary = [
    'boffey', 'bøhm', 'brown',
];

Which we can turn into sort keys, and sort using the en locale (English):

$collator = new Collator( 'en' );
foreach ( $dictionary as $word )
{
    $sortKey = $collator->getSortKey( $word );
    $dictionaryWithKey[ bin2hex( $sortKey ) ] = $word;
}

ksort( $dictionaryWithKey );
print_r( $dictionaryWithKey );

Which outputs:

Array
(
    [2b4533333159010a010a] => boffey
    [2b453741014496060109] => bøhm
    [2b4b45554301090109] => brown
)

If we would do this according to the nb (Norwegian) locale, the output would have brown and bøhm reversed:

Array
(
    [2b4533333159010a010a] => boffey
    [2b4b45554301090109] => brown
    [2b5c6703374101080108] => bøhm
)

The sort key for bøhm has now changed, so that its numerical value now makes it sort after brown instead of before brown. In Norwegian, the ö is a distinct letter that sorts after z.

MongoDB 3.4

Before the release of MongoDB 3.4, it was not possible to do a locale based search. As case-insensitivity is just another property of a locale, that was not supported either. Many users worked around this by storing a lower case version of the value in separate field just to do a case-insensitive search. But this has now changed with the implementation of SERVER-1920.

In MongoDB 3.4 you may attach a default locale to a collection:

db.createCollection( 'dictionary', { collation: { locale: 'nb' } } );

A default locale is used for any query without a different locale being specified with the query. Compare the default (nb) locale:

> db.dictionary.find().sort( { word: 1 } );
{ "_id" : ObjectId("5846d65210d52027a50725f0"), "word" : "boffey" }
{ "_id" : ObjectId("5846d65210d52027a50725f1"), "word" : "brown" }
{ "_id" : ObjectId("5846d65210d52027a50725f2"), "word" : "bøhm" }

With the English (en) locale:

> db.dictionary.find().collation( { locale: 'en'} ).sort( { word: 1 } );
{ "_id" : ObjectId("5846d65210d52027a50725f0"), "word" : "boffey" }
{ "_id" : ObjectId("5846d65210d52027a50725f2"), "word" : "bøhm" }
{ "_id" : ObjectId("5846d65210d52027a50725f1"), "word" : "brown" }

The default locale of a collection is also inherited by an index when you create one:

db.dictionary.createIndex( { word: 1 } );

db.dictionary.getIndexes();
[
    …
    {
        "v" : 2,
        "key" : { "word" : 1 },
        "name" : "word_1",
        "ns" : "demo.dictionary",
        "collation" : {
            "locale" : "nb",
            "caseLevel" : false,
            "caseFirst" : "off",
            "strength" : 3,
            "numericOrdering" : false,
            "alternate" : "non-ignorable",
            "maxVariable" : "punct",
            "normalization" : false,
            "backwards" : false,
            "version" : "57.1"
        }
    }
]


From PHP

All the examples below are using the PHP driver for MongoDB (1.2.0) and the accompanying library (1.1.0). These are the minimum versions to work with locales.

To use the MongoDB PHP Library, you need to use Composer to install it, and include the Composer-generated autoloader to make the library available to the script. In short, that is:

php composer require mongodb/mongodb=^1.1.0

And at the start of your script:

<?php
require 'vendor/autoload.php';

In this first example, we are going to drop the collection dictionary from the demo database, and create a collection with the default collation en. We also create an index on the word field and insert a couple of words.

First the set-up and assigning of the database handle ($demo):

$client = new \MongoDB\Client();
$demo = $client->demo;

Then we drop the dictionary collection:

$demo->dropCollection( 'dictionary' );

We create a new collection dictionary and set the default collation for this collection to the en locale:

$demo->createCollection(
    'dictionary',
    [
        'collation' => [ 'locale' => 'en' ],
    ]
);
$dictionary = $demo->dictionary;

We create the index, and we also give the index the name dictionary_en. MongoDB supports multiple indexes with the same field pattern, as long as they have a different name and have different collations (e.g. locale, or locale options):

$dictionary->createIndex(
    [ 'word' => 1 ],
    [ 'name' => 'dictionary_en' ]
);

And then we insert some words:

$dictionary->insertMany( [
    [ 'word' => 'beer' ],
    [ 'word' => 'Beer' ],
    [ 'word' => 'côte' ],
    [ 'word' => 'coté' ],
    [ 'word' => 'høme' ],
    [ 'word' => 'id_12' ],
    [ 'word' => 'id_4' ],
    [ 'word' => 'Home' ],
] );

When doing a query, you can specify the locale for that operation. Only one locale can be used for a single operation, which means that MongoDB uses the same locale for the find and the sort parts of a query. We do intent to support more granular support for using collations on different parts of an operation. This is tracked in SERVER-25954.

Using the Default Locale

Let's do a query while sorting with the en locale. Because this is the default locale for this collection, we don't have to specify it. We also define a helper function to show the result of this query, and further queries:

function showResults( string $name, \MongoDB\Driver\Cursor $results )
{
    echo $name, ":\n";
    foreach( $results as $result )
    {
        echo $result->word, " ";
    }
    echo "\n\n";
}

showResults(
    "Sort with default locale",
    $dictionary->find( [], [ 'sort' => [ 'word' => 1 ] ] )
);

This outputs:

Sort with default locale:
beer Beer coté côte Home høme id_12 id_4


Only the Base Character

There are many variants of locales. The strength option defines the number of levels that are used to perform a comparison of characters. At strength=1, only base characters are compared. This means that with the en locale: beer == Beer, coté == côte, and Home == høme.

You can specify the strength while doing each query. First we use the en locale and strength 1. This is equivalent to a case insensitive match:

showResults(
    "Match on base character only",
    $dictionary->find(
        [ 'word' => 'beer' ],
        [ 'collation' => [ 'locale' => 'en', 'strength' => 1 ] ]
    )
);

Which outputs:

Match on base character only:
beer Beer

Strength 1 also ignores accents on characters, such as in:

showResults(
    "Match on base character only, ignoring accents",
    $dictionary->find(
        [ 'word' => 'home' ],
        [ 'collation' => [ 'locale' => 'en', 'strength' => 1 ] ]
    )
);

Which outputs:

Match on base character only, ignoring accents:
høme Home

As strength, or any of the other options we will see later, changes the sort key for a string, it is important that you realise that because of this, an index in MongoDB will only be used if it is created with the exact same locale options as the query.

Because we only have an index on word with the default en locale, all other examples do not make use of an index while matching or sorting. If you want to make an indexed lookup for the en/strength=1 example, you need to create an index with:

$dictionary->createIndex(
    [ 'word' => 1 ],
    [
        'name' => 'word_en_strength1',
        'collation' => [
            'locale' => 'en',
            'strength' => 1
        ],
    ]
);

Different Locales, Different Letters

Not every language considers an accented character a variant of the original base character. If we run the last example with the Norwegian Bokmål (nb) locale we get a different result:

showResults(
    "Match on base character only (nb locale)",
    $dictionary->find(
        [ 'word' => 'home' ],
        [ 'collation' => [ 'locale' => 'nb', 'strength' => 1 ] ]
    )
);

Which outputs:

Match on base character only (nb locale), ignoring accents:
Home

In Norwegian, the ø sorts as a distinct letter after z, where the alphabet ends with: y z æ ø å.

Sorting Accents

Strength 2 takes into account accents on letters while matching and sorting. If we run the match on home in the English locale with strength 2, we get:

showResults(
    "Match on base character with accents",
    $dictionary->find(
        [ 'word' => 'home' ],
        [ 'collation' => [ 'locale' => 'en', 'strength' => 2 ] ]
    )
);

Which outputs:

Match on base character with accents:
Home

The word høme is no longer included. However, the case of characters is still not considered:

showResults(
    "Match on base character with accents (and not case sensitive)",
    $dictionary->find(
        [ 'word' => 'beer' ],
        [ 'collation' => [ 'locale' => 'en', 'strength' => 2 ] ]
    )
);

Which outputs:

Match on base character with accents (and not case sensitive):
beer Beer

Again, more fun can be had while sorting with accents, because languages do things differently. If we take the words cøte and coté, we see a difference in sorting between the fr (French) and fr_CA (Canadian French) locales:

showResults(
    "Sorting accents in French (France)",
    $dictionary->find(
        [ 'word' => new \MongoDB\BSON\Regex( '^c' ) ],
        [
            'collation' => [ 'locale' => 'fr', 'strength' => 2 ],
            'sort' => [ 'word' => 1 ],
        ]
    )
);

showResults(
    "Sorting accents in Canadian French",
    $dictionary->find(
        [ 'word' => new \MongoDB\BSON\Regex( '^c' ) ],
        [
            'collation' => [ 'locale' => 'fr_CA', 'strength' => 2 ],
            'sort' => [ 'word' => 1 ],
        ]
    )
);

Which outputs:

Sorting accents in French (France):
coté côte

Sorting accents in Canadian French:
côte coté

In Canadian French, the accents sort from back to front. This is called Backward Secondary Sorting sorting, and is an option you can set on any locale-based query. Some language locales have different default values for options. To make the French Canadian sort the "wrong" way, we can specify the additional backwards option:

showResults(
    "Sorting accents in Canadian French, the 'wrong' way",
    $dictionary->find(
        [ 'word' => new \MongoDB\BSON\Regex( '^c' ) ],
        [
            'collation' => [ 'locale' => 'fr_CA', 'strength' => 2, 'backwards' => false ],
            'sort' => [ 'word' => 1 ],
        ]
    )
);

Which outputs:

Sorting accents in Canadian French, the 'wrong' way:
coté côte

Interesting Locales

There are a few other interesting sorting and matching methods in different locales.

  • In Germany's phone book collation, the ö in böhm sorts like an oe.

  • In Russian, the Cyrillic letters sort before Latin letters.

  • In Sweden's "standard" collation, the v and w are considered equivalent letters.

As an example:

$demo->dropCollection( 'dictionary' );

$dictionary->insertMany( [
    [ 'word' => 'swag' ],
    [ 'word' => 'Boden' ],
    [ 'word' => 'böse' ],
    [ 'word' => 'Bogen' ],
    [ 'word' => 'sverre' ],
    [ 'word' => 'Валенти́на' ],
    [ 'word' => 'Ю́рий' ],
] );

$locales = [
    'de',
    'de@collation=phonebook',
    'ru',
    'sv@collation=standard',
];

foreach( $locales as $locale )
{
    showResults(
        "Sorting with the '$locale' locale",
        $dictionary->find(
            [],
            [
                'collation' => [ 'locale' => $locale, 'strength' => 2 ],
                'sort' => [ 'word' => 1 ]
            ]
        )
    );
}

Which outputs:

Sorting with the 'de' locale:
Boden Bogen böse sverre swag Валенти́на Ю́рий

Sorting with the 'de@collation=phonebook' locale:
Boden böse Bogen sverre swag Валенти́на Ю́рий

Sorting with the 'ru' locale:
Валенти́на Ю́рий Boden Bogen böse sverre swag

Sorting with the 'sv@collation=standard' locale:
Boden Bogen böse swag sverre Валенти́на Ю́рий

Please also note that I had to set strength to 2 here, as Germans like capitalizing their nouns as well as names!

Other Options

The default strength is 3, which besides base character and accents, also takes the case into account. A search for beer will no longer find Beer (☹).

But there are a few other things you can configure with locales. If you paid attention, you saw that my word list includes id_4 and id_12. If you sort this in the normal default order, you will see the following:

showResults(
    "Sorting with numbers in strings",
    $dictionary->find(
        [ 'word' => new \MongoDB\BSON\Regex( '^id_' ) ],
        [ 'sort' => [ 'word' => 1 ] ]
    )
);

Which outputs:

Sorting with numbers in strings:
id_12 id_4

In order to fix that, you can set the numericOrdering option on the locale, as this done here:

showResults(
    "Sorting with numbers in strings, properly",
    $dictionary->find(
        [ 'word' => new \MongoDB\BSON\Regex( '^id_' ) ],
        [
            'collation' => [ 'locale' => 'en', 'numericOrdering' => true ],
            'sort' => [ 'word' => 1 ],
        ]
    )
);

Which then outputs:

Sorting with numbers in strings, properly:
id_4 id_12

Other options are also available, and are documented in the Collation section of the MongoDB manual.

Conclusion

Languages and language sorting is complex. In the examples above I have only shown collations with Western Latin and Cyrillic characters. Asian languages make searching and sorting even more complicate. With Japanese and Chinese characters, there are different ways of determining their sort order for example. But getting sorting strings and matching search phrases right is very important for the usability of applications. And because of that, the implementation of SERVER-1920 is a very welcome addition to MongoDB. The implementation in MongoDB supports every locale and variant that ICU supports. A list of these locales with their identifier can be found in the documentation.

Further work on collation support is also expected. To track issues and vote for them, please refer to list on JIRA.

Shortlink

This article has a short URL available: https://drck.me/mdbcoll34-cqh

Comments

This is very insightful, thanks for taking the time to write this!

Thanks for the insightful info presented here. I have learnt something new today because of this blog article. Thanks again. Will be delighted to read more.

Not Finding the Symbols

Yesterday we released the new version of the MongoDB Driver for PHP, to coincide with the release of MongoDB 3.4. Not long after that, we received an issue through GitHub titled "Undefined Symbol php_json_serializable_ce in Unknown on Line 0".

TL;DR: Load the JSON extension before the MongoDB extension.

The newly released version of the driver has support for PHP's json_encode() through the JsonSerializable interface, to convert some of our internal BSON types (think MongoDB\BSON\Binary and MongoDB\BSON\UTCDateTime) directly to JSON. For this it uses functionality in PHP's JSON extension, and with that the php_json_serializable_ce symbol that this extension defines.

We run our test suite on many different distributions, but (nearly) always with our own compiled PHP binaries as we need to support so many versions of PHP (5.4-5.6, 7.0, and now 7.1), in various configurations (ZTS, or not; 32-bit or 64-bit). It came hence quite as a surprise that a self-compiled extension would not load for one of our users.

When compiling PHP from its source, by default the JSON extension becomes part of the binary. This means that the JSON extension, and the symbols it implements are always available. Linux distributions often split out each extension into their own package or shared object. Debian has php5-json (on which php5-cli depends), while Fedora has php-json. In order to make use of the JSON extension, you therefore need to install a separate package that provides the shared object (json.so) and a configuration file. Fedora installs the 20-json.ini file in /etc/php.d/. Debian installs the 20-json.ini file in /etc/php5/mods-available with a symlink to /etc/php5/cli/conf.d/20-json.ini. In both cases, they include the required extension=json.so line that instruct PHP to load the shared object and make its symbols (and PHP functions) available.

A normal PHP binary uses the dlopen system call to load a shared object, with the RTLD_LAZY flag. This flag means that symbols (such as php_json_serializable_ce) are only resolved lazily, when they are first used. This is important, because PHP extensions and the shared objects they live in, can depend on each other. The MongoDB extension depends on date, spl and json. After PHP has loaded all the shared extensions, it registers the classes and functions contained in them, in an order to satisfy this dependency graph. PHP makes sure that the classes and functions in the JSON extension are registered before the MongoDB extension, so that when the latter uses the php_json_serializable_ce symbol to declare that the MongoDB\\BSON\\UTCDateTime class implements the JsonSerializable interface the symbol is already available.

Distributions often want to harden their provided packages with additional security features. For that, they compile binaries with additional features and flags.

Debian patches PHP to replace the RTLD_LAZY flag with RTLD_NOW. Instead of resolving symbols when they are first used, this signals to the dlopen system call to resolve the symbols when the shared object is loaded. This means, that if the MongoDB extension is loaded before the JSON extension, the symbols are not available yet, and the linker throws the "Undefined Symbol php_json_serializable_ce in Unknown on Line 0" error from our bug report. This is not a problem that only related to PHP; TCL has similar issues for example.

With Fedora, the same issue is present, but shows through slightly different means. Instead of patching PHP to replace RTLD_LAZY with RTLD_NOW, it uses linker flags ("-Wl,-z,relro,-z,now") to force binaries to resolve symbols as soon as they are loaded process wide. This Built with BIND_NOW security feature goes hand in hand with Built with RELRO. The explanation on why these features are enabled on Fedora is well described on their wiki. Previously, this did expose an issue with an internal PHP API regarding creating a DateTime object.

But where does this leave us? The solution is fairly simple: You need to make sure that the JSON extension's shared object is loaded before the MongoDB extension's shared object. PECL's pecl install suggests to add the extension=mongodb.so line to the end of php.ini. Instead, on Debian, it would be much better to put the extension=mongodb.so line in a separate 99-mongodb.ini file under /etc/php5/mods-available, with a symlink to /etc/php5/cli/conf.d/99-mongodb.ini and /etc/php5/apache2/conf.d/99-mongodb.ini:

cat << EOF > /etc/php5/mods-available/mongodb.ini
; priority=99
extension=mongodb.so
EOF
php5enmod mongodb

On Fedora, you should add the extension=mongodb.so line to the new file /etc/php.d/50-mongodb.ini:

echo "extension=mongodb.so" > /etc/php.d/50-mongodb.ini

Alternatively, you can install the distribution's package for the MongoDB extension. Fedora currently has the updated 1.2.0 release for Rawhide (Fedora 26). Debian however, does not yet provide a package for the latest release yet, although an older version (1.1.7) is available in Debian unstable. At the time of this writing, Ubuntu only provides older versions for Xenial and Yakkety.

Shortlink

This article has a short URL available: https://drck.me/undefsym-cpv

Comments

Adding "extension=mongodb.so" at the end of php.ini didn't solve the issue but 99-mongodb.ini did! Thanks! BTW $ uname -a Linux titanic 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux

Made new file mongodb.ini

added 2 lines ; priority=99 extension=mongodb.so

& run command phpenmod mongodb

This worked for me. I had to remove extension=mongodb.so from php.ini file

On Ubuntu 14 & php 7 ( Working on Vagrant )

Thanks for the info, I tried just sticking extension=json.so one line above in the php.ini and that worked for me

I was using a standard AWS image. And the following worked for me: # Don't add extension to /etc/php-5.6.ini echo "extension=mongodb.so" > /etc/php.d/50-mongodb.ini echo "extension=mongo.so" > /etc/php.d/50-mongo.ini

Walking the Capital Ring - Section 15

Section 15

In this last section, we walked around London City airport. Well, sorta. Starting off at Beckton Park, and its Jake Russell Walk we started the last section of the Capital Ring. After Beckton Park, we walked through new Beckton Park as well, before coming to Cyprus DLR station.

The Capital Ring as mapped on OpenStreetMap, diverged from the route on the TFL site, and the signage on the ground. Instead of following the Royal Albert Way, it no goes through the University of East London and along the Royal Albert Dock for a while. I can imagine this being a little nicer than a dual carriage way.

At the end of the path along the dock, we had a little stint through an industrial estate, with a radar mast in sight at the end. The radar mast was straight on the Thames, and that point ended up being the furthest East on the Capital Ring: Galleons Point. The tide on the Thames was low, exposing loads of shopping trolleys and other rubbish.

From there, we crossed two sets of locks. A small one, and the much larger King George the 5th lock. Galleons Point is another new development. Now along the Thames, we passed by the Royal Victoria Gardens and ended up at North Woolwich.

The old station at North Woolwich at one time housed a museum, but that is now closed. Close behind it are Crossrail works, where a tunnel under the Thames for the new railway starts.

At this point, we could have taken the ferry across to Woolwich, and the end of the Capital Ring. Instead, we decided to walk through the Woolwich Foot Tunnel. This and the Greenwich Foot Tunnel are the only two pedestrian only tunnels under the Thames, built in the early 1900s. Apparently, photography is forbidden, but I only found that out while writing up this blog post!

At the other end of the tunnel, we walked the last 50 meters to the end of the section, and with that, the Capital Ring!

Route (with GPX)

Waymarked Trails

Time

1h 12m 14s

Distance

5.88 km

Average Heart Rate

105 bpm

Calories Burned

659 cal

For the full photo series, see my Flickr set.

Shortlink

This article has a short URL available: https://drck.me/cr15-cms

Comments

No comments yet

Walking the Capital Ring - Section 14

Section 14

Upon setting off on the penultimate section, we quickly looped around the Olympic Station, now the London Station, home of West Ham. There is a lot of new landscaping, that's not quite finished yet around here. We left the Lee soon enough, and walked towards Stratford High Street. There Crossrail work near Pudding Mill Lane station, where the Ring was diverted along an industrial estate. This was badly signposted. I've updated OpenStreetMap with the current route, but it will have to be redone once the works are over. Once they are, it should be easy enough to navigate from Victoria Walk to the Greenway.

In our case, we had to spend a little bit of time to fine the Greenway, but once we got there it was a very easy route. In fact, for 80% of this section, you walk in a straight line over the Greenway.

Near the start of the Greenway, the walk goes past the Abbey Mills Pumping Station, which has been pumping sewage around since the 1860s. As a matter of fact, the Greenway that we were walking on, is actually a long footpath and cycleway on top of the Northern Outfall Sewer. This then also explained the nice fragrances along this stretch of the route...

It was a long and monotonous walk along the Greenway, and we were pleased once we left it and could walk the last bit of this section through Beckton District Park. When coming out of the park, we had concluded this section. One more to go!

Route (with GPX)

Waymarked Trails

Time

1h 37m 38s

Distance

8.24 km

Average Heart Rate

108 bpm

Calories Burned

836 cal

For the full photo series, see my Flickr set.

Shortlink

This article has a short URL available: https://drck.me/cr14-cms

Comments

No comments yet

Life Line