Natural Language Sorting with MongoDB

Arranging English words in order is simple—well, most of the time. You simply arrange them in alphabetical order. However sorting a set of German words, or French words with all their accents, or Chinese with their different characters is a lot harder than it looks. Sorting rules are specified through "locales", which determine how accents are sorted, in which order the letters are in and how to do case-insensitive sorts. There is a good set of those sorting rules available through CLDR, and there is a neat example to play with all kinds of sorting at ICU's demo site. If you really want to know how the algorithms work, have a look at the Unicode Consortium's report on the Unicode Collation Algorithm.

Right now, MongoDB does not support indexes or sorting on anything but Unicode Code Points. Basically, that means, that it can't sort anything but English. There is a long standing issue, SERVER-1920, that is at the top of the priority list, but is not scheduled to be added to a future release. I expect this to be addressed at a point in the near future. However, with some tricks there is a way to solve the sorting problem manually.

Many languages, have their own implementation of the Unicode Collation Algorithm, often implemented through ICU. PHP has an ICU based implementation as part of the intl extension. And the class to use is the Collator class.

The Collator class encapsulates the Collation Algorithm to allow you to sort an array of text yourself, but it also allows you extract the "sort key". By storing this generated sort key in a separate field in MongoDB, we can sort by locale—and even multiple locales.

Take for example the following array of words:

$words = [
        'bailey', 'boffey', 'böhm', 'brown', 'серге́й', 'сергій', 'swag',
        'svere'
];

Which we can turn into sort keys with a simple PHP script like:

$collator = new Collator( 'en' );
foreach ( $words as $word )
{
        $sortKey = $collator->getSortKey( $word );
        echo $word, ': ', bin2hex( $sortKey ), "\n";
}

We create a collator object for the en locale, which is generic English. When running the script, the output is (after formatting):

bailey: 2927373d2f57010a010a
boffey: 294331312f57010a010a
böhm:   2943353f01859d060109
brown:  294943534101090109
серге́й: 5cba34b41a346601828d05010b
сергій: 5cba34b41a6066010a010a
swag:   4b53273301080108
svere:  4b512f492f01090109

Those sort keys can be used to then sort the array of names. In PHP, that would be:

$collator->sort( $words );
print_r( $words );

Which returns the following list:

[0] => bailey
[1] => boffey
[2] => böhm
[3] => brown
[4] => svere
[5] => swag
[6] => серге́й
[7] => сергій

We can extend this script, to use multiple collations, and import each word including its sort keys into MongoDB.

Below, we define the words we want to sort on, and the collations we want to compare. They are in order: English, German with phone book sorting, Norwegian, Russian and two forms of Swedish: "default" and "standard":

<?php
$words = [
        'bailey', 'boffey', 'böhm', 'brown', 'серге́й', 'сергій',
        'swag', 'svere'
];
$collations = [
        'en', 'de_DE@collation=phonebook', 'no', 'ru',
        'sv', 'sv@collation=standard',
];

Make the connection to MongoDB and clean out the collection:

$m = new MongoClient;
$d = $m->demo;
$c = $d->collate;
$c->drop();

Create the Collator objects for each of our collations:

$collators = [];

foreach ( $collations as $collation )
{
        $c->createIndex( [ $collation => 1 ] );
        $collators[$collation] = new Collator( $collation );
}

Loop over all the words, and for each collation we have define, use the created Collator object to generate the sort key. We encode the sort key with bin2hex() because sort keys are binary data, and MongoDB requires UTF-8 for strings. My original plan of using MongoDB's BinData type did not work, as it sorts first according to the length of the data. Encoding with base64_encode() also does not work, as it's encoding scheme does not keep the original order. Encoding with utf8_encode() does work, but as it creates some binary (but valid-for-MongoDB-UTF-8) data, it's not good to use as an example.

foreach ( $words as $word )
{
        $doc = [ 'word' => $word ];
        foreach ( $collations as $collation )
        {
                $sortKey = $collators[$collation]->getSortKey( $word );
                $doc[$collation] = bin2hex( $sortKey );
        }
        $c->insert( $doc );
}

When we run the script, and see what's in the database, we find something like the following for böhm:

> db.collate.find( { word: 'böhm' }).pretty();
{
        "_id" : ObjectId("53fc721844670a35498b4569"),
        "word" : "böhm",
        "en" : "2943353f01859d060109",
        "de_DE@collation=phonebook" : "29432f353f0186870701848f06",
        "no" : "295aa105353f018687060108",
        "ru" : "2b45374101859d060109",
        "sv@collation=standard" : "295aa106353f01080108",
        "sv@collation=default" : "295aa106353f01080108"
}

To see the sorting for the words in all the locales, I've added the following to the end of the script:

foreach ( $collations as $collation )
{
        echo $collation, ":\n";

        $r = $c->find()->sort( [ $collation => 1 ] );
        foreach ( $r as $res )
        {
                echo $res['word'], ' ';
        }

        echo "\n\n";
}

As you can see, we call sort() and specify which field to sort on. The $collation variable contains the name of the collation. In each stored document, the field with the name of the collation, stores the sort key for that collation as you saw in the previous MongoDB shell output.

Running with this part of the code added, we get:

en:
bailey boffey böhm brown svere swag серге́й сергій

de_DE@collation=phonebook:
bailey böhm boffey brown svere swag серге́й сергій

no:
bailey boffey brown böhm svere swag серге́й сергій

ru:
серге́й сергій bailey boffey böhm brown svere swag

sv@collation=standard:
bailey boffey brown böhm swag svere серге́й сергій

sv@collation=default:
bailey boffey brown böhm svere swag серге́й сергій

  • In English, the ö in böhm sorts as an o.

  • In Germany's phone book collation, the ö in böhm sorts like an oe.

  • In Norwegian, the ö in böhm sorts as an extra letter after z.

  • In Russian, the Cyrillic letters sort before Latin letters.

  • In Sweden's "standard" collation, the v and w are considered equivalent letters.

By generating a sort key for your data, you get to chose with which locale MongoDB will do the sorting, but with the overhead of having to maintain an index yourself. ICU, the library that lies underneath PHP's intl extension supports a lot more customisations for collators, and even allows you to define your own custom rules. In the future, we will likely see some of this functionality make it into MongoDB as well. Until this implemented, generating your own sort-key field for each document like this article shows, is your best MongoDB-only approach. If you find collation sorting in MongoDB important, feel free to vote on the SERVER-1920 issue in Jira.

Shortlink

This article has a short URL available: http://drck.me/mdbcoll-b2p

Comments

Nice tip. I just wanted to mention that even for english you have to use a technique like this to sort properly: résumé is sorted after rope if you use a binary sorting method. Sorting people's names is another common requirement in english-only software where you need UCA collation to sort it properly.

On Backwards Compatibility and not Being Evil

This is a repost of an email I sent to PHP internals as a reply to:

And since you're targetting[sic] the next major release, BC isn't an issue.

This sort of blanket statements that "Backwards Compatibility is not an issue" with a new major version is extremely unwarranted. Extreme care should be taken when deciding to break Backwards Compatibility. It should not be "oh we have a major new version so we can break all the things"™.

There are two main types of breaking Backwards Compatibility:

  1. The obvious case where running things trough php -l instantly tells you your code no longer works. Bugs like the two default cases, fall in this category. I have no problem with this, as it's very easy to spot the difference (In the case of allowing multiple "default" cases, it's a fricking bug fix too).

  2. Subtle changes in how PHP behaves. There is large amount of those things currently under discussion. There is the nearly undetectable change of the "Uniform Variable Syntax", that I already wrote about, the current discussion on "Binary String Comparison", and now changing the behaviour on << and >> in a subtle way. These changes are not okay, because they are nearly impossible to test for.

    Changes that are so difficult to detect, mean that our users need to re-audit and check their whole code base. It makes people not want to upgrade to a newer version as there would be more overhead than gains. Heck, even changing the $ in front of variables to £ is a better change, as it's immediately apparent that stuff changed. And you can't get away with "But Symfony and ZendFramework don't use this" either, as there is so much code out there

As I said, the first type isn't much of a problem, as it's easy to find what causes such Backwards Compatibility break, but the second type is what causes our users an enormous amount of frustration. Which then results in a lot slower adoption rate—if they bother upgrading at all. Computer Science "purity" reasons to "make things better" have little to no meaning for PHP, as it's clearly not designed in the first place.

Can I please urge people to not take Backwards Compatibility issues so lightly. Please think really careful when you suggest to break Backwards Compatibility, it should only be considered if there is a real and important reason to do so. Changing binary comparison is not one of those, changing behaviour for everybody regarding << and >> is not one of those, and subtle changes to what syntax means is certainly not one of them.

Don't be Evil

Shortlink

This article has a short URL available: http://drck.me/onbc-b2g

Comments

Also Mund that breaking BC gives messages to two kinds oft users.

Existing users are being told "migration is work, better stay on the old one, or maybe migrate to a different platform"

For users evaluating the platform it tells "you can't rely on your stuff working for the next years, better go somewhere else to make sure your investment isn't in a dead end"

As C++ founder Bjarne Stroustrup says "Compatibility is a feature"

On the other hand, comfort in programming matters too. These days web developers tend to have experience on multiple platforms, and they compare them against each other. If PHP due to its lack of design and weirdness is less comfortable to use than other platforms, this will over time cause a brain drain, especially at the most senior level where the people who make the frameworks and libraries are situated.

That's why you have to be willing to make subtle changes if they increase developer comfort, even if they cause a brief amount of developer pain to migrate existing code. The fact that PHP wasn't designed should not prevent it from evolving towards a design. Purity matters if it reduces grief during programming.

As in all things, it is a balance.

" It should not be "oh we have a major new version so we can break all the things"™"

Yes it should, absolutely and completely.

Why? Because PHP must grow up. It is full of inconsistencies that hinder development and cause the weirdest of bugs. Speed is not the only reason why HACK is gaining popularity, the strict typing alone makes it a hundred times better than PHP.

I would love it if PHP7 would have a consistent syntax, with consistent error handling, predictable argument order, etc. And if that means breaking BC completely, then so be it, porting to PHP7 would be a breeze if all errors are reported by exceptions instead of "notice" messages in a log somewhere, that yuo enable or disable, preferably at runtime, based on the hostname of the server...

Vincent, true PHP has flaws - in the language design as well as in the standard library - and it would be nice if those were fixed. But: Fixing those destroys the whole Eco system. All libraries, IDEs tools have to be changed. But not only that - also all developers have to relearn things as you end up with a different language. See Perl 6 or Python 3 how such attempts fail. We can only do that in small steps. See removal of register_globals: That was obviously a thing we had to fix to have people write more secure code and a feature where there are work-arounds to migrate cod availablee but it took us 10 years (well 9 years and 10 month), from providing users better tools ($_GET/POST) and advising them to migrate, till we could finally remove that option. We're not living in the days anymore where Rasmus could ssh into every server and upgrade code on BC break but with massive investments by thousands of developers.

Yes, please break PHP! I'd love to have better arguments to sell a migration from PHP to a real programming language to my client and there wouldn't be a better argument than seeing PHP killing itself.

I agree 100% with Derick and disagree 100% with everybody else (except Johannes).

Breaking BC should ONLY be done when it is absolutely necessary, such as to eliminate security issues or to fix bugs, and NEVER on a whim such as "to make it more consistent" or "to conform to other languages" or, the worst of all, “because I think it ought to be done this way”.

There is absolutely no reason why code which was originally written for PHP4 over a decade ago should not run today in the latest version of PHP5. While it is permissible to enhance the language with new features, all the existing features should remain untouched and carry on working as before. In this way all the thousands of existing ISPs and millions of websites with their millions of developers will have (or should have) complete confidence that they can upgrade their copy of PHP without the fear that their previously working code will suddenly stop working. If you change the language so drastically that existing applications fail to work then you will have in fact created a different language, just like Perl 6 was different from Perl 5 and VB.NET was different from VB6.

As PHP is an open source language there is nothing stopping you from forking the language into something different, such as “ProperPHP” so that you can rewrite the whole syntax to conform to your ideas of “purity” and “perfection”. You could take this fork into any direction you desire, but when you ask the millions of existing developers to follow you, I can guarantee that their answer will be along the lines of “Fork Off!”

To those of you who say that PHP is a crap language that should be rewritten to “proper” standards I have this to say: If PHP is such crap then why do millions of people choose it as their development language? It is not forced on them, so they are free to choose whatever language they want. The fact that millions of people use PHP as their language of choice, and millions of websites have been written in PHP proves that it is not as crappy as you think. I have been programming for over 35 years and I have used a wide variety of 2nd, 3rd and 4th generation languages, and I can say quite categorically that every language has its share of good points as well as bad points, its fans and its detractors. If you don’t like PHP then all I can say is shut up and use a different language.

Sure, there are lots of examples of crap programs written in PHP, but that is down to the shortcomings of the developers, not the language itself. You cannot point to a bad PHP program and say "That PHP program is crap, therefore all PHP programs are crap" just as you cannot say "That rock is chalk, therefore all rocks are chalk". A crappy developer will always write crappy code irrespective of the language. There is nothing in PHP that prevents a competent developer from writing efficient and effective software, so in that respect it is an excellent language. If YOU cannot write effective programs in PHP then perhaps it is YOUR programming abilities which should be questioned.

@ Joeri Sebrechts – you said "you have to be willing to make subtle changes if they increase developer comfort, even if they cause a brief amount of developer pain to migrate existing code." I’m afraid that you couldn't be more wrong. Changing the language to please new developers at the expense of the millions of existing developers is simply not acceptable. If any new developer is uncomfortable with PHP then he/she should move on to a different language. Alienating thousands of existing developers just to please a single new developer would be one way of killing the language completely.

@ Vincent: you said "PHP must grow up. It is full of inconsistencies that hinder development and cause the weirdest of bugs." The inconsistencies have not hindered millions of programmers from developing millions of programs, and 99.99% of the bugs out there are caused by bad developers and not bugs in the language itself.

You also said "porting to PHP7 would be a breeze if all errors are reported by exceptions". Where do you get such ridiculous ideas? There is nothing wrong with PHP’s existing error handling mechanism, and forcing everyone to switch to exceptions will make you more enemies than friends. You are aware, of course, that exceptions are just one way of dealing with errors and not the only way?

@ Thomas Koch: you said "I'd love to have better arguments to sell a migration from PHP to a real programming language" which means that your definition of "programming language" is seriously flawed. PHP most definitely IS a programming language, just like COBOL, Perl, Java, Python and Ruby.

You also said: "there wouldn't be a better argument than seeing PHP killing itself." This is the only intelligent thing you said as it is quite obvious that making unnecessary changes to the language which causes existing code to break will have the effect of killing the language instead of improving it.

I won't waste Derick's time moderating a reply to what is essentially just a rant, but I do want to say this:

" and 99.99% of the bugs out there are caused by bad developers and not bugs in the language itself."

A language that requires a good programmer to make it work is itself worthless.

""You also said "porting to PHP7 would be a breeze if all errors are reported by exceptions". Where do you get such ridiculous ideas?"

There is plenty wrong with PHP's error handling, and the fact that there are more than one way to report an error is an example of that. In languages such as Java, C++ and Python I can put one big try/catch around any piece of code and I can be sure that I will catch any errors that the code generates. In PHP I can use a try/catch, but I also have to set error_reporting to maximum and write some errorhandler that can actually turn an error into a fatal error if I want to catch undefined etc. If I'm really lucky, I'll load a library that is written by some "expert" who sets error_reporting back to zero because he thought that was a neat idea, of who writes his own errorhandler so mine is skipped altogether.

Been there, cursed at that, blamed PHP for making it possible.

@ Vincent: "A language that requires a good programmer to make it work is itself worthless."

The language "works" as it is. It is not so bad that even competent programmers find it impossible to write effective software. By your definition, if a programmer can write a program in language X and that program has errors in it, then it is the fault of the language! Really?

I have been writing software with PHP since 2002, and I have absolutely NO difficulty in writing effective programs, and if I can do it then anyone can. If YOU can't, then perhaps the problem lies with you and not the language.

"There is plenty wrong with PHP's error handling, and the fact that there are more than one way to report an error is an example of that."

I repeat, I have found nothing wrong with PHP's error handling abilities as I have been using the set_error_handler() function to trap and deal with all errors, warnings and notices since 2002. The fact that PHP doesn't throw exceptions for everything by default is irrelevant. If you are incapable of dealing with errors which are not thrown as exceptions then it signifies a lack of flexibility on your part. Personally I don't like exceptions and avoid them like the plague simply because they are WORSE than PHP's native error handling abilities.

"I have been writing software with PHP since 2002"

Hmm... I've been doing it since 1999, and in other languages since somewhere in the late 80's, so what does that mean exactly?

" and I have absolutely NO difficulty in writing effective programs, and if I can do it then anyone can. If YOU can't, then perhaps the problem lies with you and not the language."

Personal attacks indicate a lack of proper arguments.

The point of programming languages is not just that you can write programs, they make writing programs easier, faster and the endresult reliable and predictable. PHP's loose typing, inconsistent arguments and error reporting just make it more difficult. Sure, you can work around all that but seems a waste of time to me. For example: If I write a method that expects an integer then I should not have to waste my time writing code that checks the content of the parameter to be an integer, especially if the language does allow that for classes.

"Personally I don't like exceptions and avoid them like the plague simply because they are WORSE than PHP's native error handling abilities."

I don't know a polite way to express how silly that statement is, so I'll just leave it alone, I'm not even going to ask you to explain :-)

Walking the London LOOP - part 16

Two months, getting married and a honeymoon later, it was finally time to continue walking the loop. Section 16 is the longest at 11 miles so we didn't double it up as we have done before.

The section starts at Elstree, which would have been two stops on the Overground to West Hampstead and a Thames Link train to get to. But it being Sunday that didn't quite work. Instead we had to catch a Rail Replacement Bus to West Hampstead, which luckily turned up quickly.

After getting to Elstree, and turning on the GPS we set off. The first stretch was along a residential road that became steeper and steeper the further we got along it.

At the top we turned left and proceeded besides busy Barnet Lane, passing two odd looking structures, which later turned out to be airvents for the Midland Main Line that we just got to Elstree by.

loop16-d36_8831.jpg

After the hill, we continued and followed a path into a forest, Scratch Wood. Together with Moat Mount Open Space they are a local nature reserve. However, to get to Moat Mount Open Space, we had to make a nasty detour past the A1/Barnet Way.

Moat Mount Open Space is also the start of the Dollis Valley Greenwalk, which you can follow all the way to Hampstead Heath. Wikepedia tells me that this is meant to be a link between the LOOP and the Capital Ring, one of our likely future walking projects.

loop16-d36_8855.jpg

In any case, the Greenway passes by some open spaces before crossing Hendon Wood Lane. Just after the lane we came upon a non-public path, that is open all year round, except for February 28th... not quite sure what the reason for that is!. We met a friendly horse, and soon we started following Dollis Brook. First through farm land, but then soon through parks. At the end, just before hitting the Northern Line we went North and stopped at the Old Red Lion for a refreshing pint.

After the pint we continued past the High Barnet tube depot before climbing up the hill through King George's Fields up to Hadley Common. This climb was a bit longer than we liked, but the weather was good, a pint had been drunk and the view was great. At the top we crossed the street on to Hadley Green. Here the houses are looking a lot larger and priciers than the houses down in Barnet.

loop16-d36_8875.jpg

The east side of the green feature several mansions, and is also the home to the Wilbraham Almshouses. After passing St. Mary the Virgin we crossed into Monken Hadley Common, also called Hadley Woods for the remainder of the walk. After a quick detour to have a look at Jack's Lake we made it to Cockfosters, the end of the section.

The weather was mostly sunny with a few clouds at the start, and a few more at the finish. It was warm at 26°C and not nearly as humid as we feared. We took four hours and a bit for this section's 19.0km.

The photos that I took on this section, as well as the photos of the other sections of the LOOP, are available as a Flickr set.

Shortlink

This article has a short URL available: http://drck.me/loop16-b1x

Comments

No comments yet

No to a Uniform Variable Syntax

As you might have heard, PHP developers voted on an RFC called "Uniform Variable Syntax". This RFC "proposes the introduction of an internally consistent and complete variable syntax". In general, this RFC argues for making PHP's parser more complete for all sorts of variable dereferences. For example:

$foo()['bar']()
[$obj1, $obj2][0]->prop
$foo->bar()()
$foo::$bar::$baz

Thirty people voted for, and one against: Me.

Does that mean that I am against a unified variable syntax? No, I am not. I am actually quite a fan of having a consistent language, but we need to be careful when this hits existing users.

The already accepted RFC also has some negative aspects, in the form of backwards compatibility (BC) breaks. For example (quoted from the RFC):

// syntax               // old meaning            // new meaning
$$foo['bar']['baz']     ${$foo['bar']['baz']}     ($$foo)['bar']['baz']
$foo->$bar['baz']       $foo->{$bar['baz']}       ($foo->$bar)['baz']
$foo->$bar['baz']()     $foo->{$bar['baz']}()     ($foo->$bar)['baz']()
Foo::$bar['baz']()      Foo::{$bar['baz']}()      (Foo::$bar)['baz']()

This basically says that the RFC author knows there are BC breaks, but choses to ignore how this might annoy users.

Unlike keyword additions, or functions and/or settings being removed, this change in semantics is probably one of the worst BC breaks you can imagine. You can't really write a scanner for it, as the code could already have been converted. A tiny change like this however, can create very hard to debug issues within existing code. And this is exactly why people whine that PHP breaks BC and does not care about its users. In many cases, breaking BC happens by accident, and I'm no stranger to breaking BC due to some oversight. Accidents like this are certainly annoying, but slightly unavoidable as we do not have test cases for everything.

However, when you know for certain that you are going to break BC, there is no excuse. With such a marginal new "feature" as is outlined in this RFC, antagonising our users is not a good thing.

Shortlink

This article has a short URL available: http://drck.me/nobc-b11

Comments

Thank you.

I voted in favor of this RFC knowing full well that it represented a BC break. I was okay with this for one major reason:

The RFC was explicitly offered for inclusion in the next major (PHP6, PHP7, whatever people want to call it).

This is precisely what major versions are for: improvements that might break backwards compatibility in exchange for a better language. Is BC critically important for the success of any language? Absolutely. Is this break justifiable for a major version? In my opinion, absolutely.

If this RFC were offered for any minor version I would have been strenuously against it. Thankfully, that was not the case. Any user upgrading to a new major version with the expectation that all existing code will "just work" is destined for problems regardless. Such users must not be held out as justification for holding back language progress.

Just my two cents on the subject :)

Daniel,

Even for a new major version being able to break BC doesn't mean it's good. Changing BC adds costs to everybody in the environment. Developers having to review the code, IDE developers, tool developers, ...

This case might actually be bad - you have to be very careful in a review to find this and figure out whether the old or new way is expected.

As Bjarne Stroustrup says: "Compatibility is a feature." To existing users who can stay up to date easily as well as to new users who can be assured that their investment is safe. Randomly changing the language doesn't do that. Randomly changing the language tells "be prepared to continuously reviewing your code for breakage" instead of helping them solving their actual issues. Staying up to date should not be an issue for any user.

well, daniel, the thing is, the more BC breaks are in the more legacy you will have aftewards - we now have PHP5 for 10 years, and still there are a few PHP4 users left, because their stuff is not running on PHP5.

Thats a bad thing.

And i'm sorry, but this is not "lets break some minor thing to get a huge benefit", and after just reading this RFC i have to disagree on the "low practical impact" of the BC break.

I agree with you, of course. Historical compatibility is one of PHP's best attributes and must not be discarded lightly. Any BC break must be weighed against the potential havoc it may create in existing codebases. Whether or not a change justifies potential breakage is in the purview of the voting process. The hard part is striking a balance between progress and compatibility. The voting process (in theory) brings the wisdom (hopefully) of the group to bear on this determination.

I don't think anyone wants to see PHP in a Python 3 situation, and for that, Derick's opinion here is 100% valid. Caution and patience are the best antidotes to introducing new WTFs going forward.

Sorry Derick, but i might desagree with your vote.

I'm no expert in the core concepts of PHP, i'm speaking as an user, and as an user that in the past one or two years saw tecnologies like hack, a fork of PHP, being ovacionated just for the fact that they did what PHP was affraid to do, Break BC.

As Daniel Lowrey said and i quote "This is precisely what major versions are for". PHP should evolve and just the fact that some old users with legacy code don't have the courage to grow up and learn how the language evolve in the last 10 years does not change the necessity of evolution that moves the more active community.

In fact i think this is less then the expected but is already a start, i think many more BC breaks should and must be made in order to place the PHP in the new era, or we will loose the train (again).

And you don't need to look too far to see that hungry for change, since this changes doesn't not come from the language itself they grow up widely on the community, the examples are vast, composer, hack, hhvm, FIG and many others are all examples of changes being made on the margins of the language.

Please as an user I ask, bring these changes and this thirsty for changes to the core of the language and stop being so corservatives, lets grow up all together.

Thanks

@Paulo do not confuse new technologies being talked-about on the interwebs with their adoption rate. Do you know many people running today their Wordpress on hhvm?

I was a proponent of the "go-php5" initiative, and I remember how it took a concerted effort of all the major players at the time to get adoption of version 5 kickstarted (which was not, mind you, the abandoning of version 4).

Php is much bigger today, and BC breaks in syntax can not happen unless there is agreement and aoption from:

  • frameworks

  • major apps (cms, etc)

  • operating systems bundling the new version

@gggeek yes, i do know many people running or at least testing wordpress applications, zend framework apps, symfony apps, laravel apps and many more on hhvm, in fact we did a hangout talking just about it, is in portuguese but you can see here: https://www.youtube.com/watch?v=3tGiK4hXDag

And the community arround the most widely used frameworks and applications are exactly the one pushing foward to see new features and changes on PHP, and operating systens will just use the most recent version avaliable on the launch date.

I know that any change is hard to adopt, but they are needed and postpone this changes is not a solution, is just pass the problem ahead, so i rather see this changes now and at once than see it being made in installments.

Hi Derick!

Thank you for writing this post. I think you are right that BC is very important and we should pay attention and try to avoid breaking it as much as possible. But there are things that I disagree with this post. I think main thing I disagree with is that this is a marginal feature. I think the time has come in PHP's life when it's mature enough to try and fix some of the mistakes of its more youthful and careless years. If you look at the meanings table you quote, and hide everything but the leftmost column and forget all BC and just ask yourself - what what I would expect, naively, this to mean - and then open the other two columns again, I think you will discover that in most cases the naive expectation matches the "new" column. Yes, for various reasons, both good and not so good, it was not so in PHP. But we can make it so, and a major version is a good opportunity to do it. Major version is expected to break some things. That does not mean it should, but it is an acceptable price if necessary.

I think it is a big service to our users, most current and future, to try and make their expectations match the reality. Yes, we will have to pay for it, with BC breakage and some temporary pain while moving to PHP.Next. We can ease this pain with tools, etc. but we know we will not be able to avoid it. But once the debt is paid, we'd have much cleaner and consistent language, and I think it is a big win. PHP.Next, which is being started now, is a major opportunity to get such fixes - an opportunity the like of which we probably won't have for another decade or so in PHP. Thus, in my opinion, using this opportunity to fix some of PHP's less appealing parts is not marginal. Compatibility is a feature, yes, but so is cleanness and easiness of use. I think if we have to sacrifice a little of the former to gain a sizeable deal of the latter, it is a worthy investment.

PHP has always been a large collection of inconsistent behavior, even after 14 years of the stuff I still find myself looking up the order of arguments in the manual.

Please please PLEASE vote for every single suggestion that makes it more consistent. Yes that will break BC, but PHP programmers waste a truckload of time trying to deal with the current inconsistencies. It is a lot easier to debug code that has been broken by a newer, more consistent behavior, than by the current inconsistencies.

There is no question that breaking BC, major version or not, can lose you users, and it is always a balancing act. At the same time, not breaking BC can also lose you users.

After more than 10 years of writing PHP and building my entire profressional career on it, I left the community because inconsistencies like this made it impossible to write the kind of functional code that I wanted to write. Frankly, it didn't make sense to me that the language would behave in such an unexpected way, and admittedly I didn't know until recently that the cause was way more fundamental to the internals than I expected.

In any case, this RFC addressed what was easily my biggest complaint about PHP, and I think it is awesome.

Derick, in no way am I upset that you voted against this proposal, but I do hope you consider the benefits of a BC break as well as the hardships. As I said before, it's always a balancing act, but sometimes breaking compatibility is good for the greater community.

Hi Derick,

I really get you point here and being an Obj C developer we have been through a few ugly shifts.

One shift was going from manual reference counting to Automatic Reference Counting. Objective c was pretty good at converting code sometimes but other times not so hot and you have to go fix it.

  1. I do think it would be possible to write a scanner to fix it but the scanner would need to comment the fix so it does not fix it twice. I dont know what the comment would look like but it would mark the function somehow.

  2. With Obj-c we always had a fallback and told the compiler NOT to compile a list of files as ARC and could use the old syntax for this list of files.

For PHP to remain backward compatible I would suggest that a file, names space or directory could be marked as non Uniform Variable Syntax so it would be possible to continue to use these massive libraries that exist today yet have the app code using the new syntax.

John.

Hi Derick, PHP dev here.

I completely agree with you that introducing BC break changes can fall people into problems with existing code.

But think of it another way.

PHP5 was released 10 years ago, but there still are people who write code or libraries that are written in PHP4. Or let's take PHP5.3 which was released 5 years ago, but only recently hosting providers started upgrading PHP version to newer versions.

The point is - upgrading version is always painful and not one-week/month/year thing. People to whom the major version upgrade will cause troubles will either stick to current version or spend some time to fix all BC troubles it caused.

And it's not only about PHP. There're plenty of libraries (no matter what language they're written in) that introduce new changes that are not BC. We always have either to maintain our code to conform them or leave the codebase at some snapshot that is satisfying customer needs.

Just my 2 cents.

Hey Derick

I don't have my own 2 cents to throw in; I just want to say thanks for publicly explaining your opinion. Seeing so many for and one against on the vote made me very curious to know the reasoning.

Thanks

I really understand your point of view. The fact that BC are taken lightly makes me a little bit sad.

In my last few companies, we were always years and years behind the cutting edge. Once your codebase grows to hundreds of thousands of lines of code you are stuck. Any breaking change is a disaster to deal with.

We have just finished migrating rest of our platforms to PHP 5.3 this year and it was a gigantic effort. Doing it all over again with breaking changes? I am not sure if I even want to think of it.

Personally, I could not care less whether I can $foo()['bar']() or not. It has never been an issue for past 10 years and believe me I was pretty busy writing code. If anything, I am a bit worried that people may start abusing it making code an absolute mess.

Thanks for a nice post.

I absolutely can't agree with Derick. This BC break is really not so relevant. As a pretty long time PHP developer I never (attn - NEVER) used such ugly constructions in my code. I guess no major frameworks or quality projects even notice this change.

One of the PHP problems - it has too much of unpredictable "magic". This RFC makes some part of it consistent and logical. If some shit-code relies on tricky behaviour and breaks from such change - this shouldn't stop from making language better for use in real projects.

As I said, I never used such constructions and some examples from "old meaning" look absolutely unlogical for me. If they really now works as mentioned - this SHOULD be changed.

For example: just looking at the code "Foo::$bar['baz']" I can guess this means "get class Foo, found static field $bar in it and get value with index baz from that array". This is the way it works for another languages and I have no idea why this by default should mean "get value from $bar['baz'] and static field with that name from class Foo". If someone want such or some another behaviour - he can use brackets to group identifiers as he want.

Dead Code

Frequently I've been asked why Xdebug sees "dead code" in places where people don't expect it. Most often this is related to PHPUnit's Code Coverage in the following situations:

1:  <?php
2:  function foo()
3:  {
4:      if ( false )
5:      {
6:          throw new Exception();
7:      } /* line with dead code */
8:
9:      return 42;
10: } /* line with dead code */
11: ?>

The explanation for this is rather simple. Xdebug checks code coverage by adding hooks into certain opcodes. Opcodes are the building blocks of oparrays. PHP converts each element in your script—main body, method, function—to oparrays when it parses them. The PHP engine then executes those oparrays by running some code for each opcode. Opcodes are generated, but they are not optimised. Which means that it does not remove opcodes that can not be executed.

With vld we can see which opcodes are generated. For the above script, there are two elements. The main body of the script, and the foo function. I used vld to show their opcodes, and after some trimming the main script body looks like:

line     #* I O op               ext  return  operands
--------------------------------------------------------
   2     0  >   EXT_STMT
         1      NOP
  12     2      EXT_STMT
         3    > RETURN                        1

We'll ignore this one mostly, as there is nothing much in it, but do notice the RETURN opcode, which represents a return statement in a PHP script. We did not add a return statement, but PHP's parser always puts a RETURN opcode at the end of each oparray.

The foo function's oparray looks like:

line     #* I O op               ext  return  operands
--------------------------------------------------------
   2     0  >   EXT_NOP
   5     1      EXT_STMT
         2    > JMPZ                          false, ->11
   6     3  >   EXT_STMT
         4      FETCH_CLASS        4  :0      'Exception'
         5      EXT_FCALL_BEGIN
         6      NEW                   $1      :0
         7      DO_FCALL_BY_NAME   0
         8      EXT_FCALL_END
         9    > THROW              0          $1
   7    10*     JMP                           ->11
   9    11  >   EXT_STMT
        12    > RETURN                        42
  10    13*     EXT_STMT
        14*   > RETURN                        null

Xdebug's code coverage marks line 7 and 10 as "dead code". When we look at the vld output above, we see that line 10 has an EXT_STMT and a RETURN statement. But they can never be reached as there is no path through the code that does not hit the RETURN on line 9 first. vld marks dead code with a *. The > in the I and O columns indicate points in the oparray that that are the end point of a jump instruction (ie., the start of a branch) and a location from where a jump is initiated respectively (ie., the exit point out of a branch).

vld actually tells you which branches and paths are found:

branch: #  0; line: 2- 5; sop:  0; eop:  2; out1:   3; out2:  11
branch: #  3; line: 6- 6; sop:  3; eop:  9
branch: # 11; line: 9-10; sop: 11; eop: 14
path #1: 0, 3,
path #2: 0, 11,

Each branch is "named" by its starting opcode entry. For each of the branches, Xdebug, and vld, check whether there is a premature unconditional exit. Conditional exits and jumps are already checked when the oparray is split into branches.

From the three branch definitions you can already see that opcode 10 is not part of any branch as it sits between an exit point and an entry point. Hence it's marked as dead code on line 7. This line contains the closing brace (}) of the if statement.

In the branch covering opcodes 3 to 9 the THROW in opcode 9 is the exit point. For if statements, PHP's code generator always generates an extra JMP at the end. This opcode would simply jump to the next opcode (the jump target is shown as ->11). However, if the branch is exitted prematurely (due to the THROW) in this case, it's not hit. Because it's the only opcode on line 7, the whole line gets marked as "dead code".

In the branch covering opcodes 11 to 14, the RETURN statement in opcode 12 on line 9 is the exit point of the branch, and hence opcodes 13-14 are marked as dead code.

Hopefully this explains that sometimes lines which seem to have code, are marked as dead code. And this is in the cases where PHP gets the line numbers for opcodes right... which isn't always the case either.

For Xdebug, I am improving code coverage to also include path and branch coverage, which should come in Xdebug 2.3.

Shortlink

This article has a short URL available: http://drck.me/deadcode-azx

Comments

No comments yet

Life Line