Missing Characters
I release a new version of Xdebug on Sunday, which fixes a few bugs. One of them is titled emoji character become diamond question marks. This bug turned out to be the same as var_dump does not output some Cyrillic characters, which was originally reported a few days earlier but hadn't come with a decent enough reproducible case.
At first I dismissed this, as it's not unlikely that people get their character sets wrong, or mixed up.
But when I tested it, the following script really did not show the right result:
<?php $str = 'hello π'; var_dump($str); ?>
Instead of the expected:
<pre class='xdebug-var-dump' dir='ltr'> <small>Standard input code:3:</small><small>string</small> <font color='#cc0000'>'hello π'</font> <i>(length=10)</i> </pre>
It showed:
<pre class='xdebug-var-dump' dir='ltr'> <small>Standard input code:3:</small><small>string</small> <font color='#cc0000'>'hello οΏ½οΏ½οΏ½'</font> <i>(length=10)</i> </pre>
The four bytes that should have made up the π had turned into three.
Xdebug uses a function, xdebug_xmlize, to escape XML and XHTML-special characters such as ", &, and < when it outputs strings of data.
Its algorithm first calculates how much memory the resulting string would use by looping over the source characters, and adding the lengths of the escaped characters together. It uses a 256-entry table for this.
The first row shows that byte 0's escaped length will be 4 (for �) and the LF character's escaped length will be 5 (for ).
The replacement strings are recorded in the table that follows. It only has place for 64 elements, as none of the bytes above byte-64 need to be escaped. You can see that because the xml_encode_count table only has entries containing 1 after the fourth 16-element row.
Then in a second iteration it loops over all the source characters again to construct the resulting output.
In this iteration, it checks if the destination length is 1, in which case it just copies the character over. If the destination length is not 1, then it adds the number of characters that correspond to the destination character's length.
The bug here was that the table for xml_encode_count, although it was defined as having 256 entries, only had 240 entries. I had missed to add the 16th line, so instead there were only 15 lines of 16 elements.
And in C, that means that these missing elements were all set to 0. This meant that if there was a character in the source string where the byte value was larger or equal to hexadecimal 0xF0 (decimal: 240), the algorithm thought the replacement length of these characters would be 0. This then resulted in these characters to just be ignored, and not copied over into the destination string.
For the π character (hex: 0xF0 0x9F 0x91 0x8D) that meant that its first byte (0xF0) was not copied into the destination string. And that meant a broken UTF-8 character. Oops! π©
In Xdebug 3.4.2 this is now fixed, as I have added the 16th line to the table, with 16 more elements containing 1.
What I did find curious that it took nearly five years for something to report this issue, and with that, two in the same week!
Likes
β€οΈ π΅πΈ Γlvaro GonzΓ‘lez
β€οΈ Dan Leech
β€οΈ Derick Rethans
β€οΈ Joseph Leedy :magento:
β€οΈ LucileDT
β€οΈ Marcus Bointon
β€οΈ mradcliffe
β€οΈ Ross McKay
Life Line
Updated a cafe; Confirmed 3 convenience shops, a fast_food, and a laundry shop
Created a fitness_centre; Updated an event_caterer office and a social_facility; Confirmed a restaurant and a pharmacy
Updated a pub
Created 3 entrances
Created 2 fast_foods, a convenience shop, and 2 other objects
I hiked 10.6km in 2h59m33s
I walked 3.2km in 1h17m20s
I walked 3.3km in 1h2m23s
I walked 1.6km in 16m19s
Updated a restaurant
I walked 8.0km in 1h27m42s
Merged pull request #1074
Bump actions/download-artifact from 6 to 8
Merged pull request #1073
Bump actions/upload-artifact from 6 to 7
Merged pull request #1072
Bump geekyeggo/delete-artifact from 5 to 6
Merge branch 'v2022'
Merge pull request #173 from LukasGelbmann/lukasgelbmann/fix-year-0
Having a sleep after learning how to create value.
#goose #EgyptianGoose #BirdPhotography #Photography #BirdsOfFediverse
I walked 4.3km in 51m10s
I walked 1.1km in 9m55s
I walked 5.4km in 1h40m50s
Look at me being cool!
A crested tit sits on a branch among some leaves.
#BirdPhotogaphy #BirdsOfMastodon #Birds #photography #aves #TheNetherlands #nature
I walked 5.4km in 53m36s



Shortlink
This article has a short URL available: https://drck.me/missing-chars-jdk