Pretty Printing BSON
In Wireshark and MongoDB 3.6, I explained that Wireshark is amazing for debugging actual network communications. But sometimes it is necessary to debug things before they get sent out onto the wire. The majority of the driver's communication with the server is through BSON documents with minimal overhead of wire protocol messages. BSON documents are represented in the C Driver by bson_t data structures. The bson_t structure wraps all of the different data types from the BSON Specification. It is analogous to PHP's zval structure, although its implementation is a little more complicated.
A bson_t structure can be allocated on the stack or heap, just like a zval structure. A zval structure represents a single data type and single value. A bson_t structure represents a buffer of bytes constituting one or more values in the form of a BSON document. This buffer is exactly what the MongoDB server expects to be transmitted over a network connection. As many BSON documents are small, the bson_t structure can function in two modes, determined by a flag: inline, or allocated. In inline mode it only has space for 120 bytes of BSON data, but no memory has to be allocated on the heap. This mode can significantly speed up its creation, especially if it is allocated on the stack (by using bson_t value, instead of bson_t *value
= bson_new()). It makes sense to have this mode, as many common interactions with the server fall under this 120-byte limit.
For PHP's zval, the PHP developers have developed a helper function, printzv, that can be loaded into the GDB debugger. This helper function unpacks all the intricacies of the zval structure (e.g. arrays, objects) and displays them on the GDB console. When working on some code for the MongoDB Driver for PHP, I was looking for something similar for the bson_t structure only to find that no such thing existed yet. With the bson_t structure being more complicated (two modes, data as a binary stream of data), it would be just as useful as PHP's printzv GDB helper. You can guess already that, of course, I felt the need to just write one myself.
GDB supports extensions written in Python, but that functionality is sometimes disabled. It also has its own scripting language that you can use on its command line, or by loading your own files with the source command. You can define functions in the language, but the functions can't return values. There are also no classes or scoping, which means all variables are global. With the data stored in the bson_t struct as a stream of binary data, I ended up writing a GDB implementation of a streamed BSON decoder, with a lot of handicaps.
The new printbson function accepts a bson_t * value, and then determines whether its mode is inline or allocated. Depending on the allocation type, printbson then delegates to a "private" __printbson function with the right parameters describing where the binary stream is stored.
__printbson prints the length of the top-level BSON document and then calls the _printelements function. This function reads data from the stream until all key/value pairs have been consumed, advancing its internal read pointer as it goes. It can detect that all elements have been read, as each BSON document ends with a null byte character (\0).
If a value contains a nested BSON document, such as the document or array types, it recursively calls __printelements, and also does some housekeeping to make sure the following output is nicely indented.
Each element begins with a single byte indicating the field type, followed by the field name as a null-terminated string, and then a value. After the type and name are consumed, __printelements defers to a specialised print function for each type. As an example, for an ObjectID field, it has:
if $type == 0x07
__printObjectID $data
end
The __printObjectID function is then responsible for reading and displaying the value of the ObjectID. In this case, the value is 12 bytes, which we'd like to display as a hexadecimal string:
define __printObjectID
set $value = ((uint8_t*) $arg0)
set $i = 0
printf "ObjectID(\""
while $i < 12
printf "%02X", $value[$i]
set $i = $i + 1
end
printf "\")"
set $data = $data + 12
end
It first assigns a value of a correctly cast type (uint8_t*) to the $value variable, and initialises the loop variable $i. It then uses a while loop to iterate over the 12 bytes; GDB does not have a for construct. At the end of each display function, the $data pointer is advanced by the number of bytes that the value reader consumed.
For types that use a null-terminated C-string, an additional loop advances $data until a \0 character is found. For example, the Regex data type is represented by two C-strings:
define __printRegex
printf "Regex(\"%s\", \"", (char*) $data
# skip through C String
while $data[0] != '\0'
set $data = $data + 1
end
set $data = $data + 1
printf "%s\")", (char*) $data
# skip through C String
while $data[0] != '\0'
set $data = $data + 1
end
set $data = $data + 1
end
We start by printing the type name prefix and first string (pattern) using printf and then advance our data pointer with a while loop. Then, the second string (modifiers) is printed with printf and we advance again, leaving the $data pointer at the next key/value pair (or our document's trailing null byte if the regex type was the last element).
After implementing all the different data types, I made a PR against the MongoDB C driver, where the BSON library resides. It has now been merged. In order to make use of the .gdbinit file, you can include it in your GDB session with source /path/to/.gdbinit.
With the file loaded, and bson_doc being bson_t * variable in the local scope, you can run printbson bson_doc, and receive something like the following semi-JSON formatted output:
(gdb) printbson bson_doc
ALLOC [0x555556cd7310 + 0] (len=475)
{
'bool' : true,
'int32' : NumberInt("42"),
'int64' : NumberLong("3000000042"),
'string' : "Stŕìñg",
'objectId' : ObjectID("5A1442F3122D331C3C6757E1"),
'utcDateTime' : UTCDateTime(1511277299031),
'arrayOfInts' : [
'0' : NumberInt("1"),
'1' : NumberInt("2"),
'2' : NumberInt("3"),
'3' : NumberInt("5"),
'4' : NumberInt("8"),
'5' : NumberInt("13"),
'6' : NumberInt("21"),
'7' : NumberInt("34")
],
'embeddedDocument' : {
'arrayOfStrings' : [
'0' : "one",
'1' : "two",
'2' : "three"
],
'double' : 2.718280,
'notherDoc' : {
'true' : NumberInt("1"),
'false' : false
}
},
'binary' : Binary("02", "3031343532333637"),
'regex' : Regex("@[a-z]+@", "im"),
'null' : null,
'js' : JavaScript("print foo"),
'jsws' : JavaScript("print foo") with scope: {
'f' : NumberInt("42"),
'a' : [
'0' : 3.141593,
'1' : 2.718282
]
},
'timestamp' : Timestamp(4294967295, 4294967295),
'double' : 3.141593
}
In the future, I might add information about the length of strings, or the convert the predefined types of the Binary data-type to their common name. Happy hacking!
Life Line
I've finished reading Children of Memory, the third book in the series.
Another interesting take on forms of intelligent life.
A fourth one is going to get released later this year.
Updated a post_box, a beauty shop, and a restaurant; Confirmed 2 clothes shops, 2 pet shops, and a restaurant
I walked 5.9km in 1h40m39s
Updated a bicycle_parking
Updated 2 waste_baskets
I walked 7.9km in 1h37m12s
Created 3 waste_baskets; Updated 3 bus_stops, 2 benches, and 2 waste_baskets
I walked 8.1km in 1h25m53s
I walked 1.2km in 9m31s
I walked 9.4km in 1h39m05s
Merge branch 'xdebug_3_5'
Merged pull request #1071
Fixed issue #2411: Native Path Mapping is not applied to the initial …
Created 2 waste_baskets; Updated 3 waste_baskets, 2 benches, and 2 other objects; Deleted a waste_basket
I walked 7.9km in 1h45m36s
RE: https://phpc.social/@phpc_tv/116274041642323081
Now that phpc.tv and phpc.social are part of the same umbrella, I've upped my yearly contributions to their Open Collective: https://opencollective.com/phpcommunity/projects/phpc-social
Merge branch 'xdebug_3_5'
Merged pull request #1070
I walked 7.2km in 1h10m26s
Fixed issue #2405: Handle minimum path in .xdebug directory discovery
I've published a new blog post: "Human Creations", on the difference in content generation by LLMs, and the creation of text, art and code by humans.
You can find it at https://derickrethans.nl/human-creations.html or at @blog
I walked 7.8km in 1h38m32s
RE: https://phpc.social/@afilina/116274024588235234
It's good to see that more and more people are realising that the Web can be for-good, without all the enshittification.
That's why I'm happy to see endeavours like phpc.tv springing up, and helping out where I can.
Taking back the control of how the Web is for people, by people, without big tech making it all shit.
Created a waste_basket; Updated 5 crossings and a bicycle_parking
I walked 10.7km in 2h35m10s


Shortlink
This article has a short URL available: https://drck.me/gdb-bson-dy0