Pretty Printing BSON
In Wireshark and MongoDB 3.6, I explained that Wireshark is amazing for debugging actual network communications. But sometimes it is necessary to debug things before they get sent out onto the wire. The majority of the driver's communication with the server is through BSON documents with minimal overhead of wire protocol messages. BSON documents are represented in the C Driver by bson_t
data structures. The bson_t
structure wraps all of the different data types from the BSON Specification. It is analogous to PHP's zval
structure, although its implementation is a little more complicated.
A bson_t
structure can be allocated on the stack or heap, just like a zval
structure. A zval
structure represents a single data type and single value. A bson_t
structure represents a buffer of bytes constituting one or more values in the form of a BSON document. This buffer is exactly what the MongoDB server expects to be transmitted over a network connection. As many BSON documents are small, the bson_t
structure can function in two modes, determined by a flag: inline
, or allocated
. In inline
mode it only has space for 120 bytes of BSON data, but no memory has to be allocated on the heap. This mode can significantly speed up its creation, especially if it is allocated on the stack (by using bson_t value
, instead of bson_t *value
= bson_new()
). It makes sense to have this mode, as many common interactions with the server fall under this 120-byte limit.
For PHP's zval
, the PHP developers have developed a helper function, printzv, that can be loaded into the GDB debugger. This helper function unpacks all the intricacies of the zval
structure (e.g. arrays, objects) and displays them on the GDB console. When working on some code for the MongoDB Driver for PHP, I was looking for something similar for the bson_t
structure only to find that no such thing existed yet. With the bson_t
structure being more complicated (two modes, data as a binary stream of data), it would be just as useful as PHP's printzv
GDB helper. You can guess already that, of course, I felt the need to just write one myself.
GDB supports extensions written in Python, but that functionality is sometimes disabled. It also has its own scripting language that you can use on its command line, or by loading your own files with the source
command. You can define functions in the language, but the functions can't return values. There are also no classes or scoping, which means all variables are global. With the data stored in the bson_t
struct as a stream of binary data, I ended up writing a GDB implementation of a streamed BSON decoder, with a lot of handicaps.
The new printbson
function accepts a bson_t *
value, and then determines whether its mode is inline
or allocated
. Depending on the allocation type, printbson
then delegates to a "private" __printbson
function with the right parameters describing where the binary stream is stored.
__printbson
prints the length of the top-level BSON document and then calls the _printelements
function. This function reads data from the stream until all key/value pairs have been consumed, advancing its internal read pointer as it goes. It can detect that all elements have been read, as each BSON document ends with a null byte character (\0
).
If a value contains a nested BSON document, such as the document or array types, it recursively calls __printelements
, and also does some housekeeping to make sure the following output is nicely indented.
Each element begins with a single byte indicating the field type, followed by the field name as a null-terminated string, and then a value. After the type and name are consumed, __printelements
defers to a specialised print function for each type. As an example, for an ObjectID field, it has:
if $type == 0x07 __printObjectID $data end
The __printObjectID
function is then responsible for reading and displaying the value of the ObjectID. In this case, the value is 12 bytes, which we'd like to display as a hexadecimal string:
define __printObjectID set $value = ((uint8_t*) $arg0) set $i = 0 printf "ObjectID(\"" while $i < 12 printf "%02X", $value[$i] set $i = $i + 1 end printf "\")" set $data = $data + 12 end
It first assigns a value of a correctly cast type (uint8_t*
) to the $value
variable, and initialises the loop variable $i
. It then uses a while
loop to iterate over the 12 bytes; GDB does not have a for
construct. At the end of each display function, the $data
pointer is advanced by the number of bytes that the value reader consumed.
For types that use a null-terminated C-string, an additional loop advances $data
until a \0
character is found. For example, the Regex data type is represented by two C-strings:
define __printRegex printf "Regex(\"%s\", \"", (char*) $data # skip through C String while $data[0] != '\0' set $data = $data + 1 end set $data = $data + 1 printf "%s\")", (char*) $data # skip through C String while $data[0] != '\0' set $data = $data + 1 end set $data = $data + 1 end
We start by printing the type name prefix and first string (pattern) using printf
and then advance our data pointer with a while
loop. Then, the second string (modifiers) is printed with printf
and we advance again, leaving the $data
pointer at the next key/value pair (or our document's trailing null byte if the regex type was the last element).
After implementing all the different data types, I made a PR against the MongoDB C driver, where the BSON library resides. It has now been merged. In order to make use of the .gdbinit file, you can include it in your GDB session with source /path/to/.gdbinit
.
With the file loaded, and bson_doc
being bson_t *
variable in the local scope, you can run printbson bson_doc
, and receive something like the following semi-JSON formatted output:
(gdb) printbson bson_doc ALLOC [0x555556cd7310 + 0] (len=475) { 'bool' : true, 'int32' : NumberInt("42"), 'int64' : NumberLong("3000000042"), 'string' : "Stŕìñg", 'objectId' : ObjectID("5A1442F3122D331C3C6757E1"), 'utcDateTime' : UTCDateTime(1511277299031), 'arrayOfInts' : [ '0' : NumberInt("1"), '1' : NumberInt("2"), '2' : NumberInt("3"), '3' : NumberInt("5"), '4' : NumberInt("8"), '5' : NumberInt("13"), '6' : NumberInt("21"), '7' : NumberInt("34") ], 'embeddedDocument' : { 'arrayOfStrings' : [ '0' : "one", '1' : "two", '2' : "three" ], 'double' : 2.718280, 'notherDoc' : { 'true' : NumberInt("1"), 'false' : false } }, 'binary' : Binary("02", "3031343532333637"), 'regex' : Regex("@[a-z]+@", "im"), 'null' : null, 'js' : JavaScript("print foo"), 'jsws' : JavaScript("print foo") with scope: { 'f' : NumberInt("42"), 'a' : [ '0' : 3.141593, '1' : 2.718282 ] }, 'timestamp' : Timestamp(4294967295, 4294967295), 'double' : 3.141593 }
In the future, I might add information about the length of strings, or the convert the predefined types of the Binary data-type to their common name. Happy hacking!
Shortlink
This article has a short URL available: https://drck.me/gdb-bson-dy0