Pretty Printing BSON

In Wireshark and MongoDB 3.6, I explained that Wireshark is amazing for debugging actual network communications. But sometimes it is necessary to debug things before they get sent out onto the wire. The majority of the driver's communication with the server is through BSON documents with minimal overhead of wire protocol messages. BSON documents are represented in the C Driver by bson_t data structures. The bson_t structure wraps all of the different data types from the BSON Specification. It is analogous to PHP's zval structure, although its implementation is a little more complicated.

A bson_t structure can be allocated on the stack or heap, just like a zval structure. A zval structure represents a single data type and single value. A bson_t structure represents a buffer of bytes constituting one or more values in the form of a BSON document. This buffer is exactly what the MongoDB server expects to be transmitted over a network connection. As many BSON documents are small, the bson_t structure can function in two modes, determined by a flag: inline, or allocated. In inline mode it only has space for 120 bytes of BSON data, but no memory has to be allocated on the heap. This mode can significantly speed up its creation, especially if it is allocated on the stack (by using bson_t value, instead of bson_t *value = bson_new()). It makes sense to have this mode, as many common interactions with the server fall under this 120-byte limit.

For PHP's zval, the PHP developers have developed a helper function, printzv, that can be loaded into the GDB debugger. This helper function unpacks all the intricacies of the zval structure (e.g. arrays, objects) and displays them on the GDB console. When working on some code for the MongoDB Driver for PHP, I was looking for something similar for the bson_t structure only to find that no such thing existed yet. With the bson_t structure being more complicated (two modes, data as a binary stream of data), it would be just as useful as PHP's printzv GDB helper. You can guess already that, of course, I felt the need to just write one myself.

GDB supports extensions written in Python, but that functionality is sometimes disabled. It also has its own scripting language that you can use on its command line, or by loading your own files with the source command. You can define functions in the language, but the functions can't return values. There are also no classes or scoping, which means all variables are global. With the data stored in the bson_t struct as a stream of binary data, I ended up writing a GDB implementation of a streamed BSON decoder, with a lot of handicaps.

The new printbson function accepts a bson_t * value, and then determines whether its mode is inline or allocated. Depending on the allocation type, printbson then delegates to a "private" __printbson function with the right parameters describing where the binary stream is stored.

__printbson prints the length of the top-level BSON document and then calls the _printelements function. This function reads data from the stream until all key/value pairs have been consumed, advancing its internal read pointer as it goes. It can detect that all elements have been read, as each BSON document ends with a null byte character (\0).

If a value contains a nested BSON document, such as the document or array types, it recursively calls __printelements, and also does some housekeeping to make sure the following output is nicely indented.

Each element begins with a single byte indicating the field type, followed by the field name as a null-terminated string, and then a value. After the type and name are consumed, __printelements defers to a specialised print function for each type. As an example, for an ObjectID field, it has:

if $type == 0x07
    __printObjectID $data
end

The __printObjectID function is then responsible for reading and displaying the value of the ObjectID. In this case, the value is 12 bytes, which we'd like to display as a hexadecimal string:

define __printObjectID
    set $value = ((uint8_t*) $arg0)
    set $i = 0
    printf "ObjectID(\""
    while $i < 12
        printf "%02X", $value[$i]
        set $i = $i + 1
    end
    printf "\")"
    set $data = $data + 12
end

It first assigns a value of a correctly cast type (uint8_t*) to the $value variable, and initialises the loop variable $i. It then uses a while loop to iterate over the 12 bytes; GDB does not have a for construct. At the end of each display function, the $data pointer is advanced by the number of bytes that the value reader consumed.

For types that use a null-terminated C-string, an additional loop advances $data until a \0 character is found. For example, the Regex data type is represented by two C-strings:

define __printRegex
    printf "Regex(\"%s\", \"", (char*) $data

    # skip through C String
    while $data[0] != '\0'
        set $data = $data + 1
    end
    set $data = $data + 1

    printf "%s\")", (char*) $data

    # skip through C String
    while $data[0] != '\0'
        set $data = $data + 1
    end
    set $data = $data + 1
end

We start by printing the type name prefix and first string (pattern) using printf and then advance our data pointer with a while loop. Then, the second string (modifiers) is printed with printf and we advance again, leaving the $data pointer at the next key/value pair (or our document's trailing null byte if the regex type was the last element).

After implementing all the different data types, I made a PR against the MongoDB C driver, where the BSON library resides. It has now been merged. In order to make use of the .gdbinit file, you can include it in your GDB session with source /path/to/.gdbinit.

With the file loaded, and bson_doc being bson_t * variable in the local scope, you can run printbson bson_doc, and receive something like the following semi-JSON formatted output:

(gdb) printbson bson_doc
ALLOC [0x555556cd7310 + 0] (len=475)
{
    'bool' : true,
    'int32' : NumberInt("42"),
    'int64' : NumberLong("3000000042"),
    'string' : "Stŕìñg",
    'objectId' : ObjectID("5A1442F3122D331C3C6757E1"),
    'utcDateTime' : UTCDateTime(1511277299031),
    'arrayOfInts' : [
        '0' : NumberInt("1"),
        '1' : NumberInt("2"),
        '2' : NumberInt("3"),
        '3' : NumberInt("5"),
        '4' : NumberInt("8"),
        '5' : NumberInt("13"),
        '6' : NumberInt("21"),
        '7' : NumberInt("34")
    ],
    'embeddedDocument' : {
        'arrayOfStrings' : [
            '0' : "one",
            '1' : "two",
            '2' : "three"
        ],
        'double' : 2.718280,
        'notherDoc' : {
            'true' : NumberInt("1"),
            'false' : false
        }
    },
    'binary' : Binary("02", "3031343532333637"),
    'regex' : Regex("@[a-z]+@", "im"),
    'null' : null,
    'js' : JavaScript("print foo"),
    'jsws' : JavaScript("print foo") with scope: {
        'f' : NumberInt("42"),
        'a' : [
            '0' : 3.141593,
            '1' : 2.718282
        ]
    },
    'timestamp' : Timestamp(4294967295, 4294967295),
    'double' : 3.141593
}

In the future, I might add information about the length of strings, or the convert the predefined types of the Binary data-type to their common name. Happy hacking!

Comments

No comments yet

Add Comment

Name:
Email:

Will not be posted. Please leave empty instead of filling in garbage though!
Comment:

Please follow the reStructured Text format. Do not use the comment form to report issues in software, use the relevant issue tracker. I will not answer them here.


All comments are moderated

Life Line