Using the Right Debugging Tools

A while ago, we updated the MongoDB PHP driver's embedded C library to a new version. The C library handles most of the connection management and other low level tasks that the PHP driver needs to successfully talk to MongoDB deployments, especially in replicated environments where servers might disappear for maintenance or hardware failures.

After upgrading the C library from 1.12 to 1.13 we noticed that one of the PHP driver's tests was failing:

derick@singlemalt:~/dev/php/derickr-mongo-php-driver $ make test TESTS=tests/atlas.phpt
…
====================================================================
FAILED TEST SUMMARY
---------------------------------------------------------------------
Atlas Connectivity Tests [tests/atlas.phpt]
=====================================================================

When running the test manually, we get:

derick@singlemalt:~/dev/php/mongodb-mongo-php-driver $ php tests/atlas.phpt
--TEST--
Atlas Connectivity Tests
--SKIPIF--
skip Atlas tests not wanted
--FILE--
PASS
mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cluster.c:1852 mongoc_cluster_fetch_stream_single():
        precondition failed: scanner_node && !scanner_node->retired
Aborted

That was not good news.

The atlas.phpt test tests whether the PHP driver (through the C driver) can connect to a set of different deployments of Atlas, MongoDB's Database as a Service platform. The test makes sure we can talk to an Atlas replica set, a sharded cluster, a free tier replica set, as well as TLS 1.1 and TLS 1.2 specific deployments. The test started failing when connecting to the second provided URI (the sharded cluster).

At first I thought this was caused by the upgrade from version 1.12 to 1.13 of the C driver, but that didn't end up being the case. Let's see how we got to finding and fixing this bug.

First of all, I started GDB to see where it was failing:

derick@singlemalt:~/dev/php/mongodb-mongo-php-driver $ gdb --args php tests/atlas.phpt
GNU gdb (Debian 8.1-4+b1) 8.1
…
Reading symbols from php...done.
(gdb) run
Starting program: /usr/local/php/7.3dev/bin/php tests/atlas.phpt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--TEST--
Atlas Connectivity Tests
--FILE--
PASS
mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cluster.c:1852 mongoc_cluster_fetch_stream_single():
        precondition failed: scanner_node && !scanner_node->retired

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

This shows that the C driver bailed out due to a specific assertion on line 1852 of mongoc-cluster.c. This assertion reads:

BSON_ASSERT (scanner_node && !scanner_node->retired);

Which didn't really say a lot. The next thing to try is then to make a backtrace in GDB with the bt command. This revealed:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff41472f1 in __GI_abort () at abort.c:79
#2  0x00007ffff32cdcf1 in mongoc_cluster_fetch_stream_single (
        cluster=0x555556ba8e68, server_id=2, reconnect_ok=true, error=0x555556b95858)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cluster.c:1852
#3  0x00007ffff32cd9df in _mongoc_cluster_stream_for_server (
        cluster=0x555556ba8e68, server_id=2, reconnect_ok=true, cs=0x0, reply=0x7fffffff9ee0, error=0x555556b95858)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cluster.c:1762
#4  0x00007ffff32cdbe6 in mongoc_cluster_stream_for_server (
        cluster=0x555556ba8e68, server_id=2, reconnect_ok=true, cs=0x0, reply=0x7fffffff9ee0, error=0x555556b95858)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cluster.c:1826
#5  0x00007ffff32ddb0a in _mongoc_cursor_fetch_stream (cursor=0x555556b95700)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cursor.c:647
#6  0x00007ffff32e12a0 in _prime (cursor=0x555556b95700)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cursor-find.c:40
#7  0x00007ffff32df22c in _call_transition (cursor=0x555556b95700)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cursor.c:1146
#8  0x00007ffff32df54b in mongoc_cursor_next (cursor=0x555556b95700, bson=0x7fffffffa038)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cursor.c:1214
#9  0x00007ffff332511c in phongo_cursor_advance_and_check_for_error (cursor=0x555556b95700)
        at mongo-php-driver/php_phongo.c:742
#10 0x00007ffff33253d9 in phongo_execute_query (
        client=0x555556ba8e60, namespace=0x7fffeaeb5d58 "test.test", zquery=0x7ffff38201c0, options=0x0,
        server_id=1, return_value=0x7ffff38200f0, return_value_used=1)
        at mongo-php-driver/php_phongo.c:810
#11 0x00007ffff3342cc3 in zim_Manager_executeQuery (execute_data=0x7ffff3820160, return_value=0x7ffff38200f0)
        at mongo-php-driver/src/MongoDB/Manager.c:492
#12 0x0000555555e41d16 in execute_internal (execute_data=0x7ffff3820160, return_value=0x7ffff38200f0)
        at /home/derick/dev/php/php-src.git/Zend/zend_execute.c:2328
…

At first glance, I couldn't really see anything wrong with this backtrace, and was still puzzled as to why it would abort. I decided to go for a lunch time walk and have a look at it again. I always find that walks are good for clearing my mind.

After the walk, and a cuppa tea, I looked at the backtrace again, and noticed the following curiosity:

#4  0x00007ffff32cdbe6 in mongoc_cluster_stream_for_server (
        cluster=0x555556ba8e68, server_id=2, reconnect_ok=true, cs=0x0, reply=0x7fffffff9ee0, error=0x555556b95858)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cluster.c:1826

vs:

#10 0x00007ffff33253d9 in phongo_execute_query (
        client=0x555556ba8e60, namespace=0x7fffeaeb5d58 "test.test", zquery=0x7ffff38201c0, options=0x0,
        server_id=1, return_value=0x7ffff38200f0, return_value_used=1)
        at mongo-php-driver/php_phongo.c:810

In frame #10 the server_id variable is 1, whereas in frame #4 later on, the server_id variable is 2. These should have been the same.

The server ID is determined by the C driver when selecting a server to send a read or write operation to, and refers to a specific server's connection ID. The PHP driver uses this server ID when executing the query through the phongo_execute_query function, which calls the C driver's mongoc_collection_find_with_opts. The latter accepts as its 3rd argument a bson_t value with options to use while executing a query. These options include the pre-selected server ID, so that the C driver does not attempt to reselect a server:

cursor = mongoc_collection_find_with_opts(collection, query->filter, query->opts,
        phongo_read_preference_from_zval(zreadPreference TSRMLS_CC));

I decided to investigate which options the PHP driver was sending to mongoc_collection_find_with_opts. A while ago I developed a GDB helper function, about which I wrote in pretty-printing BSON. I sourced this helper within my GDB instance, and switched to frame #10 to inspect the value of the query options:

(gdb) source ~/dev/php/mongodb-mongo-php-driver/src/libmongoc/.gdbinit
(gdb) frame 10

The function call uses the options from the query struct query->opts, so I used the printbson helper function to display its contents:

(gdb) printbson query->opts

Which showed:

$11 = "!\000\000\000\020serverId\000\002\000\000\000\020serverId\000\001", '\000' <repeats 90 times>
INLINE (len=33)
{
        'serverId' : NumberInt("2"),
        'serverId' : NumberInt("1")
}

There are not supposed to be two conflicting serverId elements. Unlike PHP's arrays, bson_t values can have the same key appear multiple times. Although the C driver had selected server ID 1 for this query, server 2 was used because it was the first serverId element in the options struct. But why were there two values in the first place?

If you look at the PHP test, you see the following:

<?php
$urls = explode("\n", file_get_contents('.travis.scripts/atlas-uris.txt'));

…
$query = new \MongoDB\Driver\Query([]);

foreach ($urls as $url) {
        …

        try {
                $m = new \MongoDB\Driver\Manager($url);
                …
                iterator_to_array($m->executeQuery('test.test', $query));
                …
        } catch(Exception $e) {
                …
        }
}
?>

From this follows that we create the Query object, assign it to $query, and then use the same variable for each iteration. Somehow, we were not resetting the query options back to default before we used them, resulting in a duplicate serverId field. Once we figured out the problem, creating the fix was easy enough: Make sure we use a clean query->opts struct before we pass it to the mongoc_collection_find_with_opts function.

Debugging this issue was made a lot easier by having the right debugging tools, and this case shows that spending time on writing the GDB helper function printbson earlier in the year paid off. With this bug fixed, we could release a new patch version of the MongoDB Driver for PHP.

Happy hacking!

Shortlink

This article has a short URL available: https://drck.me/dbgtools-e7g

Comments

No comments yet

Add Comment

Name:
Email:

Will not be posted. Please leave empty instead of filling in garbage though!
Comment:

Please follow the reStructured Text format. Do not use the comment form to report issues in software, use the relevant issue tracker. I will not answer them here.


All comments are moderated

Life Line