The Confused C-Driver
Over the past few months I have been adding the MongoDB driver for PHP to Evergreen, MongoDB's CI system. Although we already test on Travis and AppVeyor, adding the driver to Evergreen also allows us to test on more esoteric platforms, such as ARM64, Power8, and zSeries.
While adding the zSeries architecture to our test matrix, we noticed that one specific group of tests was failing. In these tests we would do an insert with a write concern of {w: 0} followed by an insert with {w: 1}.
A write concern is used to enforce specific guarantees on how many MongoDB servers a write needs to be replicated to, before the primary responds to the client that the write has been acknowledged. A write concern of {w: 0} means that the client does not care about the result of the write operation, and henceforth does not need a reply from the server. These are also referred to as unacknowledged writes.
The test fails in such a way that the result for the second insert (with {w: 1}) seemed empty.
Figuring out what was going wrong was not particularly easy in this case. Single-stepping through a lot of connection and socket handling code with a fairly complicated protocol takes time. After several hours it was still unclear what was going wrong.
I tried to find out whether the data was sent correctly over the wire. As we only have a single shared zSeries development server without a GUI, I could not use Wireshark. Moreover, I could not use tcpdump either, as I didn't have any sudo rights:
$ tcpdump -i lo -nnXSs 0 'port 27017' tcpdump: lo: You don't have permission to capture on that device (socket: Operation not permitted)
After some time I figured out that you can also use strace for finding out what goes over the wire, albeit in a slightly annoying format. I used the following invocation:
strace -f -e trace=network -s 10000 -p 22506
The -f makes strace follow forked processes, the -e trace=network shows the system calls of all network related operations, the -s 10000 makes sure we see 10 000 bytes of data in strings, and the 22506 value for the -p argument is the process ID.
On the zSeries platform, the strace dump looks like the following. The first group is the insert with {w: 0}, and the second group the insert with {w: 1}}:
recvmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"\252\0\0\0\2\0\0\0\0\0\0\0\335\7\0\0", 16}], msg_controllen=0, msg_flags=0}, 0) = 16
recvmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\2\0h\0\0\0\2insert\0#\0\0\0server_server_executeBulkWrite_002\0\10ordered\0\1\2$db\0\7\0\0\0phongo\0\3writeConcern\0\f\0\0\0\20w\0\0\0\0\0\0\0\1,\0\0\0documents\0\36\0\0\0\20wc\0\0\0\0\0\7_id\0\\\31\32\27\r\0200G%yF\322\0", 154}], msg_controllen=0, msg_flags=0}, 0) = 154
sendmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"&\0\0\0v\0\0\0\2\0\0\0\335\7\0\0\0\0\0\0\0\21\0\0\0\1ok\0\0\0\0\0\0\0\360?\0", 38}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 38
recvmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"\316\0\0\0\4\0\0\0\0\0\0\0\335\7\0\0", 16}], msg_controllen=0, msg_flags=0}, 0) = 16
recvmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\214\0\0\0\2insert\0#\0\0\0server_server_executeBulkWrite_002\0\10ordered\0\1\2$db\0\7\0\0\0phongo\0\3lsid\0\36\0\0\0\5id\0\20\0\0\0\4\337{QQi\312A\245\213\270\210\376&+\247\260\0\3writeConcern\0\f\0\0\0\20w\0\1\0\0\0\0\0\1,\0\0\0documents\0\36\0\0\0\20wc\0\1\0\0\0\7_id\0\\\31\32\27\r\0200G%yF\323\0", 190}], msg_controllen=0, msg_flags=0}, 0) = 190
sendmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"-\0\0\0x\0\0\0\4\0\0\0\335\7\0\0\0\0\0\0\0\30\0\0\0\20n\0\1\0\0\0\1ok\0\0\0\0\0\0\0\360?\0", 45}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 45
The first line is the header, with the last four bytes (\335\7\0\0) being the opcode 2013 (OP_MSG). After the opcode, on the second line follow the flagBits: \0\0\0\2. The flagBits field is \0\0\0\0 for the second insert.
On my local Linux machine, the strace dump looks a little more complicated as the data is split into multiple IOV packets, but similar data is present. The trace is also made with the client and server reversed, so sendmsg and recvmsg are also swapped:
sendmsg(4, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\252\0\0\0", iov_len=4}, {iov_base="\2\0\0\0", iov_len=4}, {iov_base="\0\0\0\0", iov_len=4}, {iov_base="\335\7\0\0", iov_len=4},
{iov_base="\2\0\0\0", iov_len=4}, {iov_base="\0", iov_len=1}, {iov_base="h\0\0\0\2insert\0#\0\0\0server_server_executeBulkWrite_002\0\10ordered\0\1\2$db\0\7\0\0\0phongo\0\3writeConcern\0\f\0\0\0\20w\0\0\0\0\0\0\0", iov_len=104}, {iov_base="\1", iov_len=1}, {iov_base=",\0\0\0", iov_len=4}, {iov_base="documents\0", iov_len=10}, {iov_base="\36\0\0\0\20wc\0\0\0\0\0\7_id\0\\\31\32=\343\231,\25\301\\\203r\0", iov_len=30}], msg_iovlen=11, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 170
sendmsg(4, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\316\0\0\0", iov_len=4}, {iov_base="\4\0\0\0", iov_len=4}, {iov_base="\0\0\0\0", iov_len=4}, {iov_base="\335\7\0\0", iov_len=4},
{iov_base="\0\0\0\0", iov_len=4}, {iov_base="\0", iov_len=1}, {iov_base="\214\0\0\0\2insert\0#\0\0\0server_server_executeBulkWrite_002\0\10ordered\0\1\2$db\0\7\0\0\0phongo\0\3lsid\0\36\0\0\0\5id\0\20\0\0\0\4 \314rM\25\362L\7\203\36O\2157\344\201V\0\3writeConcern\0\f\0\0\0\20w\0\1\0\0\0\0\0", iov_len=140}, {iov_base="\1", iov_len=1}, {iov_base=",\0\0\0", iov_len=4}, {iov_base="documents\0", iov_len=10}, {iov_base="\36\0\0\0\20wc\0\1\0\0\0\7_id\0\\\31\32=\343\231,\25\301\\\203s\0", iov_len=30}], msg_iovlen=11, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 206
recvfrom(4, "-\0\0\0", 4, 0, NULL, NULL) = 4
recvfrom(4, "i\0\0\0\4\0\0\0\335\7\0\0\0\0\0\0\0\30\0\0\0\20n\0\1\0\0\0\1ok\0\0\0\0\0\0\0\360?\0", 41, 0, NULL, NULL) = 41
The first big difference is that locally, there is no recvfrom for the first sendmsg, but there is one for the second insert. The other difference is that the flagBits (the second line in each sendmsg) is \2\0\0\0 for the first insert. This is different in the zSeries trace, where the value is \0\0\0\2.
The latest version of the MongoDB wire protocol uses OP_MSG for all operations, which is a typical request/response API. This means that generally every request is expected to generate a reply from the server. For unacknowledged writes, the client does not need a reply from the server, and the way to tell the server that is by setting the moreToCome flag in the OP_MSG packet. It's bit 1 in the 32 bit wide bit field.
The wire protocol requires the use of the little-endian byte order for numbers. libmongoc, which implements the connection aspects of the PHP driver, accomplishes this by running a "swap" in case the driver is run on a big-endian system such as zSeries.
It turned out that although this swap happened for all normal integers (such as the OP_MSG opcode 2013), it did not happen for the flagBits. This meant that instead of setting the moreToCome flag (bit 1), we set bit 25, which does absolutely nothing. Because of that, the server helpfully sent a reply (the first sendmsg) in the zSeries dump. Because the driver did not expect such a packet, it did not read it from the socket either.
This meant that when the driver read information from the socket in response to the second insert (the one with a write concern of {w: 1}) it read the response of the first insert. And then it got confused.
In the end, the fix was as easy as making sure that the flagBits field was also correctly swapped between big- and little-endian.
Confusion solved!
Life Line
Created a bench and a crossing
Created 2 benches and a crossing; Updated 5 cushions, a post_box, and a crossing
Created a crossing; Updated a cushion
If you were wondering whether the www.php.net & downloads.php.net services weren't responding very well in the last 6 hours — thousands of requests/sec to https://www.php.net/ 's root.
The server's load was 720, didn't die, but CDN connections to it timed out.
Now there is a caching strategy in place for a selected set of resources.
Updated a bench
Created 3 benches; Updated 10 benches
Updated a bench
Updated a bus_stop
Created a bench and a waste_basket; Updated 6 bus_stops and a crossing
Created 2 waste_baskets and a recycling; Updated 2 bicycle_parkings and a recycling
Updated a fast_food, a funeral_directors shop, and 2 other objects; Confirmed a fast_food and a hairdresser shop
Created an information; Updated 3 benches and 2 waste_baskets
Updated 2 benches and a waste_basket
Updated a bench
Created a waste_basket and an information
Created a waste_basket
I hiked 18.0km in 4h1m52s
I walked 1.4km in 17m19s
I walked 4.5km in 1h21m49s
I just made and ate, a bowl full of bacon fried Brussels Sprouts. Not under duress, and out of my own free will.
Added new residential building
Created a hairdresser shop; Confirmed a convenience shop and a dry_cleaning shop
Created a building_materials shop, a vacant shop, and 4 other objects; Confirmed a hairdresser shop, a cafe, and 2 other objects
I walked 8.3km in 1h33m44s



Shortlink
This article has a short URL available: https://drck.me/c-endian-eab