The Confused C-Driver
Over the past few months I have been adding the MongoDB driver for PHP to Evergreen, MongoDB's CI system. Although we already test on Travis and AppVeyor, adding the driver to Evergreen also allows us to test on more esoteric platforms, such as ARM64, Power8, and zSeries.
While adding the zSeries architecture to our test matrix, we noticed that one specific group of tests was failing. In these tests we would do an insert with a write concern of {w: 0} followed by an insert with {w: 1}.
A write concern is used to enforce specific guarantees on how many MongoDB servers a write needs to be replicated to, before the primary responds to the client that the write has been acknowledged. A write concern of {w: 0} means that the client does not care about the result of the write operation, and henceforth does not need a reply from the server. These are also referred to as unacknowledged writes.
The test fails in such a way that the result for the second insert (with {w: 1}) seemed empty.
Figuring out what was going wrong was not particularly easy in this case. Single-stepping through a lot of connection and socket handling code with a fairly complicated protocol takes time. After several hours it was still unclear what was going wrong.
I tried to find out whether the data was sent correctly over the wire. As we only have a single shared zSeries development server without a GUI, I could not use Wireshark. Moreover, I could not use tcpdump either, as I didn't have any sudo rights:
$ tcpdump -i lo -nnXSs 0 'port 27017' tcpdump: lo: You don't have permission to capture on that device (socket: Operation not permitted)
After some time I figured out that you can also use strace for finding out what goes over the wire, albeit in a slightly annoying format. I used the following invocation:
strace -f -e trace=network -s 10000 -p 22506
The -f makes strace follow forked processes, the -e trace=network shows the system calls of all network related operations, the -s 10000 makes sure we see 10 000 bytes of data in strings, and the 22506 value for the -p argument is the process ID.
On the zSeries platform, the strace dump looks like the following. The first group is the insert with {w: 0}, and the second group the insert with {w: 1}}:
recvmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"\252\0\0\0\2\0\0\0\0\0\0\0\335\7\0\0", 16}], msg_controllen=0, msg_flags=0}, 0) = 16
recvmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\2\0h\0\0\0\2insert\0#\0\0\0server_server_executeBulkWrite_002\0\10ordered\0\1\2$db\0\7\0\0\0phongo\0\3writeConcern\0\f\0\0\0\20w\0\0\0\0\0\0\0\1,\0\0\0documents\0\36\0\0\0\20wc\0\0\0\0\0\7_id\0\\\31\32\27\r\0200G%yF\322\0", 154}], msg_controllen=0, msg_flags=0}, 0) = 154
sendmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"&\0\0\0v\0\0\0\2\0\0\0\335\7\0\0\0\0\0\0\0\21\0\0\0\1ok\0\0\0\0\0\0\0\360?\0", 38}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 38
recvmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"\316\0\0\0\4\0\0\0\0\0\0\0\335\7\0\0", 16}], msg_controllen=0, msg_flags=0}, 0) = 16
recvmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\214\0\0\0\2insert\0#\0\0\0server_server_executeBulkWrite_002\0\10ordered\0\1\2$db\0\7\0\0\0phongo\0\3lsid\0\36\0\0\0\5id\0\20\0\0\0\4\337{QQi\312A\245\213\270\210\376&+\247\260\0\3writeConcern\0\f\0\0\0\20w\0\1\0\0\0\0\0\1,\0\0\0documents\0\36\0\0\0\20wc\0\1\0\0\0\7_id\0\\\31\32\27\r\0200G%yF\323\0", 190}], msg_controllen=0, msg_flags=0}, 0) = 190
sendmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"-\0\0\0x\0\0\0\4\0\0\0\335\7\0\0\0\0\0\0\0\30\0\0\0\20n\0\1\0\0\0\1ok\0\0\0\0\0\0\0\360?\0", 45}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 45
The first line is the header, with the last four bytes (\335\7\0\0) being the opcode 2013 (OP_MSG). After the opcode, on the second line follow the flagBits: \0\0\0\2. The flagBits field is \0\0\0\0 for the second insert.
On my local Linux machine, the strace dump looks a little more complicated as the data is split into multiple IOV packets, but similar data is present. The trace is also made with the client and server reversed, so sendmsg and recvmsg are also swapped:
sendmsg(4, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\252\0\0\0", iov_len=4}, {iov_base="\2\0\0\0", iov_len=4}, {iov_base="\0\0\0\0", iov_len=4}, {iov_base="\335\7\0\0", iov_len=4},
{iov_base="\2\0\0\0", iov_len=4}, {iov_base="\0", iov_len=1}, {iov_base="h\0\0\0\2insert\0#\0\0\0server_server_executeBulkWrite_002\0\10ordered\0\1\2$db\0\7\0\0\0phongo\0\3writeConcern\0\f\0\0\0\20w\0\0\0\0\0\0\0", iov_len=104}, {iov_base="\1", iov_len=1}, {iov_base=",\0\0\0", iov_len=4}, {iov_base="documents\0", iov_len=10}, {iov_base="\36\0\0\0\20wc\0\0\0\0\0\7_id\0\\\31\32=\343\231,\25\301\\\203r\0", iov_len=30}], msg_iovlen=11, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 170
sendmsg(4, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\316\0\0\0", iov_len=4}, {iov_base="\4\0\0\0", iov_len=4}, {iov_base="\0\0\0\0", iov_len=4}, {iov_base="\335\7\0\0", iov_len=4},
{iov_base="\0\0\0\0", iov_len=4}, {iov_base="\0", iov_len=1}, {iov_base="\214\0\0\0\2insert\0#\0\0\0server_server_executeBulkWrite_002\0\10ordered\0\1\2$db\0\7\0\0\0phongo\0\3lsid\0\36\0\0\0\5id\0\20\0\0\0\4 \314rM\25\362L\7\203\36O\2157\344\201V\0\3writeConcern\0\f\0\0\0\20w\0\1\0\0\0\0\0", iov_len=140}, {iov_base="\1", iov_len=1}, {iov_base=",\0\0\0", iov_len=4}, {iov_base="documents\0", iov_len=10}, {iov_base="\36\0\0\0\20wc\0\1\0\0\0\7_id\0\\\31\32=\343\231,\25\301\\\203s\0", iov_len=30}], msg_iovlen=11, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 206
recvfrom(4, "-\0\0\0", 4, 0, NULL, NULL) = 4
recvfrom(4, "i\0\0\0\4\0\0\0\335\7\0\0\0\0\0\0\0\30\0\0\0\20n\0\1\0\0\0\1ok\0\0\0\0\0\0\0\360?\0", 41, 0, NULL, NULL) = 41
The first big difference is that locally, there is no recvfrom for the first sendmsg, but there is one for the second insert. The other difference is that the flagBits (the second line in each sendmsg) is \2\0\0\0 for the first insert. This is different in the zSeries trace, where the value is \0\0\0\2.
The latest version of the MongoDB wire protocol uses OP_MSG for all operations, which is a typical request/response API. This means that generally every request is expected to generate a reply from the server. For unacknowledged writes, the client does not need a reply from the server, and the way to tell the server that is by setting the moreToCome flag in the OP_MSG packet. It's bit 1 in the 32 bit wide bit field.
The wire protocol requires the use of the little-endian byte order for numbers. libmongoc, which implements the connection aspects of the PHP driver, accomplishes this by running a "swap" in case the driver is run on a big-endian system such as zSeries.
It turned out that although this swap happened for all normal integers (such as the OP_MSG opcode 2013), it did not happen for the flagBits. This meant that instead of setting the moreToCome flag (bit 1), we set bit 25, which does absolutely nothing. Because of that, the server helpfully sent a reply (the first sendmsg) in the zSeries dump. Because the driver did not expect such a packet, it did not read it from the socket either.
This meant that when the driver read information from the socket in response to the second insert (the one with a write concern of {w: 1}) it read the response of the first insert. And then it got confused.
In the end, the fix was as easy as making sure that the flagBits field was also correctly swapped between big- and little-endian.
Confusion solved!
Life Line
Updated a waste_basket
Created 11 benches, 2 life_rings, and 3 other objects; Updated 8 benches and a waste_basket; Deleted a bench and a log; Confirmed a cafe
I walked 7.8km in 1h52m27s
Tomorrow we have elections in the UK!
Lots of local authorities, all London Councils, the Welsh Senedd, and the Scottish Parliament.
Don't forget to vote if you have the right.
I get to vote for myself again 😎.
Benches, and corrections for the QE II Gardens
Addresses on College Road
Created 2 main entrances and an entrance; Updated an entrance, a residential building, and a house building
Created an apartments building and a main entrance
I walked 9.5km in 2h21m10s
Created a waste_basket
I walked 6.6km in 1h13m23s
On my walk from Aylesbury to Princes Risborough I spotted a few new bird species. I didn't get all the best photos though!
A Common Buzzard, a Yellow Wagtail, a Greater White throat, and a Green Woodpecker.
#photography #Birds #BirdPhotogaphy #BirdsOfMastodon #nature #Buckinghamshire
Updated an alcohol shop
I walked 9.4km in 1h58m25s
Updated 2 benches
Created a bench; Updated a bench
I hiked 19.0km in 4h35m50s
I hiked 19.0km in 4h35m50s
I walked 6.8km in 1h15m36s
Updated an estate_agent office
I walked 4.1km in 55m33s
I walked 1.1km in 10m05s
My First Lapwing!
I went to the London Wetland Centre yesterday, for a day out in nature.
While hiding in a hide, this chap and a friend showed up starting to forage for grubs.
#BirdPhotography #BirdsOfMastodon #Photography #Birds #London #Nature
Created a waste_basket; Updated a cafe and a restaurant; Confirmed an estate_agent office
I walked 6.6km in 1h8m53s



Shortlink
This article has a short URL available: https://drck.me/c-endian-eab