Become a Patron!

My Amazon wishlist can be found here.

Life Line

Memory Malfeasance

A while ago I started getting weird crashes on my desktop machine β€” Gargleblaster. Once in a while, PHP, or Node, would crash. And browser tabs kept turning blank, with Firefox crashing altogether once in a while too.

At first I thought there was some memory corruption in a system library, but neither valgrind or GDB would show any issues β€” if the problem could be reproduced at all. It was also very random, but the problem went away for a short while after a reboot.

I suspected the worst: Broken memory.

In the past I had used tools like memtest86, and memtest86+ β€” both available as packages on my Debian system. There are some complications with both of these on newer UEFI BIOS systems. This meant that when I tried them, the system would not even boot. A new version of memtest86+ was supposed to fix this, but that did not work either for me.

I decided to live with it for a while, but after another total loss of tabs (oh dear!), I stumbled upon a different tool: PCMemTest. This did boot, but their documentation page says "The UHCI USB controller is not yet supported", which is needed for USB keyboards.

I was happily surprised that Debian's APT repository also included a package for this memory testing tool. After I installed it, I rebooted my machine to see what it would say. The result:

PCMemTest showing broken memory

PCMemTest allows you to create a configuration line for the Grub configuration which the Linux kernel uses while booting up to exclude certainly parts of physical memory from being used. However, without the USB keyboard working, I could not not navigate to that feature.

Then I read that the kernel itself also has a memory test tool built in: the memtest kernel parameter.

To include the memory test when the system boots, update the GRUB_CMDLINE_LINUX_DEFAULT line in /etc/default/grub to:

GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off memtest=4"

And then run update-grub.

Now when the system starts, the kernel will run a memory test and automatically exclude any memory that it finds not working.

On my system this looks in the dmesg output like:

[    0.000000] early_memtest: # of tests: 4
[    0.000000]   0x0000000000100000 - 0x0000000001000000 pattern aaaaaaaaaaaaaaaa
[    0.000000]   0x0000000001020000 - 0x0000000004000000 pattern aaaaaaaaaaaaaaaa
[    0.000000]   0x000000000401e000 - 0x0000000009df0000 pattern aaaaaaaaaaaaaaaa
[    0.000000]   0x0000000100000000 - 0x0000000180000000 pattern 5555555555555555
[    0.000000] ------------[ cut here ]------------
[    0.000000] Bad RAM detected. Use memtest86+ to perform a thorough test
                           and the memmap= parameter to reserve the bad areas.
[    0.000000]   5555555555555555 bad mem addr 0x000000016dbc8450 - 0x000000016dbc8458 reserved
[    0.000000]   0x000000016dbc8458 - 0x0000000180000000 pattern 5555555555555555
[    0.000000]   0x0000000180410000 - 0x0000000727200000 pattern 5555555555555555
[    0.000000]   0x000000072980d000 - 0x000000107f300000 pattern 5555555555555555
[    0.000000]   0x0000000000100000 - 0x0000000001000000 pattern ffffffffffffffff
[    0.000000]   0x000000072980d000 - 0x000000107f300000 pattern 0000000000000000

The line bad mem addr 0x000000016dbc8450 - 0x000000016dbc8458 reserved is saying that the kernel excluded that section of memory because it found it to be broken.

Since I booted my system 16 days ago, I have no longer seen any unexplained crashes. Yay!

At some point I will need to replace this memory, if I find out which of the four memory modules it is. That is a job for some other time.


No comments yet

Add Comment


Will not be posted. Please leave empty instead of filling in garbage though!

Please follow the reStructured Text format. Do not use the comment form to report issues in software, use the relevant issue tracker. I will not answer them here.

All comments are moderated