My default stance on any computer product is “broken until proven working.” I never assume anything will work as advertised until I’ve tested it myself. But sometimes even this level of paranoia is not enough, as illustrated by this past weekend.
tl;dr version: in the past few days, the following basic things have failed on me:
- Intel’s Gigabit ethernet driver
- The Linux the boot loader
- Brand-new hard drives
- C library string functions
I should not have to deal with problems with these basic building blocks. This is 2011, not 1989.
I’m going to accelerate the transition of my whole computing infrastructure to the cloud. I’m perfectly happy to pay Amazon staff to handle all these nit-picky problems for me.
Detailed account of what happened, for posterity:
– I wanted to try setting up a VPN with OpenVPN on my Linux server, but I hadn’t compiled the necessary “tun” module into the kernel. No problem, I’ll just recompile it, and might as well upgrade to the latest kernel version at the same time.
– Oops, now my render nodes won’t connect to the network. It turns out the Intel Gigabit Ethernet driver included with the new kernel acts flaky on my hardware. Tried forward-porting the old driver to the new kernel, but there were too many API changes. Gave up and wrote a script that checks the Ethernet connection every 10 minutes and resets it if it’s down.
– Oh, and now the server complains that the kernel image is getting too big for LILO. Well, I guess I might as well join the modern era and upgrade to the new GRUB bootloader.
– Oops. GRUB won’t even install, complaining of some device error, apparently because of changes to how udev exposes devices in the latest kernel. So I guess I can’t use GRUB now, because the version Debian ships isn’t compatible with newer kernels. Gave up and pray LILO doesn’t fail in the future.
– Got a batch of four new 2TB Western Digital “Green” drives to replace the 250GB drives in my file server. After sliding them in, discovered that they cripple read/write performance down to <10MB/sec and time-out frequently. (Yes, I checked Google and found lots of reports about the 4KB sector size causing problems, but that’s not my issue – I am SURE my partitions are aligned correctly). No way I’m going to rely on these drives for my server. Order new Hitachi drives.
– Brought my Debian packages up to date, including updating glibc from 2.5 to 2.11.
– Oops. Now any C program I compile segfaults immediately upon the first call to any string function (what???). Eventually discovered that the new glibc plays tricks with linker symbols in a way that my older binutils can’t handle. Very disappointed that there is no error message for this – things compile fine, then just refuse to run. Silent failure is a Cardinal Sin of software. Upgraded binutils and all is well again.