Mystery downtime

Posted by Ceri Davies Wed, 22 Mar 2006 20:01:00 GMT

At around 2am this morning, shrike died.

shrike comprises the entire backbone of my setup here. It’s the network router and external firewall and runs the database server, the web server, the web proxy server, the CVS server, the SMTP (in- and out-bound) and IMAP server.

Since this happened in the early hours, I didn’t notice until Stef called me in work to complain that she couldn’t reach the Internet. Attempts to connect to anything on shrike just timed out, so I told her to powercycle it.

A couple of minutes later, I attempted to SSH in and see what was up: Connection refused. Waited another five minutes, and attempts to connect just timed out. Five minutes later, connection refused again. Moments later, time-outs. WTF?

Once I got home, I connected up a monitor and a USB keyboard to take a look.

The machine had panic()'d with ffs_valloc: dup alloc, complaining about /var. I booted single-user, fsck'd /var and mounted it. It was 99% full, and there were 28 crash dumps in /var!

Looks like the machine had been rebooting, saving the crash dump, then panic()ing again due to the error above. Turned out to be a whole bunch of duplicated blocks. Interesting how panic() and background fsck can act together to completely fill /var!

I still have no idea what caused the initial fault, which bugs me but not too much.

Anyway, the point of this is that, since I had a monitor attached for once, I got to try out the new kbdmux(4) stuff and it's superb (read: it does what you'd expect — you plug in a keyboard and it works). Should solve a lot of problems for a bunch of folks.

Posted in ,  | no comments | no trackbacks

Comments

Trackbacks

Use the following link to trackback from your own site:
http://typo.submonkey.net/articles/trackback/68

Comments are disabled