July 6, 2004

Server Crash and Site Rebuild

It has been a very, very, very, long week.

A week ago today, the server which hosts this web site crashed. The disk drive failed, refusing to ever boot again. And as I sat staring at the screen, wondering if I needed to fly back to San Francisco (from London), my worry turned to horror as I realized that I hadn’t backed up in a very very long time. Since…ouch…the end of October. Yes, eight long months ago.

Now before you start chastising me – I’ve done quite enough of that myself this past week, thank you very much – stop what you’re doing right now, and back up whatever it is you haven’t been backing up. Buy a big fat hard disk, connect it to your computer, and make a backup. I went and bought a 250GB disk from Western Digital. I partitioned it into 4 x 60GB partitions, and now I backup every night before I go to bed (well, most every night…). If you’re using a Mac get the free SilverKeeper backup program from LaCie – it’s pretty good. And if you’re using Windows, the built-in “Backup” program is free too (Start -> Programs -> Accessories -> System Tools -> Backup).

I own the hardware that runs this site (and yes, there are times when I wonder why I own my own hardware – times like this actually). The server is parked in a colo in San Jose, in the cage of a friend who runs a medium-sized web site. I have two machines, a 1U IBM eServer 330, and a 4U no-name box that’s mostly full of air (and is a heck of a lot slower than the IBM).

I called the colo to have them reboot the 1U with a monitor attached. It wouldn’t get past the BIOS, failing with a 19990301, an error code that shows up on exactly once on the IBM web site with this informative piece of information: “19990301 - Hard disk boot failed. Run Setup Utility. Hard disk drive.” Gee thanks, that’s helpful.

After a couple of hours back and forth with the nice folks at the colo, after my fifth cup of worry coffee, and after moaning some more to Rachel, I wondered if I could at least get myself up and limping again using the 4U box. Hmmm. I’d really only used it to run secondary DNS, but it did have an old copy of my mail server program, and it had a web server…maybe…just maybe, I could go from 100% disaster, to only 98% disaster. Heck, it was worth a try.

I fired up Terminal Client and connected to the 4U box. DNS was out (looks like I’d misconfigured something, it shouldn’t have been out, but I’ll worry about that later), so no one on the net could find my web sites; of course that wasn’t really a problem since there were no longer any web sites to find. My first piece of luck was that the secondary DNS had stored dns files for all of my domains, so I fired up the DNS server user interface, switched all the domains from secondary to primary, and with that the breath of internet life had resurrected my domains.

Next I re-started the mail server. It was a version from about 18 months earlier, but it still worked. I added a couple of domains and accounts that were missing. And then I went back to the DNS server and updated all the mail server entries to point to the ip address of the 4U box.

6 hours later and I had DNS and mail working again. For the first time I thought that I might not have to fly back to San Francisco after all…

The thing I was most worried about was that I didn’t have a backup of the writing I’d done on my A Year In Cornwall weblog. I had all the pictures, but very little of the writing. I remembered reading about someone else re-creating their web site by using the Google cache, so I fired up my web browser and typed in “A Year In Cornwall". Sure enough, there was the home page. Then I typed in “A Year In Cornwall archive” and there was one of the monthly archive pages. And so one month at a time, I downloade the text of all of the entries in my weblog. Whew – got it all – thank you Google!

Now on to the web sites. That was going to be a bit harder, mostly because so much had changed since October. I started by opening the October 31 backup. The backup had most of the web sites, and though they’d all been updated since the backup, the folder structure was pretty much the same. I copied them all to the 4U box, and started creating web sites, one at a time. I’d forgotten I had so many sites – 18 in all if you include things like the database admin site, the stats site, and the blogging admin site. As I was doing that I also upgraded php and MySql to the latest versions. Then I copied the database files from the October backup and got the database up and running. I even installed phpBB for my support bulletin board. And then I tried to create the ftp sites I needed. And that’s when I got stuck.

Creating ftp sites. Can there be a more complicated, convoluted and poorly-documented feature in all of Microsoft’s IIS web/ftp server? If there is, I’d like to know what it is, because it took me 5 hours – 5 HOURS – to get half a dozen ftp accounts set up. First the user accounts, then the permissions, then the ftp sites. And while it all seems to work now, there’s a big red “ERROR” icon next to each ftp site in the ftp console window. Why? Damned if I know – I just hope it doesn’t mean something like “your server is now open for attack because you’ve misconfigured everything".

So there it was Thursday, two days later, and things were looking a lot better than they had on Tuesday.

In the meantime there’s the issue of what to do about the hardware. I scrambled around trying to find a way to get the dead server up and running again, without flying all the way back to San Francisco. I called some of my old workmates at Wired – they’d all been laid off – and the two I was able to contact were too busy to make the trek to San Jose. A bunch of phone calls and emails later and I realize how few friends I have that could actually do the work. I thought about calling my brother Chris, trying to lead him through hooking up the monitor and rebooting the machine, but then I envisioned him with his tool belt on – he’s a general contractor – and after a particularly frustrating moment taking one of his hammers to the equipment. Noooooo, maybe I better find another way. Luckily, it turns out that the colo I use, AboveNet, has a Level 2 tech-support program, which means that if I can get them new disks, they can do the reinstall.

So I jumped on eBay, and start looking for disks. Unfortunately the disks I needed were quite specialized – they’re hot swap disks – and it was important that I get the right drive, right tray and right type of connector. More unfortunately, few people put all the pertinent serial numbers on their items, so searching for disks was very hit and miss. That and the fact that I needed to “Buy It Now", not wait for an auction to complete. After many hours of surfing eBay I finally found two disks in Minnesota that could be overnighted to San Jose (and crossed my fingers that they were the right disks). Then I ordered Norton SystemWorks from PCConnection on the off-chance that it could recover the data from the disk. And by Thursday night all the hardware and software I needed was on its way to San Jose.

Back to the software setup. Web stats was next. I use Analog and ReportMagic for the reporting, along with QuickDNS for doing ip-to-host lookups, and Stats Automator for IIS to make it a one click process. The hardest thing here is getting all the various config files talking to each other. Analog has to output data in the right format for ReportMagic. QuickDNS has to run before Analog and cache the dns lookups in the right place. And the Automator has to be customized to put the stats reports where I want them. Another 4 hours.

Come Friday, the disks arrived in San Jose via morning FedEx. But the software didn’t show up until mid-afternoon. The tech, Rudy, who was going to do the work had to leave early and wouldn’t be back until Tuesday morning (oh yeah, I forgot, it’s Fourth of July weekend in the States). I’d talked with him on the phone and got the feeling that he really knew what he was doing, so after thinking about it for a bit I said, ok, I want you to do the work, so let’s wait until Tuesday to try to recover the disk. I figured at that point it didn’t really matter whether things came up Friday or Tuesday. If the disk was recoverable it was worth recovering. And if it wasn’t recoverable, then I might as well keep doing what I was doing with the 4U box since it was going to be next week before all the web content was restored anyway.

Finally Tuesday arrived, and Rudy tried to recover the disk. Norton SystemWorks booted the server, but it couldn’t see the disk. “Ok” I said, “let’s just reinstall the OS on the new disks.” An hour later Rudy called back and said everything was installed and connected. I logged on with Terminal Server, updated to the latest patches, rebooted several times, and set up disk mirroring using the two new disks. Hopefully disk mirroring will make it much less likely that what happened will happen again. With disk mirroring, if one disk fails the other will keep running. And because they’re hot swap drives, it’s possible to buy another drive, stick it in, and have the mirroring continue on the new disk automatically. Personally I’m hoping that I have never have to test that scenario – that like rain in London, where the fact that you’re carrying an umbrella means that it’s much less likely to rain – that having mirrored disks means it’s much less likely to ever have a failure.

So there it is – seven very long days to get my servers up and running again. But what I think is most amazing about it is that it could be done at all. 10,000 miles away. All it took was eBay, FedEx, PCConnection, Amazon.com, a broadband connection, a web browser, email, a telephone and a credit card.

“But what about the web sites?” I can hear you saying. “Why are they still not up? And why do they look different?” Ahhh, that’s for the next entry.

Posted by: Frank @ 11:15 pm — Filed under:

1 Comment »

  1. Hey Frank, thanks for the vote of confidence in my ability to help you out in your time of need… you forgot to mention that my hairy knuckles drag on the ground too!!! Seems to me I’ve made several of your homes look pretty nice, and that took a hell of a lot more patience than a couple of hours in a cage in San Jose… :) see you a in a few weeks on the Cape…

    Comment by Chris Leahy — July 20, 2004 @ 5:11 am

RSS feed for comments on this post.

Leave a comment

Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>



In an effort to control spam, please fill in the result of the equation below