Recovery Journal

Notes on the ongoing recovery from the 14-Jul-2005 meltdown.

Organized in reverse chronological order—if you don't know what this is all about, scroll down to the bottom and read up!

23-Sep-2005 18:35

I hereby declare success!

Not everything apparently lost has been recovered. Not everything once served from here is still served from here. However, I'm through my to-do list of getting things back up, so I hereby declare the "recovery" complete.

This recovery log will be terminated; any future sysadmin reports, whether related to recovering from the crash, or just normal keeping up with maintenance, will go in the server bulletins.

5-Sep-2005 23:15

John's memorial site, and a call for assistance

I got my father's memorial site back up (and changed the appearance and function of the retirement letters page into the bargain).

A number of sites I'm hosting for others aren't restored yet. I suspect that there's been a lack of communication on what needs to be done; for that I apologize. And, even more useful, try to now communicate what needs to be done.

If you've got a site here that I haven't turned back on yet, please poke me when you're ready to have it turned on. If you can check it for obvious missing image files (especially JPEGs) first and replace them, that will probably help it look better when it does come up. If not, so it goes; I'll turn it on and you can check it for missing files when it's live.

I'm getting fairly close to declaring the "recovery" period over.

26-Aug-2005 14:10

Real photos, outside work, and such

Got my real photos section up. Worked on an outside paid project that came up, and delayed this work some. Cleaned up many little details for me and others.

The remaining web sites for other people can go live whenever you say. Because of the high chance of missing images, I do want to coordinate with you when they go live, though; I want you to be available to check them out and fix whatever is wrong fairly quickly, at least to know that there's something needing fixing.

I will soon be declaring the recovery finished. This doesn't mean I think absolutely everything will be fixed yet, you understand. Just that the worklist is becoming somewhat indefinite, and update intervals longer, and it's less and less deserving of its own special weblog.

19-Aug-2005 01:19

Snapshots 2004 and 2005, and some real photos

The 2004 and 2005 snapshots have been recovered, with no real dificulty except the amount of fiddly little stuff to be paid attention to. I've also restored the orphans into my Lighthunters photo gallery.

18-Aug-2005 15:33

My photo galleries, and BOINC

Not that my photo galleries are fully restored, far from it. But the software is now working, updated to function with mod_perl 2 and apache 2. I've gone through all my orphan files and put back into place everything I can (though the gallery may not show them yet). Now I have to track down the ones that are still missing and replace them from my desktop system where all the photos really live.

BOINC is a distributed computing platform used for various scientific projects, started by the people who do SETI-at-home. I was starting to feel bad about all those CPU cycles just going to waste! (I'd been running SETI on the previous incarnation of the server.)

17-Aug-2005 00:24

External DVD burner, and mod_perl

After five days mostly mucking about with getting an external DVD burner working, I've got one, which will make moving data around and some new types of backup procedure possible. I've had three different burners and two different external cases installed at various times, and of course the thing that actually works, finally, is much the most expensive of the choices.

And I've started work on getting my photo galleries working, and have cracked the first big problem (due to the software changes I'm upgrading from Perl 5.6 to Perl 5.8, and from mod_perl 1 to mod_perl 2). I can now run simple Perl scripts under mod_perl 2, and have found where they hid the utility functions.

13-Aug-2005 01:24

Einblatt mailing list

Put the Minn-StF monthly calendar mailing list back together, with the members Scott had saved (plus one request made since then).

11-Aug-2005 23:32

Dragaera, Natter, and Concom email lists on

And, perhaps even more important, the basic process of making virtual mailing lists is working.

There are three others I'm consulting on, and something like three more I should do something about soon.

10-Aug-2005 23:47

First Email List Turned On—Minn-StF's "natter" list

Virtual mailing lists in Mailman are a crock, a hack, a bag on the side of the damned thing. However, I think I have my first one working.

5-Aug-2005 19:23

Dragaera.info booksearch, and dbpromo.dd-b.net

Knocked off the booksearch problem, and another whole web site (reinstalled from CVS and and built, no problems found).

5-Aug-2005 17:06

Dragaera.info back except for mailing lists

Oh, and booksearch results aren't showing up, but I expect to fix that shortly.

Mailing lists will be next, and I'm not sure if that will be fairly simple, or if it will turn out to be a pain. Maybe this weekend, if it's simple.

On the way, I updated ns2 to Debian 3.1, which means I'm now running the same software version on both servers. Also, I have reliable access to the mp3 collection I serve from there, which means I can now listen to music while working on this.

3-Aug-2005 17:43

Dragaera.info coming back

But slowly. Considerable portions still go to the recovery page.

Also, I updated the software on ns2 to match gw, and got a backup procedure working (still manually triggered).

A number of users have continued to contribute to the recovery through the paypal link on the recover page. Thank you very much!

30-Jul-2005 02:05

richard.dyer-bennet.net restored

Another web site back up—and a bit expanded (I've had pending images from Paul to add for over a year, sigh).

29-Jul-2005 16:31

Many more orphans identified

A bunch of orphans were under the old "httpd" user, meaning they were created dynamically by Apache. So most of the lost gallery directories were there, and lots of Chaz's knotware sessions were there. They've been assigned to the right owner now, and a few renamed as well.

29-Jul-2005 15:15

Orphans identified

The orphan files that are images have been identified. So in /home/restore/orphans, the file that (for example) used to be named "#1458446" was determined to be a jpeg, so it's now named "#1458446.JPEG". This makes it possible to easily look at it in an image viewer, for example.

Use "ls -l" to show the owners of the files; the file ownership was preserved across the disaster, mostly. Most of the loose files in /home/restore/orphans turn out to be jpegs belonging to Oleg.

Also, a couple more websites are back up, including ctein.com and photo-repair.com. And new passwords have been installed for the Minn-StF web team.

28-Jul-2005 20:39

First web sites back up

I've made several web sites public again, after fixing various things, including photo-repair.com and noguns-nomoney.com (both are simple, one strictly static html, one with server-side includes but no CGI execs, making them good places to start).

A reminder to people hosting sites here—the ones I'm not responsible for maintaining, I'm not going to do all the recovery work either. I don't know how they work, I don't even necessarily know how they should look—and I don't have the latest copy of the site neatly sitting on my computer the way some of you do for your sites.

28-Jul-2005 00:26

Bad backup files

I went through my orphan files, and found a lot of the missing snapshot album pages. The photos all exist on my desktop system, but finding the right ones and moving them over could be excessively time-consuming, so I'm always glad when more of the files turn up already organized right for the album pages.

I got samba working so I can more easily move files from my Windows box to the server. Then I dug out several additional old backups, and found that most of them were unreadable or incomplete. This only contributes more to my great love and respect for NTI Backup Now!. I also figured out some more URL rewriting techniques, so that I can display the being-fixed version of the page internally but remap the same URL to the recovery notice for external accesses.

And I got my personal homepage working—just the first page, you understand, possibly the most boring page on the server.

Not much will happen on this tomorrow, at least before the evening. I've got better things to do.

26-Jul-2005 18:43

Still more email trouble

I'm clearly not as comfortable manipulating the new virtual mailbox domains, or any other aspect of the email system, as I was with the cold setup. And in fact I think it's actually more difficult. (I switched from qmail to postfix, for those who don't have a copy of the program handy). However, I've just unraveled yet one more little virtual mailbox email problem.

26-Jul-2005 14:07

Web service enabled

No web content yet, other than these two pages.

26-Jul-2005 00:22

It's starting to show

Web service is just starting to come back — this page being the first example. Email is flowing okay, and user accounts are starting to get reactivated. I feel like I'm making progress, even if I did spend all day yesterday on POP mail pickup issues that were probably caused by anti-virus POP proxies.

23-Jul-2005 06:07 pm

Server recovery — Ready to let shell users back on

Those of you with shell accounts on gw.dd-b.net should please give me a call, or drop me email if you have a non-gw email account, so we can get your new password established. I'm ready to let users log in, and email is working again (though I keep interrupting it to "improve" various parts of the configuration).

I haven't turned web service back on yet, but that should follow fairly soon. Even if you only care about your web pages, you might as well get your new password set up so you'll be ready when it becomes possible to start fixing your web pages.

Also those of you with virtual email boxes in the ellegon.com, dreamcafe.com, or dyer-bennet.net domains (email you can get to *only* via POP3, no shell account)

20-Jul-2005 02:14 am

Progress!

Now all the dependencies are starting to bite. I have local and local-aliased email working, and I'm running the *huge* task that compares and merges the 5 different sets of copies of files that I'm working with into one set of the most recent dated versions (but preferring the older copies of the same date).

Virtual mailbox service, and any kind of pop or imap service, are another story, for a later date, however.

Once I have all the files together, some kind of web service can begin, too.

So, basically, tomorrow is going to be the first day with much externally-visible progress. (Always assuming of course that there are no further disasters.)

The procedure is going to involve people with accounts on gw emailing me (if they have an outside email address) or calling to get their passwords. I'll announce here and on www.dd-b.net (which will be available again by the time this can happen) when that can start.

19-Jul-2005 01:56 am

Server recovery — disk controller problem confirmed and fixed

There really definitely *was* a disk controller problem. Cost me most of the day futzing around, and also leaves me wondering if maybe one of the disks I thought was dead is actually possibly okay, maybe. But the disk controller on the motherboard appears to be really truly bad. Luckily I've taken enough disks out that the add-in controller has enough channels, and it has a BIOS, and disabling the motherboard controller lets the add-in take over. And everything is going much better now.

So I just have to do the software installation over again, and make the manual changes over again (but I've got a record and even the files of those, so that's not so bad).

And the bad disk controller is very likely the overall cause behind the disaster, of course.

17-Jul-2005 03:12 pm

Progess is pretty good

For those following the server reconstruction. Just slow. Sometimes it's slow when you have to copy 60 gigabytes of data. Sometimes it's slow when you have to think about things, too. I'm checking off steps as they're accomplished in the original post.

Biggest problem so far is that the drive I expected to use for the new boot drive appears to have failed on the shelf. The replacement I'm using gives some of the same hardware error indications that the original boot drive did, so it's not a long-term solution. I've turnd on S.M.A.R.T. monitoring to see how much stuff is going on with that drive. The problem might be a controller issue rather than a drive issue. Either way, though, I've now got some indications of hardware problems in the disk subsystem.

LOTS of stuff will be changing when the system starts to come back. Apache 2.0 instead of Apache 1.3 for the web server, for example. Probably SUExec, to run CGI scripts as the user owning that site — but that puts on new limitations on file protections and path references (no upward path references, meaning no "../"). I'm also seriously considering installing Postfix instead of qmail and Mailman instead of ezmlm, mostly because of issues of ongoing development and support, and also because Debian makes that much easier to do.

16-Jul-2005 13:35:00

Server recovery plan

My web-sites, Joel's ellegon.com, Minn-StF and Minicon, and Chaz's sites are down until the server comes back, as are all associated email addresses (including mine!)

I've got a rough plan for server recovery. I have to balance a bunch of things, including preserving the chance of recovering data from damaged disks, getting some software upgrades done, giving status reports, and getting things working again. There's also often a tradeoff available between hack workarounds that can be implemented fairly quickly, and real fixes that will last but take longer to put in place.

First step was to preserve the data that was "clearly available" — files still obviously on the disks in the right places. That's essentially done.

Now...

  1. Install a new hard drive for /home done
  2. Block-copy the existing /home to a same-size partition on the new drive done
  3. Block-copy the existing /home to a same-size partition on the external backup drive (the original /home has not been mounted writable since the problem was discovered, and is being saved for possible drastic measures by one of the users) failed, but may not need to retry
  4. Replace the drive in the spare swappable tray with a bigger one (from spares) done; controller was bad, replaced that too
  5. Install Debian 3.1 on the spare swappable tray done
  6. Get firewall working done
  7. Get DNS working done (and put in [info]hazelchaz's change for lasfs.org and loscon.org)
  8. Use fsck and e2undel and perhaps fsdebug to see what else might be recoverable from the /home copies first pass done
  9. Destroy /home copy on new drive and start creating replacement /home structure from recovered data last piece running
  10. Get minimal web service going, for static pages and very simple CGI
  11. Get email working for local users done
  12. Get email working for virtual users
  13. Get mailing lists working
  14. Get more serious web scripting working
  15. Get backup system upgraded to Debian 3.1
  16. Get secondary DNS working after upgrade
  17. Get backup MX handling working on secondary system
  18. Clean up various little problems (for about the next year)

There are, of course, a bazillion sysadmin issues left after this, but I can describe the end of the above list as having "recovered" the server to the extent possible. Further work isn't part of the "recovery".

This will not, obviously, be finished this weekend. Besides, I'm going over to the Minn-StF picnic in an hour or so (I guess because it's too hot here in the computer room? No, wait, that can't be it).

15-Jul-2004 14:00

Big Time!

My server melted down bigtime the night of 14-Jul-2005. I found some troubles with the primary IDE controller, and some apparent strangeness with two of the drives (though that may have been mostly because of the controller; those drives aren't in the current configuration, though). I can't rule out a breakin, either; since the recovery involves reinstalling from scratch from known-good media, it doesn't matter terribly.

Whatever the cause, lots of disk files were lost from both web and software directories.

I'm bringing it back up. Slowly, that being the only way I can.

Those of you interested primarily in the latest changes should see the recovery journal, where I add notes as the mood takes me. On this page, I'm trying to give an overall sense of what happened.

A surprising number of sites beyond my own are hosted here, including but not limited to allamagoosa.com, anime-la.org, animelosangeles.com, animelosangeles.org, boston-baden.com, bostonbaden.com, bottag.com, chaparraltree.com, clue-server.net, ctein.com, daedala.net, dd-b.net, demesne.com, dragaera.info, dreamcafe.com, dyer-bennet.net, e-lie.com, ellegon.com, fireopal.org, gamesaddiction.com, jinzougen.net, jinzougen.org, lacon4.net, lasfs.net, lasfs.org, lehnet.org, lensman.net, lighthunters.net, loscon.org, maria-rodriguez.com, minicon.org, mnstf.org, mpls.cx, nhs72.org, noguns-nomoney.com, pegkerr.com, photo-repair.com, sciencefictionleague.com, sciencefictionleague.net, sciencefictionleague.org, scifiinc.com, scifiinc.net, tigersandbeer.com, ytilaer.com, and Pamela Dean, Lydia Nickerson, and my own sites on subdomains of dd-b.net. They're all down or mostly down or in degraded mode while the recovery proceeds. Including email in those domains. (Some of them have been permanently or temporarily redirected to other hosting locations at this point).

I've spent more time on reconstruction than on forensics, but I noted no signs of a breakin, and the pattern of behavior seemed strange for a breakin (certainly they weren't just trying to zombify me!). That doesn't mean it wasn't a breakin; it's hard to prove a negative. However, there were definite problems with one of the disk controllers and two of the drives, too. So my current theory is that this was a hardware problem.

Much of what was on the disk has been recovered in one form or another, and restored. There were backups of text files, and a lot was recovered from the disks.

The state of the backups was not all one could hope, partly because I was waiting to get a software upgrade accomplished to have access to an external firewire disk drive to make full-size backups on.

My apologies to all who have been inconvenienced by this (and I know there are a lot of you!). This is what I meant when I warned people hosting sites here that I was not offering commercial-quality service :-(.

This server began as a co-located server I got to set up at gofast.net when they were still in the basement, in return for serving as their news admin. That was before it was simple and easy for anybody to buy web hosting for $4/month, and it seemed only right to share this unusual new capability. I gradually accumulated quite a collection of users. Then, eventually, I lost the gofast co-location (after they were acquired), and I moved the server home on my DSL line (I got a 768k symmetrical connection; that's slow inbound today, but it's still fast outbound, and it's still about what I'm running, though specification creep has it up to more like 900k now). But due to established habit and friendships, I continued to allow friends to host their sites here. It was fun to be part of a group of people who would toss up a web site on short notice for some amusing small project.

This meltdown and recovery has been quite costly in time and actual money—new disk controllers, firewire card, several disk drives, for example. Several people have come forward with big donations, and another is pending in the "ask for something you need" form. The people hosting sites here have already been helpful and generous, which is greatly appreciated. But while I was writing this, I thought I'd take the opportunity to mention that any users of the sites on this system could also contribute to the recovery, and to keeping it running, if they wished.

Donations from users have made it possible to improve the backup situation consdiderably, and the complete software reinstall has probably contributed to security and reliability, and has definitely allowed some new features to be added. I'd still kinda like to have my two weeks back, though.

15-Jul-2005 11:36 am

The fucking fucker's fucking fucked

Specifically, my server, which handles web and email for me and far too many other people. There's been some kind of disk subsystem meltdown, and I currently consider it unlikely that either the configuration information or the user data can be recovered.

So I'm going to lunch, and think about it, before I do anything rash that might constrain future recovery efforts.


David Dyer-Bennet <dd-b@dd-b.net>
Last modified: Fri Sep 23 18:37:45 CDT 2005