Notes on the ongoing recovery from the 14-Jul-2005 meltdown.
Organized in reverse chronological order—if you don't know
what this is all about, scroll down to the bottom and read up!
5-Sep-2005 23:15
John's memorial site, and a call for assistance
I got my father's memorial site back up (and changed the appearance
and function of the retirement letters page into the bargain).
A number of sites I'm hosting for others aren't restored yet. I
suspect that there's been a lack of communication on what needs to be
done; for that I apologize. And, even more useful, try to now
communicate what needs to be done.
If you've got a site here that I haven't turned back on yet, please
poke me when you're ready to have it turned on. If you can check it
for obvious missing image files (especially JPEGs) first and replace
them, that will probably help it look better when it does come up. If
not, so it goes; I'll turn it on and you can check it for missing
files when it's live.
I'm getting fairly close to declaring the "recovery" period
over.
26-Aug-2005 14:10
Real photos, outside work, and such
Got my real photos section up. Worked on an outside paid project
that came up, and delayed this work some. Cleaned up many little
details for me and others.
The remaining web sites for other people can go live whenever you
say. Because of the high chance of missing images, I do want to
coordinate with you when they go live, though; I want you to be
available to check them out and fix whatever is wrong fairly quickly,
at least to know that there's something needing fixing.
I will soon be declaring the recovery finished. This doesn't mean
I think absolutely everything will be fixed yet, you understand. Just
that the worklist is becoming somewhat indefinite, and update intervals
longer, and it's less and less deserving of its own special
weblog.
17-Aug-2005 00:24
External DVD burner, and mod_perl
After five days mostly mucking about with getting an
external DVD burner working, I've got one, which will make moving data
around and some new types of backup procedure possible. I've had
three different burners and two different external cases installed at
various times, and of course the thing that actually works, finally,
is much the most expensive of the choices.
And I've started work on getting my photo galleries
working, and have cracked the first big problem (due to the software
changes I'm upgrading from Perl 5.6 to Perl 5.8, and from mod_perl 1
to mod_perl 2). I can now run simple Perl scripts under mod_perl 2,
and have found where they hid the utility functions.
5-Aug-2005 17:06
Dragaera.info back except for mailing lists
Oh, and booksearch results aren't showing up, but I expect to fix
that shortly.
Mailing lists will be next, and I'm not sure if that will be fairly
simple, or if it will turn out to be a pain. Maybe this weekend, if
it's simple.
On the way, I updated ns2 to Debian 3.1, which means I'm now
running the same software version on both servers. Also, I have
reliable access to the mp3 collection I serve from there, which means
I can now listen to music while working on this.
3-Aug-2005 17:43
Dragaera.info coming back
But slowly. Considerable portions still go to the recovery page.
Also, I updated the software on ns2 to match gw, and got a backup
procedure working (still manually triggered).
A number of users have continued to contribute to the recovery
through the paypal link on the recover page. Thank
you very much!
30-Jul-2005 02:05
richard.dyer-bennet.net restored
Another web site back up—and a bit expanded (I've had pending
images from Paul to add for over a year, sigh).
29-Jul-2005 16:31
Many more orphans identified
A bunch of orphans were under the old "httpd" user, meaning they
were created dynamically by Apache. So most of the lost gallery
directories were there, and lots of Chaz's knotware sessions were
there. They've been assigned to the right owner now, and a few
renamed as well.
29-Jul-2005 15:15
Orphans identified
The orphan files that are images have been identified. So in
/home/restore/orphans, the file that (for example) used to be named
"#1458446" was determined to be a jpeg, so it's now named
"#1458446.JPEG". This makes it possible to easily look at it in an
image viewer, for example.
Use "ls -l" to show the owners of the files; the file ownership was
preserved across the disaster, mostly. Most of the loose files in
/home/restore/orphans turn out to be jpegs belonging to Oleg.
Also, a couple more websites are back up, including ctein.com and
photo-repair.com. And new passwords have been installed for the
Minn-StF web team.
28-Jul-2005 20:39
First web sites back up
I've made several web sites public again, after fixing various
things, including photo-repair.com and noguns-nomoney.com (both are
simple, one strictly static html, one with server-side includes but no
CGI execs, making them good places to start).
A reminder to people hosting sites here—the ones I'm not
responsible for maintaining, I'm not going to do all the recovery work
either. I don't know how they work, I don't even necessarily know how
they should look—and I don't have the latest copy of the site
neatly sitting on my computer the way some of you do for your
sites.
28-Jul-2005 00:26
Bad backup files
I went through my orphan files, and found a lot of the missing
snapshot album pages. The photos all exist on my desktop system, but
finding the right ones and moving them over could be excessively
time-consuming, so I'm always glad when more of the files turn up
already organized right for the album pages.
I got samba working so I can more easily move files from my
Windows box to the server. Then I dug out several additional old
backups, and found that most of them were unreadable or incomplete.
This only contributes more to my great love and respect for NTI Backup
Now!. I also figured out some more URL rewriting techniques, so that
I can display the being-fixed version of the page internally but remap
the same URL to the recovery notice for external accesses.
And I got my personal homepage working—just the first page, you
understand, possibly the most boring page on the server.
Not much will happen on this tomorrow, at least before the evening.
I've got better things to do.
26-Jul-2005 18:43
Still more email trouble
I'm clearly not as comfortable manipulating the new virtual mailbox
domains, or any other aspect of the email system, as I was with the
cold setup. And in fact I think it's actually more difficult. (I
switched from qmail to postfix, for those who don't have a copy of the
program handy). However, I've just unraveled yet one more little
virtual mailbox email problem.
26-Jul-2005 14:07
Web service enabled
No web content yet, other than these two pages.
26-Jul-2005 00:22
It's starting to show
Web service is just starting to come back — this page being the
first example. Email is flowing okay, and user accounts are starting
to get reactivated. I feel like I'm making progress, even if I
did spend all day yesterday on POP mail pickup issues that
were probably caused by anti-virus POP proxies.
23-Jul-2005 06:07 pm
Server recovery — Ready to let shell users back
on
Those of you with shell accounts on gw.dd-b.net should please give
me a call, or drop me email if you have a non-gw email account, so we
can get your new password established. I'm ready to let users log in,
and email is working again (though I keep interrupting it to "improve"
various parts of the configuration).
I haven't turned web service back on yet, but that should follow
fairly soon. Even if you only care about your web pages, you might as
well get your new password set up so you'll be ready when it becomes
possible to start fixing your web pages.
Also those of you with virtual email boxes in the ellegon.com,
dreamcafe.com, or dyer-bennet.net domains (email you can get to *only*
via POP3, no shell account)
20-Jul-2005 02:14 am
Progress!
Now all the dependencies are starting to bite. I have local and
local-aliased email working, and I'm running the *huge* task that
compares and merges the 5 different sets of copies of files that I'm
working with into one set of the most recent dated versions (but
preferring the older copies of the same date).
Virtual mailbox service, and any kind of pop or imap service, are
another story, for a later date, however.
Once I have all the files together, some kind of web service can
begin, too.
So, basically, tomorrow is going to be the first day with much
externally-visible progress. (Always assuming of course that there are
no further disasters.)
The procedure is going to involve people with accounts on gw
emailing me (if they have an outside email address) or calling to get
their passwords. I'll announce here and on www.dd-b.net (which will be
available again by the time this can happen) when that can start.
19-Jul-2005 01:56 am
Server recovery — disk controller problem confirmed and fixed
There really definitely *was* a disk controller problem. Cost me
most of the day futzing around, and also leaves me wondering if maybe
one of the disks I thought was dead is actually possibly okay,
maybe. But the disk controller on the motherboard appears to be really
truly bad. Luckily I've taken enough disks out that the add-in
controller has enough channels, and it has a BIOS, and disabling the
motherboard controller lets the add-in take over. And everything is
going much better now.
So I just have to do the software installation over again, and make
the manual changes over again (but I've got a record and even the
files of those, so that's not so bad).
And the bad disk controller is very likely the overall cause behind
the disaster, of course.
17-Jul-2005 03:12 pm
Progess is pretty good
For those following the server reconstruction. Just slow. Sometimes it's slow when you have to copy 60 gigabytes of data. Sometimes it's slow when you have to think about things, too. I'm checking off steps as they're accomplished in the original post.
Biggest problem so far is that the drive I expected to use for the new boot drive appears to have failed on the shelf. The replacement I'm using gives some of the same hardware error indications that the original boot drive did, so it's not a long-term solution. I've turnd on S.M.A.R.T. monitoring to see how much stuff is going on with that drive. The problem might be a controller issue rather than a drive issue. Either way, though, I've now got some indications of hardware problems in the disk subsystem.
LOTS of stuff will be changing when the system starts to come back. Apache 2.0 instead of Apache 1.3 for the web server, for example. Probably SUExec, to run CGI scripts as the user owning that site — but that puts on new limitations on file protections and path references (no upward path references, meaning no "../"). I'm also seriously considering installing Postfix instead of qmail and Mailman instead of ezmlm, mostly because of issues of ongoing development and support, and also because Debian makes that much easier to do.
16-Jul-2005 13:35:00
Server recovery plan
My web-sites, Joel's ellegon.com, Minn-StF and Minicon, and Chaz's sites are down until the server comes back, as are all associated email addresses (including mine!)
I've got a rough plan for server recovery. I have to balance a bunch of things, including preserving the chance of recovering data from damaged disks, getting some software upgrades done, giving status reports, and getting things working again. There's also often a tradeoff available between hack workarounds that can be implemented fairly quickly, and real fixes that will last but take longer to put in place.
First step was to preserve the data that was "clearly available" — files still obviously on the disks in the right places. That's essentially done.
Now...
- Install a new hard drive for /home done
- Block-copy the existing /home to a same-size partition on the new drive done
- Block-copy the existing /home to a same-size partition on the external backup drive (the original /home has not been mounted writable since the problem was discovered, and is being saved for possible drastic measures by one of the users) failed, but may not need to retry
- Replace the drive in the spare swappable tray with a bigger one (from spares) done; controller was bad, replaced that too
- Install Debian 3.1 on the spare swappable tray done
- Get firewall working done
- Get DNS working done (and put in [info]hazelchaz's change for lasfs.org and loscon.org)
- Use fsck and e2undel and perhaps fsdebug to see what else might be recoverable from the /home copies first pass done
- Destroy /home copy on new drive and start creating replacement /home structure from recovered data last piece running
- Get minimal web service going, for static pages and very simple CGI
- Get email working for local users done
- Get email working for virtual users
- Get mailing lists working
- Get more serious web scripting working
- Get backup system upgraded to Debian 3.1
- Get secondary DNS working after upgrade
- Get backup MX handling working on secondary system
- Clean up various little problems (for about the next year)
There are, of course, a bazillion sysadmin issues left after this, but
I can describe the end of the above list as having "recovered" the
server to the extent possible. Further work isn't part of the
"recovery".
This will not, obviously, be finished this weekend. Besides, I'm going over to the Minn-StF picnic in an hour or so (I guess because it's too hot here in the computer room? No, wait, that can't be it).
15-Jul-2004 14:00
Big Time!
My server melted down bigtime the night of 14-Jul-2005. I found some
troubles with the primary IDE controller, and some apparent
strangeness with two of the drives (though that may have been mostly
because of the controller; those drives aren't in the current
configuration, though). I can't rule out a breakin, either; since the
recovery involves reinstalling from scratch from known-good media, it
doesn't matter terribly.
Whatever the cause, lots of disk files were lost from both web and
software directories.
I'm bringing it back up. Slowly, that being the only way I
can.
Those of you interested primarily in the latest changes should see
the recovery journal, where
I add notes as the mood takes me. On this page, I'm trying to
give an overall sense of what happened.
A surprising number of sites beyond my own are hosted here,
including but not limited to
allamagoosa.com,
anime-la.org,
animelosangeles.com,
animelosangeles.org,
boston-baden.com,
bostonbaden.com,
bottag.com,
chaparraltree.com,
clue-server.net,
ctein.com,
daedala.net,
dd-b.net,
demesne.com,
dragaera.info,
dreamcafe.com,
dyer-bennet.net,
e-lie.com,
ellegon.com,
fireopal.org,
gamesaddiction.com,
jinzougen.net,
jinzougen.org,
lacon4.net,
lasfs.net,
lasfs.org,
lehnet.org,
lensman.net,
lighthunters.net,
loscon.org,
maria-rodriguez.com,
minicon.org,
mnstf.org,
mpls.cx,
nhs72.org,
noguns-nomoney.com,
pegkerr.com,
photo-repair.com,
sciencefictionleague.com,
sciencefictionleague.net,
sciencefictionleague.org,
scifiinc.com,
scifiinc.net,
tigersandbeer.com,
ytilaer.com,
and Pamela Dean, Lydia Nickerson, and my own sites on subdomains of
dd-b.net.
They're all down or mostly down or in degraded mode while the recovery
proceeds. Including email in those domains. (Some of them have been
permanently or temporarily redirected to other hosting locations at
this point).
I've spent more time on reconstruction than on forensics, but I
noted no signs of a breakin, and the pattern of behavior seemed
strange for a breakin (certainly they weren't just trying to zombify
me!). That doesn't mean it wasn't a breakin; it's hard to
prove a negative. However, there were definite problems with one of
the disk controllers and two of the drives, too. So my current theory
is that this was a hardware problem.
Much of what was on the disk has been recovered in one form or
another, and restored. There were backups of text files, and a lot
was recovered from the disks.
The state of the backups was not all one could hope, partly because
I was waiting to get a software upgrade accomplished to have access to
an external firewire disk drive to make full-size backups on.
My apologies to all who have been inconvenienced by this (and I
know there are a lot of you!). This is what I meant when I
warned people hosting sites here that I was not offering
commercial-quality service :-(.
This server began as a co-located server I got to set up at
gofast.net when they were still in the basement, in return for serving
as their news admin. That was before it was simple and easy for
anybody to buy web hosting for $4/month, and it seemed only right to
share this unusual new capability. I gradually accumulated quite a
collection of users. Then, eventually, I lost the gofast co-location
(after they were acquired), and I moved the server home on my DSL line
(I got a 768k symmetrical connection; that's slow inbound today, but
it's still fast outbound, and it's still about what I'm running,
though specification creep has it up to more like 900k now). But due
to established habit and friendships, I continued to allow friends to
host their sites here. It was fun to be part of a group of people who
would toss up a web site on short notice for some amusing small
project.
This meltdown and recovery has been quite costly in time and actual
money—new disk controllers, firewire card, several disk drives, for
example. Several people have come forward with big donations, and
another is pending in the "ask for something you need" form. The
people hosting sites here have already been helpful and generous,
which is greatly appreciated. But while I was writing this, I thought
I'd take the opportunity to mention that any users of the sites on
this system could also contribute to the recovery, and to keeping it
running, if they wished.
Donations from users have made it possible to improve the backup
situation consdiderably, and the complete software reinstall has
probably contributed to security and reliability, and has definitely
allowed some new features to be added. I'd still kinda like to have
my two weeks back, though.
15-Jul-2005 11:36 am
The fucking fucker's fucking fucked
Specifically, my server, which handles web and email for me and far too many other people. There's been some kind of disk subsystem meltdown, and I currently consider it unlikely that either the configuration information or the user data can be recovered.
So I'm going to lunch, and think about it, before I do anything rash that might constrain future recovery efforts.
David Dyer-Bennet <dd-b@dd-b.net>
Last modified: Fri Sep 23 18:37:45 CDT 2005