"Sometime" was finally determined to be about 2:30pm today.
External net connectivity was down from about noon until -- sometime. Qwest says it should be back on fairly soon.
Both gw and ns2 will be down for a moderate period of time after 10pm this evening to exchange a UPS and reroute some power wiring.
At about 18:55, and I didn't notice and reboot until 19:43 or some such. The last valid stuff in the log was BOINC stuff, but on the console there was some virtual memory mapping trouble showing.
I declare victory in the recovery from the 15-Jul meltdown. Not that everything is recovered; not even all my own sites are back up. But I'm now past the point of considering "recovery" to be the top priority.
So, anybody with stuff here that they still think I need to do stuff to, please get in touch with details.
Kernel crash on something apparently while running SETI-at-home in idle time. Down from 20:35 until 21:10 (CDT).
Total server meltdown, resulting in some hardware replacement and a complete software rebuild, and a lot of lost files. It has a separate journal here.
4-Apr-2005 23:50: Some more restores, index.cgi and index.htm where available, and some linked files that got destroyed through the links.
4-Apr-2005 18:08: I see they hit some other filenames as well; I hadn't even realized the server recognized index.htm, but apparently some people were using it. Also index.cgi. I'm expanding my replacement campaign. So be careful; if you replace it yourself you will prevent my replacing it from the backup. This may or may not be what you want; think it through.
Well, that was an undisputed hack. Somebody came in and changed all the index.html files. I've restored those from backups, which takes things back to january. Some files had no backups, list below. Remember that index.html files not on that list have still been reverted to an old state, and you may need to restore or recreate them by some method from information you have.
And of course there's nothing to prevent them from doing it again any time, until I get the software rebuilt, which is taking aboslutely fucking forever.
/home/httpd/RKBA/p32/index.html /home/httpd/lacon3/lacon3/Funds/index.html /home/httpd/olegv/visitors/mj/index.html /home/httpd/olegv/photosinc/index.html /home/httpd/olegv/fuzzykittens/index.html /home/httpd/olegv/vanderbilt/index.html /home/httpd/mpls.cx/html/learncss/index.html /home/httpd/ellegon/bl-20030813-2221/mark/index.shtml_files/index.html /home/httpd/animela/html/anime-la/bbs-bin/Database/index.html/index.html /home/httpd/animela/html/anime-la/2005/bbs/uploads/index.html /home/httpd/animela/html/anime-la/2005/bbs/non-cgi/avatars/index.html /home/httpd/animela/html/anime-la/2005/bbs/non-cgi/email/index.html /home/httpd/animela/html/anime-la/2005/bbs/non-cgi/emoticons/index.html /home/httpd/animela/html/anime-la/2005/bbs/non-cgi/index.html /home/httpd/animela/html/anime-la/2005/bbs/non-cgi/mime_types/index.html /home/httpd/animela/html/anime-la/2005/bbs/non-cgi/Skin/index.html /home/httpd/animela/html/anime-la/2005/bbs/non-cgi/Skin/Default/images/index.html /home/httpd/animela/html/anime-la/2005/bbs/non-cgi/ssi/index.html /home/httpd/animela/html/anime-la/2005/bbs/non-cgi/sys-img/index.html /home/httpd/animela/html/anime-la/2005/bbs/non-cgi/team_icons/index.html
Had the system down briefly at 3 to work on some filesystem issues. For the next month or so it's probably a very good thing for you to keep your backups up-to-date. Also, if you have compiled code in your home or web directory spaces (anything in the /home partition), think about how you could live without it, or how you'd go about rebuilding it quickly, or what package I could supply to give you the same code only centrally installed, because when the big system rebuild comes along, existing binary code in /home is probably going away.
A rash of hits on missing pictures in a photo gallery is causing massive system load (due to the recovery processing of that gallery). I've had to shut down the web server a couple of times to keep other things working.
System was down from about 8 until 9 this evening, it crashed on a kernel panic and then I had to fix some filesystem corruption.
Email is now moving again. Log rotation doesn't happen yet, but nobody but me is likely to notice or care. Outbound email connections were for a while being monopolized by bounced to a Las Vegas spammer (mediadreamland.com) which has their mail servers set up to infinitely delay anybody trying to mail them. I firewalled them, and that cleared up fairly fast; but if you for some reason need to make a tcp connection to any IP in 188.8.131.52/19, it'll fail right now.
Something *major* has gone wrong, or we've been hacked. The thing I'm clearly noticing is that the entire log directory and everything in it was missing just now (I've recreated the directories). No doubt there are other things funny elsewhere. Email is not moving right now (just because the log directories are missing).
I don't think I've ever seen the load average hit 144 before! I don't know yet what was going on, but the system was thrashing horribly. I was able to shut it down cleanly and reboot.
Something different went wrong! Variety is the spice of life, they say. Starting at 10:42 tonight, something was going wrong in accessing the main /home partition. Messages were logged indicating attempts to access beyond the end of the device. A reset was necessary, and fsck found errors on the disk. Half a dozen files may even have been lost.
Mail system has reached nearly its old state, after being torn apart do deal with duplicates for a while. The new old state is better than the old old state, because there's no longer an every-10-minutes process to move mail from ns2 to gw (and note that ns2 is the primary MX for most domains I'm handling, so most mail ends up there).
System was off the net for a couple of minutes around 4:30pm CDT to replace the main household switch; the old one was flaking and had a bad fan.
We appear to have had a *hub* outage that cut the server off from the DSL line for an hour, from 10 to 11 this morning local time.
Web service crashed just after midnight last night. Nothing anomalous in any log file. Corwin restarted it a bit before 9 this morning.
Another power glitch, while I'm sitting here at the console. No system crashed, again. I'm feeling *even* better about the battery replacement having solved the UPS problem.
We've now had several power glitches that triggered the UPSs since I replaced the battery, and GW hasn't gone down. So I'm willing to conclude that that problem, at least, is resolved.
Another power glitch. GW is now running on NS2s UPS. NS2 is running without a UPS. I'll try replacing the battery, and see how that works.
We've had at least three more power-related outages, lasting from seconds to a few dozen minutes (I've been lucky to be by the computer several times when it dropped). I guess I've got another bad UPS; and that battery was replaced only a couple of years ago, I thought. Maybe it's been longer. When things open up tomorrow I'll have to look into that.
System down from 03:54 to 12:02 on a power failure. (It also glitched briefly slightly earlier, but rebooted without trouble). There were thunderstorms, and the lights flickered quite significantly.
System down from 8:55 to 11:28 on some kind of power failure (the other systems rode it out, but this one was down).
System crashed at about noon on a kernel panic again; came back up like a lamb when I reset it, and was back online by 12:45.
System down from 9:05am until 11pm today. There appears to have been some sort of power or power supply failure. The box was found powered off, with no log evidence of trouble leading to the crash.
System down from 3:20pm to 3:37pm on some kind of kernel crash. No records in log, fragmentary records on-screen; may relate to memory management, swap space, or memory errors.
Disk upgrade went okay, so there's lots of space. Please hold off a little longer on loading lots of new stuff -- I haven't tested the backup disk connected via Firewire yet, so we don't currently have anything to back up onto!
System will be down for a medium to long amount of time starting sometime after 10pm Central Time tonight. I'm going to try putting in the new disk drive and the controller for the backup disk. Wish me luck.
System was down from about 7am CST until 5pm CST. So far as I can tell it shut down on a short power glitch that the UPS for whatever reason didn't adequately protect it from; no indication of other problems anywhere around then in the log.
System hung sometime overnight, I had to push reset to get it back this morning. Last real entry in the log appears to be at 9:14, and it got restarted about 9:38. No trace of a *reason* that I can find yet.
I'm playing with some email changes. Some of my playing may possibly result in delayed emails, or possibly (if I make mistakes) in some spurious bounces. I apologize in advance. And I'm changing the anti-spam provisions, which also will change what gets through and what does not. I hope it will result in less spam, and more of the spam blocked before acceptance.
One URL hosted here is apparently listed in this month's Playboy. Another is on Slashdot (I see the references in the log, but I can't find the actual link there yet). So far we're surviving....
System got partially wedged this morning, don't know what really happened yet. Had to repair disks to get it going again. The logs seem to show it was wedged from shortly after midnight until nearly noon today, though low-level services were still running this morning. Several files seem to have been lost; probably earlier, actually, and just uncovered by fsck on the reboot.
Internet connectivity has been somewhat intermittent this afternoon. I believe the immediate cause is some denial-of-service activity against gw3.mpls.cx, which lives here.
Had some more DNS trouble off and on late last night and today. I've upgraded to bind 9.2, and am now running it under supervise, so it should restart in future when it crashes.
gw was down from 10:10pm to 11:40pm Central time tonight to install additional memory. The memory install went without a hitch -- and then the system refused to recognize the boot drive. Replacing the cable cleared it up.
gw will be down from 10pm to 11pm Central time tonight to install additional memory.
The home/web directory partition is getting critically full. Do you have accidental, unexplored, directory subtrees, old copies of your web site, and so forth hanging around? This'd be a great time to clean them up!
Power glitch confused something in the network infrastructure, and lots of connection weirdnesses ensued. Rebooting seems to have cleared it up.
Server was down from 03:40 this morning until about 10. An actual hardware crash, so far as I can tell, and (so far) no signs of damage.
Email backed up from about 1pm until about 6pm, on what looks like a very strange cause which I'm still investigating.
Down for 10 minutes just before 23:00 CDT to put on repaired UPS and run some quick tests. It passed tests, repeatedly, that it had failed before the repair, so I'm cautiosly optimistic about that.
We were down from 04:22 unto 08:10 because a power glitch knocked down GW through the UPS (the other systems stayed up fine).
Again at 16:30; this time my windows box went too.
We were off the net for half an hour, from 18:28 until 19:04, because the cats knocked the networking equipment over while I was upstairs at dinner. I didn't notice until I tried to do stuff afterwards.
Down for a few minutes just before 17:30. Again, the other systems stayed up, but this one shut down. I double-checked the wiring, and it actually is plugged into the UPS that's supposed to be covering it. I suppose that means I have to debug a UPS problem now. Drat.
Down from 16:54 to 17:36, roughly, due to a power failure. The UPS didn't kick in in time, I guess. One other computer, upstairs, had the same problem, so perhaps something about the profile of the power loss and recovery tricked them somehow. No sign of any damage.
Once again there's some sort of massive disk subsystem meltdown. 3 of the 4 drives in the system can't be found even at the bios level. I'm investigating possible heat problems (which involves considerable waiting unfortunatele). Disks are unfindable on two different controllers, which makes it hard to come up with a reasonable theory to explain.
I have replaced the power supply, as the most likely thing to cause this sort of weirdness (multiple drives not visible across multiple controllers), especially in the light of my previous runarounds with this problem where controller and drive faults were investigated fairly thoroughly.
Shut down for about 10 minutes at about 11:30 CDT to move gw from the desktop up to the shelf. For a wonder, it seems to have gone smoothly.
We were down about 10:30 until 4am, due to a "simple" software upgrade gone bad. I'm not sure the actual thing being upgraded is fixed yet, either, so there may well be more trouble ahead.
Web service was down from about 20:20 to 22:30, for no apparent reason. The rest of the system was up, other services were working, apache would shut down and restart itself, netstat showed apache listening on port 80. But no web service. Very weird.
Power failure for something like 20 minutes, ending about 1pm. Systems are now back up okay (1:10).
I believe I have the nightly Analog statistical run working again, so the report linked from the statistics page at www.dd-b.net/stats will be changing once again.
Scheduled downtime successfully completed. I'll need some additional downtime some morning later this week and will announce it when I figure out when.
The server will be down tomorrow, Tuesday, for probably a couple of hours in the early morning CDT. Early morning is unlikely to be before 9am. I am, for once, performing a safe operation -- doing a clean install into a new partition to have a clean system available for comparison and replacement of corrupted files. I'll be intercepting web traffic and directing it to a status page during the interruption.
I may have fixed the problem with logging in; I've been replacing and locking down corrupted software packages, and at least *I* can now log in. It'll depend somewhat on what you have put in your startup scripts.
Mongo drive problems all of which came to nothing (i.e. no signs of lost data). Heat problems seem the most likely (otherwise it would have to be 3 drives at once, or 2 controllers at once, or a power supply that can't handle 4 drives in normal use but *can* handle 5 under torture).
The system will still be up and down for a while, as lingering problems get ironed out. And there's still some sort of software issue going on....
I've now seen three occurrences tonight of a problem where logins get blocked after the system has been up a little while. So far, email and web service has continued when that has happened. I don't know what's causing it yet.
[11:03] I'm seeing some failures to create new processes, but only in some contexts. Currently running just a single processor, so loads may feel a bit high. Waiting to get the new SMP kernel, although the problem description doesn't actually sound like what I'm seeing here.
I've made a drastic change to how email coming into the box is handled (using the users/assign mechanism, for those who care). Nothing *should* be visibly changed....
System will be taken down repeatedly on short notice, mostly at the least convenient possible times, this afternoon and probably evening. Corwin and I are trying to move dreamcafe web files over off the disk they're currently on, and resolving anything that gets in the way.
System was effectively down due to a transient disk or controller error from about 03:18 this morning until 09:30. No data lost (because it was on the /home partition, which is mirrored).
/cbi-bin/FormMail.cgi now available again. I've put up an improved version. This new version will *not* accept submissions from outside domains (a worthless security feature, but it came that way), and it also will not send email outside the local domains. So if you need to send email via formmail to an address I'm not hosting or forwarding, you need to get me to add it to the list of allowed email addresses for formmail.
Pop email collection is now working!
I have removed FormMail.cgi from the /cgi-bin directory because somebody on the web has decided exploiting FormMail is an interesting way of sending spam. I would have thought the overhead was too high, myself, but they were doing it. It's too powerful a script -- every aspect of the email can be set in the request, so you can exploit it for general-purpose mailing.
Well, some progress has been made. Basic system functions are secure and working. DNS service is fully functional. Web service is restored; though there may well be undiscovered bugs lurking. Email service is not yet restored, and web email won't work. User logons are not yet restored.
I expect to finish restoring web service and user logons today. Email may be restored today.
9:30pm: I've reenabled users, but email is not yet turned on. I needed users on *first* since email arriving when I first turn that on could cause trouble otherwise. Yes, that means I think I'm close.
9:44pm: Email is turned on. I've verified it as well as I can before I turn it on, but I'm still testing. The pending email from the downtime should start showing up in about 7 minutes. Mailing lists are the most uncertain element at this point.
System going down at 1pm for major rebuild; should be back up tonight or tomorrow. (I'm making a major effort to be back up by 9am Thursday).
I will finally be taking the system down tomorrow to rebuild it and get rid of the lingering traces of the breakin. I expect to shut it down sometime in the early afternoon. I hope to give at least a half hour warning before shutting it down.
Once I shut it down it will be down for some hours or days, depending how well or poorly things go.
I've probably learned rather a lot while rebuilding ns2, so it is perhaps not *completely* crazy to hope things will go well this time.
During that time, all the email accounts and web pages handled through gw will not be available.
Incoming mail was down for a good part of the day. I rebooted and fixed it at about 9pm. This, unfortunately, demonstrates quite clearly the need for a drastic update, which I now hope to carry out this week (having failed to do so last week; I'm learning things while updating the secondary server). The system may be down most of the day Wednesday for that rebuild.
Definitely still unstable. Expect another day of downtime later this week, though maybe I'll get lucky. I'm going to have to steel-wool some bits of this disk I think.
System was down since last night; I caught a partially failed hack attack, and have been trying to clean up the mess. I'll probably be pulling the system down now and then for the next week or so -- as some of you know, recovering a root-compromised system by any method other than formatting the disks, unplugging the network, and installing from known-good read-only media is highly problematic at best.
As of noon, local email isn't being delivered. Based on one recent experience, a reboot should cure this, but since it's happened before I want to try to figure out WTF is going ON first, so it may be a little bit before email starts being delivered again.
As of 13:15, I seem to have gotten mail flowing again without a reboot. I know more than I did before, but NOT enough.
Rebooted about 21:50 to bring up new kernel.
The servers were down from about 7:30 to 8:15 this morning due to a power failure (which started somewhat earlier, but it lasted longer than the UPS does).
Email was held up for some amount of time until about 4pm this afternoon. I don't know what the cause of the problem was, but it appears to have been cleared (by brute force).
The servers were down from 15:45 to 16:11 due to a power failure. Shutdown and restart were clean (there is a UPS, just not enough battery to cover the duration of this power failure), so nothing much ought to have broken.
Finally got the last of the bugs out of /home; at least I finally got fsck to run twice in a row with no errors. So this could mean another batch of lost or damaged files. Probably nothing that wasn't clearly damaged already, though.
Well, things are about back to normal; anything found from here out I'm going to deal with as some new bug. Thanks for you patience!
Qwest seems to have had some internal network trouble that prevented most (but not absolutely all) connections into and out of my systems from about 21:30 CDT until about 23:00.
ftp, telnet for selected IP addresses, and nethack are now working. ftp sometimes seems to take a long time to start up, for reasons I don't understand. So does smtp, but that's not a user interactive service (normally) and besides I think I know why that is.
ssh 2 connections are very slow to start up (it uses a different protocol; I think it's actually an inferior protocol, but it is not patent-encumbered). ssh 1 is still supported, but software may be defaulting to 2 now that I support 2 here, so you may have switched to 2 without realizing it. The various public-key authorization options are different for ssh 2, also.
22:56. Most things except web mail are working at least as viewed from the outside. At least my web areas don't seem badly damaged. A lot of software upgrades have gotten done, which is good.
00:56. Good news, but read carefully. Things are mostly back up, but that's *mostly*. Email is flowing in and out again. Mutt is known to work; haven't tried anything else. Elm probably isn't even there, and may well not come back. User accounts have been turned back on.
There's still a lot of recovery to do, plus the things I shoved aside to get this far this fast.
Off the top of my head, some things I know aren't back, fixed, or whatever yet: nn isn't installed. Elm isn't installed. Nethack isn't installed. Managing and reading the virtual email domains on the web isn't back yet. The web interface to the mailing list archives isn't back yet. There is no telnet access (even for the places that once had special holes). There is no ftp access. (but ssh users can probably find the scp or pscp programs that come with various ssh clients, to do what telnet does only securely).
There are probably a lot of things not on that list that also aren't back yet; but I'm less likely to rip your head off for asking about them.
Anybody who was running a crontab, you have to reinsert it (use the crontab program) to start it running again. I see one of you has figured that out already; about 5 others still need to do it....
There's the possibility of some files lost from the user and web disks (/home is the mount point). I *think* the files that ended up in lost&found were files I put on that drive somewhat irregularly for safekeeping during the recovery, rather than older user files. But I can't prove it. So you may be missing some files, and if you are there's some small chance that we can find them in lost&found.
00:25, so just barely the 10th.
Be sure to read the entries for 8 and 9 Oct 2001.
I've just finished testing email and mailing lists. Based on that test, email to real users and to virtual pop users works, and mailing lists should work.
The web interface to email hasn't been reinstalled yet, but that should happen during the day tomorrow.
Mail sent over the last two days was delivered to the backup mail system, and I've just turned it loose to send the 2000 some messages over. Since a lot of them are to mailing lists, the outbound traffic is getting pretty fierce (and the load average has just hit 9).
Looking good, but I need to configure mail programs and things before letting users back on.
Be sure to read the entry for 8-Oct-2001.
While nothing visible has really changed, progress is being made. There are a lot of software packages that go into the email subsystem, and they all have to be configured, built, and installed, and I have to make sure they're working just right before I turn email back on (or else email arriving into the bad configuration will be lost). So this is the hardest and slowest part of the process. It's the part I'm working on now. I have hopes of getting email back on today (the old configuration is still available in archives, so I know for sure what worked then), but testing will be very laborious.
Hoo boy; where to begin. I'm rebuilding the server pretty much from scratch, although user data is mostly preservered (looks like at most 500 files lost, and probably far fewer because I think I know what most of the ones lost were). Suffice it to say that I tried to fix something that wasn't really urgent.
Web is starting to creep towards working, but there are lots of little bits broken everywhere. Discarded bits overflowing the bit bucket, in fact. Static pages are mostly okay, but server-side includes are missing. Scripts may actually work.
Selected user passwords have been restored, for people with web stuff on the server. Email isn't working, remember, and there's probably other things too.
Email isn't working yet.
The only thing that worries me is that I seem to have errors on the user disk again, and I don't know where they came from, and that disk had nothing to do with the problem that started this.
Qwest says the DSL line problem is fixed. It might be; at least it's been staying up for hours at a stretch instead of minutes. There may still be a problem with the router locking up periodically, which may be related to code-red worm activity (excessive probes). I've made a couple of reconfigurations suggested at web sites that the Qwest tech is not allowed to recommend to people, but which he did mention to me :-) .
More intermittend DSL problems, most of the day (starting at 5am, though there have been breaks of an hour or so sometimes). Qwest thinks it can be Code-Red related, trying to hit the web server in the Cisco 675 router. Mine was set to accept internal IP connects only, but I've disabled it completely just on principle. We'll see if it helps. Also upgraded to 2.4.1 and 2.4.2 of the router code. So far it's still bouncing, and the problem has been left in the hands of Qwest tech support.
We seem to have been having intermittend DSL problems, starting about, oh, around midnight last night, some still ongoing at 11:26 this morning.
There was a DSL outage at about 21:45; and some confusion among the systems as to what was going on for a while after it came back, too. I'm still not sure precisely what was going on; may have been mostly in my head. Seems to be okay at midnight, anyway.
Hmm; on closer log examination, the DSL trouble may have started an hour or more earlier.
The CPU fan replacement is completed.
The CPU fan on one of the CPUs is dead; just noticed it today, don't know how long that's been true. I'm shutting down tonight to remove it, and I hope there will be a very brief shutdown tomorrow to put the replacement into service. I hope the system will otherwise remain up, but there's some possibility the processor will fry of course.
We had a DSL outage for a while around 4, with a few cycles before it settled down. And the server itself got wedged, but I brought it back while the DSL was down.
Well, I'm starting to get this system back together. Let me know of anything you host here isn't working. Once again the symbolic links have danced around, and some may be broken.
There shouldn't be a shortage of disk space for a while now.
The server will be down this afternoon (Sunday) from about 2pm until later (perhaps much later) to move to a different physical box with more disk space.
Server automatically rebooted at 09:57 CST, probably via software watchdog. It was back up by 10:05 CST.
The server was down from 04:15 to 10:15 due to some disk-related freeze (nothing logged, though).
Two unexplained crashes this morning; in the first there was some sort of dump on the screen, but nothing went into the logs. In the second, the screen wouldn't light up.
The server was down from about 8:45 until 11:00 CST. It appears that a kernel data structure related to disk allocation became corrupted somehow.
The server was down mysteriously from 12:20 until 13:20 CST this afternoon.
DSL down intermittently (but mostly) from midnight to 7 this morning.
System down from 17:05 CST to 18:47 CST for no apparent reason.
CST Email is a bit stable again.
People who read mail exclusively on the server will find things essentially unchanged, except that there's yet another mail client available (I've installed mutt).
People who pick up their mail from here exclusively via pop need to make some changes. NOTE: some of you *forward* your email from here to *another server* where you then pick it up via pop. I'm not talking about you. I'm talking about people who pick it up directly from gw via pop. I'll have a script to automate this soon. What you need to do is create a maildir (/var/qmail/bin/maildirmake) named ~/Maildir, then mv Mailbox oldmail, then ~ddb/mbox2maildir.pl oldmail Maildir, then echo ./Maildir/ > .qmail . If you already have a .qmail file then don't do that last step blindly, it'll overwrite what's already there! Oh, and rm oldmail at the end.
The short-term workaround for pop users is to log on and read locally using mutt for now (if you're not comfortable with that sketch of how to convert that I gave above) until I get the script ready. Write me to le me know you're waiting on the script; I may need to convert individual users manually instead of writing a script, or at least it might be simpler.
People who read *both* via pop *and* locally have a small problem. To read via pop, you must convert to receiving your mail in maildir format. Currently, the only local mail reader that understands maildirs is mutt. So, you need to give up reading via pop, or change to using mutt to read locally (mutt defaults to elm-compatible commands, and is highly configurable).
13:30 CST email is currently turned off; no new mail will be received, and outbound mail will not be sent. I'm working on the mail upgrade.
15:42 CST email is currently back on, and mostly caught up. I'll probably be blocking it for short periods again later, though.
17:06 CST People who pick up their email from gw via POP are about to lose that capability. To recover it, we'll need to go through a process of changing to "maildir" rather than "mbox" delivery for your mail, which in turn will affect the tools you can use to read directly on gw (if you do).
I'll be doing some major software adjustments this weekend; the server will be up and down (mostly up), and email service will be confused. There's also the possibility that I'll break something spectacularly....
It's possible that some of the mail reading tools people have been using will become unavailable in some cases. I'm going to be making a change that requires all POP users to have their mail delivered to maildirs rather than mailbox files. If you read mail only via pop, this won't make much difference; but if you read both via pop and locally on the server, it will.
When things have settled down, I'll describe here and in a MOTD what's changed; you might not be able to read your mail here then.
The DSL link was down from 04:26 until 05:05 this morning (CST). And 06:57 until 07:04.
Upgraded network card to good PCI 100Base-T (3com); system was down about 20 minutes. I still plan to change the video card, and relocate the system, later today or tomorrow.
Still haven't done the server maintenance that I keep threatening; but I *have* added some vents to the wall behind the computers, and put them on pull-out trays so I can reach the back more easily. Friday or Saturday I still hope to change the server graphic and network cards, and put it up on the shelf where it belongs (finally freeing up my desk). I hope.
Having failed to last night, I hope to take the server down later tonight, maybe around 9pm Central, for some hardware replacements and physical rearrangement (from the desk to a shelf). It should be back up by midnight, I hope.
I plan to take the server down later tonight, maybe around 9pm Central, for some hardware replacements and physical rearrangement (from the desk to a shelf). It should be back up by midnight, I hope.
The server rebooted just before midnight for no apparent reason. Before that, there was a brief DSL outage, from 11:44 to 11:50. After, there were a couple of additional brief DSL outages, 11:59 - 00:05, 00:07 - 00:13, 00:15 - 00:19, 00:40 - 00:42, and finaly 01:13 - 01:25.
DSL or router down from about 11 last night until 10:30 or
DSL was down this morning, from 11:17 until about 1pm, CDT.
Finally actually implemented the telnet block at 14:20 CDT today.
Telnet access will be terminated to gw later tonight (well, maybe tomorrow). ssh remote terminal access will continue to be available. Search for "ssh client" on yahoo or dmoz to find a range of free and commercial ssh clients for windows and mac systems. The best commercial one for the PC is SecureCRT from Van Dyke software. A free one that I know works is putty. I know at least one of you is in the position of being unable to install software on your work computer. I'm willing to make special exceptions to allow telnet from your certain IP addresses in that case; I've already put that in for one of you.
Added a virtual domain for Oleg's firearms pages, and have been playing around with various mappings to get it and the links to and from it to work right. Some of you may have seen intermediate states that didn't work; for that matter there could be real problems even now.
The DSL line or router was down from 02:34:24 until about 9:30 CDT. It seemed to come up briefly a bit earlier, but died again. I hope it stays up this time!
The DSL line went down at 13:49 CDT. I've just (16:23) entered a repair order with Qwest, no clue when they'll actually repair it. Last week, it recovered fairly quickly after I placed the order, and then a technician called a day later. (Came back up about 18:05)
The DSL line was bouncy for a bit from 10:10 until 10:25 CDT this morning; some of you may have lost connections or had trouble connecting then.
The DSL line was essentially down from about 12:30 until about 2pm CDT. Some small number of connections may have succeeded during this time, but 80% of the packets received were rejected because of CRC errors.
We had a power failure, and the system was down from about 8:30 until about 11:00 this morning.
Disk cabling wasn't the problem either; it's crashed twice since then. I also found that the CPU cooling fan was out (it's original, which makes it over 4 years old). I've replaced it. Nothing left to replace at this point but the drive, the CPU, and the motherboard.
Well, we're back online. I probably tried to fix the hang problem too late at night yesterday. I've made some disk cabling changes that may help the hang problem, too.
10:30pm: I'm going to be playing with the network configuration scripts, so the network will be up and down for the next hour or so. If you come in via telnet or ssh, you're likely to be unexpectedly disconnected. You'd be better off to wait until later.
System will be down today from 3pm to 5pm CDT for hardware maintenance.
Later: The power supply has been replaced; whether this will actually help remains, of course, to be seen.
System crashed again, was down for about a day (I was in New York, and it needed manual fsck). And again shortly after rebooting. Clearly something is going on, too bad it doesn't log anything. Also, the monitor died. This system is on a UPS, and the whole house has a surge suppressor.
System was off the net or down from last Thursday afternoon. We had the DSL (and dialtone) out, and then it got fixed when I was up to my elbows in software guts. Hope it's stable; I've updated a lot of software versions.
System was down from 06:50 until 11:09 CDT this morning, for unknown reasons . Not pingable, or anything. Nothing interesting in the logs. (The last few have all had DNS stuff as the last thing in the logs, but that's a periodic statistics dump and it's hanging at low-use times, so that's fairly likely from pure chance.)
System was down from 05:21 until 08:55 CDT this morning, for unknown reasons again. The system was pingable, but didn't respond at a higher level. Nothing interesting was logged.
System was down from 20:45 CDT until 23:15 for no apparent (so far) reason.
I had the server down for 5 minutes just before 2pm (Central Daylight here) to physically relocate it. In addition, there would have been various brief (a few seconds) network connectivity losses as I rerouted some cables.
The server was down from 22:00 until 22:30, following some disk-related problem. I think SETI-at-home was involved; or at least it was the running process. A large core file was written and trashed. It appears to have come back up without problems.
Rebooted manually at 11:20; system seemed okay from the console, could be pinged from inside the firewall, but wasn't responding to other stuff.
(Later: Apparently the problem was somehow in the DSL router. I couldn't telnet to it, and while it showed activity, nothing from it was coming through to the server. I rebooted *it*, and things now seem fine.)
Well, we seem to have survived into the new year/century/millennium (yes, I know the arguments why it isn't really the new millennium; bollocks).
I've replaced the power supply and wiggled everything in its socket, in hopes of fixing the random hang problem.
Most services seem to be functional, but the system is continuing to hang periodically. I now have physical access to it, so the down periods will likely be shorter at least. Obviously I'm working to try to debug this, but it's a hard kind of problem.
Well, we're sort-of up. There's old DNS information out there, so some of the time you get directed to where this system really is, and some of the time you get directed to where it used to be. So *some* web hits come through, and *some* email comes through. (And, on retries more than an hour later, the email may still come through).
The system has hung twice since I've had it home; haven't had time to work that out yet.
This system will go down around 5pm Wed 24-Nov-1999, and will come up sometime this weekend. It's moving and changing IP addresses -- expect unreliability, performance problems, missing services, and so forth for a while.
Then it died again at about 23 Nov 03:30, still for no apparent reason. Kernel upgraded, watchdog being installed.
The server was down through most of the weekend. It died for no apparent reason at Nov 20 04:23, and I don't have physical access over the weekend. It finally got rebooted at Nov 22 06:17. No sign in the logs of what the problem was, either. Grump.
System was down again from 04:25 until 08:44 Tuesday morning 2-Nov-99 due to a power problem at gofast (transfer switch failed to transfer *back* from diesel backup generator to building power when building power came back on).
System was down from Sunday 31-Oct-99 03:xx until Monday early afternoon for unknown reasons.
Web service was down from about 3 this morning until 8:17. There was a reboot at 3 for no apparent reason, and web service didn't come back up because of an entry in the control file that's been there for years. I am confused; but web service is back up.
The primary name of this server will be changing to www.dd-b.net on roughly August 16th. Please update any bookmarks you have in this site! (In fact if you look at your address window you'll see that www.ddb.com is redirecting to www.dd-b.net already).
gw.ddb.com was down from about 4pm to 8pm on Friday, 2-July-1999. I've reconfigured some disk drives, removed one, and generally engaged in provocative maintenance. Please be a little forgiving to people whose websites were broken by this; it'll take them a few days to find the things I broke and fix them.
The server has been working way too well for a long time. It was down for an hour around 7:15 this morning, with first a power failure and then an ethernet card seating problem. No apparent other damage, though. Oh, and it was flakey for about another hour after that.
Kinda a rough upgrade to Apache 1.3.3 this afternoon, down for a half hour or so from 3pm Central Time. Lots of BIG changes not noted in the documentation (like the whole DSO thing). And ddb.com web service was down overnight.
Memory replaced with stuff that works; we now have 96 meg.
System has been upgraded a little; more memory (not all functional for some reason) and more disk (not all in service yet). We were down from 4:30 until 8 this evening for the upgrade.
Incoming email connections are once again checked against the RBL; some people won't be able to mail here, especially spammers.
Web service was down for half an hour in the afternoon. I've upgraded the logging some, and upgraded to Analog 3.1. For the next 30 days the web stats will be a bit confused, as the old logs without virtual host information phase out and the new ones take over.
Problem from last library upgrade finally surfaced when I rebooted the system. Took a while to iron out (and the RedHat "rescue" mode on the boot disks isn't much help).
Down for a few minutes around 8:30pm to upgrade the web server (Apache). Earlier today, considerable performance testing was being done, which may have caused noticable system slowdowns.
There will probably be assorted brief outages tonight and tomorrow; gofast.net is switching their net connectivity over to fiber.
In preparation for a probable conversion to running the Apache web server, I'm running a secondary server here. This server serves all the same documents, but it's running Apache. If you want to try it out and see if it works for you, start here. If you spot any problems, report them to me.
Changed the welcome map again, and I now implement it using client side image maps so you can see what the regions do, and so that Lynx can build a menu out of it.