Server Upgrade Chronicles III

Very briefly, since it’s late!

I’ve temporarily given up on getting incremental send/receive working. I’m liking my results with full backups, and have some reason to believe that the next OpenSolaris release will fix the bug I’m seeing, and can’t see how to proceed on that in the short term. With a third backup drive, having just full backups isn’t so limiting, either.

I’ve got the 4x 2.5″ adapter installed in the 5.25″ bay, and the 8 port SATA controller hooked up to it, and two drives in it, and I’ve copied the system rpool to a new pool there called rp2. And am now looking at details of how to finish making that pool bootable, probably involving booting from a livecd to rename it among other things.

Weird problem with the SATA controller—the bracket on the end of the card doesn’t match the cutout in the case. I had to remove the bracket, leaving the card unsupported, which is clearly not viable in the long term with two sets of stiff cables attached to it (right now it’s supported with a bit of gaffer’s tape).

Haven’t looked into booting from the new controller, either; it’s possible I can’t, I suppose, but if so putting both boot disks on the old controller isn’t terribly painful, though it ruins my perfect plan to split every mirror pair across controllers.

There’s also a problem with the bottom left 2.5″ tray, but I’m ignoring that for now since I only need two drives in the 2.5″ bay to finish my upgrade.

Don’t know that it might not be better to install to the new drives from scratch, but there are issues duplicating the configuration down to UIDs and GIDs, which is necessary for the data pool to be accessible to the users when I import it.

Still, all the new hardware seems to be working, which is good.

Disk Prices

I happened to compute the price per byte for a disk drive I bought Tuesday.  Then it occurred to me to compute the price per byte of the first hard drive I ever bought.

Then it occurred to me to compute the ratio.

2,916,667

The price per byte of that first disk drive was very nearly three million times higher.

Three.  Million.

That’s a lot.

I like living in the future.

Server Upgrade Chronicles IIb

Recreated wrack, the USB backup drive with the oldest data on it, using my updated scripts, and started a full backup, also with the updated scripts.  Was running fine when I went to bed.

Seems to be hung when I got up this morning, dammit.  System and pools are responsive, but there’s been no progress and no disk IO since I first checked when I got up. Haven’t tried to kill any processes yet; waiting to see if the zfs-discuss list has any data-gathering suggestions.

This older software version doesn’t support the -p option in zfs send, but that won’t be the cause of the hang; that will simply require me to recreate some key properties manually if I have to restore from backup.

ETA: My detailed report on the zfs-discuss mailing list.

It’s sitting at the same spot after work, after sitting all day.  Offiicially hung.

I wonder what it will take to stop the backup and export the pool?  Well, that’s nice; a straight “kill” terminated the processes, at least.

zpool status shows no errors. zfs list shows backup filesystems mounted.

zpool export -f is running…no disk I/O now…starting to look hung.

Ah, the zfs receive process is still in the process table.  kill -9 doesn’t help.

Kill and kill -9 won’t touch the zpool export process, either.

Pulling the USB cable on the drive doesn’t seem to be helping any either.

zfs list now hangs, but giving it a little longer just in case.

Kill -9 doesn’t touch any of the hung jobs.

Closing the ssh sessions doesn’t touch any of them either.

zfs list on pools other than bup-wrack works. zpool list works, and shows bup-wrack.

Attempting to set failmode=continue gives an I/O error.

Plugging the USB back in and then setting failmode gives the same I/O error.

cfgadm -al lists known disk drives and usb3/9 as “usb-storage connected”. I think that’s the USB disk that’s stuck.

cfgadm -cremove usb3/9 failed “configuration operation not supported”.

cfgadm -cdisconnect usb3/9 queried if I wanted to suspend activity, then failed with “cannot issue devctl to ap_id: /devices/pci@0,0/pci10de,cb84@2,1:9”

Still -al the same.

cfgadm -cunconfigure same error as disconnect.

I was able to list properties on bup-wrack:

bash-3.2$ zpool get all bup-wrack
NAME       PROPERTY       VALUE               SOURCE
bup-wrack  size           928G                -
bup-wrack  used           438G                -
bup-wrack  available      490G                -
bup-wrack  capacity       47%                 -
bup-wrack  altroot        /backups/bup-wrack  local
bup-wrack  health         UNAVAIL             -
bup-wrack  guid           2209605264342513453  default
bup-wrack  version        14                  default
bup-wrack  bootfs         -                   default
bup-wrack  delegation     on                  default
bup-wrack  autoreplace    off                 default
bup-wrack  cachefile      none                local
bup-wrack  failmode       wait                default
bup-wrack  listsnapshots  off                 default

It’s not healthy, alright. And the attempt to set failmode really did fail.

ETA: So I had to reboot.  However, that worked fine, and I recreated the pool, and I ran the same full backup script overnight, and it completed successfully.  Took 392:23, a bit over 6 hours, but it completed. (Something broke the ssh connection, but luckily I had run the backup under screen, so it just got detached and I could reconnect and see what happened. And it was making a log file, anyway.)

There’s a ‘cut’ version error in some of my after-backup processing that I’ll need to fix.

Server Upgrade Chronicles IIa

Didn’t find any useful information.  Asked today online, and found t/3 of what I need at least, so I can revisit the install and upgrade tomorrow (got better things to do tonight). I need to resolve the uncertainties about update before considering doing it to the real server. Maybe I should just wait for the next stable release, due in March and expected in April.

Meanwhile, yesterday the new SAS disk controller arrived. I’ve now got all the hardware to install into the box, and I can do that without changing anything associated with the current boot or data disks, so I probably will. Try things out on the new hardware and new disks, before cutting over.

Meanwhile, not sure why WordPress is displaying the time in UTC.

ETA: Okay, install log is fairly clean.  There’s “device pciclass,030000@3(display#0) keeps up device sd@0,0(sd#0), but the latter is not power managed” and “/usr/lib/powerd: [ID387247 daemon.error] Able to open /dev/srn” and “SUNW_piclmemcfg init mc failed!”

Added Emacs.  Going to shutdown, snapshot, and start update from there.

ETA: On boot, several popup errors, and this in the logs:

Feb  5 07:47:08 osol-play-002 nwamd[37]: [ID 116842 daemon.error] sysevent_bind_handle: Permission denied
Feb  5 07:47:11 osol-play-002 genunix: [ID 127566 kern.info] device pciclass,030000@2(display#0) keeps up device sd@0,0(sd#0), but the former is not power managed
Feb  5 07:47:11 osol-play-002 /usr/lib/power/powerd: [ID 387247 daemon.error] Able to open /dev/srn
Feb  5 07:48:53 osol-play-002 inetd[6089]: [ID 702911 daemon.error] Failed to update state of instance svc:/application/x11/xfs:default in repository: entity not found
Feb  5 07:48:53 osol-play-002 inetd[6089]: [ID 702911 daemon.error] Failed to get instance for svc:/application/x11/xfs:default
Feb  5 07:48:53 osol-play-002 inetd[6089]: [ID 702911 daemon.error] Failed to update state of instance svc:/application/x11/xfs:default in repository: No such file or directory
Feb  5 07:48:53 osol-play-002 inetd[6089]: [ID 702911 daemon.error] Failed to update state of instance svc:/application/x11/xfs:default in repository: entity not found
Feb  5 07:48:53 osol-play-002 inetd[6089]: [ID 702911 daemon.error] Failed to get instance for svc:/application/x11/xfs:default
Feb  5 07:48:53 osol-play-002 inetd[6089]: [ID 702911 daemon.error] Failed to update state of instance svc:/application/x11/xfs:default in repository: No such file or directory
Feb  5 07:49:01 osol-play-002 ip: [ID 224711 kern.warning] WARNING: Memory pressure: TCP defensive mode on

But, unlike the previous attempt, the system did boot. I halted and snapshotted it immediately after the boot.

Server Upgrade Chronicles II

Good news in automatic email: a bug I filed is fixed in build 122. Now, it’s a duplicate of another bug that I apparently failed to find, and there’s been an easy workaround all this time (turns out it was a pointer problem in parsing file paths, triggered by not having a “/” at the end of a directory path). This was apparently what was blocking my ability to do incremental backups with ZFS send/receive.

This makes software update key, not that it wasn’t already.

So I have had to re-install Virtualbox (because VMWare player won’t work with virtual machines on my network drive, whereas VirtualBox will), and reinstall Solaris. Then I will learn how to upgrade to various builds, because I’m ashamed to say I don’t know how except to “current”, which may not be the place to be.

Huh; almost looks like there isn’t a way.In future, I can update more often, and keep the old snapshots around. Though that doesn’t give any way to reinstall if what I really need is an old version.

I’m updating a virtual system, to test techniques and such. It’s downloading very slowly, equally slowly in bridged or NAT mode. So it’s not going to be done tonight, which means the testing will be delayed and the actual upgrade thus even more delayed. Well, things take time.

So far, knock on wood, nothing has gone terribly wrong.

ETA: The update (switching to the dev branch) completed overnight, with a number of errors. The new Boot Environment doesn’t come all the way up. No time to check more this morning.