Server Upgrade Chronicles IIb

Recreated wrack, the USB backup drive with the oldest data on it, using my updated scripts, and started a full backup, also with the updated scripts.  Was running fine when I went to bed.

Seems to be hung when I got up this morning, dammit.  System and pools are responsive, but there’s been no progress and no disk IO since I first checked when I got up. Haven’t tried to kill any processes yet; waiting to see if the zfs-discuss list has any data-gathering suggestions.

This older software version doesn’t support the -p option in zfs send, but that won’t be the cause of the hang; that will simply require me to recreate some key properties manually if I have to restore from backup.

ETA: My detailed report on the zfs-discuss mailing list.

It’s sitting at the same spot after work, after sitting all day.  Offiicially hung.

I wonder what it will take to stop the backup and export the pool?  Well, that’s nice; a straight “kill” terminated the processes, at least.

zpool status shows no errors. zfs list shows backup filesystems mounted.

zpool export -f is running…no disk I/O now…starting to look hung.

Ah, the zfs receive process is still in the process table.  kill -9 doesn’t help.

Kill and kill -9 won’t touch the zpool export process, either.

Pulling the USB cable on the drive doesn’t seem to be helping any either.

zfs list now hangs, but giving it a little longer just in case.

Kill -9 doesn’t touch any of the hung jobs.

Closing the ssh sessions doesn’t touch any of them either.

zfs list on pools other than bup-wrack works. zpool list works, and shows bup-wrack.

Attempting to set failmode=continue gives an I/O error.

Plugging the USB back in and then setting failmode gives the same I/O error.

cfgadm -al lists known disk drives and usb3/9 as “usb-storage connected”. I think that’s the USB disk that’s stuck.

cfgadm -cremove usb3/9 failed “configuration operation not supported”.

cfgadm -cdisconnect usb3/9 queried if I wanted to suspend activity, then failed with “cannot issue devctl to ap_id: /devices/pci@0,0/pci10de,cb84@2,1:9”

Still -al the same.

cfgadm -cunconfigure same error as disconnect.

I was able to list properties on bup-wrack:

bash-3.2$ zpool get all bup-wrack
NAME       PROPERTY       VALUE               SOURCE
bup-wrack  size           928G                -
bup-wrack  used           438G                -
bup-wrack  available      490G                -
bup-wrack  capacity       47%                 -
bup-wrack  altroot        /backups/bup-wrack  local
bup-wrack  health         UNAVAIL             -
bup-wrack  guid           2209605264342513453  default
bup-wrack  version        14                  default
bup-wrack  bootfs         -                   default
bup-wrack  delegation     on                  default
bup-wrack  autoreplace    off                 default
bup-wrack  cachefile      none                local
bup-wrack  failmode       wait                default
bup-wrack  listsnapshots  off                 default

It’s not healthy, alright. And the attempt to set failmode really did fail.

ETA: So I had to reboot.  However, that worked fine, and I recreated the pool, and I ran the same full backup script overnight, and it completed successfully.  Took 392:23, a bit over 6 hours, but it completed. (Something broke the ssh connection, but luckily I had run the backup under screen, so it just got detached and I could reconnect and see what happened. And it was making a log file, anyway.)

There’s a ‘cut’ version error in some of my after-backup processing that I’ll need to fix.

