Recreated wrack, the USB backup drive with the oldest data on it, using my updated scripts, and started a full backup, also with the updated scripts. Was running fine when I went to bed.
Seems to be hung when I got up this morning, dammit. System and pools are responsive, but there’s been no progress and no disk IO since I first checked when I got up. Haven’t tried to kill any processes yet; waiting to see if the zfs-discuss list has any data-gathering suggestions.
This older software version doesn’t support the -p option in zfs send, but that won’t be the cause of the hang; that will simply require me to recreate some key properties manually if I have to restore from backup.
ETA: My detailed report on the zfs-discuss mailing list.
It’s sitting at the same spot after work, after sitting all day. Offiicially hung.
I wonder what it will take to stop the backup and export the pool? Well, that’s nice; a straight “kill” terminated the processes, at least.
zpool status shows no errors. zfs list shows backup filesystems mounted.
zpool export -f is running…no disk I/O now…starting to look hung.
Ah, the zfs receive process is still in the process table. kill -9 doesn’t help.
Kill and kill -9 won’t touch the zpool export process, either.
Pulling the USB cable on the drive doesn’t seem to be helping any either.
zfs list now hangs, but giving it a little longer just in case.
Kill -9 doesn’t touch any of the hung jobs.
Closing the ssh sessions doesn’t touch any of them either.
zfs list on pools other than bup-wrack works. zpool list works, and shows bup-wrack.
Attempting to set failmode=continue gives an I/O error.
Plugging the USB back in and then setting failmode gives the same I/O error.
cfgadm -al lists known disk drives and usb3/9 as “usb-storage connected”. I think that’s the USB disk that’s stuck.
cfgadm -cremove usb3/9 failed “configuration operation not supported”.
cfgadm -cdisconnect usb3/9 queried if I wanted to suspend activity, then failed with “cannot issue devctl to ap_id: /devices/pci@0,0/pci10de,cb84@2,1:9″
Still -al the same.
cfgadm -cunconfigure same error as disconnect.
I was able to list properties on bup-wrack:
bash-3.2$ zpool get all bup-wrack
NAME PROPERTY VALUE SOURCE
bup-wrack size 928G -
bup-wrack used 438G -
bup-wrack available 490G -
bup-wrack capacity 47% -
bup-wrack altroot /backups/bup-wrack local
bup-wrack health UNAVAIL -
bup-wrack guid 2209605264342513453 default
bup-wrack version 14 default
bup-wrack bootfs - default
bup-wrack delegation on default
bup-wrack autoreplace off default
bup-wrack cachefile none local
bup-wrack failmode wait default
bup-wrack listsnapshots off default
It’s not healthy, alright. And the attempt to set failmode really did fail.
ETA: So I had to reboot. However, that worked fine, and I recreated the pool, and I ran the same full backup script overnight, and it completed successfully. Took 392:23, a bit over 6 hours, but it completed. (Something broke the ssh connection, but luckily I had run the backup under screen, so it just got detached and I could reconnect and see what happened. And it was making a log file, anyway.)
There’s a ‘cut’ version error in some of my after-backup processing that I’ll need to fix.