Recreated wrack, the USB backup drive with the oldest data on it, using my updated scripts, and started a full backup, also with the updated scripts. Was running fine when I went to bed.
Seems to be hung when I got up this morning, dammit. System and pools are responsive, but there’s been no progress and no disk IO since I first checked when I got up. Haven’t tried to kill any processes yet; waiting to see if the zfs-discuss list has any data-gathering suggestions.
This older software version doesn’t support the -p option in zfs send, but that won’t be the cause of the hang; that will simply require me to recreate some key properties manually if I have to restore from backup.
ETA: My detailed report on the zfs-discuss mailing list.
It’s sitting at the same spot after work, after sitting all day. Offiicially hung.
I wonder what it will take to stop the backup and export the pool? Well, that’s nice; a straight “kill” terminated the processes, at least.
zpool status shows no errors. zfs list shows backup filesystems mounted.
zpool export -f is running…no disk I/O now…starting to look hung.
Ah, the zfs receive process is still in the process table. kill -9 doesn’t help.
Kill and kill -9 won’t touch the zpool export process, either.
Pulling the USB cable on the drive doesn’t seem to be helping any either.
zfs list now hangs, but giving it a little longer just in case.
Kill -9 doesn’t touch any of the hung jobs.
Closing the ssh sessions doesn’t touch any of them either.
zfs list on pools other than bup-wrack works. zpool list works, and shows bup-wrack.
Attempting to set failmode=continue gives an I/O error.
Plugging the USB back in and then setting failmode gives the same I/O error.
cfgadm -al lists known disk drives and usb3/9 as “usb-storage connected”. I think that’s the USB disk that’s stuck.
cfgadm -cremove usb3/9 failed “configuration operation not supported”.
cfgadm -cdisconnect usb3/9 queried if I wanted to suspend activity, then failed with “cannot issue devctl to ap_id: /devices/pci@0,0/pci10de,cb84@2,1:9”
Still -al the same.
cfgadm -cunconfigure same error as disconnect.
I was able to list properties on bup-wrack:
bash-3.2$ zpool get all bup-wrack
NAMEÂ Â Â Â Â Â PROPERTYÂ Â Â Â Â Â VALUEÂ Â Â Â Â Â Â Â Â Â Â Â Â Â SOURCE
bup-wrack size          928G               -
bup-wrack used          438G               -
bup-wrack available     490G               -
bup-wrack capacity      47%                -
bup-wrack altroot       /backups/bup-wrack local
bup-wrack health        UNAVAIL            -
bup-wrack guid          2209605264342513453 default
bup-wrack version       14                 default
bup-wrack bootfs        -                  default
bup-wrack delegation    on                 default
bup-wrack autoreplace   off                default
bup-wrack cachefile     none               local
bup-wrack failmode      wait               default
bup-wrack listsnapshots off                default
It’s not healthy, alright. And the attempt to set failmode really did fail.
ETA: So I had to reboot. However, that worked fine, and I recreated the pool, and I ran the same full backup script overnight, and it completed successfully. Took 392:23, a bit over 6 hours, but it completed. (Something broke the ssh connection, but luckily I had run the backup under screen, so it just got detached and I could reconnect and see what happened. And it was making a log file, anyway.)
There’s a ‘cut’ version error in some of my after-backup processing that I’ll need to fix.