zfsonlinux / pkg-zfs

Native ZFS packaging for Debian and Ubuntu
https://launchpad.net/~zfs-native/+archive/daily
308 stars 55 forks source link

ubuntu ZOL mount at boot hangs on "large" zil replay #135

Closed prantikk closed 9 years ago

prantikk commented 9 years ago

I performed an experiment on a ZFS filesystem recently that, despite being a little crazy, produced an error of ZIL replay at boot that I'd like to understand better and hopefully be able to deal with if/when it arises again.

My hardware is: 24 4TB NL-SAS drives behind an LSI SAS expander, connected to a SAS HBA (mpt2sas). I have 4-way Xeon machine with 256GB ECC RAM. I'm running Ubuntu 14.04 LTS with ZOL 0.6.3 via the Ubuntu ppa. The disks are set up as a 3x8 raidz2. No L2ARC or SLOG devices.

  1. With default ZOL config (no compression), multiple sequential bonnie++ runs with fsync (-b) writes report about 400MB/s write, 250MB/s rewrite, 950MB/s read. These bonnie++ runs without fail.
  2. Then, I start a script that runs 4 parallel bonnie++ procs with fsync'd writes (-b), writing 800GB datasets (-s). I put this operation in a loop of 25 iterations, and leave script to run overnight. I come back next morning (12 hours later), and the disks seems to be still reading/writing and no errors are reported, which is good.
  3. I press Cntl+C at the terminal to kill the script, and disk reading/writing stops. I do ps -aux and I see the bonnie++ runs still there apparently hanging as zombie processes. Kill -9 does not resolve. I reboot.
  4. Upon boot, ZFS automount seems to begin replaying ZIL as indicated by hard drive blinking. It runs for about 8 minutes and blinking stops but boot does not progress. Three-finger-salute tries to reboot but hangs, and after a little while a kernel dump/trace trace comes up.
  5. A manual restart goes through boot process, and at ZFS mount replay starts again, and hangs after roughly same amount of time. I destroy the pool.
  6. I run the same test (2) on a separate set of 4TB SATA drives. This time, step (2) actually seems to crash the array/hangs the system after some time (maybe hours, exact time unknown). Rebooting then leads to the same ZIL replay hang issue. After about 10-20 reboots, surprisingly, boot succeeds and long replay is not observed again. In this case, the ZIL may be have less populated since the SATA-behind-SAS array crashed the OS early on.

Any idea what leads the situation in (3) and (4)? How can I probe what's going on more effectively? Why is the pool stopping the ZIL replay and hanging the boot after a certain amount of writing, especially as it seems after enough piecewise replays the ZIL is flushed and the system can boot. I have my SAS disks back in and I'd like to keep them there. Given very large write scenarios, which will happen in my application, how can I make sure this issue doesn't arise again?

I feel that this issue is reproducible, and I'm willing to run some experiments if people are interested.

prantikk commented 9 years ago

moved this issue to zfsonlinux/zfs