zboxfs / zbox

Zero-details, privacy-focused in-app file system.
https://zbox.io/fs/
Apache License 2.0
1.53k stars 74 forks source link

Problems when `file://` storage runs out of space #67

Open vi opened 4 years ago

vi commented 4 years ago
Steps to reproduce:
  1. Mount tmpfs with a size limit
  2. Create ZboxFS file:// storage on that size
  3. Create a zbox::File there and start filling it with data (without finishing) ad infinum.
Expected:

Eventually std::io::Write::write returns IO error suggesting that allowed storage space is exhaused. Ideally the file may be finish()ed at that point, with the data previously accepted by std::io::Write::write without error being guaranteed to fit on remaining space.

Actual:
[2020-01-01T00:45:51Z DEBUG fuse_mt::fusemt] getxattr: "/file" "security.capability"
[2020-01-01T00:45:51Z DEBUG fuse_mt::fusemt] getxattr: error 61
[2020-01-01T00:45:51Z DEBUG fuse::request] WRITE(153376) ino 0x0000000000000002, fh 0, offset 314085376, size 4096, flags 0x0
[2020-01-01T00:45:51Z DEBUG fuse_mt::fusemt] write: "/file" 0x1000 @ 0x12b89000
[2020-01-01T00:45:51Z DEBUG fuse::request] GETXATTR(153377) ino 0x0000000000000002, name "security.capability", size 0
[2020-01-01T00:45:51Z DEBUG fuse_mt::fusemt] getxattr: "/file" "security.capability"
[2020-01-01T00:45:51Z DEBUG fuse_mt::fusemt] getxattr: error 61
[2020-01-01T00:45:51Z DEBUG fuse::request] WRITE(153378) ino 0x0000000000000002, fh 0, offset 314089472, size 4096, flags 0x0
[2020-01-01T00:45:51Z DEBUG fuse_mt::fusemt] write: "/file" 0x1000 @ 0x12b8a000
[2020-01-01T00:45:51Z DEBUG zbox::trans::txmgr] run tx failed: Io(Custom { kind: Other, error: "other os error" })
[2020-01-01T00:45:51Z DEBUG zbox::trans::txmgr] abort tx#3
[2020-01-01T00:45:51Z WARN  zbox::trans::txmgr] abort tx#3 failed: No space left on device (os error 28)
[2020-01-01T00:45:51Z ERROR zboxmount::zboxfuse] IO error: other os error
[2020-01-01T00:45:51Z WARN  fuse::request] Ignoring unknown FUSE operation 40
[2020-01-01T00:45:51Z DEBUG fuse::request] GETXATTR(153380) ino 0x0000000000000002, name "security.capability", size 0
[2020-01-01T00:45:51Z DEBUG fuse_mt::fusemt] getxattr: "/file" "security.capability"
[2020-01-01T00:45:51Z DEBUG fuse_mt::fusemt] getxattr: error 61
[2020-01-01T00:45:51Z DEBUG fuse::request] WRITE(153381) ino 0x0000000000000002, fh 0, offset 314089472, size 4096, flags 0x0
[2020-01-01T00:45:51Z DEBUG fuse_mt::fusemt] write: "/file" 0x1000 @ 0x12b8a000
[2020-01-01T00:45:51Z DEBUG zbox::trans::txmgr] run tx failed: Io(Custom { kind: Other, error: "other os error" })
[2020-01-01T00:45:51Z DEBUG zbox::trans::txmgr] abort tx#3
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /rustc/3eeb8d4f2fbae0bb1c587d00b5abeaf938da47f4/src/libcore/macros/mod.rs:15:40
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:84
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:61
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1025
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1426
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:65
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:50
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:193
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:210
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:471
  11: rust_begin_unwind
             at src/libstd/panicking.rs:375
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:84
  13: core::panicking::panic
             at src/libcore/panicking.rs:51
  14: core::option::Option<T>::unwrap
             at /rustc/3eeb8d4f2fbae0bb1c587d00b5abeaf938da47f4/src/libcore/macros/mod.rs:15
  15: zbox::trans::txmgr::TxMgr::abort_trans
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/zbox-0.8.8/src/trans/txmgr.rs:157
  16: zbox::trans::txmgr::TxHandle::abort
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/zbox-0.8.8/src/trans/txmgr.rs:238
  17: zbox::trans::txmgr::TxHandle::run
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/zbox-0.8.8/src/trans/txmgr.rs:208
  18: <zbox::file::File as std::io::Write>::write
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/zbox-0.8.8/src/file.rs:586
...

Additionally, storage cannot be opened again after that:

[2020-01-01T00:52:25Z INFO  zbox::base] ZboxFS v0.8.8 - Zero-details, privacy-focused in-app file system
[2020-01-01T00:52:25Z INFO  zbox::fs::fs] open repo: file:///tmp/limit/q, read_only: false
[2020-01-01T00:52:25Z DEBUG zbox::volume::volume] volume opened: Storage(FileStorage { base: "/tmp/limit/q" })
[2020-01-01T00:52:25Z DEBUG zbox::trans::wal] cold redo abort tx#3
[2020-01-01T00:52:25Z DEBUG zbox::trans::wal] cold abort completed
[2020-01-01T00:52:25Z DEBUG zbox::trans::wal] WalQueue opened, seq: 6, watermarks: (txid: 3, block: 6)
Error: Io(Custom { kind: Other, error: "Decrypt error" })
vi commented 4 years ago

Similar panic happens in sqlite:// mode (with a "database is full" error instead). In that case however storage can be opened again (if I grow tmpfs a little bit). Unfinished information in the file is lost, obviously (how do I recover it by the way?).

burmecia commented 4 years ago

Thank you for issue, that's a good edge case.

The unfinished data won't be able to recover as writing is processed in a transaction. If writing doesn't call finish(), the transaction doesn't commit and all uncommitted data will be lost.

vi commented 4 years ago

If writing doesn't call finish(), the transaction doesn't commit and all uncommitted data will be lost.

After a crash with unfinished writes, storage is still full in megabytes. Are those files securely inaccessible (e.g. no decryption key available) or just unrefereced?

vi commented 4 years ago

With mem:// storage, running out of space is even nastier: memory allocation of 8192 bytes failed, then just plain abort.


I think ZboxFS should actively mind available room (in file://, in memory, in database) and reserve some space, so writes start failing while transaction can still be committed.

burmecia commented 4 years ago

If writing doesn't call finish(), the transaction doesn't commit and all uncommitted data will be lost.

After a crash with unfinished writes, storage is still full in megabytes. Are those files securely inaccessible (e.g. no decryption key available) or just unrefereced?

Yes, those files are still secured but they are inaccessible.

burmecia commented 4 years ago

With mem:// storage, running out of space is even nastier: memory allocation of 8192 bytes failed, then just plain abort.

I think ZboxFS should actively mind available room (in file://, in memory, in database) and reserve some space, so writes start failing while transaction can still be committed.

That could be hard as some storage are not able to retrieve free space at run time, even through it can be retrieved the cost to constantly monitor it is still high.

vi commented 4 years ago

That could be hard as some storage are not able to retrieve free space at run time, even through it can be retrieved the cost to constantly monitor it is still high.

ZboxFS can just pre-allocate some space needed to complete a transaction.

vi commented 4 years ago

Yes, those files are still secured but they are inaccessible.

Secured with the same key as the rest of files or secured with some session key that is supposed to be written to storage only on transaction commit?

If the former, there should be some recovery function that trades consistency for availability, allowing to read possibly broken data from outside normal transaction finish. Like git fsck + git cat-file for Git.

burmecia commented 4 years ago

That could be hard as some storage are not able to retrieve free space at run time, even through it can be retrieved the cost to constantly monitor it is still high.

ZboxFS can just pre-allocate some space needed to complete a transaction.

Accurately pre-allocate space is impossible because it has no idea how much data user will write in. The file writing is like streaming, until transaction was committed there is no way to know the actual size in advance.

vi commented 4 years ago

Accurately pre-allocate space is impossible because it has no idea how much data user will write in. The file writing is like streaming, until transaction was committed there is no way to know the actual size in advance.

I see megabytes used by file:// storage going up while writing a file even before committing a transaction. Does it mean only some constant additional space is needed to commit transaction when std::io::Write::write calls are already done?

Anyway, even a heuristical "pre-allocate X kilobytes for transaction commits" where X is configurable by user can be beneficial, to reduce probability of finish() also failing after write has failed.


In short: running out of space should not lead to big data loss, only maybe a limited one.

burmecia commented 4 years ago

Yes, those files are still secured but they are inaccessible.

Secured with the same key as the rest of files or secured with some session key that is supposed to be written to storage only on transaction commit?

If the former, there should be some recovery function that trades consistency for availability, allowing to read possibly broken data from outside normal transaction finish. Like git fsck + git cat-file for Git.

ZboxFS uses the former. When writing data to file, it firstly writes data to a staging area. When transaction is committing, those staging data will be transferred to content management system. That means any uncommitted data will not be in content management system, thus it cannot be accessed and it will discard when transaction aborted.

vi commented 4 years ago

When writing data to file, it firstly writes data to a staging area.

How do I recover data from that staging area if transaction was neither cleanly finished nor cleanly aborted?

Does that staging area contain all necessary information needed to complete a transaction or some things are only kept in RAM?

burmecia commented 4 years ago

Accurately pre-allocate space is impossible because it has no idea how much data user will write in. The file writing is like streaming, until transaction was committed there is no way to know the actual size in advance.

I see megabytes used by file:// storage going up while writing a file even before committing a transaction. Does it mean only some constant additional space is needed to commit transaction when std::io::Write::write calls are already done?

Anyway, even a heuristical "pre-allocate X kilobytes for transaction commits" where X is configurable by user can be beneficial, to reduce probability of finish() also failing after write has failed.

In short: running out of space should not lead to big data loss, only maybe a limited one.

You see the storage going up because of the staging data, the file content has been written to storage and committing tx is just to 'mark' it as permanent. Committing tx needs to write a WAL file, its size is not large but still depends on how large the file you are writing. Larger file produces more content nodes to be added to tx, thus the wal file size will increase. Heuristically estimate the wal file size in advance is impossible, but I agree pre-allocation some blocks for wal is a good idea, and it might be able to handle most cases.

burmecia commented 4 years ago

When writing data to file, it firstly writes data to a staging area.

How do I recover data from that staging area if transaction was neither cleanly finished nor cleanly aborted?

Does that staging area contain all necessary information needed to complete a transaction or some things are only kept in RAM?

The staging area only contains file content and it is referenced by a temporary content node which is only in memory. That content node is merged with pre-existing file content node and written to storage when committing. Thus, if tx is not completed, the temporary content is lost and staging data cannot be accessed.

burmecia commented 4 years ago

Would you mind share your code and setup, so I can reproduce this issue locally and that will be helpful? Thanks.

vi commented 4 years ago

High-level setup is outlined in the opening comment.

Here is example code.

src/main.rs:

fn main() -> Result<(), Box<dyn std::error::Error>> {
    env_logger::init();
    zbox::init_env();
    let mut r = zbox::RepoOpener::new().create(true).force(true).open("file:///tmp/limit/q", "123asd")?;
    use std::io::Write;
    let mut f = r.create_file("/test")?;
    loop {
        match f.write(&[77;65536]) {
            Ok(_) => {
                eprint!(".")
            }
            Err(e) => {
                eprintln!("\nwrite error: {}", e);
                break;
            }
        }
    }
    f.finish()?;
    Ok(())
}

Cargo.toml:

[package]
name = "zboxtest"
version = "0.1.0"
edition = "2018"

[dependencies]
zbox = {version="0.8.8",features=["storage-file"]}
env_logger = "0.7.1"
  1. On Linux, mount tmpfs with size limit: mkdir /tmp/limit; mount -t tmpfs tmp /tmp/limit/ -o size=100M
  2. RUST_BACKTRACE=1 cargo run
vi commented 4 years ago

Heuristically estimate the wal file size in advance is impossible

Either this is not true (I suppose it should be something like a logarithm of file size being written) or design is flawed somewhere.

It may be reasonable that it's not possible to calculate the size exactly beforehand, but impossibility to estimate (e.g. provide some upper bound), heuristically (i.e. it works with high probability, but can fail) looks very suspicious.

Space being reserved for wal file can even dynamically grow based on ongoing written,but uncommitted data size.


If I want, for example, to store 10 terabytes of data into ZboxFS, how often am I recommended to finish()?. Does Zbox like mega-transactions or medium-sized (e.g. 64-megabyte) transactions are preferred?

vi commented 4 years ago

Thus, if tx is not completed, the temporary content is lost and staging data cannot be accessed.

Do I correctly understand that recovery of that lost content is theoretically possible (given the correct password obviously), but is not implemented currently.

burmecia commented 4 years ago

Heuristically estimate the wal file size in advance is impossible

Either this is not true (I suppose it should be something like a logarithm of file size being written) or design is flawed somewhere.

It may be reasonable that it's not possible to calculate the size exactly beforehand, but impossibility to estimate (e.g. provide some upper bound), heuristically (i.e. it works with high probability, but can fail) looks very suspicious.

Space being reserved for wal file can even dynamically grow based on ongoing written,but uncommitted data size.

If I want, for example, to store 10 terabytes of data into ZboxFS, how often am I recommended to finish()?. Does Zbox like mega-transactions or medium-sized (e.g. 64-megabyte) transactions are preferred?

ZboxFs uses mega-transaction, so if you use a single finish() for a large file the result would be either success or nothing. I suggest splitting the file to smaller chunk then write and use multiplefinish(). But how small the chunk depends on your situation, ZboxFS doesn't have preferred value for that.

burmecia commented 4 years ago

Thus, if tx is not completed, the temporary content is lost and staging data cannot be accessed.

Do I correctly understand that recovery of that lost content is theoretically possible (given the correct password obviously), but is not implemented currently.

Theoretically it is possible but hard to implement and I also think it is not necessary. The application should be aware of that any uncommitted data can be lost, and take proper actions to deal with this situation. ZboxFS itself only gives atomicity promise, a transaction is either success or failure in full. Data recovery is the application's responsibility. You can think of ZboxFS as a RDBMS, which require user to re-submit the whole transaction again in case of failure.

burmecia commented 4 years ago

So for this issue, the root cause is actually because of improper error propagation while loading cow object. A cow object has two copies (arms), each change will make a copy of current arm and write to the other arm. When loading the cow object, the error should pop up only when both of the arms loading failed. If any arm was successfully loaded, we should use that arm.

In this case, the correct result should be like this:

I've change it in c30a303, can you test it again in your end? Thanks.

vi commented 4 years ago

Failed transaction needs to abort

What is a failed transaction (in terms of public ZboxFS API)? Does any std::io::Error from std::io::Write::write mark transaction as doomed? Can I check zbox::File if it is in middle of a failed transaction?

vi commented 4 years ago

I've change it in c30a303, can you test it again in your end?

Tried again with zbox commit c30a303cdf417f8b8f491a950ed8195d95f51bb5:

[2020-01-08T20:46:42Z WARN  zbox::trans::txmgr] abort tx#3 failed: No space left on device (os error 28)
[2020-01-08T20:46:42Z ERROR zboxmount::zboxfuse] IO error: other os error
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /rustc/3eeb8d4f2fbae0bb1c587d00b5abeaf938da47f4/src/libcore/macros/mod.rs:15:40
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:84
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:61
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1025
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1426
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:65
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:50
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:193
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:210
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:471
  11: rust_begin_unwind
             at src/libstd/panicking.rs:375
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:84
  13: core::panicking::panic
             at src/libcore/panicking.rs:51
  14: core::option::Option<T>::unwrap
             at /rustc/3eeb8d4f2fbae0bb1c587d00b5abeaf938da47f4/src/libcore/macros/mod.rs:15
  15: zbox::trans::txmgr::TxMgr::abort_trans
             at /mnt/src/git/zbox/src/trans/txmgr.rs:157
  16: zbox::trans::txmgr::TxHandle::abort
             at /mnt/src/git/zbox/src/trans/txmgr.rs:238
  17: zbox::trans::txmgr::TxHandle::run
             at /mnt/src/git/zbox/src/trans/txmgr.rs:208
  18: <zbox::file::File as std::io::Write>::write
             at /mnt/src/git/zbox/src/file.rs:586
  19: <zboxmount::zboxfuse::ZboxFs as fuse_mt::types::FilesystemMT>::write
             at src/zboxfuse.rs:450
  20: <fuse_mt::fusemt::FuseMT<T> as fuse::Filesystem>::write::{{closure}}
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse_mt-0.5.0/src/fusemt.rs:351
  21: fuse_mt::fusemt::FuseMT<T>::threadpool_run
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse_mt-0.5.0/src/fusemt.rs:76
  22: <fuse_mt::fusemt::FuseMT<T> as fuse::Filesystem>::write
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse_mt-0.5.0/src/fusemt.rs:350
  23: fuse::request::Request::dispatch
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse-0.3.1/src/request.rs:278
  24: fuse::request::dispatch
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse-0.3.1/src/request.rs:34
  25: fuse::session::Session<FS>::run
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse-0.3.1/src/session.rs:79
  26: fuse::mount::{{closure}}
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse-0.3.1/src/lib.rs:376
  27: core::result::Result<T,E>::and_then
             at /rustc/3eeb8d4f2fbae0bb1c587d00b5abeaf938da47f4/src/libcore/result.rs:715
  28: fuse::mount
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse-0.3.1/src/lib.rs:376
  29: zboxmount::zboxfuse::mount
             at src/zboxfuse.rs:710
  30: zboxmount::main
             at src/main.rs:144
  31: std::rt::lang_start::{{closure}}
             at /rustc/3eeb8d4f2fbae0bb1c587d00b5abeaf938da47f4/src/libstd/rt.rs:67
  32: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:52
  33: std::panicking::try::do_call
             at src/libstd/panicking.rs:292
  34: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:78
  35: std::panicking::try
             at src/libstd/panicking.rs:270
  36: std::panic::catch_unwind
             at src/libstd/panic.rs:394
  37: std::rt::lang_start_internal
             at src/libstd/rt.rs:51
  38: std::rt::lang_start
             at /rustc/3eeb8d4f2fbae0bb1c587d00b5abeaf938da47f4/src/libstd/rt.rs:67
  39: main
  40: __libc_start_main
  41: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2020-01-08T20:46:42Z WARN  fuse::reply] Reply not sent for operation 26374, replying with I/O error
[2020-01-08T20:46:42Z INFO  fuse::session] Unmounted /home/vi/code/zboxmount/m
[2020-01-08T20:46:42Z INFO  zbox::fs::fs] repo closed

Still the same panic.

burmecia commented 4 years ago

Failed transaction needs to abort

What is a failed transaction (in terms of public ZboxFS API)? Does any std::io::Error from std::io::Write::write mark transaction as doomed? Can I check zbox::File if it is in middle of a failed transaction?

Transaction is used internally when mutating file or file system. There is no API can check its status. Before writing a file, a transaction is created internally and keep updating while writing that file. Any error during the writing will mark that transaction failed and to abort. If everything goes fine, finish() will commit that transaction in the end and form a permanent content version.

burmecia commented 4 years ago

I've change it in c30a303, can you test it again in your end?

Tried again with zbox commit c30a303:

[2020-01-08T20:46:42Z WARN  zbox::trans::txmgr] abort tx#3 failed: No space left on device (os error 28)
[2020-01-08T20:46:42Z ERROR zboxmount::zboxfuse] IO error: other os error
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /rustc/3eeb8d4f2fbae0bb1c587d00b5abeaf938da47f4/src/libcore/macros/mod.rs:15:40
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:84
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:61
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1025
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1426
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:65
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:50
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:193
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:210
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:471
  11: rust_begin_unwind
             at src/libstd/panicking.rs:375
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:84
  13: core::panicking::panic
             at src/libcore/panicking.rs:51
  14: core::option::Option<T>::unwrap
             at /rustc/3eeb8d4f2fbae0bb1c587d00b5abeaf938da47f4/src/libcore/macros/mod.rs:15
  15: zbox::trans::txmgr::TxMgr::abort_trans
             at /mnt/src/git/zbox/src/trans/txmgr.rs:157
  16: zbox::trans::txmgr::TxHandle::abort
             at /mnt/src/git/zbox/src/trans/txmgr.rs:238
  17: zbox::trans::txmgr::TxHandle::run
             at /mnt/src/git/zbox/src/trans/txmgr.rs:208
  18: <zbox::file::File as std::io::Write>::write
             at /mnt/src/git/zbox/src/file.rs:586
  19: <zboxmount::zboxfuse::ZboxFs as fuse_mt::types::FilesystemMT>::write
             at src/zboxfuse.rs:450
  20: <fuse_mt::fusemt::FuseMT<T> as fuse::Filesystem>::write::{{closure}}
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse_mt-0.5.0/src/fusemt.rs:351
  21: fuse_mt::fusemt::FuseMT<T>::threadpool_run
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse_mt-0.5.0/src/fusemt.rs:76
  22: <fuse_mt::fusemt::FuseMT<T> as fuse::Filesystem>::write
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse_mt-0.5.0/src/fusemt.rs:350
  23: fuse::request::Request::dispatch
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse-0.3.1/src/request.rs:278
  24: fuse::request::dispatch
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse-0.3.1/src/request.rs:34
  25: fuse::session::Session<FS>::run
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse-0.3.1/src/session.rs:79
  26: fuse::mount::{{closure}}
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse-0.3.1/src/lib.rs:376
  27: core::result::Result<T,E>::and_then
             at /rustc/3eeb8d4f2fbae0bb1c587d00b5abeaf938da47f4/src/libcore/result.rs:715
  28: fuse::mount
             at /home/vi/.cargo/registry/src/github.com-1ecc6299db9ec823/fuse-0.3.1/src/lib.rs:376
  29: zboxmount::zboxfuse::mount
             at src/zboxfuse.rs:710
  30: zboxmount::main
             at src/main.rs:144
  31: std::rt::lang_start::{{closure}}
             at /rustc/3eeb8d4f2fbae0bb1c587d00b5abeaf938da47f4/src/libstd/rt.rs:67
  32: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:52
  33: std::panicking::try::do_call
             at src/libstd/panicking.rs:292
  34: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:78
  35: std::panicking::try
             at src/libstd/panicking.rs:270
  36: std::panic::catch_unwind
             at src/libstd/panic.rs:394
  37: std::rt::lang_start_internal
             at src/libstd/rt.rs:51
  38: std::rt::lang_start
             at /rustc/3eeb8d4f2fbae0bb1c587d00b5abeaf938da47f4/src/libstd/rt.rs:67
  39: main
  40: __libc_start_main
  41: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2020-01-08T20:46:42Z WARN  fuse::reply] Reply not sent for operation 26374, replying with I/O error
[2020-01-08T20:46:42Z INFO  fuse::session] Unmounted /home/vi/code/zboxmount/m
[2020-01-08T20:46:42Z INFO  zbox::fs::fs] repo closed

Still the same panic.

This panic is because you're calling finish after a failed write. write failure will cause transaction abort and removed from transaction manager, thus finish should not be called after that. I've added more defense code to deal with this case in eb8b135. Can you try it again?

vi commented 4 years ago

After eb8b135 finish indeed stops causing panic.

Similar protection however should be added to zbox::File::write and other things that face transactions.

Similar panic is still happening with a modified test code:

fn main() -> Result<(), Box<dyn std::error::Error>> {
    env_logger::init();
    zbox::init_env();
    use std::io::Write;
    let mut f : zbox::File;

    let mut r = zbox::RepoOpener::new().create(true).force(true).open("file:///tmp/limit/q", "qwe123")?;
    f = r.create_file("/test")?;

    let mut remaining_failed_writes = 10;

    loop {
        match f.write(&[77;65536]) {
            Ok(_) => {
                eprint!(".")
            }
            Err(e) => {
                eprintln!("\nwrite error: {}", e);
                if remaining_failed_writes == 0 { break; }
                remaining_failed_writes -= 1;
            }
        }
    }
    f.finish()?;
    Ok(())
}

Actually I expect that a File::write may succeed with smaller buffer size after another File::write failed because of ENOSPC. Not every IO error should be unrecoverable.

vi commented 4 years ago

What is official way to recover from ENOSPC and write a smaller file which would fit the storage?

Even after proper (non-panic) repository close after ENOSPC it remains cluttered with uncleared aborted transaction and no more transactions are accepted, even smaller ones.


Here is source code illustrating what do I mean:

fn main() -> Result<(), Box<dyn std::error::Error>> {
    env_logger::init();
    zbox::init_env();
    use std::io::Write;
    let mut f : zbox::File;

    let mut r = zbox::RepoOpener::new().create(true).force(true).open("file:///tmp/limit/q", "qwe123")?;
    f = r.create_file("/test")?;

    let mut written_size : usize = 0;

    eprintln!("Trying to create a file as big as possible");

    loop {
        match f.write(&[77;65536]) {
            Ok(x) => {
                eprint!(".");
                written_size+=x;
            }
            Err(e) => {
                eprintln!("\nwrite error: {}", e);
                break;
            }
        }
    }

    eprintln!("\nOK, we cannot write {} bytes here. Let's try storing a {}-byte file instead", written_size, written_size/2);

    r.remove_file("/test")?;
    f = r.create_file("/test.smaller")?;

    let mut written_size2 : usize = 0;

    loop {
        match f.write(&[77;65536]) {
            Ok(x) => {
                eprint!(".");
                written_size2+=x;
                if written_size2 * 2 >= written_size {
                    eprintln!("\nOK, enough. Committing");
                    break;
                }
            }
            Err(e) => {
                eprintln!("\nwrite error: {}", e);
                break;
            }
        }
    }
    f.finish()?;

    Ok(())
}

Log:

[2020-01-09T08:21:12Z INFO  zbox::base] ZboxFS v0.8.8 - Zero-details, privacy-focused in-app file system
[2020-01-09T08:21:12Z INFO  zbox::fs::fs] create repo: file:///tmp/limit/q
[2020-01-09T08:21:12Z DEBUG zbox::volume::volume] volume initialised
[2020-01-09T08:21:12Z DEBUG zbox::trans::txmgr] begin tx#1
[2020-01-09T08:21:12Z DEBUG zbox::trans::trans] commit tx#1, cohorts: 2
[2020-01-09T08:21:12Z DEBUG zbox::trans::txmgr] tx#1 committed
[2020-01-09T08:21:12Z INFO  zbox::fs::fs] repo created
[2020-01-09T08:21:12Z DEBUG zbox::trans::txmgr] begin tx#2
[2020-01-09T08:21:12Z DEBUG zbox::trans::trans] commit tx#2, cohorts: 4
[2020-01-09T08:21:12Z DEBUG zbox::trans::txmgr] tx#2 committed
Trying to create a file as big as possible
[2020-01-09T08:21:12Z DEBUG zbox::trans::txmgr] begin tx#3
.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
[2020-01-09T08:21:21Z DEBUG zbox::trans::txmgr] run tx failed: Io(Custom { kind: Other, error: "other os error" })
[2020-01-09T08:21:21Z DEBUG zbox::trans::txmgr] abort tx#3
[2020-01-09T08:21:21Z WARN  zbox::trans::txmgr] abort tx#3 failed: No space left on device (os error 28)

write error: other os error

OK, we cannot write 104660992 bytes here. Let's try storing a 52330496-byte file instead
[2020-01-09T08:21:21Z DEBUG zbox::trans::wal] hot redo abort tx#3
[2020-01-09T08:21:21Z DEBUG zbox::trans::wal] hot redo abort failed: Io(Os { code: 28, kind: Other, message: "No space left on device" })
[2020-01-09T08:21:21Z INFO  zbox::fs::fs] repo closed
Error: Io(Os { code: 28, kind: Other, message: "No space left on device" })

Expected: after failing to write the big file, it succeeds to write and commit smaller file.

burmecia commented 4 years ago

There is no way to recover when writing is failed, the reason is described above. The recommended way writing big file is to use smaller batch and call finish() multiple times when batch is full. By doing that way, file data is gradually committed in multiple batches rather than in a one big transaction. When the disk is full and error popped up, your file will end up with partially written.

The example code is like below:

fn main() -> Result<(), Box<dyn std::error::Error>> {
    env_logger::init();
    zbox::init_env();

    let mut r = zbox::RepoOpener::new()
        .create(true)
        .force(true)
        .open("file:///mnt/q", "123asd")?;

    use std::io::Write;

    let src = vec![77; 65536 * 1024];
    let batch_size = 8 * 1024 * 1024;  // choose proper batch size suits your need
    let mut offset = 0;
    let mut written_in_batch = 0;
    let mut retry = 3;

    let mut f = r.create_file("/test")?;

    while offset < src.len() {
        match f.write(&src[offset..]) {
            Ok(written) => {
                eprint!(".");

                offset += written;
                written_in_batch += written;

                if written_in_batch >= batch_size {
                    f.finish()?;
                    written_in_batch = 0;
                }
            }
            Err(e) => {
                eprintln!("\nwrite error (retry #{}): {}", retry, e);
                retry -= 1;
                if retry == 0 {
                    break;
                }
            }
        }
    }

    Ok(())
}
vi commented 4 years ago

The recommended way writing big file is to use smaller batch and call finish() multiple times when batch is full.

Can this be documented in API?

Does Zbox like mega-transactions or medium-sized (e.g. 64-megabyte) transactions are preferred?

Looks like the answer is ZboxFS does not like multi-gigabyte transactions and keeping reasonable upper bound on number of unfinished bytes is recommended.

burmecia commented 4 years ago

The recommended way writing big file is to use smaller batch and call finish() multiple times when batch is full.

Can this be documented in API?

Sure I will make it more clear in the document.

Does Zbox like mega-transactions or medium-sized (e.g. 64-megabyte) transactions are preferred?

Looks like the answer is ZboxFS does not like multi-gigabyte transactions and keeping reasonable upper bound on number of unfinished bytes is recommended.

Actually, ZboxFS doesn't have preferred way or size of transaction of write big file, it all depends on the application's requirement.

If app wants integrity and atomicity, use single finish() and put all in one transaction. If app wants to keep as much as possible and doesn't like lost data in case of failure, use multiple finish() and adjust the commit granularity to meet their lost tolerance.

vi commented 4 years ago

As of 330ed30840e1564f08a0cfe1c67604a51844d13e it no longer panics.

However, once run out of free space, I cannot even delete an existing file to free up some space. Repository effectively becomes read-only.

burmecia commented 4 years ago

As of 330ed30 it no longer panics.

However, once run out of free space, I cannot even delete an existing file to free up some space. Repository effectively becomes read-only.

Yes, that's true. Each transaction needs to write a WAL file beforehand, when there is no free space left no WAL file can be persisted, thus no transaction can be made and the repo becomes effectively read-only. There might be better way to deal with this situation, but it needs thoroughly thinking and might need some trade-offs.

vi commented 4 years ago

Is there upper bound on WAL file size needed to remove one ZboxFS File?

Such file may be kept allocated in advance for such cases.

Also why aborted transaction does not clean up it's unused files in file:// storage?

burmecia commented 4 years ago

Is there upper bound on WAL file size needed to remove one ZboxFS File?

Such file may be kept allocated in advance for such cases.

The difficulty is WAL file size is dynamic and unknown in advance. Pre allocation might work in most cases but cannot guarantee.

Also why aborted transaction does not clean up it's unused files in file:// storage?

The same reason. The sector file needs write a copy first when change, that's how COW works. When no space left, the aborted tx even cannot reclaim used space. That's why you see tx abort failed.

burmecia commented 4 years ago

I just checked with sqlite, it also cannot complete any transaction when disk is full. So this behavior might be valid for ZboxFS as well, but I will keep trying to mitigate it as much as possible. My think is that, when disk is full, at least the transaction for file deletion should still be able to execute, otherwise the app will get stuck in this limbo. The solution might be pre-allocation for some fix-sized space for wal and rotate it when transactions flow in.

burmecia commented 4 years ago

After carefully checking different possible solutions, I think the repo mutability when disk full is hard to implement, so I will have to leave it unimplemented. The expected result will be same as sqlite, when disk is full the repo becomes read-only and no more transactions can be made, even deletion.

The major difficulties I found are:

vi commented 4 years ago

What a WAL file consists of? What affects its file size?

Maybe pre-allocated WAL size should be just a user-configurable parameter? That file would normally be unused except of when:

burmecia commented 4 years ago

What a WAL file consists of? What affects its file size?

Maybe pre-allocated WAL size should be just a user-configurable parameter? That file would normally be unused except of when:

* The transaction in question is just deleting one file;

* Attempt to do that transaction normally failed with `Io::Error::Other`.

The WAL file contains all affected objects in the transaction, including new, update and deletion. When writing data to a file, the data will be split to chunks and chunks will be grouped to limit-sized segments, each segment is an new object to be put in the transaction. So when keeping writing data, more and more new segments are created and added to transaction, thus the WAL file size will keep increasing.

It's hard to predict the WAL size because we don't know how many bytes users will write to a file. Even though we know it, the number of chunks and segments will still hard to know because they are depending on the Rabin rolling hashing algorithm, which will produce variable-length chunks based on the data content.

vi commented 4 years ago

How many chunks/segments are required to delete one file?

Maybe in "only delete 1 file" special case WAL size is more predictable compared to general case?

burmecia commented 4 years ago

To delete one file, one transaction needed, which consists of 2 wal files (one for tx itself, one for wal queue). For deletion, if it is not in extreme case, normally no more than several kilo bytes. But the difficulty is not this, it is how to reliably capture the disk full error without sacrifice efficiency.

vi commented 4 years ago

Just capture it unreliably.

If it is not a disk full error, it is probably no problem trying to handle it like if it were a full disk error (i.e. trying to use the reserved files). It would just fail again.

burmecia commented 4 years ago

That will be overkill, that error in Rust is too general and you will end up with doing unnecessary error handling for all other errors.