westerndigitalcorporation / zenfs

ZenFS is a storage backend for RocksDB that enables support for ZNS SSDs and SMR HDDs.
GNU General Public License v2.0
243 stars 88 forks source link

IO Error when enabling `use_direct_io_for_flush_and_compaction` #124

Closed royguo closed 2 years ago

royguo commented 2 years ago

Error Message:

try recovering from put error: IO error: positioned append not at write pointer

Debug Info:

test/000007.sst PositionedAppend, offset: 0,      wp=0,      data size = 798720
test/000007.sst PositionedAppend, offset: 794624, wp=798720, data size = 4096

The file 000007.sst has only two operations, the first one appened 798720 bytes, but the second start pwrite at offset 794624.

Please take a look!

aravind-wdc commented 2 years ago

Hi, Thanks for submitting the issue. Could you please provide some more context on how you hit this issue ? Also any specific steps to reproduce this issue, will greatly help.

royguo commented 2 years ago
.db_bench \
        --zbd_path=$DEVICE \
        --benchmarks=fillrandom \
        --use_existing_db=0 \
        --histogram=1 \
        --statistics=0 \
        --stats_per_interval=1 \
        --stats_interval_seconds=60 \
       --max_background_flushes=3 \
        --max_background_compactions=5 \
        --enable_lazy_compaction=0 \
        --level0_file_num_compaction_trigger=4 \
        --sync=1 \
        --allow_concurrent_memtable_write=1 \
        --bytes_per_sync=32768 \
        --wal_bytes_per_sync=32768 \
        --delayed_write_rate=419430400 \
        --enable_write_thread_adaptive_yield=1 \
        --threads=16 \
        --num_levels=7 \
        --key_size=36 \
        --value_size=16000 \
        --level_compaction_dynamic_level_bytes=true \
        --mmap_read=false \
        --compression_type=none \
        --memtablerep=skip_list \
        --write_buffer_size=268435456 \
        --max_write_buffer_number=20 \
        --target_file_size_base=134217728 \
        --target_blob_file_size=134217728 \
        --blob_file_defragment_size=33554432 \
        --max_dependence_blob_overlap=128 \
        --optimize_filters_for_hits=true \
        --optimize_range_deletion=true \
        --num=60000000 \
        --db=test_kuankuan \
        --benchmark_write_rate_limit=100000000 \
        --prepare_log_writer_num=0 \
       --use_direct_io_for_flush_and_compaction=1

@aravind-wdc

yhr commented 2 years ago

Looking at the debug info, the last block from the first write is over-written. ZenFS does not support overwrites. Upstream rocksdb does not do this, so we'll need to look into what is going on in terarkdb.

yhr commented 2 years ago

I added an assert on the error condition. This is the bactrace from gdb:

thread 1 "db_bench" received signal SIGSEGV, Segmentation fault.
0x0000555555983380 in terarkdb::ZonedWritableFile::PositionedAppend(terarkdb::Slice const&, unsigned long, terarkdb::IOOptions const&, terarkdb::IODebugContext*) ()
(gdb) bt
#0  0x0000555555983380 in terarkdb::ZonedWritableFile::PositionedAppend(terarkdb::Slice const&, unsigned long, terarkdb::IOOptions const&, terarkdb::IODebugContext*) ()
#1  0x000055555594ef33 in terarkdb::ZenfsWritableFile::PositionedAppend(terarkdb::Slice const&, unsigned long) ()
#2  0x00005555558c53fe in terarkdb::WritableFileWriter::WriteDirect() ()
#3  0x00005555558c591f in terarkdb::WritableFileWriter::Flush() ()
#4  0x00005555558c617e in terarkdb::WritableFileWriter::Close() ()
#5  0x000055555598caa7 in terarkdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, terarkdb::VersionSet*, terarkdb::Env*, terarkdb::ImmutableCFOptions const&, terarkdb::MutableCFOptions const&, terarkdb::EnvOptions const&, terarkdb::TableCache*, terarkdb::InternalIteratorBase<terarkdb::LazyBuffer>* (*)(void*, terarkdb::Arena&), void*, std::vector<std::unique_ptr<terarkdb::FragmentedRangeTombstoneIterator, std::default_delete<terarkdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<terarkdb::FragmentedRangeTombstoneIterator, std::default_delete<terarkdb::FragmentedRangeTombstoneIterator> > > > (*)(void*), void*, std::vector<terarkdb::FileMetaData, std::allocator<terarkdb::FileMetaData> >*, terarkdb::InternalKeyComparator const&, std::vector<std::unique_ptr<terarkdb::IntTblPropCollectorFactory, std::default_delete<terarkdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<terarkdb::IntTblPropCollectorFactory, std::default_delete<terarkdb::IntTblPropCollectorFactory> > > > const*, std::vector<std::unique_ptr<terarkdb::IntTblPropCollectorFactory, std::default_delete<terarkdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<terarkdb::IntTblPropCollectorFactory, std::default_delete<terarkdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, terarkdb::SnapshotChecker*, terarkdb::CompressionType, terarkdb::CompressionOptions const&, bool, terarkdb::InternalStats*, terarkdb::TableFileCreationReason, terarkdb:
yhr commented 2 years ago

There are some differences delta upstream rocksdb, and there are some specific changes around direct io that explains the difference in behavior delta upstream rocksdb. See this commit for example: https://github.com/bytedance/terarkdb/commit/512059363607df22b8398bb1788a3f9174c78a05#diff-5a497572c52e60ba25fce7450f621ff517320963fd87ac37d3d85e3a3ee17670

entire history: https://github.com/bytedance/terarkdb/commits/dev.1.4/util/file_reader_writer.cc

royguo commented 2 years ago

Hi, @yhr, Just reviewed the commit you mentioned, didn't see any change that causes the overwrite problem. Will dig into it a little bit more sooner.

yhr commented 2 years ago

@royguo , It looks like terarkdb is missing this patch: https://github.com/facebook/rocksdb/pull/4771/commits/f0e1840d15137e632d9ee99f37394c81b7fa30a5

After applying that, it looks like things are working with --use_direct_io_for_flush_and_compaction in terkarkdb