wvwwvwwv / scalable-concurrent-containers

High performance containers and utilities for concurrent and asynchronous programming
Apache License 2.0
329 stars 16 forks source link

non-deterministic test failures in v2.1.6 on different CPU architectures (other than x86_64) #153

Closed decathorpe closed 2 months ago

decathorpe commented 2 months ago

I'm working on packaging the scc crate to Fedora Linux as a new dependency of serial_test. I am encountering test failures on different CPU architectures. I ran four builds on 5 architectures each, and all of them failed because of test failures on at least one CPU architecture:

On i686-unknown-linux-gnu, the following tests failed:

tests::correctness::bag_test::mpmc
tests::correctness::hashmap_test::insert_read_remove
tests::correctness::hashmap_test::retain_any
tests::correctness::treeindex_test::prop_remove_range
tree_index::leaf::test::calculate_boundary

On a second and third try, these failed:

tests::correctness::bag_test::mpmc
tests::correctness::treeindex_test::prop_remove_range
tree_index::leaf::test::calculate_boundary
Full test output from first try

---- tests::correctness::bag_test::mpmc stdout ----
thread 'tests::correctness::bag_test::mpmc' panicked at src/bag.rs:84:9:
assertion failed: ARRAY_LEN <= DEFAULT_ARRAY_LEN
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---- tests::correctness::treeindex_test::prop_remove_range stdout ----
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2312:13:
assertion `left == right` failed
  left: 3
 right: 2
proptest: Saving this and future failures in /builddir/build/BUILD/rust-scc-2.1.6-build/scc-2.1.6/proptest-regressions/tests/correctness.txt
proptest: If this test was run on a CI system, you may wish to add the following line to your copy of the file. (You may need to create it.)
cc 074a97fceb48a66a19b3da68788380536ecc139810233239b80d635a2a4e758f
thread 'tests::correctness::treeindex_test::prop_remove_range' panicked at src/tests/correctness.rs:2303:5:
Test failed: assertion `left == right` failed
  left: 3
 right: 2.
minimal failing input: lower = 0, range = 0
    successes: 0
    local rejects: 0
    global rejects: 0
---- tree_index::leaf::test::calculate_boundary stdout ----
thread 'tree_index::leaf::test::calculate_boundary' panicked at src/tree_index/leaf.rs:1060:9:
assertion `left == right` failed
  left: 5
 right: 3
---- tests::correctness::hashmap_test::insert_read_remove stdout ----
thread 'tokio-runtime-worker' panicked at src/hash_table.rs:815:21:
index out of bounds: the len is 16 but the index is 16
thread 'tokio-runtime-worker' panicked at src/hash_table.rs:815:21:
index out of bounds: the len is 16 but the index is 16
thread 'tests::correctness::hashmap_test::insert_read_remove' panicked at src/tests/correctness.rs:332:17:
assertion failed: r.is_ok()
---- tests::correctness::hashmap_test::retain_any stdout ----
thread 'tokio-runtime-worker' panicked at src/hash_table.rs:815:21:
index out of bounds: the len is 16 but the index is 16
thread 'tokio-runtime-worker' panicked at src/hash_table.rs:815:21:
index out of bounds: the len is 16 but the index is 16
thread 'tokio-runtime-worker' panicked at src/hash_table.rs:815:21:
index out of bounds: the len is 16 but the index is 16
thread 'tokio-runtime-worker' panicked at src/hash_table.rs:815:21:
index out of bounds: the len is 16 but the index is 16
thread 'tokio-runtime-worker' panicked at src/hash_table.rs:815:21:
index out of bounds: the len is 16 but the index is 16
thread 'tokio-runtime-worker' panicked at src/hash_table.rs:815:21:
index out of bounds: the len is 16 but the index is 16
thread 'tokio-runtime-worker' panicked at src/hash_table.rs:815:21:
index out of bounds: the len is 16 but the index is 16
thread 'tokio-runtime-worker' panicked at src/hash_table.rs:815:21:
index out of bounds: the len is 16 but the index is 16
thread 'tests::correctness::hashmap_test::retain_any' panicked at src/tests/correctness.rs:543:17:
assertion failed: r.is_ok()

On aarch64-unknown-linux-gnu, the following tests failed:

tests::correctness::treeindex_test::remove
Full test output

---- tests::correctness::treeindex_test::remove stdout ----
thread '' panicked at src/tests/correctness.rs:2130:21:
assertion `left == right` failed: 18 17 17
  left: 18
 right: 17
thread 'tests::correctness::treeindex_test::remove' panicked at src/tests/correctness.rs:2135:27:
called `Result::unwrap()` on an `Err` value: Any { .. }

On powerpc64le-unknown-linux-gnu, the following tests failed:

tests::correctness::treeindex_test::remove
tree_index::internal_node::test::durability

On a second try, these tests failed:

tests::correctness::treeindex_test::remove
tree_index::internal_node::test::durability
tree_index::leaf_node::test::durability

On a third try, these tests failed:

tests::correctness::treeindex_test::remove
tree_index::internal_node::test::durability

On a fourth try, these tests failed:

tests::correctness::treeindex_test::remove
tests::performance::benchmark::benchmarks_sync
tree_index::internal_node::test::durability
tree_index::leaf_node::test::durability
Full test output from first try

---- tests::correctness::treeindex_test::remove stdout ----
thread '' panicked at src/tests/correctness.rs:2128:21:
assertion `left == right` failed
  left: 1
 right: 0
thread '' panicked at src/tests/correctness.rs:2128:21:
assertion `left == right` failed
  left: 1
 right: 0
thread '' panicked at src/tests/correctness.rs:2130:21:
assertion `left == right` failed: 1 0 0
  left: 1
 right: 0
thread '' panicked at src/tree_index/leaf_node.rs:861:46:
internal error: entered unreachable code
thread '' panicked at src/tree_index/leaf_node.rs:861:46:
internal error: entered unreachable code
thread '' panicked at src/tree_index/leaf_node.rs:861:46:
internal error: entered unreachable code
thread '' panicked at src/tree_index/leaf_node.rs:861:46:
internal error: entered unreachable code
thread '' panicked at src/tree_index/leaf_node.rs:861:46:
internal error: entered unreachable code
thread '' panicked at src/tree_index/leaf_node.rs:861:46:
internal error: entered unreachable code
thread '' panicked at src/tree_index/leaf_node.rs:861:46:
internal error: entered unreachable code
thread '' panicked at src/tree_index/leaf_node.rs:861:46:
internal error: entered unreachable code
thread '' panicked at src/tree_index/leaf_node.rs:861:46:
internal error: entered unreachable code
thread '' panicked at src/tree_index/leaf_node.rs:861:46:
internal error: entered unreachable code
thread 'tests::correctness::treeindex_test::remove' panicked at src/tests/correctness.rs:2135:27:
called `Result::unwrap()` on an `Err` value: Any { .. }
---- tree_index::internal_node::test::durability stdout ----
thread 'tokio-runtime-worker' panicked at src/tree_index/internal_node.rs:1302:37:
assertion `left == right` failed
  left: 61
 right: 52
thread 'tokio-runtime-worker' panicked at src/tree_index/internal_node.rs:1302:37:
assertion `left == right` failed
  left: 61
 right: 52
thread 'tokio-runtime-worker' panicked at src/tree_index/internal_node.rs:1302:37:
assertion `left == right` failed
  left: 61
 right: 52
thread 'tree_index::internal_node::test::durability' panicked at src/tree_index/internal_node.rs:1325:21:
assertion failed: r.is_ok()

In particular, these errors sound concerning and could be indicative of soundness or correctness issues:

The fact that none of the tests failed on x86_64 makes me think that some code path might rely on the stronger memory ordering guarantees of x86_64 compared to other CPU architectures, so possibly atomics are related.

If it helps, tests were compiled and run by Rust 1.80.0 and with --release.

wvwwvwwv commented 2 months ago

Hi, thanks for reporting this!

  1. Problems on i686. Most tests assume size_of(usize) == 8. I'll need to fix them.
thread 'tokio-runtime-worker' panicked at src/hash_table.rs:815:21:
index out of bounds: the len is 16 but the index is 16
thread 'tests::correctness::hashmap_test::retain_any' panicked at src/tests/correctness.rs:543:17:

-> It is not a test issue: BUCKET_LEN is not variadic in accordance of the size of usize, causing trouble on 32-bit machines. -> Need to fix this for serial_test to work correctly on 32-bit machines -> trivial.

  1. Problems on AARCH64 and PPC64LE with TreeIndex.

Yes, it's definitely data races from relaxed memory models. Fixing those issues will take some time. Rest assured, it's not used by serial_test.

wvwwvwwv commented 2 months ago

2.1.7 addresses the mentioned issues unrelated to tree_index. I'll take time to look into the remaining tree_index problems.

tree_index currently has two problems.

  1. Index mis-calculation on 32-bit CPUs.
  2. Incorrect visibility on the "search/remove" execution path, most likely caused by way too relaxed use of memory barriers.
decathorpe commented 2 months ago

Thank you for the fast response and for investigating!

wvwwvwwv commented 2 months ago

Remaining problems with tree_index.

  1. (general) fn remove_range: entries may remain if the depth >= 3.
  2. (multi-threaded, non-intel) fn remove: may fail to remove an entry.
wvwwvwwv commented 2 months ago

Update.

  1. (general) remove_range: resolved, need cleanup -> SCC 2.1.8 (in 5 days).
  2. (relaxed memory ordering) TreeIndex::{remove, find}: on-going.
  3. (relaxed memory ordering, new) Bag::pop, sometimes non-linearisable: on-going.
wvwwvwwv commented 2 months ago

Update.

  1. (relaxed memory ordering) TreeIndex::{remove, find}, misses an entry in a node being split/removed: on-going.
  2. (relaxed memory ordering) Bag::pop, sometimes non-linearisable: on-going. -> SCC 2.1.9 (in 5 days).
wvwwvwwv commented 2 months ago

Update (SCC 2.1.9).

Now the only remaining issue is the data race in TreeIndex on relaxed memory ordering architectures. -> SCC 2.1.10 (in 7 days).

changgyoopark-db commented 2 months ago

Added a deterministic loom model that reproduces the exact same problem. -> Will be fixed in SCC 2.1.10: scheduled on 11st Aug at the latest.

changgyoopark-db commented 2 months ago

I'm closing this ticket for the time being, but I'm not 100% sure that all the tree_index issues have been perfectly resolved because,

So, @decathorpe if you encounter a similar problem again, please file a new ticket, and temporarily pin SCC to 2.1.9.

decathorpe commented 2 months ago

Awesome. Thank you! If I encounter any more issues, I will report back.