"Not enough replicas" exception condition checked against numReplicas instead quorum. Quorum is a non zero variable this.quorum = this.numReplicas / 2 + 1; and the condition doesn't get triggered when no partition is available as it is supposed to.
As a result during recovery process a thread is stuck in a recovery completion process holding lock on StoreSessionManager object blocking another thread from adding replica after assign-partition Cli is called.
This happens during creation of a new cluster and deployment of server nodes first.
"Thread-2-Append-P0" #14 daemon prio=5 os_prio=0 tid=0x00007fa324012800 nid=0x2a in Object.wait() [0x00007fa36cf4c000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000a7d16a98> (a java.lang.Object)
at java.lang.Object.wait(Object.java:502)
at com.wepay.waltz.store.internal.RecoveryManagerImpl.awaitCompletion(RecoveryManagerImpl.java:430)
- locked <0x00000000a7d16a98> (a java.lang.Object)
at com.wepay.waltz.store.internal.RecoveryManagerImpl.highWaterMark(RecoveryManagerImpl.java:400)
at com.wepay.waltz.store.internal.StoreSessionImpl.open(StoreSessionImpl.java:100)
at com.wepay.waltz.store.internal.StoreSessionManager.createSession(StoreSessionManager.java:203)
at com.wepay.waltz.store.internal.StoreSessionManager.getStoreSession(StoreSessionManager.java:144)
- locked <0x00000000a76ad680> (a com.wepay.waltz.store.internal.StoreSessionManager)
at com.wepay.waltz.store.internal.StorePartitionImpl.highWaterMark(StorePartitionImpl.java:175)
at com.wepay.waltz.server.internal.Partition$AppendTask.init(Partition.java:543)
at com.wepay.riff.util.RepeatingTask.lambda$new$0(RepeatingTask.java:20)
at com.wepay.riff.util.RepeatingTask$$Lambda$11/1905485420.run(Unknown Source)
at java.lang.Thread.run(Thread.java:745)
"pool-2-thread-1" #28 daemon prio=5 os_prio=0 tid=0x00007fa324018000 nid=0x37 waiting for monitor entry [0x00007fa35c99a000]
java.lang.Thread.State: BLOCKED (on object monitor)
at com.wepay.waltz.store.internal.StoreSessionManager.getStoreSession(StoreSessionManager.java:132)
- waiting to lock <0x00000000a76ad680> (a com.wepay.waltz.store.internal.StoreSessionManager)
at com.wepay.waltz.store.internal.StoreImpl.getStoreSession(StoreImpl.java:170)
at com.wepay.waltz.store.internal.StoreImpl.lambda$onReplicaAssignmentsUpdate$1(StoreImpl.java:152)
at com.wepay.waltz.store.internal.StoreImpl$$Lambda$52/2053985767.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The fix:
if number of store replicas is 0, StoreException is thrown, store session is closed and remains closed until assign-partition is called, which opens the store session. Partition is kept open through out the whole time.
How to reproduce the error:
comment everything after line echo "----- assigning partitions to the storage -----" in add-storage.sh
create new cluster: bin/test-cluster.sh start
add-partition: bin/storage-cli.sh add-partition -c config/local-docker/waltz_cluster/waltz-tools.yml -s localhost:55281 -p 0
assign-partition: bin/zookeeper-cli.sh assign-partition -c config/local-docker/waltz_cluster/waltz-tools.yml -s waltz_cluster_storage:55280 -p 0
see the error/fix working: bin/zookeeper-cli.sh list --cli-config-path ./config/local-docker/waltz-tools.yml - without the fix we see
store [/waltz_cluster/store/partition/0] replica states:
No node found
instead of
store [/waltz_cluster/store/partition/0] replica states:
ReplicaId(0,waltz_cluster_storage:55280), SessionId: 2, closingHighWaterMark: UNRESOLVED
"Not enough replicas" exception condition checked against
numReplicas
insteadquorum
. Quorum is a non zero variablethis.quorum = this.numReplicas / 2 + 1;
and the condition doesn't get triggered when no partition is available as it is supposed to. As a result during recovery process a thread is stuck in a recovery completion process holding lock on StoreSessionManager object blocking another thread from adding replica afterassign-partition
Cli is called.This happens during creation of a new cluster and deployment of server nodes first.
The fix: if number of store replicas is 0, StoreException is thrown, store session is closed and remains closed until
assign-partition
is called, which opens the store session. Partition is kept open through out the whole time.How to reproduce the error: comment everything after line
echo "----- assigning partitions to the storage -----"
inadd-storage.sh
create new cluster:bin/test-cluster.sh start
add-partition:bin/storage-cli.sh add-partition -c config/local-docker/waltz_cluster/waltz-tools.yml -s localhost:55281 -p 0
assign-partition:bin/zookeeper-cli.sh assign-partition -c config/local-docker/waltz_cluster/waltz-tools.yml -s waltz_cluster_storage:55280 -p 0
see the error/fix working:bin/zookeeper-cli.sh list --cli-config-path ./config/local-docker/waltz-tools.yml
- without the fix we seeinstead of