zsmartsystems / com.zsmartsystems.zigbee

ZigBee Cluster Library Java framework supporting multiple dongles
Eclipse Public License 1.0
141 stars 87 forks source link

ArrayIndexOutOfBoundsException during shuffle of outstanding queues (transaction manager) #1363

Closed mikomarrache closed 1 year ago

mikomarrache commented 1 year ago

We got the following exception:

java.lang.ArrayIndexOutOfBoundsException: 3
    at java.util.concurrent.CopyOnWriteArrayList.get(CopyOnWriteArrayList.java:388)
    at java.util.concurrent.CopyOnWriteArrayList.get(CopyOnWriteArrayList.java:397)
    at java.util.Collections.swap(Collections.java:499)
    at java.util.Collections.shuffle(Collections.java:460)
    at java.util.Collections.shuffle(Collections.java:427)
    at com.zsmartsystems.zigbee.transaction.ZigBeeTransactionManager.sendNextTransaction(ZigBeeTransactionManager.java:656)
    at com.zsmartsystems.zigbee.transaction.ZigBeeTransactionManager.transactionComplete(ZigBeeTransactionManager.java:540)
    at com.zsmartsystems.zigbee.transaction.ZigBeeTransaction.failTransaction(ZigBeeTransaction.java:446)
    at com.zsmartsystems.zigbee.transaction.ZigBeeTransaction.transactionStatusReceived(ZigBeeTransaction.java:491)
    at com.zsmartsystems.zigbee.transaction.ZigBeeTransactionManager$3.run(ZigBeeTransactionManager.java:579)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

It happened only once but I think it may happen if the list is shuffled while it is modified.

cdjackson commented 1 year ago

I thought that this shouldn't be possible since this is a CopyOnWriteArrayList so it should manage concurrent access -:

A thread-safe variant of ArrayList in which all mutative operations (add, set, and so on) are implemented by making a fresh copy of the underlying array.

Maybe something in here isn't safe (I guess given the OOB exception). Also, I would have expected a CME if there was a concurrent modification problem?

mikomarrache commented 1 year ago

The shuffle() implementation uses the get and set methods which are thread-safe in CopyOnWriteArrayList. However, the get and set methods are called with indices that may not be relevant anymore (therefore the OOB). This implementation is thread-safe inside the modifying methods by creating a copy of the underlying array, but nothing prevents you from keeping bad indices in client code that calls these methods. That's what we have here.

Do we need the shuffle method to be called here?

If yes, I'm afraid we'll have to synchronize on client code.

cdjackson commented 1 year ago

Thanks for the explanation - I'd not looked at how this was being used.

Do we need the shuffle method to be called here?

Yes - at least we need some way to randomise the queues - otherwise we'd send all frames for one device before moving on to the next device. So while shuffle probably isn't required - some way to select a random queue is needed. Maybe there's a better option though than shuffle.

ViToni commented 1 year ago

Yes - at least we need some way to randomise the queues - otherwise we'd send all frames for one device before moving on to the next device. So while shuffle probably isn't required - some way to select a random queue is needed. Maybe there's a better option though than shuffle.

Is there a special reason which makes it better to have multiple queues to take work from compared to a common queue where outstanding work is put in and then all outstanding work is processed from there?

cdjackson commented 1 year ago

Yes, because we need a way to manage different devices. The system needs to treat sleepy devices differently to other devices for example. We release only one message to a sleepy device over a certain period. Similar issues for broadcasts - only a certain number are allowed in a 7 second period.

Due to these requirements each device was managed in a separate queue which has its own parameters.

Sent from my iPhone

On 9 Dec 2022, at 19:18, Mickael Marrache @.***> wrote:

 The shuffle() implementation uses the get and set methods which are thread-safe in CopyOnWriteArrayList. However, the get and set methods are called with indices that may not be relevant anymore (therefore the OOB). This implementation is thread-safe inside the modifying methods by creating a copy of the underlying array, but nothing prevents you from keeping bad indices in client code that calls these methods. That's what we have here.

Do we need the shuffle method to be called here?

If yes, I'm afraid we'll have to synchronize on client code.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

ViToni commented 1 year ago

@cdjackson Thanks for claryfing. Maybe I can rephrase the question: can we uses a central queue to manage the transactions queues, thereby keeping order and covering the side cases like sleepy devices and "blocked" devices? I created PR #1381 to clarify my approach. Maybe you could give it a look to see if the idea is feasible.

The idea is that a transaction queue can be added multiple times to the main queue, once for each transaction it needs to send out. When polling the outstandingQueues queue we get a transaction queue for each pending transaction. The transaction queue doesn't need to be removed when cancelled, it's just need to be cleared. When a clared queue is retrieved from outstandingQueues it results into a noop. This should also remove the need for shuffeling since all transactions are orderered by time due to the insertion order of the transaction queues into outstandingQueues.