Make sure all operations can be run concurrently multiple times

mitar commented 10 years ago

Make sure all operations can be run concurrently multiple times. There are two main issues.

Assuring that concurrent runs of downsampling do the expected thing (not overriding or duplicating work). Probably we could lock streams as they get started being downsampled and other runs skip them. We should make sure that they do not get locked indefinitely. Same for backprocessing of dependent streams.

Assuring that datapoints can be appended concurrently. Mostly this is already so and even for processing of dependent streams this is so. The only known issue is with derive operator which expects reset stream to be processed before data stream, so that it can know if reset happened or not. Maybe we should just document this and require user to assure that? Or should we make it work no matter the order? The issue with the latter path would be that it seems we would have to store not just datapoints when reset happened, but also when it did not.

kostko commented 10 years ago

We have now implemented the following:

Multiple downsample operations can be run concurrently and will use per-stream locking. (7a11b4c9f36e630e2526a58814ac91e2467052a0) Other downsamplers will not wait for the lock to be released, but will simply skip to the next stream. This introduced two new fields in stream metadata, _lock_mt that holds the timestamp when the lock will expire and downsample_count that holds a monotonically incrementing counter of performed downsample operations. During downsampling, if the lock is near expiry, we lengthen the lock.
Interleaving of append and downsample operations is handled properly. (99d6fd3b120730bccc486b46544e6d97b4376609, 71243246d27425e0baf040c838e6861244451a8a) Before inserting the datapoint we update stream metadata to reflect the timestamp of the last inserted datapoint. In order to properly handle cases where multiple appends to the same stream interleave with downsample operations, we use a safety margin of 10 seconds. We maintain a per-stream list of datapoint timestamps inserted (or in the middle of being inserted) in the last 10 seconds which is checked before performing downsampling to select a minimum timestamp of them all. This timestamp is then used as a reference point for downsampling the stream. This guarantees that if append takes less than 10 seconds to complete (between updating stream metadata and actual datapoint insertion) downsampling will be consistent and will not skip datapoints that are pending insertion.

Handling concurrent backprocessing and derived streams is still pending.

mitar commented 10 years ago

Just to add to the comment above. So currently it means that you can downsample only until 10s before the last datapoint. (10s is used for above mentioned safety margin.)

wlanslovenija / datastream

Make sure all operations can be run concurrently multiple times #23