wimleers / fileconveyor

File Conveyor is a daemon written in Python to detect, process and sync files. In particular, it's designed to sync files to CDNs. Amazon S3 and Rackspace Cloud Files, as well as any Origin Pull or (S)FTP Push CDN, are supported. Originally written for my bachelor thesis at Hasselt University in Belgium.
https://wimleers.com/fileconveyor
The Unlicense
341 stars 95 forks source link

Implement "event merging" #68

Closed wimleers closed 13 years ago

wimleers commented 13 years ago

Whenever a change on the file system is detected, File Conveyor checks adds it to the "discover queue", after which it is moved ASAP to the "pipeline queue". In the pipeline queue, there are items of the form (input_file, event). Now, if a file is changed, that may happen in the form of "1) delete, 2) create". Those will be 2 separate events in the queue. I want to add functionality to merge those events. Also, when a file is first created, then deleted, then created, then deleted, and so on, and the daemon has so many files to process that all those events are still queued, then it is actually not smart enough: it will delete the file from the CDN, then create it, then delete it, and so on. I.e. it will replay the exact events, instead of being smart enough to merge the events into the actual resulting event. Merging these events will make it faster and more efficient. This is the only major oversight I'm aware of.

All merging patterns:

This implies that a new data structure has to be developed: the current PersistentQueue data structure does not allow for this merging of events.

wimleers commented 13 years ago

Related issues: #2, #12 & #30.

tedfordgif commented 13 years ago

What is the reason to have new and modified be different states? It seems that state determination should be based on interrogation of the endpoint(s) of the transfer(s). In other words, a file is new if it does not exist at the destination, and modified if it does. If I have multiple endpoints, they might somehow get into different states.

wimleers commented 13 years ago

The state at the endpoint is also taken into account.

However, this is purely about the state of file events in the pipeline queue. The pipeline queue contains files that are yet to be processed (and then transported). Nothing has happened to them yet, except for these events being detected and thus these files ending up in the discovery queue. Therefor, it makes sense to mark a new, unsynced file that is then modified (while it is still unsynced) as just "new" instead of creating both a "new" and a "modified" event for it in the pipeline queue (this is what currently happens). The result of this current (unfortunate & inefficient) implementation is that the file is actually synced first and then synced again, while no changes have happened. It's synced first because it was in the pipeline queue for the creation ("new") event and then again because it was changed ("modify") before it could be synced.

That's what this is about.

The state at the endpoint is only taken into account as soon as we arrive at the transporting step.

I hope this answers your concerns.

tedfordgif commented 13 years ago

I understand the purpose of this ticket--perhaps my question could have been clearer. Is there any need to track three file statuses (new, modified, deleted), or would two decision statuses suffice (needs-to-be-transferred, needs-to-be-deleted)? I suppose there is no need to throw away that information when supplied by the notification library, but there might not be a need to keep it either. If so, instead of worrying about merging events, it's just a binary decision:

if deleted:
    queue_status = delete
else: # new or modified
    queue_status = transfer

It doesn't change the decision, but simplifying like this might have made this design consideration more obvious in the first place.

wimleers commented 13 years ago

First of all: these design decisions were made >2 years ago, so I can't immediately give you "the complete explanation".

Those two decision statuses are indeed all that's necessary to be able to know whether to either transfer or delete the file from one of its destinations. However, this simplification of events only happens at the very latest step. Before that, I keep all metadata around (you never know what kind of edge case, optimization or functionality might need additional metadata). File Conveyor's components (i.e. all code except for arbitrator.py) are as generic as possible, allowing them to be interchanged fairly easily.

Also, I think there may be some edge cases deep in arbitrator.py that require this knowledge. They might not strictly need to, but for now, I'm not going to change this.

Finally, in the case of FSMonitorInotify, there is the fact that it triggers many events for even a simple touch /monitored/directory. When that file does not yet exist, it generates the IN_CREATE (mapped to FSMonitor.CREATED) and IN_ATTRIB (mapped to FSMonitor.MODIFIED) events. When we would be able to merge these events to a single FSMonitor.CREATED event, we have the correct interpretation of events without already simplifying it to transfer so early in the process. I.e., other applications may be able to benefit from this as well.

I'm afraid I can't give you a better answer than the above at this time.

wimleers commented 13 years ago

Oops, that commit should have closed #69! Reopening.

wimleers commented 13 years ago

There's something stupid I've overlooked: the changed DB schema for PersistentQueue implies I need to provide an upgrade path! Reported by @unn. Will fix this.