wimleers / fileconveyor

File Conveyor is a daemon written in Python to detect, process and sync files. In particular, it's designed to sync files to CDNs. Amazon S3 and Rackspace Cloud Files, as well as any Origin Pull or (S)FTP Push CDN, are supported. Originally written for my bachelor thesis at Hasselt University in Belgium.
https://wimleers.com/fileconveyor
The Unlicense
340 stars 95 forks source link

When no `fileDeletionDelayAfterSync` attribute is set, no deletion should happen at all #108

Open kneedrag opened 12 years ago

kneedrag commented 12 years ago

Mosso and symlink_or_copy transporters. FileConveyor deletes all the source files. A workaround is to set the fileDeletionDelayAfterSync to an extremely high number that hopefully would never be hit. Would be better to find out why this is occurring and fix it.

This workaround seems to be working for me: rule for="items" label="cdn" fileDeletionDelayAfterSync="60000000000000000"

satheshf12000 commented 12 years ago

@kneedrag Did you find a work around yet? If so, please post it here for the community. Thanks.

kneedrag commented 12 years ago

I found adding this to my "rule worked fine...

fileDeletionDelayAfterSync="60000000000000000"

Then periodically I use sqlite from the commmand line, and open the persistent_data.db. From there you can issue the command:

delete from files_to_delete_list where id > 0 ;

Not the best solution, there are likely others -- but this has served me fine....
Good luck.....

mnestor commented 12 years ago

I don't even have a delete section on mine and it nuked everything. Ugh!

mnestor commented 12 years ago

Probably not the best python code

Change [fileconveyor/arbitrator.py:778] if rule["deletionDelay"] > 0:

To if rule["deletionDelay"] is None: self.logger.warning("Not going to delete '%s'" % (input_file)) elif rule["deletionDelay"] > 0:

satheshf12000 commented 12 years ago

@kneedrag Thanks & yeah for now that seems to be the only solution..

@mnestor Thanks for the code .. I will try it..

chrisivens commented 12 years ago

I've been looking into this and it looks like all rules and their deletion settings are applied to every file. This didn't make sense to me so I borrowed the code from the filter section to see if the file matched the rule before applying the delete timeout.

I've made a patch to show you guys what I mean.

diff --git a/fileconveyor/arbitrator.py b/fileconveyor/arbitrator.py
index 2eec1f3..f755cf5 100644
--- a/fileconveyor/arbitrator.py
+++ b/fileconveyor/arbitrator.py
@@ -28,7 +28,7 @@ from persistent_list import *
 from fsmonitor import *
 from filter import *
 from processors.processor import *
-from transporters.transporter import Transporter
+from transporters.transporter import Transporter, ConnectionError
 from daemon_thread_runner import *

@@ -774,15 +774,21 @@ class Arbitrator(threading.Thread):
                 # does not make sense, of course).
                 for rule in self.rules:
                     if event != FSMonitor.DELETED:
-                        if rule["deletionDelay"] > 0:
-                            self.lock.acquire()
-                            self.files_to_delete.append((input_file, time.time() + rul
-                            self.logger.warning("Scheduled '%s' for deletion in %d sec
-                            self.lock.release()
-                        else:
-                            if os.path.exists(input_file):
-                                os.remove(input_file)
-                            self.logger.warning("Deleted '%s' as per the '%s' rule." %
+                        if input_file.startswith(self.config.sources[rule["source"]]["
+                            (rule["filter"] is None
+                            or
+                            rule["filter"].matches(input_file, file_is_deleted=False))
+                            
+                            if rule["deletionDelay"] > 0:
+                                self.lock.acquire()
+                                self.files_to_delete.append((input_file, time.time() +
+                                self.logger.warning("Scheduled '%s' for deletion in %d
+                                self.lock.release()
+                            else:
+                                if os.path.exists(input_file):
+                                    os.remove(input_file)
+                                self.logger.warning("Deleted '%s' as per the '%s' rule
+

                 # The file went all the way through the pipeline, so now it's safe
                 # to remove it from the persistent 'files_in_pipeline' list.

And here's the commit in my fork

I expect at some point we need to make sure the delete rule is set to start with as per mnestor's suggestion

mrfelton commented 12 years ago

I have this fixed in my fork as per mnestor 's suggestion. https://github.com/systemseed/fileconveyor/commit/68c40ebaad532d330d7dcd14271ad1cdb3c7007b

chrisivens commented 12 years ago

I had to deliberately break file deletion on my fork on purpose. I'm running glusterfs and I'm not sure whether it is causing some issues with pynotify or something else in the chain.

Moving to a new box in a week or so, so I'll be checking out whether mrfelton's fork works well. I might make a hybrid for a bit with broken deletion but good logging to make sure it is behaving.

It'll be a production box and I can't risk deletions so I'm erring on the side of caution.

Incidentally, is there a good migration strategy for fileconveyor? If the paths are the same I can't see an issue but I guess with the pickling it makes it harder to move things.

wimleers commented 11 years ago

This is obviously a super critical bug, that was introduced in #100. My apologies. This should never have happened, of course.

I merged @mrfelton's https://github.com/systemseed/fileconveyor/commit/68c40ebaad532d330d7dcd14271ad1cdb3c7007b, with an updated commit message and an added trailing period. Thanks, @mrfelton!

chrisivens commented 11 years ago

I still have a feeling this needs to check the individual rules for deletion delay. I'll perhaps re-read the code to see if I'm right but if memory serves me correctly. A deletion delay setting on any rule seemed to set the value on a file regardless of whether it matched.

wimleers commented 11 years ago

Can you expand on what you mean, @chrisivens? It's not entirely clear to me (but perhaps that's because I haven't got the code in my head anymore).

chrisivens commented 11 years ago

Likewise I don't think I can be as clear as I should be but I'll try and paint a scenario.

Lets say you have 2 rules set up, one has deletion delay set and one has nothing. The system notices a change on a file and checks on the actions that should be performed. It then checks to see if deletiondelay has been set. From memory, it doesn't check a specific rule match against the file, it'll check both those rules and if it find a deletiondelay setting it'll use it.

It might be for rules with paths that overlap only but I can't be certain. Either way, I added a check (borrowed from elsewhere in the code) to see whether the file is affected by a rule or not before setting the delete time if needed.

I could have things totally wrong but I did add that code (above) as a reaction to delete times being added where they shouldn't.

EDIT: looking at the code very quickly, it says for each rule, check the deletion delay setting and apply if found. No check to see if it matches against the file. All rules will get checked and presumably, the last rule with a deletiondelay value will be the one that sticks.

wimleers commented 11 years ago

You're right that multiple rules might apply to a single file and thus might cause conflicting behavior (in that one assumes the file continues to exist, the other assumes it gets deleted). The interesting thing then is that the rule without the deletionDelay will sync the deletion to its targets…

Without deletionDelay, multiple rules applying to a single file wasn't an issue, since the source was essentially always untouched. Now it is.

How do you propose we solve this? AFAICT the only clean solution here is to very clearly document that once you start using deletionDelay on a rule, you must ensure that no other rule will govern the same file.

Thoughts?