wimleers / fileconveyor

File Conveyor is a daemon written in Python to detect, process and sync files. In particular, it's designed to sync files to CDNs. Amazon S3 and Rackspace Cloud Files, as well as any Origin Pull or (S)FTP Push CDN, are supported. Originally written for my bachelor thesis at Hasselt University in Belgium.
https://wimleers.com/fileconveyor
The Unlicense
341 stars 95 forks source link

8-bit bytestrings vs. Unicode strings in Python: fix this once and for all #62

Open haleagar opened 13 years ago

haleagar commented 13 years ago

I've seen issue #25 and gotten the most recent version with the related update, but I'm still getting similar errors still however now in arbitrator.py I belive this is one of the problem files WoodyHallæ°åç_0.jpg Python 2.6.5 Ubuntu 10.04.1 LTS

Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner self.run() File "arbitrator.py", line 289, in run self.process_db_queue() File "arbitrator.py", line 634, in process_db_queue self.dbcur.execute("SELECT COUNT(*) FROM synced_files WHERE input_file=? AND server=?", (input_file, server)) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

haleagar commented 13 years ago

So I tried a simple fix of adding

    input_file = input_file.decode('utf-8')
    output_file = output_file.decode('utf-8')
    transported_file = transported_file.decode('utf-8')

to arbitrator.py at line 623, (probably should be done earlier though) That "fixed" the above error, but not quite my problem because I'm using Rackspace CloudFiles

Now I get this errors, which probably means, yes earlier, or maybe we have to convert to non-unicode altogether to use those processors and transporters.

2011-06-10 17:08:29,332 - Arbitrator.Transporter - ERROR - The transporter 'mosso' has failed while transporting the file '/tmp/daemon/var/www/webroot/sites/default/files/imagefield_thumbs/featured_img/WoodyHallæ°åç_0_1306282603.jpg' (action: 1). Error: 'u'\u6c0f''. 2011-06-10 17:08:29,534 - Arbitrator.ProcessorChain - ERROR - The processsor 'link_updater.CSSURLUpdater' has failed while processing the file '/var/www/webroot/sites/all/modules/jquery_ui/jquery.ui/tests/unit/testsuite.css'. Exception class: <class 'xml.dom.SyntaxErr'>. Message: CSSValue: No match: ('CHAR', u':', 62, 16).

and google to find this update https://github.com/rackspace/python-cloudfiles/pull/29 so I grabbed that version of python-cloudfiles, but the errors above still persist, and I've fallen out of my depth, but I hope that's helpful.

haleagar commented 13 years ago

one more error cropping up now as well, then I'll be quite.

Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner self.run() File "arbitrator.py", line 289, in run self.process_db_queue() File "arbitrator.py", line 709, in process_db_queue self.remaining_transporters[key].remove(server) KeyError: u"/var/www/webroot/sites/default/files/imagecache/detail_featured/featured_img/IT\u6295\u8cc72011.jpg2{'filter': <filter.Filter object at 0x1b1e210>, 'source': 'drupal', 'destinations': {'cloudfiles': {'path': 'static'}}, 'processorChain': ['yui_compressor.YUICompressor', 'link_updater.CSSURLUpdater', 'unique_filename.Mtime'], 'label': 'CSS, JS, images and Flash'}"

EricB1021 commented 13 years ago

Thank you for the fix above. However after those 3 lines of code fixed the original problem, File Conveyor runs for a little while and I get this output. I am not very familiar with Python, but any help would be appreciated, Thanks.

2011-06-20 15:32:50,795 - Arbitrator - WARNING - Arbitrator is initializing. 2011-06-20 15:32:50,797 - Arbitrator - WARNING - Loaded config file. 2011-06-20 15:32:51,103 - Arbitrator - WARNING - Created 'ftp' transporter for the 'ftp push cdn' server. 2011-06-20 15:32:51,103 - Arbitrator - WARNING - Server connection tests succesful! 2011-06-20 15:32:51,104 - Arbitrator - WARNING - Setup: created transporter pool for the 'ftp push cdn' server. 2011-06-20 15:32:51,112 - Arbitrator - WARNING - Setup: initialized 'pipeline' persistent queue, contains 15055 items. 2011-06-20 15:32:51,113 - Arbitrator - WARNING - Setup: initialized 'files_in_pipeline' persistent list, contains 0 items. 2011-06-20 15:32:51,113 - Arbitrator - WARNING - Setup: initialized 'failed_files' persistent list, contains 0 items. 2011-06-20 15:32:51,114 - Arbitrator - WARNING - Setup: moved 0 items from the 'files_in_pipeline' persistent list into the 'pipeline' persistent queue. 2011-06-20 15:32:51,114 - Arbitrator - WARNING - Setup: connected to the synced files DB. Contains metadata for 0 previously synced files. 2011-06-20 15:32:51,151 - Arbitrator - WARNING - Setup: initialized FSMonitor. 2011-06-20 15:32:51,151 - Arbitrator - WARNING - Fully up and running now. Exception in thread Thread-3: Traceback (most recent call last): File "/usr/lib/python2.6/threading.py", line 532, in bootstrap_inner self.run() File "/etc/file_conveyor/code/fsmonitor_inotify.py", line 107, in run self.__process_queues() File "/etc/file_conveyor/code/fsmonitor_inotify.py", line 132, in process_queues self.add_dir(path, event_mask) File "/etc/file_conveyor/code/fsmonitor_inotify.py", line 58, in add_dir FSMonitor.generate_missed_events(self, path, event_mask) File "/etc/file_conveyor/code/fsmonitor.py", line 121, in generate_missed_events for event_path, result in self.pathscanner.scan_tree(path): File "/etc/file_conveyor/code/pathscanner.py", line 240, in scan_tree for subpath, subresult in self.scan_tree(os.path.join(path, filename)): File "/etc/file_conveyor/code/pathscanner.py", line 240, in scan_tree for subpath, subresult in self.scan_tree(os.path.join(path, filename)): File "/etc/file_conveyor/code/pathscanner.py", line 240, in scan_tree for subpath, subresult in self.scan_tree(os.path.join(path, filename)): File "/etc/file_conveyor/code/pathscanner.py", line 227, in scan_tree result = self.scan(path) File "/etc/file_conveyor/code/pathscanner.py", line 191, in scan for path, filename, mtime, is_dir in self.__listdir(path): File "/etc/file_conveyor/code/pathscanner.py", line 76, in __listdir path_to_file = os.path.join(path, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 13: ordinal not in range(128)

andriy-gerasika commented 13 years ago

IMHO true fix should involve dbcon.text_factory = str code, not str.decode('utf-8')

http://www.gerixsoft.com/sites/gerixsoft.com/files/fileconveyor-utf8.patch fixed the problem for me.

wimleers commented 13 years ago

Uploaded the patch you linked to to gist.github.com, in case your site goes offline: https://gist.github.com/1118004.

wimleers commented 13 years ago

The patch posted by andriy-gerasika is definitely interesting. Look at the documentation of pysqlite.Connection.text_factory: http://pysqlite.googlecode.com/svn/doc/sqlite3.html#sqlite3.Connection.text_factory.

However, the log output provided at #25 suggests this is not the right way to solve the problem: ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

Clearly, this implies that we should be using proper Unicode strings in Python. Whatever that may be, because that's absolutely not clear. (If anything is messy in Python, it's unicode strings.) I guess a good starting point is http://docs.python.org/howto/unicode.html.

jacobSingh commented 13 years ago

This one just won't die huh? I did a bunch of research on this before, but I really don't remember at which point we should be intercepting this. If I remember correctly, we should be handling it at the point where we harvest the path names, before they go in the DB.

I think I tried dbcon.text_factory = str and it didn't work IIRC. I can't remember why, but I think it had something to do with the point at which it was trying to massage the text. Sorry I can't be more help, I'm kinda foggy on this one.

unn commented 13 years ago

That's what we'd tried Jacob. I don't specifically remember if we'd implemented the dbcon.text_factory = str every where it was needed, but I do remember we tried it and were still seeing similar stack traces as https://github.com/wimleers/fileconveyor/issues/62#issuecomment-1405653

wimleers commented 13 years ago

@jacobSingh: Thanks for weighing in! :) @unn: Thanks for also reporting back on this.

It's clear that additional attention will be needed to solve this for once and for all. And no clear solution is available at the moment.

unn commented 13 years ago

I'll email you some of the files that were causing the issues we were seeing.

wimleers commented 13 years ago

I've read through Python's entire Unicode HOWTO (which seems to be authoritative). Especially the Unicode filenames section is interesting. The os.listdir() trick mentioned there just might be all that we need… (This will make the changes introduced in 12f2ddffdf83323f9cda for #25 obsolete; these changes will be undone.)

Combined with sqlite3.Connection.text_factory = unicode (which is the default, but setting it explicitly should prevent any confusion/assumptions/different defaults in a specific SQLite build).

There's also this daunting post on Stack Overflow, which was also fairly informational.

From effbot.org's "Python Unicode Objects", I discovered the need to use re.UNICODE when using regular expressions.

Further, I changed the Config module: I added Config.__ensure_unicode() and used this for all values parsed from the config.xml file through xml.etree.ElementTree, to ensure that all strings used from the config file are Python Unicode strings (i.e. u'string' instead of 'string'). This was necessary because xml.etree.ElementTree will try to optimize memory consumption by only storing Unicode strings as Python Unicode strings (u'string') if it's impossible to represent it as a regular string (in the system's default encoding). By calling Config.__ensure_unicode() on every string, we can be sure that all strings are in Unicode.

After all, the initial call to os.listdir() will happen with a parameter read directly from the config file. If that string is a Unicode string, then os.listdir() will return Unicode strings, which implies that future calls to os.listdir() (to traverse the directory tree) will also be called with a Unicode string as a parameter. Hence these changes to the Config module were necessary.

Next, I had to ensure all strings received through FSMonitor (i.e. coming from inotify and FSEvents) were in Unicode. On OS X, this is easy, since the file system always uses UTF-8. On Linux, many different encodings are possible, hence we use sys.getfilesystemencoding() to make sure we decode from the right one.

Because we're now using Unicode strings everywhere in File Conveyor (as it should be), we'll need to encode it to byte strings to be able to use certain functions (like this: u'unicode string'.encode('utf-8')), such as hashlib.md5() in PersistentQueue.__hash_key().

Finally, I was having problems with PersistentQueue and PersistentList: both of these cPickle.dumps() arbitrary Python data and then store it in a SQLite DB for persistency. I was already loading data correctly using sqlite3.register_converter("pickle", cPickle.loads), but apparently it should be stored in a SQLite BLOB column, not a TEXT column (source). Then, it needs to be inserted with special care, using sqlite3.Binary(). This function is documented nowhere, not even on the official sqlite3 documentation page on docs.python.org!

This covers Unicode issues 99% of the way, but there's still the potential problem of not knowing the encoding of the file system of the destination — for that, I just created #75.

Phew. That was not easy! I hope I didn't forget to mention anything.

P.S.: some more interesting functions:

patrickfournier commented 13 years ago

I still get errors (using release d1c55b8):

/usr/lib/python2.6/urllib.py:1222: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal res = map(safe_map.getitem, s) 2011-08-27 02:27:05,165 - Arbitrator.Transporter - ERROR - The transporter 'S3' has failed while transporting the file '/var/www/sites/curiosae/files/RB 47 - Départ pour la pêche des huîtres un jour de grande marée.jpg' (action: 1). Error: 'u'\xe9''.

and later (maybe for a different file, I am not sure):

2011-08-27 02:27:17,199 - Arbitrator - ERROR - Unhandled exception of type 'ProgrammingError' detected, arguments: '('You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.',)'. Traceback (most recent call last): File "arbitrator.py", line 301, in run self.process_retry_queue() File "arbitrator.py", line 778, in __process_retry_queue if (input_file, event) not in self.failed_files and (input_file, event) not in self.pipeline_queue: File "/home/ubuntu/wimleers-fileconveyor-d1c55b8/code/persistent_queue.py", line 77, in __contains return self.dbcur.execute("SELECT COUNT(item) FROM %s WHERE item=?" % (self.table), (cPickle.dumps(item), )).fetchone()[0] ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. 2

wimleers commented 13 years ago

Did you start from scratch with File Conveyor or did you upgrade from a previous File Conveyor installation? In the latter case, did you run the upgrade script?

ykyuen commented 13 years ago

I am using release d1c55b8. i want to use file converyor to sync my drupal site to rackspace cloudfiles.

the server it Ubuntu 10.04 with python2.6.5 i got the following error which is same as EricB1021


Exception in thread FSMonitorThread: Traceback (most recent call last): File "/usr/lib/python2.6/threading.py", line 532, in bootstrap_inner self.run() File "/home/halo/fileconveyor/code/fsmonitor_inotify.py", line 122, in run self.process_queues() File "/home/halo/fileconveyor/code/fsmonitor_inotify.py", line 155, in process_queues self.__add_dir(path, event_mask) File "/home/halo/fileconveyor/code/fsmonitor_inotify.py", line 91, in add_dir FSMonitor.generate_missed_events(self, path) File "/home/halo/fileconveyor/code/fsmonitor.py", line 128, in generate_missed_events for event_path, result in self.pathscanner.scan_tree(path): File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree for subpath, subresult in self.scan_tree(os.path.join(path, filename)): File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree for subpath, subresult in self.scan_tree(os.path.join(path, filename)): File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree for subpath, subresult in self.scan_tree(os.path.join(path, filename)): File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree for subpath, subresult in self.scan_tree(os.path.join(path, filename)): File "/home/halo/fileconveyor/code/pathscanner.py", line 226, in scan_tree result = self.scan(path) File "/home/halo/fileconveyor/code/pathscanner.py", line 190, in scan for path, filename, mtime, is_dir in self.listdir(path): File "/home/halo/fileconveyor/code/pathscanner.py", line 77, in listdir path_to_file = os.path.join(path, filename) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)


Then i try with python 2.5 but no luck


2011-09-16 04:06:58,628 - Arbitrator - WARNING - Fully up and running now. Exception in thread FSMonitorThread: Traceback (most recent call last): File "/usr/lib/python2.5/threading.py", line 486, in bootstrap_inner self.run() File "/home/halo/fileconveyor/code/fsmonitor_inotify.py", line 122, in run self.process_queues() File "/home/halo/fileconveyor/code/fsmonitor_inotify.py", line 155, in process_queues self.__add_dir(path, event_mask) File "/home/halo/fileconveyor/code/fsmonitor_inotify.py", line 91, in add_dir FSMonitor.generate_missed_events(self, path) File "/home/halo/fileconveyor/code/fsmonitor.py", line 128, in generate_missed_events for event_path, result in self.pathscanner.scan_tree(path): File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree for subpath, subresult in self.scan_tree(os.path.join(path, filename)): File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree for subpath, subresult in self.scan_tree(os.path.join(path, filename)): File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree for subpath, subresult in self.scan_tree(os.path.join(path, filename)): File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree for subpath, subresult in self.scan_tree(os.path.join(path, filename)): File "/home/halo/fileconveyor/code/pathscanner.py", line 226, in scan_tree result = self.scan(path) File "/home/halo/fileconveyor/code/pathscanner.py", line 190, in scan for path, filename, mtime, is_dir in self.listdir(path): File "/home/halo/fileconveyor/code/pathscanner.py", line 77, in listdir path_to_file = os.path.join(path, filename) File "/usr/lib/python2.5/posixpath.py", line 65, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)


then i replace the dependencies/cloudfiles with the latest rackspace clouldfiles https://github.com/rackspace/python-cloudfiles as suggested @ https://github.com/wimleers/fileconveyor/issues/75. still no luck.

i check the following in the python console Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import sys sys.getdefaultencoding() 'ascii' sys.getfilesystemencoding() 'ANSI_X3.4-1968'

Any way to solve the problem? Thanks.

j0rd commented 12 years ago

Same problem using Amazon CloudFront.

ykyuen commented 12 years ago

I found that the problem is related to the uploaded file name. for example, if i have some cck images which have spaces in the file names, those files cannot be synced and i notice that the spaces are converted into %20 as shown in the Drupal CDN module statistics.

Anyway, i try to solve the problem by converting the string from to utf-8. everything seems work fine but there are still some errors in the daemon.log. probably the issue is not completely resolved in the right way.

i forked the code and commited my changes. see if it works for u. https://github.com/ykyuen/fileconveyor

checkerap commented 12 years ago

Thanks ykyuen,

Your fork worked perfectly for me.

wimleers commented 12 years ago

The (hopefully) last Unicode problem has been fixed at #90.

ykyuen commented 12 years ago

thanks @wimleers =)

wimleers commented 12 years ago

@ykyuen Please let me know if you have any more suggestions or bugs :)

woutrbe commented 11 years ago

I'm actually still getting the same error

Exception in thread FSMonitorThread: Traceback (most recent call last): File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner self.run() File "/src/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 122, in run self.__process_queues() File "/src/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 155, in __process_queues self.__add_dir(path, event_mask) File "/src/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 74, in __add_dir wdd = self.wm.add_watch(path, event_mask_inotify, proc_fun=self.process_event, rec=True, auto_add=True, quiet=False) File "/usr/local/lib/python2.6/dist-packages/pyinotify.py", line 1853, in add_watch for rpath in self.__walk_rec(apath, rec): File "/usr/local/lib/python2.6/dist-packages/pyinotify.py", line 2041, in __walk_rec for root, dirs, files in os.walk(top): File "/usr/lib/python2.6/os.py", line 294, in walk for x in walk(path, topdown, onerror, followlinks): File "/usr/lib/python2.6/os.py", line 294, in walk for x in walk(path, topdown, onerror, followlinks): File "/usr/lib/python2.6/os.py", line 294, in walk for x in walk(path, topdown, onerror, followlinks): File "/usr/lib/python2.6/os.py", line 284, in walk if isdir(join(top, name)): File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 62: ordinal not in range(128)

wimleers commented 11 years ago

@woutrbe Which version of File Conveyor? Which particular file name triggers this error?

woutrbe commented 11 years ago

Sorry about the late reply, I'm using the latest version here on github, installed with

pip install -e git+https://github.com/wimleers/fileconveyor@master#egg=fileconveyor

I tried outputting the file name, but it seems it doesn't even get to that point. It seems that the error is quite similar to what others have experienced.

wimleers commented 11 years ago

Can you enable DEBUG logging?

woutrbe commented 11 years ago

It's not giving that much more information when I enable debug loggin for both CONSOLE_LOGGER_LEVEL and FILE_LOGGER_LEVEL. (http://pastebin.com/zmCiQ0RN)

I've printed the path in pathscanner.py, but that doesn't seem to be outputting anything.

wimleers commented 11 years ago

Wow. The error you're getting doesn't occur in File Conveyor; it occurs in Python's internals!

  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 62: ordinal not in range(128)

I'm afraid there's not much I can do there then. Some googling let me to these things:

import locale
locale.setlocale( locale.LC_ALL, 'C.UTF-8' )

fails


As per the latter link, I'm convinced this is the solution:

$ git d
 fileconveyor/fsmonitor_inotify.py |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fileconveyor/fsmonitor_inotify.py b/fileconveyor/fsmonitor_inotify.py
index 04a46ed..938c3f5 100644
--- a/fileconveyor/fsmonitor_inotify.py
+++ b/fileconveyor/fsmonitor_inotify.py
@@ -28,6 +28,11 @@ class FSMonitorInotify(FSMonitor):
     """inotify support for FSMonitor"""

+    # On Linux, you can choose which encoding is used for your file system's
+    # file names. Hence, whenever we interact with pyinotify, we must ensure
+    # that the paths we pass it are encoded in the file system's encoding.
+    encoding = sys.getfilesystemencoding()
+
     EVENTMAPPING = {
         FSMonitor.CREATED             : pyinotify.IN_CREATE,
         FSMonitor.MODIFIED            : pyinotify.IN_MODIFY | pyinotify.IN_ATTRIB,
@@ -71,7 +76,7 @@ class FSMonitorInotify(FSMonitor):
         # Immediately start monitoring this directory.
         event_mask_inotify = self.__fsmonitor_event_to_inotify_event(event_mask)
         try:
-            wdd = self.wm.add_watch(path, event_mask_inotify, proc_fun=self.process_event, rec=True, auto_add=True, quiet=False)
+            wdd = self.wm.add_watch(path.encode(cls.encoding), event_mask_inotify, proc_fun=self.process_event, rec=True, auto_add=True, quiet=False)
         except WatchManagerError, e:
             raise FSMonitorError, "Could not monitor '%s', reason: %s" % (path, e)
         # Verify that inotify is able to monitor this directory and all of its
@@ -79,7 +84,7 @@ class FSMonitorInotify(FSMonitor):
         for monitored_path in wdd:
             if wdd[monitored_path] < 0:
                 code = wdd[monitored_path]
-                raise FSMonitorError, "Could not monitor %s (%d)" % (monitored_path, code)
+                raise FSMonitorError, "Could not monitor %s (%d)" % (monitored_path.decode(cls.encoding), code)
         self.monitored_paths[path] = MonitoredPath(path, event_mask, wdd)
         self.monitored_paths[path].monitoring = True

@@ -100,7 +105,7 @@ class FSMonitorInotify(FSMonitor):
     def __remove_dir(self, path):
         """override of FSMonitor.__remove_dir()"""
         if path in self.monitored_paths.keys():
-            self.wm.rm_watch(path, rec=True, quiet=True)
+            self.wm.rm_watch(path.encode(cls.encoding), rec=True, quiet=True)
             del self.monitored_paths[path]

Could you please try that?


If that doesn't work, can you do this on your system and report back your output (mine is inline):

python2.5
>>> import locale
>>> locale.getlocale()
(None, None)
>>> import sys
>>> sys.getfilesystemencoding()
'utf-8'
>>> sys.getdefaultencoding()
'ascii'

In this case I think that the solution might be to do this in FSMonitor.__init__(), and possibly also in Arbitrator.__init__():

sys.setdefaultencoding('utf-8')
MaffooBristol commented 11 years ago

Hi, I know this is an old post, but I'm still having this issue... I followed your steps in the post above but it seems to just throw up this error:

Exception in thread ArbitratorThread:
Traceback (most recent call last):
File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "arbitrator.py", line 286, in run
self.__setup()
File "arbitrator.py", line 271, in __setup
self.fsmonitor = fsmonitor_class(self.fsmonitor_callback, True, True, self.config.ignored_dirs.split(":"), "fsmonitor.db", "Arbitrator")
File "/opt/fileconveyor/fileconveyor/code/fsmonitor_inotify.py", line 43, in __init__
sys.setdefaultencoding('utf-8')
AttributeError: 'module' object has no attribute 'setdefaultencoding'
wimleers commented 11 years ago

A quick googling reveals that it's essentially evil to call sys.setdefaultencoding(), so on some systems/builds, that function has been removed. What a mess, Python!

lfourcade commented 11 years ago

Hi @wimleers ,

Thank you very much for this fantastic tool. Unfortunatly, I still have a problem after applying your changes to fsmonitor_inotify.py

Here's my output, hope you can help. Thank you.

/var/fileconveyor/fileconveyor/filter.py:10: DeprecationWarning: the sets module is deprecated from sets import Set, ImmutableSet 2013-07-31 16:24:48,836 - Arbitrator - WARNING - File Conveyor is initializing. 2013-07-31 16:24:48,836 - Arbitrator - INFO - Loading config file. 2013-07-31 16:24:48,838 - Arbitrator.Config - INFO - Parsing sources. 2013-07-31 16:24:48,839 - Arbitrator.Config - INFO - Parsing servers. 2013-07-31 16:24:48,839 - Arbitrator.Config - INFO - Parsing rules. 2013-07-31 16:24:48,840 - Arbitrator - WARNING - Loaded config file. 2013-07-31 10:24:49,727 - Arbitrator - WARNING - Created 'cumulus' transporter for the 'rackspace' server. 2013-07-31 10:24:49,727 - Arbitrator - WARNING - Server connection tests succesful! 2013-07-31 10:24:49,728 - Arbitrator - WARNING - Setup: created transporter pool for the 'rackspace' server. 2013-07-31 10:24:49,729 - Arbitrator - INFO - Setup: collected all metadata for rule 'cdn' (source: 'rackspace'). 2013-07-31 10:24:49,730 - Arbitrator - WARNING - Setup: initialized 'pipeline' persistent queue, contains 0 items. 2013-07-31 10:24:49,731 - Arbitrator - WARNING - Setup: initialized 'files_in_pipeline' persistent list, contains 0 items. 2013-07-31 10:24:49,731 - Arbitrator - WARNING - Setup: initialized 'failed_files' persistent list, contains 0 items. 2013-07-31 10:24:49,732 - Arbitrator - WARNING - Setup: initialized 'files_to_delete' persistent list, contains 0 items. 2013-07-31 10:24:49,733 - Arbitrator - WARNING - Setup: moved 0 items from the 'files_in_pipeline' persistent list into the 'pipeline' persistent queue. 2013-07-31 10:24:49,733 - Arbitrator - WARNING - Setup: connected to the synced files DB. Contains metadata for 0 previously synced files. 2013-07-31 10:24:49,913 - Arbitrator.FSMonitor - INFO - FSMonitor class used: FSMonitorInotify. 2013-07-31 10:24:49,914 - Arbitrator - WARNING - Setup: initialized FSMonitor. 2013-07-31 10:24:49,914 - Arbitrator - INFO - Setup: monitoring '/var/www/vhosts/packshot-creator.com/httpdocs/' (rackspace). 2013-07-31 10:24:49,915 - Arbitrator - INFO - Cleaned up the working directory '/tmp/fileconveyor'. 2013-07-31 10:24:49,915 - Arbitrator - WARNING - Fully up and running now. Exception in thread FSMonitorThread: Traceback (most recent call last): File "/usr/local/lib/python2.6/threading.py", line 532, in bootstrap_inner self.run() File "/var/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 122, in run self.process_queues() File "/var/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 155, in process_queues self.__add_dir(path, event_mask) File "/var/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 74, in add_dir wdd = self.wm.add_watch(path.encode(cls.encoding), event_mask_inotify, proc_fun=self.process_event, rec=True, auto_add=True, quiet=False) NameError: global name 'cls' is not defined

wimleers commented 11 years ago

d'oh, the mention of cls on line 74 of fsmonitor_inotify.py should be replaced by FSMonitorInotify. Then all should be well. Small mistake :(

Can you try that?

Trozz commented 11 years ago

FYI

Exception in thread FSMonitorThread: Traceback (most recent call last): File "/usr/lib64/python2.6/threading.py", line 532, in bootstrap_inner self.run() File "/var/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 122, in run self.process_queues() File "/var/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 155, in process_queues self.__add_dir(path, event_mask) File "/var/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 74, in add_dir wdd = self.wm.add_watch(path, event_mask_inotify, proc_fun=self.process_event, rec=True, auto_add=True, quiet=False) File "/usr/lib/python2.6/site-packages/pyinotify.py", line 1742, in add_watch for rpath in self.__walk_rec(apath, rec): File "/usr/lib/python2.6/site-packages/pyinotify.py", line 1929, in __walk_rec for root, dirs, files in os.walk(top): File "/usr/lib64/python2.6/os.py", line 294, in walk for x in walk(path, topdown, onerror, followlinks): File "/usr/lib64/python2.6/os.py", line 294, in walk for x in walk(path, topdown, onerror, followlinks): File "/usr/lib64/python2.6/os.py", line 294, in walk for x in walk(path, topdown, onerror, followlinks): File "/usr/lib64/python2.6/os.py", line 294, in walk for x in walk(path, topdown, onerror, followlinks): File "/usr/lib64/python2.6/os.py", line 294, in walk for x in walk(path, topdown, onerror, followlinks): File "/usr/lib64/python2.6/os.py", line 284, in walk if isdir(join(top, name)): File "/usr/lib64/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 2: ordinal not in range(128)

wimleers commented 11 years ago

Sigh :(

I won't have time any time soon to dive deeper into this. Sorry.

insparrow commented 9 years ago

I haven't been able to resolve the UnicodeDecodeError issue. However, a few grey hairs later I was able to come up with a good enough workaround (for my use case) which has enabled me to use File Conveyor with Rackspace Cloud Files.

For my server setup (Ubuntu 10.04, Python 2.6.5 and latest File Conveyor), I encountered two UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 2: ordinal not in range(128) issues.

First issue: The daemon would throw an exception before it attempted to transfer any files. If you encounter the same problem, check your server's locale settings with locale. I wish I had checked this first! When I ran locale, the following errors were reported:

locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory

Adding export LC_ALL="en_US.UTF-8" to the server's .bashrc file resolved the issues reported by locale and File Conveyor would then (in part) work.

Second issue: The daemon would throw an exception when it attempted to transfer a filename which contained special characters. I tried in vain to fix this but I was unsuccessful (until today I hadn't written a single line of python code).

Creating a solution that worked with Rackspace Cloud Files was non-negotiable, so I set out to create an acceptable workaround for a Drupal 7 site that has over 50GB of images. I patched arbitrator.py to skip files that I knew would cause the daemon to thrown an exception:

diff --git a/fileconveyor/arbitrator.py b/fileconveyor/arbitrator.py
index 394b4b4..6fa3e5a 100644
--- a/fileconveyor/arbitrator.py
+++ b/fileconveyor/arbitrator.py
@@ -347,6 +347,7 @@ class Arbitrator(threading.Thread):
         while self.discover_queue.qsize() > 0:

             # Discover queue -> pipeline queue.
             (input_file, event) = self.discover_queue.get()
             item = self.pipeline_queue.get_item_for_key(key=input_file)
             # If the file does not yet exist in the pipeline queue, put() it.
@@ -400,6 +401,16 @@ class Arbitrator(threading.Thread):
             (input_file, event) = self.filter_queue.get()
             self.lock.release()

+            # Skip filenames which we know will not work with File Convyeor or the CDN module.
+            import re
+            path, filename = os.path.split(input_file)
+            regexp = re.compile(r'^[a-zA-Z0-9_ .-]+$')
+            if regexp.search(filename) is None:
+                import codecs
+                output_file = codecs.open('skipped_files.txt', 'a', 'utf8')
+                output_file.write(input_file + '\n')
+                continue
+
             # The file may have already been deleted, e.g. when the file was
             # moved from the pipeline list into the pipeline queue after the
             # application was interrupted. When that's the case, drop the

With this patch in, the daemon doesn't throw an exception and will transfer the files that it is able to transfer. The patch also logs the problematic files to skipped_files.txt which gives me / the client a list to fix. For a Drupal 7 site, the transliteration module gives you the option to bulk rename files. I have installed this onto the site I'm working with to take care of new files. The bulk rename function doesn't work for me but I believe that's an isolated issue related to the images stored in complex parent and child field collections.