williamperkinsfrance / gmail-backup-com

Automatically exported from code.google.com/p/gmail-backup-com
GNU General Public License v3.0
0 stars 0 forks source link

Error when email subject lenght is more than 255 chars on Windows. #16

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. backup email with an lenght of object lenght more than 255 char
2. Windows OS

According to:
http://msdn.microsoft.com/en-us/library/aa365247%28VS.85%29.aspx#maximum%5Fpath%
5Flength

fix:
in gmb.py
def store(self, msg):
    ...
    full_fn_num = '\\\\?\\' + os.path.abspath(full_fn_num)

    fw = file(full_fn_num, 'wb')

    ...

Original issue reported on code.google.com by adriano....@gmail.com on 30 Apr 2013 at 2:52

GoogleCodeExporter commented 8 years ago
Another solution is to shorten all file names to a certain 'reasonable' limit, 
e.g. 128 characters. The full, correct subject of the mail is still saved 
inside the file, so no information is lost by doing that.

The advantage of shortening the file name to something reasonable like 128 
characters is also that the backed up files will then be easier to handle by 
other programs. Even if Gmail Backup could write the files with long file names 
(full path > 255 characters), other file handling software might not be able to 
handle those files, e.g. if you want to move them, or open them in another 
program.

Also, even Gmail Backup itself might have problems reading files with long file 
names when it needs to restore those mails, which is an important part of its 
functionality. So if Gmail Backup is changed so it can write long file names it 
must also be changed so it can read those files. Right now, it would probably 
fail at that. But if the file names are shortened before writing, and the 
shortened file names are stored correctly in the index files "ids.txt" and 
"labels.txt", I am pretty sure Gmail Backup can also restore those mails 
without further change in the software.

Original comment by jesper.h...@gmail.com on 21 May 2013 at 3:17

GoogleCodeExporter commented 8 years ago
I can see in gmb.py, class DirectoryStorage, that all file names are "cleaned" 
with a call to _cleanFilename(), and that this function actually truncates 
*part of* (but only a part of) the file path to a maximum of 240 characters:

    def _cleanFilename(self, fn):
        '''Cleans the filename - removes diacritics and other filesystem special characters
        '''
        ...
        ret = ret[:240]
        return ret

The part that is shortened to 240 characters is this:
"2011/06/20110603-084202-breakingnews@mail_cnn_com-CNN_Breaking_News"

That is, it includes two directories too (year and month), but not the backup 
folder (the path leading up to those two directories, for instance 
"D:\Backup\Gmail Backup\"), and not the final part "-1.eml", where "1" could 
potentially be a larger number but most often is just a one digit number.

As "-1.eml" takes 6 characters there are only 255-240-6 = 9 characters left for 
the backup folder path. 

So, according to these calculations: 

1) If the backup folder path is bigger than 9 characters, Gmail Backup fails 
when there is a long subject in a mail. 
2) If the backup folder path is 9 characters or less, Gmail Backup never fails, 
even when there is an extremely long subject in a mail. 

A backup folder path like "D:\Backup" (9 characters) would do fine, but any 
path longer than that will fail for mails with long subject lines.

So, the problem seems to be that only *part* of the file path is being 
truncated and that the backup folder is not included in this truncation.

Another problem is that even if Gmail Backup fails to write to a file (for 
instance because the path is too long), the email is still being marked as 
having been backed up in ids.txt:

        self.message_iid2fn[msg_iid] = msg_fn_num
        self.message_fn2iid[msg_fn_num] = msg_iid
        fw = file(full_fn_num, 'wb')
        try:
            fw.write(msg)
        finally:
            fw.close()

This should have read:

        fw = file(full_fn_num, 'wb')
        try:
            fw.write(msg)
        finally:
            fw.close()
        self.message_iid2fn[msg_iid] = msg_fn_num
        self.message_fn2iid[msg_fn_num] = msg_iid

so that the last two lines are only executed if the file was written 
successfully.

Because of this extra bug, even if the original bug is solved, Gmail Backup 
will think that the mails that previously had a too long path to be backed up 
*have* already been backed up - even though they haven't. The simple way to 
solve this is to delete the existing backup (or move it to a safe place 
temporarily and delete it later) and make a new backup from scratch - after 
these two bugs have been fixed, that is.

Original comment by jesper.h...@gmail.com on 21 May 2013 at 5:13

GoogleCodeExporter commented 8 years ago
Actually, I personally have 411 mails that have not been backed up but occurs 
in the ids.txt file. I thought it was just a few, but it wasn't.

I ran this Python 2.7 script placed in the root of the backup folder to check 
which files actually exists:

import os.path, os
def fileDoesNotExist(x): return not os.path.isfile(x)
idsList=file("ids.txt", "r").readlines()  # Get the lines.
filenameList=map(lambda x: x.split("\t")[0], idsList)  # Get the file names.
nonExistingFilenameList=filter(fileDoesNotExist, filenameList)  # Get the file 
names of non-existing files.
print len(nonExistingFilenameList)  # Print the number of non-existing files.
outfilename="nonexisting.txt"
file(outfilename, "w").write("\n".join(nonExistingFilenameList))
os.startfile(outfilename)

and it gave me a list of 411 files that the "database" file ids.txt claims 
exist, but that don't. That means that 411/77205 * 100 % = 0.5 % of all my 
mails have not been backed up because of this bug. Note that I am not saying 
this as a criticism to the programmers (after all, they gave the software away 
for free and it did backup 99.5 % of my mails) but just as a fact and as 
information for others to be aware of. 

Of course I will return if I can find a good solution to the problem. 

Original comment by jesper.h...@gmail.com on 21 May 2013 at 11:13

GoogleCodeExporter commented 8 years ago
Hi

Same problem with me for some time. I backup the e-mails in a local dir and 
copy it to a server-wherer the problem occurs 

A small update regarding this as described from other users would be very much 
appreciated ... ;-))

Very best regards,

Markus

Original comment by Markus.G...@gmail.com on 1 Jul 2013 at 12:22

GoogleCodeExporter commented 8 years ago
Okay, I've had a look at the code and I am able to run it with some simple 
bugfixes that addresses both this problem and Issue 17 
(https://code.google.com/p/gmail-backup-com/issues/detail?id=17). I am testing 
it right now on my own Gmail account. It will run for many hours as I am 
creating a whole new backup. (As I mentioned earlier, it is necessary to create 
a new backup from scratch when using the new version of the program.) Until now 
the test looks promising with 4000 mails backed up without a single failure.

It also seems like I am able to compile the program and create a setup program 
for Windows. 

So if my test turns out successfully, I plan to (somehow) release a setup 
program for Windows with the bugfixed version. I also plan to release a diff 
file to show exactly what I changed in the source code.

Original comment by jesper.h...@gmail.com on 2 Jul 2013 at 2:17

GoogleCodeExporter commented 8 years ago
Thank you so much! Hope it all worked fine and you'll be able to provide a new 
setup routine.

Original comment by Markus.G...@gmail.com on 2 Jul 2013 at 5:58

GoogleCodeExporter commented 8 years ago
Well, I tested it with around 67,000 emails now. 

And it does look better: It saved 67,061 emails as .eml files, and all the 
files have different content (I tested that), but in the database file ids.txt 
only 67,034 emails were registered, which is a bit strange given that no errors 
were reported. So, 27 emails out of 67,071 emails were not registered in the 
database; that is 0.04%. So 99.96% of the emails were backed up and registered 
correctly. That is to be compared with the 99.5% of the previous version.

And all the emails in the ids.txt database are actually present as files, which 
is a big improvement. :o) 

So while the program still does have bugs, it backs up more emails in "my" 
version of it than in the previous version.

If anyone wants to try "my" new version, be sure to start an entirely new 
backup with the new version. This is because it will cut the subject lines 
shorter (140 characters shorter for the longest subject lines) and therefore 
the old database files and old .eml files do not match with this version. (This 
version might though be able to restore mails backed up by an earlier version 
of the program. It just can't continue a backup made by an earlier version).

As it seems like the original authors are not listening in here, I am supplying 
a new Windows installer and a diff to show what I changed. They are both 
attached to this comment. I probably ought to fork the new version as a new 
project, but as I haven't done that before I will see if I can find the energy 
for that later. I guess the most important thing is to just get a (more) 
working version out there. Then we can see about forking into a new project if 
the original authors really don't want to continue this project. Otherwise they 
can simply implement my changes into their own project.

I have given "my" version the revision number 10000 to distinguish it clearly 
from the proper versioning. I would have preferred to call it "20a", but the 
program crashes if the revision is not an integer :o).

I am also attaching the source code.

Original comment by jesper.h...@gmail.com on 3 Jul 2013 at 12:41

Attachments:

GoogleCodeExporter commented 8 years ago
saloveju20@gmail.com

Original comment by 222sa...@gmail.com on 29 Sep 2013 at 5:07

GoogleCodeExporter commented 8 years ago
@jesper.h...@gmail.com:

Thank you so much! It's quite strange: first I installed your version, but at 
the end I had the same problem with file lengt while copying. Later I did the 
same again and then it worked fine!

Thank you very much!

Original comment by Markus.G...@gmail.com on 30 Sep 2013 at 6:55

GoogleCodeExporter commented 8 years ago
You're very welcome! 

Maybe the first time you tried my version, you didn't do a clean new backup to 
an empty folder or delete the message files created by the previous version 
first? Because in that case, there would still be old files with too long file 
names lying around, as my version isn't as clever as to try to rename the 
existing old long file names to the new, shorter ones.

It would be quite cool if the program was that clever, though. Because then the 
maximum file name lengths could be adjusted at any time as a user setting, and 
the program could then automagically adapt the file name lengths of the 
existing files to the new maximum length. I'll put that on my list of Potential 
Cool Improvements. But the most important thing for me is to make sure that 
there are no major bugs in the source code, i.e. that the created backup is 
both correct and complete.

I have been working on the source code for some time in the summer to try to 
get an overview over it and clean it up a bit (called "refactoring"). During 
that process I also found some very small bugs and fixed them. None of those 
bugs had anything to do with the core backup or restore process, though.

I have been thinking of releasing the most recent version of the source code 
after my refactoring with the minor bugs fixed. I would just like to test it 
thoroughly first to prove (or rather make it probable) that I did not introduce 
any new bugs by refactoring the source code (I don't think I did, but I would 
like to prove it to a reasonable extent). I am still thinking about a good way 
to do that testing, as there are some challenges connected to that. 

Original comment by jesper.h...@gmail.com on 1 Oct 2013 at 11:30

GoogleCodeExporter commented 8 years ago
Sounds pretty good!

Unfortunately I can't remember exactly what I did before starting with your 
version, but since you advised to start with a new backup I assume I did 
accordingly, but am not completely sure ...

Anyway it's working perfectly for me - that only counts ... ;-)

Very best regards,

Markus

Original comment by Markus.G...@gmail.com on 1 Oct 2013 at 9:51

GoogleCodeExporter commented 8 years ago
I am very happy to hear that, Markus!

Very best regards to you too,

Jesper

Original comment by jesper.h...@gmail.com on 1 Oct 2013 at 11:07

GoogleCodeExporter commented 8 years ago
i so carzy  to seek my friends

Original comment by 222sa...@gmail.com on 15 Oct 2013 at 5:30

GoogleCodeExporter commented 8 years ago
Hi Jesper 
Could your missed (un-registered) 27 messages by chance have 140 the characters 
in the name identical with some other message(s) after the shortening?
I am not yet trained enough to see if you have addressed that possibility in 
your re-naming procedure.
- Just an idea!

Original comment by ener...@gmail.com on 19 Nov 2013 at 8:25

GoogleCodeExporter commented 8 years ago
Hi ener...,

Thank you for your suggestion! I don't think that is the case, though, because 
all filenames get a dash and a number added ("-1", "-2", etc.) *after* the 
shortening of the name to make sure the resulting file name is unique. The 
program simply increments the number by one again and again until the resulting 
file name is unique. And it is the resulting unique file name that is put in 
the ids.txt file.

I am more inclined to believe that maybe the supposed "unique" IDs of some 
emails are not always unique. And there is no check that all the IDs are 
actually unique. 

Also, there is a function in the program which contains a bug that potentially 
could make two different IDs identical if one or both of the IDs contain 
non-ascii characters.

Jesper

Original comment by jesper.h...@gmail.com on 20 Nov 2013 at 4:25

GoogleCodeExporter commented 8 years ago
Hi Jesper....

Thank you for your efforts, I am presently backing up my 60,000+ messages, and 
it certainly takes a while.

Could it be an option inside the backup2000 to actually:
1) Test for the uniqueness you mention, and, if necessary, make them unique.
2) Slightly different subject: Maybe it is worth to test for the actual length 
of the sub-directory that the backup shall store to, and shorten the filenames 
accordingly, instead of simply nominating 140 characters or some other magic 
limit, which in some cases could result in exceeding the length limit of 255.

BTW: Could the words Vojens, and Ebeltoft, by chance have any special 
significance for you?  ..  Just curious!

Regards,

Ole Knudsen

Original comment by ener...@gmail.com on 20 Nov 2013 at 8:28

GoogleCodeExporter commented 8 years ago
Hi Ole,

Thank you for your suggestions!

> 1) Test for the uniqueness you mention, and, if necessary, make them unique.

Yes. If there really is a problem with conflicting IDs, we do need to find a 
way to make really unique IDs.

But we can't just make up our own IDs to make the IDs unique in that way. 
Because to determine whether to backup a specific email or not, the program 
reads the ID of the email from the Gmail server and compares it to the IDs of 
emails that have already been downloaded. If it matches an already downloaded 
message, the message will not be downloaded again. But if we change the ID of a 
downloaded message, we cannot compare that new local ID to the online ID 
anymore, as they will for sure be different, and this would trigger a new 
download of the message at every backup (and even adding the downloaded file as 
a new file next to the already downloaded file).

A better option would perhaps be to always create an artificial ID composed of 
several of the metadata fields of an email, including its stated ID and its 
date and time stamp. That would make the risk of duplicate IDs less. 

But of course, the program should still check that the resulting artificial ID 
really is unique. And if not, the program should handle that situation in a 
sensible way. I haven't quite figured out what sensible way that could be. 
Maybe it could mark the emails that have duplicate IDs in the database in a 
special way and in the future *always* download these exceptional emails at 
*every* backup – simply because it cannot determine whether that email has 
been backed up before. The problem with that is that it would again have to 
create a new file at every backup and in that way pile up files with identical 
messages. 

A special problem is that every time you download the same message it can be 
binary different. This happens if you have installed an antivirus scanner (such 
as Avast) that inserts a special message header saying that the message was 
scanned and at what date and time it was scanned. As the time is new at every 
download, every downloaded message will be different – unless you filter out 
headers with often used antivirus header names (such as the "X-Antivirus" 
header). So in that case you can't even check to see if the downloaded file is 
identical to an already downloaded file, as it never will be.

Another way to solve the problem could be to perhaps use some Gmail extension 
to the IMAP protocol to get a guaranteed unique ID. Because I believe Gmail 
internally must have unique IDs on all emails. But I don't know if those IDs 
are avaliable..... Let me see.... Well, they are!  
https://developers.google.com/gmail/imap_extensions#access_to_the_gmail_unique_m
essage_id_x-gm-msgid . This would of course lock the program even more to Gmail 
(making it slightly more difficult to generalize it to handle any IMAP server), 
but as it is already somewhat locked to Gmail, that wouldn't be a big problem.

So, a solution could be to switch entirely to use Gmail's internal IDs instead. 

The only problem I can see with that is that people already using the program 
would have to start new backups again from scratch. Unless I can come up with a 
clever way of automatically updating the IDs without downloading all messages 
again. Maybe I can, but it would make the change more complex to program. And 
the first backup using the new version of the program would take a looooong 
time, as all IDs would have to be updated. But I guess it would take an even 
longer time to actually download all messages again.

> 2) Slightly different subject: Maybe it is worth to test for the actual 
> length of the sub-directory that the backup shall store to, and shorten the
> filenames accordingly, instead of simply nominating 140 characters or some 
> other magic limit, which in some cases could result in exceeding the length
> limit of 255.

It is a good idea, and I did think a lot of doing that too. But some do copy 
the entire backup to another folder, maybe even to another disk or file system 
(see Markus' comment #4 above). And if we use the longest paths possible, the 
risk that the file names won't fit on another file system or even in another 
folder increases. So I still think it is best to simply keep the file names 
relatively short and not assume anything about the destination file system. 

> BTW: Could the words Vojens, and Ebeltoft, by chance have any special 
> significance for you?  ..  Just curious!

Well, I recognize them as towns here in Denmark, and my brain associates Vojens 
with speedway and Ole Olsen, and Ebeltoft with the beautiful place near which I 
have often arrived to Jutland (or left it) by car :o). But otherwise I have no 
personal connection to these towns.

Regards,

Jesper

Original comment by jesper.h...@gmail.com on 23 Nov 2013 at 6:09

GoogleCodeExporter commented 8 years ago
Thanks for your detailed reply.

Re 1):
 I see no harm in implementing a version that will play with Gmail only, as long as it is clearly marked as such.  No need to continue maintaining the general IMAP version, but to leave that to others.  I even tried to locate the officially latest version 20, but to no avail, so the original developers may already have dropped the ball.
 So yes, if you are game to implement a version using Gmails internal IDs, then I'm ready to try it out, even if it takes more than 24 hours to backup my 74,000 messages.

Re 2):
 I do not know what IMAP or Gmail reads in a filename, but if it were in my days, I would have implemented something that made each name unique right up front, and then maybe either use the very first part of the name as a look up key into separate database, or simply record the original complete filename somewhere in the files header.

Re BTW): You have a namesake in Ebeltoft, who was born in Vojens :)  - - A UNI 
year one mate of mine.

Original comment by ener...@gmail.com on 23 Nov 2013 at 8:53