Closed za3k closed 2 years ago
Thanks a lot for sharing! I'll add your project to the list of related projects.
I think I would rather not bring in a dependency on the par2 format, or the par2 command line tool unless it provides some large benefit. In this case, the obvious benefit would be allowing restore with missing QR codes, without 'qr-backup' present (the biggest reason I was avoiding this feature to start with is that it would require qr-backup for restore). I'll have to think about whether I can write a bash oneliner to do what you're describing.
I re-read your comment and I'm still a little lost on what you mean by either "alignment" (just in the sense of padding?) or scrambling. Could you provide a reference on the par2 format? I'm familiar with the tool, but not the actual file format, and staring at it would probably help.
What dependencies are required for restore is always a critical question.
Personally, the approach I've taken* is to throw a tar file header on the first QR code, and use Par2 for the rest.
My par2 implementation is still in progress, but explicitly has the goal of supporting the optional "Input File Slice packet", and zero padding the packets for maximum recoverability in QR codes.
One of the primary things I am focusing on is using Par2 packets to allow file recovery, even if all QR codes are scrambled. An intermediate solution is to use a tar file header (512 bytes) in front of the file data, and another header (512 bytes) in front of the par2 file. As long as the QR codes are read in the proper order and those headers survived, this would allow for recovery even if a data QR code was unreadable.
* Different requirements, and design philosophy.
Oh, another approach is the master branch of par2cmdline
supports generating recovery blocks over 100%. I have confirmed that at 100% recovery, the software can restore a file even if it was completely deleted.
The down side is you must use a par2 program to do so. However, my approach ends up with that requirement anyways, so thinking about it, I will probably ignore the whole "Input File Slice packet" thing, and just generate over 100% recovery data. That allows any par2 program to recover the data!
That does put a dependency that you may not be happy with however. Which is another reason I am working on a pure python implementation.
One of the primary things I am focusing on is using Par2 packets to allow file recovery, even if all QR codes are scrambled.
Again, what do you mean by "scrambled"?
I'll point out that the QR codes I'm using have ~200 bytes, so 64-byte overhead would be a lot. If you're having luck with actually scanning/photographing significantly bigger ones, can you let me know what tools/hardware you're using?
I'm taking a look at par2. Hand-rolling par2 IMO seems more complex than using reed-solomon directly--it would only be worthwhile for the restore process. I think we may have the same design requirements, not different ones, if I'm understanding correctly. https://github.com/za3k/qr-backup/blob/master/docs/FAQ.md#what-are-the-design-goals-of-qr-backup
Are you decoding to a 2-member tar archive (separate file and .par2)? Or is it a 1-member archive? I'd like the restore to work without par2, which seems doable, but needs a lot of magic knowledge on the encoding end. The main downside is that no one will be able to understand the format any more--only the restore process.
P.S. Make sure think about what happens if your headers are the QR codes which are damaged--add redundancy to these! I'm sure you already are :)
OK yeah on the whole I think this seems like a good parity option because of the command-line restore option. Thanks for taking the extra time to explain the format you're using to me.
Obstacles:
Let me know if you have any answers here.
Sorry, I took so long to respond. Yes I mean if something like the pages of QR codes are mixed up, or the scanner decides to do something like read top to bottom instead of left to right.
I have focused on 512 byte QR codes, and am considering actually bumping up to 1024 bytes. With a full 25% per code redundancy. Just using my phone at this point, 512 scans every time, but 1024 does not.
The tar format works on 512 byte blocks and is a streaming format. Basically, you take a regular file and slap a tar header in front of it. So you can do:
Python has a tar file parser as part of the standard library, so there are no additional dependencies to do that.
My goal is to be able to make a book of QR codes* that if I ever need to recover I can just throw in an auto document feed scanner, convert the whole thing to one big raw file, and then recover everything with one command. Even if a page is missing, the pages were not in the right order, or water/fire damage destroyed a corner of every page.
Par2, with each packet being on a separate QR code, supports everything I just said. Especially since the format supports something like cat *.par2 > ../big.par2
, and then being able to just use one big file, even if everything was done at different times. The downside being that it actually ads almost too much redundancy to the "header" packets. This is because, like tar Par2 supports multiple archives in one file.
Unfortunately, packet size consistency and alignment is explicitly not a goal of the par2 format. Here is what I have found, using the names from my Par2Py:
par2cmdline
deliberated duplicates these to reduce the risk.par2cmdline
deliberated duplicates these to reduce the risk.par2cmdline
deliberated duplicates these to reduce the risk.* Even at different times, provided all file names are different.
** Which is an optional parameter to par2cmdline
If you haven't checked it out already, I have a much simpler method to "de-scramble" things: sort -u | cut -c5-
. It doesn't handle missing codes on its own, though.
Thanks! That's a super simple way of handling it, and has far less overhead than the 68 bytes of a Par2 packet header.
Re-reading your last comment, the largest issue is each Par2 archive is limited to 32768 input blocks. Which at a 444 byte block size results in a maximum size of just under 14MB. Here is me discussing the issue with the par2cmdline people, and being informed of this limitation.
However, that's per archive. You can have multiple archives, they just wont share recovery data.
As a note to my future self, the backup format would be a file always called 'file' and a file always called 'file.par2' both inside a nameless tarball. Here are the needed qr-backup QR "packet" types:
Restore process
Minor notes:
Edit: More notes:
Here is a spec for Tar headers.
The Tar file format explicitly allows a single Tar file to have two headers with the same file name. This is because it was designed for tape drives where you could be backing up a file multiple times. It is implementation defined how those are unpacked. Most implementations either use the last file in the archive, or the file with the newest timestamp.
Revised, there's no reason to invoke tar, what was I thinking?
Restore process
To decode an uncorrupted file without par2 installed, take "N" packets and concatenate them. To decode a corrupted file (needs par2):
OK, looked into whether it's possible to trim the 64 bytes--yes, but not while keeping it as a readable bash oneliner.
All right, overall I'm leaning toward not using par2 for now. On qr-backup's default settings, 64 binary bytes overhead for each code would mean something like doubling the number of QR codes, which is quite bad. Also, par2 adds a must-have "header" QR code, which I have been specifically avoiding because it makes things complicated for users to think about. My feeling is that currently the costs outweigh the benefits.
I do think it's still a good idea, so let me know if/when you get it working on your own project please. And I may revisit if I get larger QR codes working, and a system for requesting duplicate copies of QR codes.
there's no reason to invoke tar, what was I thinking?
You were following one of my early approaches, not yours. Different design goals and considerations. Pros and cons to each. I suspect we have slightly different definitions of "simple".
I'm glad we could discuss this. Thanks :)
Blergh, all right, given that the alternatives look worse (reed-solomon coding isn't clearly a real option), let's give this a try! I'll see what the blowup looks like in practice. I do like the possibility of CLI restore quite a lot.
Hey, don't stress out about it. I'm already working on things to make it easier/aligned. Let me deal with that part, and you can focus on how you want everything to be packed / laid out.
Heh, I appreciate it. I just want parity working, and I'm frustrated that all the options are bad (IMO anyway). But having something is better than having nothing. I'll probably do it in a couple days.
That said, if you want to give me a PR feel free!
Investigated a little further. Linux, Windows, and Mac should be able to restore the par2 format. There are no apps to do so on Android
Closed since I added erasure coding
I've been working on this on my own implementation.
There's an easy solution that already exists. Par2 is a recovery file which generates Reed Solomon codes. It's a packetized format, so you can zero pad each packet to fit in a QR code and the programs will still read them.
The downside is that you need to be aware that recovery packets are 64 bytes larger compared to the recovery block size.
If you don't care about exact alignment, you can just use Par2 as is. If you do, I am currently working on Par2Py, which will allow for alignment and for using an optional packet type that stores the data itself inside the Par2 file itself.
The advantage of this approach is if you do one Par2 packet per QR code, then not only can you recover from some codes missing, but you can also recover from codes being scrambled!
Originally posted by @EmperorArthur in https://github.com/za3k/qr-backup/issues/2#issuecomment-890360305