Provide support for handling archive packages

leandor commented 7 years ago

We talked about this on (#9), and since the I've been investigating a bit what options are out there, I thought I'd dump what I've found in an issue for future reference.

First, the links:

The LZMA SDK
And a github repo that has a clone of that (official sources are on Sourceforge)
The "master" docs
Some existing python projects that already provide the functionality:
- https://github.com/fancycode/pylzma
- https://github.com/harvimt/pylib7zip
Related Stack Overflow questions
- http://stackoverflow.com/questions/6610683/using-lzma-sdk-in-c
- http://stackoverflow.com/questions/222030/how-do-i-create-7-zip-archives-with-net
Helpful links (mainly as examples of how to use the 7z.DLL)
- http://www.codeproject.com/Articles/27148/C-NET-Interface-for-Zip-Archive-DLLs
- http://7zsharp.codeplex.com

For now I'll create this as a repository of information, until we decide what to do.

Utumno commented 7 years ago

I reply here re: binaries and Bash to your post in #7.

It depends on the time frame you'd want to wait. I mentioned the 7zip binding as a long term project, but if you need to focus on it, I could switch from CBash and start looking into it.

Just spend your available time at what it pleases you most (unlike poor me who has 100k lines of code to care and I have to do all the legwork, duh). It's good you started to look at 7z though :+1:

I don't even know yet if the provided library has all the functionality needed for Bash, so it'd all depend on that.

At some point I refactored the archives handling code (first big merge https://github.com/wrye-bash/wrye-bash/commit/c29d7b85f0509d82b5e5fd2208f3c90c096f7462 but there was more work in future merges) - it was one of the trickiest, most tedious parts of the refactoring, due to details, details and more details. I still need to factor out all this to a module (and add some leftovers like ancient omod handling code that makes me feel like an archaeologist). But bottom line is: we use subprocess to directly call the exe - see the code in "Archives" section in bolt. I don't know if the python binding will be faster so that's what we must investigate (plus if it supports rar). If you look at the compiled repository, lojack at some point tried to bin the exe and keep the dll, but the dll had no rar support so not an option.

Even then, I'd like to think a proper solution to the dependence problem... as in, dependence in self produced binaries, not on 3rd party stuff.

How other projects handle this?

Do they use GetHub releases?

Do they depend on sources and compile whole packages?

Do they link to a submodule that contains the produced binaries?

How do GitHub even handle releases?

Do they are located on a Git repos by themselves, or it has a different API altogether.

We should look for projects that have many modules that compile independently here in GitHub, and see how they handle integration, IMO.

In any case, I'd think that having a whole Git repos for handling releases (or, probably, one for each project that produces binaries would be the ideal, maybe) is not something too problematic (at least in the mean time, until a proper solution can be devised.)

Well the ideal solution would be to have python only code handle everything - ideal in terms of maintainability, extendibility and cross platformness (I would like Bash to run on linux, bare bones, one day - although the biggest hurdle for that is the case sensitive file system). After fighting with the libloadorder.dll for a couple of years I finally got round to implementing it in python (https://github.com/wrye-bash/wrye-bash/commit/66d7b4d695289f6dc29142f94a6004494e3306bd). Not only it solved a gazillion of issues and allowed me to implement undo load order which is a first class feature, it also taught me what I already knew - we can only use the full potential of Bash machinery and caches iff we stay in python land.

There are two obstacles - 7z and CBash (libbsa will be re-implemented in python). So what we do about those ?

I have investigated submodules (see the wiki article, https://github.com/wrye-bash/wrye-bash/wiki/[git]-Submodules, and the compiled repo) but there are many complications - the "clone and run" model we have now is admittedly much better usability-wise, than having the submodules (which are not yet exactly user friendly) on our way. Plus I really don't know how submodules will play with everyday git flow - I routinely check out older commits, bisect etc and submodules can prove a regrettable decision. Now that you start using them in this repo we may get some real world feedback and see how that goes.

But for now I'd say let's stay with the current model. Once refactoring is finalized we may consider rebooting the repo and introduce submodules from the start - which will play better IIRC with git workflow.

AFAIK, the whole "do not commit binaries to GIT thing" is when you have source and binaries intermingled, but if you dedicate a whole repos for releases, I don't see the problem... Git handles binaries very well, and even if they become too big to handle, you can always delete the whole repos and regenerate it, since you aren't really interested in the history, right?

Not sure about the last part - checking out older versions may need the submodule history intact - so back to rebooting once we have a better understanding of the consequences (and maybe have found a way to have 7z in python somehow - zip support is in the standard library but not 7z or rar, nor any time soon I'm afraid). Re: repo for releases - well it would be a repo for binary files we use (CBash.dll and 7z dll) - see the compiled repo here. Releases is not an issue, they are handled ok by github.

As long as you keep one branch of binaries for the stable, and one branch for dev (and maybe, an extra branch for experimental releases, occasionally) it just shouldn't become too big too handle, IMHO.

Sure - and even on dev branch we should commit stable stuff so yes it won't grow out of proportion.

((As an example, in my work, we have a separate SVN repos just for the released binaries, that has a mirrored structure w.r.t the source repos, and the actual source working copies bring the release directory from that release repos using SVN externals. The build bot just compiles the source and then auto-commits the produced binaries that end up in the release dir, which commits to the binaries repos.))

And I suppose that another solution could be to create 'proper' Python packages using setuptools/pip and host them on PyPI (assuming PyPI even host the binaries themselves? I know they host source tarballs in some cases...), but I don't know how much work would that option entail.

For releases we use github releases which is ok - as for hosting on PyPI - well as I said the goal would be to have a "clone and run" repo - which at some point may become a "clone and init 2 submodules and run" repo - 7z and CBash. I don't see for now why we would nedd PyPI but we may consider that too.

So that's the picture re: binaries we bundle with Bash. The ideal solution would be to have all in python, but that is impossible as we must bundle CBash and 7z is not anywhere written in python (and actually even if it were, performance would be an issue probably). We could have the user install 7z and use that but it would be a big regression in user friendliness, Bash is not renowned for anyway. So I think we keep things as they are for now, trying to reduce binary commits to a minimum, while we continue cleaning things up and while you investigate submodules in your repo.

For 7z bindings - I would appreciate some more information on what exactly are those bindings (hey I do c++ with the dict in hand). If it proves an easy thing and that it increases performance in the 7z operations (always welcome) it should be easy to plug this in instead of the subprocess calls (as the archives code is nearly in one place) - but just go at your pace. When I feel the beta is ready I will commit 7z 16.02 (we can afford a binary commit from time to time, anyway we are much more careful now) if we are not ready to use the "bindings" - or we are not sure they buy us anything vs the subprocess stuff. Then we can settle the matter for 307 proper. Just go at your pace - I am really relieved to have a coder around after 3 years, I wouldn't want to press you in the least :)

Keep up the good work.

leandor commented 7 years ago

So that's the picture re: binaries we bundle with Bash.

My thinking is that maybe we can absorb the required 7z functionality into CBash[1] itself.

That would allow me to provide the needed API to you without the need of wrye bash depending on the 7z binaries anymore (I hope) as I can link with it statically. That would leave just one binary to depend, and no 3rd party dependencies (on Wrye Bash code.)

[1] Take that as a generic name, I mean by that mostly this "repo"/project, since maybe when the Python binding is fully in-place, there will no CBash.dll anymore and just a single <name>.pyd, whichever name we decide. (For now, it's called cint.pyd, but that can change easily. I could rename the output to libbash.pyd, or bashapi.pyd, f.e. i.e. anything goes :P )

Just spend your available time at what it pleases you most (unlike poor me who has 100k lines of code to care and I have to do all the legwork, duh). It's good you started to look at 7z though

No worries, I can handle a few "diversions" :P I like being on to anything that helps, really. I love working with C++, with any project, and since on my current work I'm stuck with C#/Delphi, I take the opportunity of relieving my C++ urges here :D (I have to say to be completely honest, that I'm a bit rusty w.r.t. my C++ skills, and I'm really-really more proficient with C# at this point.)

When I wrote that I simply wanted to be clear in that it's something that will take time, and it maybe won't be ready as quick as you'd need to be part of the next beta... that's all.

For 7z bindings - I would appreciate some more information on what exactly are those bindings (hey I do c++ with the dict in hand). If it proves an easy thing and that it increases performance in the 7z operations (always welcome) it should be easy to plug this in instead of the subprocess calls (as the archives code is nearly in one place) - but just go at your pace

Hmmm... tbh I'm not totally sure yet... I've seen the API of the 7zip,dll and verified that handles creating (7z/zip) and unpacking archives (7z, zip, rar, and others.) Also, I've seen that it can handle all the work in-memory, on buffers... so, it should allows for extraction/compression to be done without the need of copying/moving files around (I think... not sure if that would be a benefit.)

Also, haven't found yet how would you go about getting the list of files contained in an archive, and all the info for them like size, CRC, etc, for example... and that's critical for Bash operation, so first thing I'd need to do is to make sure that can be accomplished.

For releases we use github releases which is ok

Yeah, agreed. They seem very good.

as for hosting on PyPI

Yeah, forget that I mentioned that... I listed that just for completeness sake, I'd feel weird hosting any non python projects there, and I'm not sure I'd want to investigate how to use setuptools/pip and convince it to compile C++ from code, specially with all the dependencies (boost, mainly.)

Now that you start using them in this repo we may get some real world feedback and see how that goes.

Yeah, I remember submodules being a PITA some years ago when we first tried to migrate from SVN to GIT at work, and simply decided to stay in SVN for a bit longer.

But I've been reading a bit, and it seems GIT solved a few of the issues and they really seem more seamless now, that's what prompt me to try them in the first place.

It seems that you still need to do git submodules --init --recursive after clone once, though doing git clone --recursive supposedly works too, and after that you have a fully functional repo with all your content intact including submodules. And also I've found that switching to a different branch also triggers the initialization, though.

Content from submodules get linked to your branches by commit hash so if you switch branches and each refer to a different submodule state, it all gets handled automatically, as in, submodules will point to the state according to the branch you're checking out (according to which submodule commit that branch points to.)

They are still a bit cumbersome to operate on, for when you specifically need to change to which branch on the submodule you'd want to point to (as in, let's say, in CBash, suppose it's linked to 'cotire' branch 1.7.8, and after they release 1.7.9 I'd want to switch to that.)

But, good news is you can now do it by hand by going into the submodule folder and using git commands there, on case by case basis, and then after wards, commit the index change on your main repo.

leandor commented 7 years ago

Here's a good read regarding submodules and GitHub recommendarions.

leandor commented 7 years ago

W.r.t. dependencies, I found that CMake has a directive ExternalProject that seems to be able to download dependencies from several different sources and setup them in different ways.

Worth checking out, if we're going to stick with CMake it may be useful in the future.

Of course, that's only useful if the project is already using CMake. Not sure if I would wan't to push for using it for python code (it's theoretically possible, though not sure if worth it.)

Perhaps a better option is to use standard python library to just to an HTTP GET to a known URL downloading the content found there (if newer.)

Utumno commented 7 years ago

My thinking is that maybe we can absorb the required 7z functionality into CBash[1] itself.

That would allow me to provide the needed API to you without the need of wrye bash depending on the 7z binaries anymore (I hope) as I can link with it statically. That would leave just one binary to depend, and no 3rd party dependencies (on Wrye Bash code.)

This sounds like a great idea actually :+1:

Perhaps a better option is to use standard python library to just to an HTTP GET to a known URL downloading the content found there (if newer.)

Yes for python let's do that when the time comes

wrye-bash / CBash

Provide support for handling archive packages #11