Provide auto-generated Python bindings

leandor commented 7 years ago

Steps and tasks

Step 1

[x] Use pybind11 to generate a python compatible DLL that can replace cint.py
[x] Generate a python binding that compiles and links against CBash as static library
[ ] Provide access to basic existing exported functions and export the required types as opaque
[ ] Provide general test cases that allows to diagnose the API health from release to release.
Step 2
[ ] Export the classes that are part of the existing API (ModFile, Collection, etc)
[ ] Deprecate fake accessor functions (wrappers that just take an opaque type and simply calls a method on it)
[ ] Refactor Bash code to convert from using the accessors to use the real methods
Step 3
[ ] Extend the visible types (where reasonable) to publish helper C++ only types that are useful.
[ ] Implement standard python constructions for where a class already exists that matches the concept (like Collection: provide iterators, indexers, etc) just extending what's already there, without refactoring the actual API.
Step 4
[ ] Analyze and refactor the API to implement Python patterns where needed, where they apply.
Step 5
[ ] Analyze the possibility of implementing additional constructions to help speed up the API.

leandor commented 7 years ago

After following the hint from @Utumno about pybind being better than Boost.Python, I've investigated and came to the same conclusion: best to use pybind

Documentation of pybind is here

leandor commented 7 years ago

Interesting point about the Buffer protocol explained here and how to implement it with pybind

Perhaps is something to keep in mind, as a buffer not necessarily means an array of bytes. It could be whatever big chunk of similar sized elements (as pybind example demonstrates), so it could prove useful if there's a similar pattern on some of the record types that require access to portions of it.

Utumno commented 7 years ago

Yes - in fact I just started investigating the memoryview API (replaces buffer in python 3, backported in 2.7). I do so cause I strongly suspect the python patcher can go much faster using this api. As in much.

leandor commented 7 years ago

Nice... We'll need to identify how to proper make use of it and where it's best applied.

My gut feeling is that it's something reserved for big chunks of data that hold the same repeating pattern of elements, so candidates could be PATH GRID / LANDSCAPE records in Oblivion, probably/maybe FACEGEN data (as a chunk of bytes, if needed.) Probably Skyrim/Fallouts have some similar usages.

Could also be something that the client requests for any container that has a non-polymorphic item type, which guarantees same 'record size' for each element, or at least, for which you can produce a representation that has the same size for each element.

For what I gather, any class can have a buffer handler and return an object that has the proper buffer interface.

More docs for my own reference for later: http://stackoverflow.com/questions/3422685/what-is-python-buffer-type-for https://docs.python.org/2/c-api/buffer.html https://docs.python.org/3/library/stdtypes.html#memoryview

leandor commented 7 years ago

I'm starting to see a bit of the road forward for this, so I wanted to write up my ideas.

First move, I think, will be to make visible the functions already published by CBash, with the needed types and classes (as opaque types), with not too much modification.
- I think I can manage to get that working very quickly, and that would remove 'cint.py' from the picture as a 'moving part' so to speak.
- This will remove ctypes out of the picture very quickly.
- This step will already remove the error reporting problem, as functions from C++ are able to wrap and throw exceptions into Python land.
- In this first step, I can absorb Python declarations found in 'cint.py' to provide a 'simile' API to Wrye Bash, so the code can still work, or I can convert cint into a Python package, and provide an hybrid implementation where that package imports the pyd module and extends it with python declarations.
- I personally think hybrid implementation is the best road, as it allows for more flexibility in the WB<->CBash API boundary, as long as it is simple stuff.
Next step would be to deprecate functions that provide no added value, and are defined only to be able to be used by the limited 'ctypes' type API.
- I'm talking mostly about functions whose only motivation was give access to class methods (prime example: cb_LoadCollection just wraps Collection::Load, and many other similar ones.)
- In this second step, code in Wrye Bash will probably need corrections and adapting to stop using deprecated functions, and change to make use to the direct route: like replace cb_LoadCollection(Collection, x) with Collection.Load(x).
- Of course, this is just a simplistic example to make the point, there are probably more complex scenarios that would need to be checked.
- If all the cases are simplistic like this one example, this step might be very quick to complete, and could probably me merged with 1)
Start making helper classes and types visible via de API, mainly the visitor mechanism, and some other useful stuff that is only usable from C++ currently.
- This wont provide much enhancement to Wrye Bash code, but I'll make accessible new possible interactions (I hope) that way, maybe some stuff that was made in an ugly way and can be refactored to use a more convenient access mode.
- Main idea in this change is that since callbacks aren't a problem anymore, there could probably be many of the use cases that can be resolved by methods receiving lambdas, or a list comprehension defining a subset, or temporary collections for partial results, etc.
Pythonize the API as much as it's possible, either by
1. adding constructs where they are none yet, and adding standard python support (I mean, iterators, indexing, context managers, whatever applies in each case) to classes that already exists, or,
2. worst case, introducing a new extra layer of classes/constructs that allow easier traversal from python.
  - I'd prefer to go with option i) since I think the internal structure of CBash classes is very clean already, so making it compliant to python standards (via light-weight wrappers, where needed) shouldn't be too much work.
Explore enhancements and new constructs, as mentioned in prior posts (i.e. buffer protocol, as example.)
- Anything that would make the API faster, more flexible, that reduce repetitive coding, and that makes usage of the API less error prone.

Why I think this is a good way to move forward?

My idea is to avoid touching internal CBash code as much as I can, to avoid introducing new bugs/regressions while there are no tests cases that protect the code, so steps 1), 2) and 3) can be made with probably almost no alteration of the inner workings of the existing deep 'core' stuff, like record handling, mod file parsing, etc. And I hope that by the time cases 4) and 5) enter the picture, there will be already more test coverage and protection against more common regressions.

Of course, I'm open to suggestions, alterations, rebuttals and comments :)

Utumno commented 7 years ago

Well - it just occured to me that I have a binary commit ready on my branch with 7z version 16.02 - if we are to look into the python bindings for 7z dll then maybe it's the time to do it to avoid a double binary commit. On the other hand I do want 16.02 in the beta due the vulnerability discussed here https://github.com/wrye-bash/wrye-bash/issues/313 so I might as well go ahead and commit and we revisit this at our ease. The beta has some work ahead - including loot, merging my wip branch which needs more work etc

leandor commented 7 years ago

It depends on the time frame you'd want to wait. I mentioned the 7zip binding as a long term project, but if you need to focus on it, I could switch from CBash and start looking into it.

I don't even know yet if the provided library has all the functionality needed for Bash, so it'd all depend on that.

It's something that would take me between 2 weeks and 1 month to figure out and implement, I think... and assuming that I can keep doing stuff while I'm on work hours (while my work is not too demanding.)

Even then, I'd like to think a proper solution to the dependence problem... as in, dependence in self produced binaries, not on 3rd party stuff.

How other projects handle this?

Do they use GetHub releases?
Do they depend on sources and compile whole packages?
Do they link to a submodule that contains the produced binaries?
How do GitHub even handle releases?
Do they are located on a Git repos by themselves, or it has a different API altogether.

We should look for projects that have many modules that compile independently here in GitHub, and see how they handle integration, IMO.

In any case, I'd think that having a whole Git repos for handling releases (or, probably, one for each project that produces binaries would be the ideal, maybe) is not something too problematic (at least in the mean time, until a proper solution can be devised.)

AFAIK, the whole "do not commit binaries to GIT thing" is when you have source and binaries intermingled, but if you dedicate a whole repos for releases, I don't see the problem... Git handles binaries very well, and even if they become too big to handle, you can always delete the whole repos and regenerate it, since you aren't really interested in the history, right? As long as you keep one branch of binaries for the stable, and one branch for dev (and maybe, an extra branch for experimental releases, occasionally) it just shouldn't become too big too handle, IMHO.

((As an example, in my work, we have a separate SVN repos just for the released binaries, that has a mirrored structure w.r.t the source repos, and the actual source working copies bring the release directory from that release repos using SVN externals. The build bot just compiles the source and then auto-commits the produced binaries that end up in the release dir, which commits to the binaries repos.))

And I suppose that another solution could be to create 'proper' Python packages using setuptools/pip and host them on PyPI (assuming PyPI even host the binaries themselves? I know they host source tarballs in some cases...), but I don't know how much work would that option entail.

leandor commented 7 years ago

I've been toying with a continuous integration system on a personal fork of this repo (just in case I messed up) and it seems to work good.

I found AppVeyor and decided to test it since it's free for open source projects (with limitations) and supports the environment we need.

There's also travis-ci.org, which is better IMO, but it doesn't support Windows builds (yet).

I'm gonna play with Travis maybe tomorrow, to see if I can make CBash compile in a Linux environment.

See here for the build that passed.

I've already read up a bit on how to generate releases automatically and how to handle dependences.

Also, here's the docs on Releases: https://help.github.com/articles/creating-releases/

Next thing I'm gonna do is to try to optimize the build time a bit, because compiling everything takes 15 mins, and I'm already dreading having to change the code due to the compilation times.

So, first step is to try the compiled headers stuff I found around, and the second step is to split up CBash project in several static libraries that get linked together at the end.

After that, I'll be starting code the step 1) on the list above.

Utumno commented 7 years ago

Welll of note that apparently (https://github.com/wrye-bash/wrye-bash/pull/329 ) we have some serious problems packing the pyd file with the standalone - I had similar problems making scandir work (a python module that uses a pyd and depends on the msvc)

leandor commented 7 years ago

Hmm... that worries me a bit, as soon as I have a working binding for CBash, I'd take some time to do some experimentation too with the standalone version.

Utumno commented 7 years ago

Yes I'll turn the eye on this too - anyway I want scandir to work too, speeds up quite a bit filesystem listing.

Ortham commented 7 years ago

The pyd has no problems being packed, it's the (LOOT API) DLL that it calls that I haven't been able to pack.

leandor commented 7 years ago

Hey Wrinkly,

Maybe you can just solve it by compiling against a static version of that lib when building the pyd?

That would mean having two different builds, if you need the DLL output, as I don't think CMake allows generating both the shared and static versions of a target from the same build process.

Ortham commented 7 years ago

If you link statically then you need to be very careful about memory allocation, because your library will allocate memory in its own heap, so you can't have objects get freed externally (they'll try to free memory on a heap they don't exist on).

If you're providing a C API this isn't an issue because the memory management is all explicit anyway, but I found (at least with pybind11) that leads to pointer hell, primarily re: output argument pointers, which just don't map to Python (which is why I guess ctypes constructs object wrappers for pointers, but which is probably a factor in its large overhead).

EDIT: I'm talking about linking statically to the MSVC runtime here, if you mean linking to the library itself and not the runtime, then you pull that runtime linking into the pyd, which would probably just move the issue from the DLL into the pyd, leaving importing it totally broken.

leandor commented 7 years ago

Yes, I meant just for your (intermediate, at least from the pyd standpoint) DLL library.

Agreed, even if you link to it statically, the C/C++ runtimes (and python runtime too) would always have to be linked dynamically, as you said.

wrye-bash / CBash

Provide auto-generated Python bindings #7

Steps and tasks

Step 1

Step 2

Step 3

Step 4

Step 5