weld-project / weld

High-performance runtime for data analytics applications
https://www.weld.rs
BSD 3-Clause "New" or "Revised" License
2.99k stars 258 forks source link

Grizzly is Python 2.7 only #110

Open wesm opened 7 years ago

wesm commented 7 years ago

It will be important to run on Python 3, preferably both 2.7 and 3.5/3.6 with a single codebase (the six module helps with this)

cirla commented 7 years ago

I'm working on supporting 2.7 + 3.5/3.6 simultaneously in https://github.com/weld-project/weld/pull/132, though Unicode support muddies things a bit.

wesm commented 7 years ago

While you're at it, it would be nice to plot a course to conda install weld and get all the Python things in a single import weld statement. This probably means a package structure like

weld/
    grizzy/ ...
cirla commented 7 years ago

I'm also exploring what it will take to package the shared libs into the packages so binary wheels can be built and distributed.

wesm commented 7 years ago

You can look at what we did in Apache Arrow with manylinux1: https://github.com/wesm/arrow/blob/master/python/manylinux1/build_arrow.sh

and https://github.com/wesm/arrow/blob/master/python/setup.py#L210

so all the shared libs (build with CMake) get bundled in the wheel. Probably possible to do something similar with OS X. We have an extra layer of complexity in that we also want to expose a C API for Arrow via the binary wheel (similar to the NumPy C API), but we're still working on that.

wesm commented 7 years ago

conda is the easiest way since you can package libweld (the shared libraries) and weld-python (the Python package and C extensions) as separate components

wesm commented 7 years ago

There seems to be some GitHub snafu right now so all the Apache git mirrors on GitHub are down at the moment

snakescott commented 7 years ago

@cirla how is work on this going? I'm new to Weld, but Python 3.6 + packaging support lines up with my interests; is there a part of this work (and/or #132) that is sufficiently

that I could try tackling it?

Thanks!

cirla commented 7 years ago

@snakescott I haven't been actively working on this recently, but there are two separate things to tackle for #132:

  1. Get Travis CI to run for multiple python versions.
    • May have to change language: in .travis.yml to python so that it spawns a separate job for each python version specified under the python: list, but then we'd need to install rust during the install: phase (i.e. curl -sSf https://build.travis-ci.org/files/rustup-init.sh | sh -s -- --default-toolchain=$TRAVIS_RUST_VERSION -y), but this would preclude us from testing against multiple versions of rust.
  2. The work done so far in that PR handles all of the low-hanging fruit to get weld to run in Python 3, but it's still not very usable given that unicode strings are the default over byte strings in Python 3 and all of the weld string operations assume single-byte characters for strings. You can still get the weld string operations to work if you force pandas to load the strings as byte strings and there are no multi-byte characters in the data you're loading, but it's counter-intuitive. To really be able to claim Python 3 support there needs to be real unicode support. This comes with all kinds of questions:
    • Weld operations operate over a vector of fixed-size items. We can easily convert all unicode Python strings from canonical to UCS4 encoding and therefore have an array of 4-byte WeldInt/i32s. The downsides to this are the overhead in converting the string representation, the extra memory usage by using a wide fixed-width encoding (especially if the majority of the characters are ASCII), differences in the CPython 2 and 3 API for converting Unicode, and ambiguity when decoding (is it an array of ints or an array of UCS4 codepoints?)
    • We could just encode all unicode strings as UTF-8 bytestrings and stick to the current array of WeldChar/i8 implementations, but these operations would behave very poorly when the string contains multi-byte characters, modifiers, etc.
snakescott commented 7 years ago

@cirla

I will take a look at the Travis config and see if I can puzzle anything out -- maybe hit up Travis experts in Slack if I get stuck.

Unicode seems trickier (and more interesting!). A few thoughts/questions:

  1. Perhaps there's some way to sidestep these issues by pushing the choice to the user and picking sane defaults? The ability to run on Python 3.6 but only on ASCII strings could be a useful incremental step?
  2. Even ASCII strings in Grizzly (perhaps Weld?) seem a bit confusing today. While str does map to WeldVec(WeldChar()), Grizzly represents string arrays as a vector of pointers (#136). So I'm not completely sure which Weld core operations (arithmetic, len, etc) apply to python strings in practice, versus applying to vectors of pointers to strings? What do you think? Are there simple Grizzly snippets that expose perf differences when running ascii vs unicode?
snakescott commented 7 years ago

For travis, it seems like it should be possible to

  1. delete the python block
  2. install e.g. python 2.7.12 and python 3.6.1 in addons
  3. add the Python version to test as an argument to thetest_llvm_version.sh script
  4. Move the code from the install block to test_llvm_version.sh, using a virtualenv to handle choice of python environments
  5. (Eventually/potentially) figure out conda-forge

If this sounds sufficiently promising, I can work on a PR. I expect it to be possible to merge something like this prior to full Python 3.6 support (just don't add a test_llvm_version.sh invocation for 3.6 until it is ready).

cirla commented 7 years ago

That's sounds reasonable, just make sure that the C++ code/shared library is built against the right version of Python for each one.

snakescott commented 7 years ago

Sorry about spamming the ticket with my Travis adventures. will leave the issue reference off until it is ready for PR next time!

snakescott commented 7 years ago

Ah, sorry, I have a better handle on the UTF side of things -- the example I was missing was slice, which is unfortunately missing from the language doc. From what I can see from the codebase, numpy uses UCS4 internally, so maybe that's appropriate for Grizzly? Support for both ascii (no unnecessary memory tax) and UCS4 (for numpy unicode compat) might be a good place to start.

If we're comfortable with the memory and conversion overhead, I can look into the CPython unicode API differences as well as the decoding ambiguity.