Make PySPQR into an installable Python module

yig / PySPQR

Python wrapper for the sparse QR decomposition in SuiteSparseQR.

Creative Commons Zero v1.0 Universal

34 stars 27 forks source link

Make PySPQR into an installable Python module #5

Closed Technologicat closed 7 years ago

Technologicat commented 7 years ago

Attached is a PR to make PySPQR into an installable Python module. This is especially convenient when it is needed by several local projects, as it is then enough to install it once, avoiding the need to copy files around.

I have picked the name "sparseqr" (which can still be changed if you want), because the name "spqr" is already taken on PyPI for a module supporting Roman numerals. The project title "pyspqr" is confusingly close to the existing "spqr", on PyPI the "py" prefix would be somewhat redundant, and also in the conventional all-lowercase, it becomes much less readable.

In my opinion, in the long run, the optimal solution would be to merge PySPQR into scikit-sparse. The library already wraps CHOLMOD, and depends on SuiteSparse just like PySPQR does, but it currently has no active maintainer.

So, as an interim solution, this PR makes PySPQR installable separately. Install instructions in README.md have been updated.

This PR is sufficient to support setuptools, i.e. python setup.py install (with various options). If it is desirable to have PySPQR available also on PyPI, there are basically two options - either you can upload it as the original author, or I can take up the packaging.

I took the liberty to mark this version as v0.1.0, since with setuptools it is highly recommended to have a version number. The platform has been marked as Linux, as that is what I'm testing on, but it will probably work on any platform.

yig commented 7 years ago

This looks very nice. I can accept the pull request. Do you think I should rename the git repository "sparseqr"? As a git repository, the name becomes ambiguous without a "py" in front. I can also rename the spqr.py modules inside.

Note that this does not work for me on macOS. It seems to install the module fine, but runs into a permission error when installing the test data:

running install_data
creating /usr/local/test
error: could not create '/usr/local/test': Permission denied

I will look into fixing this before I pull.

Also, I think this should be called version 1.0, not 0.1.0.

I agree that in the long run this would find a better home inside scikit-sparse. I took a different approach than they do to wrapping the C library by using cffi.

yig commented 7 years ago

It seems like package_data would be better than data_files for the test and README. Could you please make the change?

Technologicat commented 7 years ago

I think it is fine to leave the repository name as-is. Any existing users (and their friends) expect to find the project by that name.

A usage example in the README (and an included test.py for good measure) should be sufficient to tell users how to import and call the library, even if the project name and the Python package name differ. The various scikits already do this (e.g. project scikit-sparse vs. corresponding Python package sksparse).

Good point about GitHub not being Python-specific; also, to be fair, there are quite many libraries on PyPI starting or ending with a redundant "py", so it's not strictly necessary to get rid of it.

I'm not sure if "sparseqr" is even the optimal choice as the Python package name for PySPQR - do whatever you like :)

In my opinion, for the user it only matters that the package name is somewhat sensible (to maximize the first-glance readability of the source code, following the Python spirit), and that the functions are callable directly under that. There is no need to present nested namespaces to the user, as this is a small library. Internally, the code can of course be organized in whatever way is the most logical and maintainable; we can make the package-level __init__.py expose the public API (as the code currently in the PR already does).

To me the error sounds like setuptools on Mac OS is trying to install non-package data files (and directories) directly under /usr/local. Ouch, and thanks for the heads-up. Probably all of my own packages are similarly broken on Mac OS at the moment, as is my setuptools template.

For the record, on Linux (Mint), this works differently: Python (3.4) packages install into /usr/local/lib/python3.4/dist-packages, with individual per-package directories, such as sparseqr, under that. Then test/*, specified in data_files, becomes /usr/local/python3.4/dist-packages/sparseqr/test/*.

I was specifically using data_files instead of package_data, because code-wise, the README and the test/example script are technically not part of the library (Python package) itself, and yet they should be included in an sdist of the project.

Maybe using package_data is the easiest option that works on both OSs - I'll change this.

I'll also change the version as requested. The 0.1.0 was an instinctive default I use for packaging new projects - but PySPQR already does pretty much everything it wants to, so indeed 1.0.0 may be better.

Updated PR to follow soon.

Technologicat commented 7 years ago

...but before updating the PR, one more comment:

About scikit-sparse, I also took a quick look at their repository before deciding to package PySPQR separately.

As you already said, the approaches to wrapping C code are different; specifically, Cython vs. cffi. Also, I noticed they have a pending pull request that has conflicts with the base branch, which may be somewhat time-consuming to figure out without a maintainer already familiar with the project.

I wonder, in the long term, what would be the best way forward:

just glue the libraries together (as separate submodules),
re-wrap CHOLMOD using cffi, or
re-wrap SuiteSparseQR using Cython?

Solution 1 is ugly, since the code then depends on two independent systems to wrap (and load) the same C library. It also introduces unnecessary duplication of the data conversion routines that convert between SuiteSparse and NumPy/SciPy formats, since each of the two wrapper libraries has its own implementation of these.

Solution 2 discards a tested, rather thoughtfully written (e.g. shared allocator for the triplet vectors of the same sparse matrix!) and most importantly fully working wrapper library that users already depend on.

Solution 3 similarly discards a rather simple and elegant solution, similarly already tested and fully working.

yig commented 7 years ago

I'll think about the name a bit more. sparseqr is fine, as would be pysuitesparseqr. I do think we should rename the internal files from spqr.py to whatever we choose, because I like the simplicity and usability of being able to copy the two files inside one's own codebase and avoid package management entirely. (I thought about combining all the files together into a single python file. I think pip or PyPI may have a special, simple case for single file modules.)

Regarding the long term, I don't know what the best way forward is. I think it's up to whoever chooses to maintain scikit-sparse (or scipy, if they want to take this on). I don't really have a problem with there being two entirely separate libraries, so long as they can be found and used. It's somewhat suboptimal, but I already use scipy and cvxopt a lot and have to convert matrix formats. If cvxopt or scikit-sparse ever stopped working (bitrot) and I needed CHOLMOD, I would probably add it to this library using cffi.

Technologicat commented 7 years ago

Naming of the internal files, good point. To support both use cases, the names of the package and the main module indeed need to match.

If you want an opinion, I think sparseqr is both descriptive and short to type. I prefer to have libraries available on PyPI, or at least to come with a setuptools install script, so I can keep the git repos of my own projects clean of any library code, just declaring a dependency in the README and possibly setup.py if the project has one.

While at it, perhaps we could also get rid of the qr prefix in qr_solve() and qr_solve_sparse()? Thinking on this, given Python's aggressively explicit namespacing, the qr in the solver function names (which I did originally put there) is redundant. The name of qr() itself is fine, as it describes what the function does.

Digging a bit, now I remember another reason why I originally steered clear of package_data. It behaves somewhat unintuitively:

Lies, More Lies and Python Packaging Documentation on package_data

Searching the web for _packagedata setuptools turns up also:

How to include package data with setuptools/distribute? How can I include package_data without a MANIFEST.in file? JonathonReinhart/setuptools-package_data-issue pypa/sampleproject: neither package_data or data_files get installed

It seems that to include data files in both sdist and bdist, writing a MANIFEST.in is the way to go:

Install package data with setuptools I don't understand python MANIFEST.in

Then there are at least pbr and Flit offering a simplified interface for Python packaging, but for this higher level of abstraction no single dominant tool seems to have emerged yet, so it's probably better for now to stay on setuptools.

Technologicat commented 7 years ago

Ok, investigation complete, and PR updated. Version bumped to 1.0.0, and install on Mac OS hopefully fixed. Please test the updated version.

There is still the module and function renaming to do - if we agree on the names, I could do that before you pull?

One final detail is that, in order to have GitHub detect and display the license type automatically, I would like to include a LICENSE.md, if that's ok. This seems to be the official text for CC0 1.0.

About the installation issue, this SO comment confirms that with the previous setup.py, setuptools was trying to install the files declared as non-package data files directly under prefix, because I had specified their paths as relative.

By default, in Linux (Mint) each Python package gets its own prefix, which is why it worked for me. In the case of sparseqr, this is .../lib/python3.4/site-packages/sparseqr-1.0.0-py3.4-linux-x86_64.egg. Here ... is either $HOME/.local or /usr/local, depending on whether or not the --user flag was given to python3 setup.py install.

On Mac OS, based on your report, it sounds like all packages share the same prefix /usr/local. Of course, this behavior is more in line with what prefix usually means in *nix environments.

As the SO comment notes, non-package data files have no natural install location. So, the current solution simply includes them in the sdist (by specifying them in MANIFEST.in), but leaves them out when installing. I think from the user's perspective, this should be ok.

yig commented 7 years ago

I think at this point we can radically simplify the setup.py so that the entirety of the file is:

from setuptools import setup

setup(
    name = "sparseqr",
    version = "1.0.0",
    author = "Yotam Gingold and Juha Jeronen",
#    author_email = "juha.jeronen@tut.fi",
    url = "https://github.com/yig/PySPQR",

    description = "Python wrapper for SuiteSparseQR",
    long_description = """This module wraps the SuiteSparse QR decomposition and QR-based sparse linear solver functions for use with SciPy.

This is Matlab's sparse `[Q,R,E] = qr()`.

The solvers work also for overdetermined linear systems, making them useful for solving linear least-squares problems.

Supports Python 2.7 and 3.4.
""",

    license = "Public Domain CC0",

    # free-form text field; http://stackoverflow.com/questions/34994130/what-platforms-argument-to-setup-in-setup-py-does
    platforms = ["any"],

    # See
    #    https://pypi.python.org/pypi?%3Aaction=list_classifiers
    #
    # for the standard classifiers.
    #
    classifiers = [ "Development Status :: 4 - Beta",
                    "Environment :: Console",
                    "Intended Audience :: Developers",
                    "Intended Audience :: Science/Research",
                    "License :: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication",
                    "Operating System :: POSIX :: Linux",
                    "Programming Language :: Python",
                    "Programming Language :: Python :: 2",
                    "Programming Language :: Python :: 2.7",
                    "Programming Language :: Python :: 3",
                    "Programming Language :: Python :: 3.4",
                    "Topic :: Scientific/Engineering",
                    "Topic :: Scientific/Engineering :: Mathematics",
                    "Topic :: Software Development :: Libraries",
                    "Topic :: Software Development :: Libraries :: Python Modules"
                  ],

    # See
    #    http://setuptools.readthedocs.io/en/latest/setuptools.html
    #
    setup_requires = ["cffi>=1.0.0"],
    cffi_modules = ["sparseqr/spqr_gen.py:ffibuilder"],
    install_requires = ["numpy", "scipy", "cffi>=1.0.0"],
    provides = ["sparseqr"],

    # keywords for PyPI (in case you upload your project)
    #
    # e.g. the keywords your project uses as topics on GitHub, minus "python" (if there)
    #
    keywords = ["suitesparse bindings wrapper scipy numpy qr-decomposition qr-factorisation sparse-matrix sparse-linear-system sparse-linear-solver"],

    # Declare packages so that  python -m setup build  will copy .py files (especially __init__.py).
    #
    # This **does not** automatically recurse into subpackages, so they must also be declared.
    #
    packages = ["sparseqr"],

    zip_safe = False  # includes a binary extension, not zip safe
)

yig commented 7 years ago

Sure, we can/should add a LICENSE. Here is what I've used for CC0 in the past:

https://github.com/yig/rig_converter/blob/master/LICENSE

yig commented 7 years ago

Regarding renaming the functions themselves, we should probably follow the C API itself, which means:

qr() -> QR()
qr_solve() -> backslash()
qr_solve_sparse() -> backslash_sparse()

Technologicat commented 7 years ago

Re setup.py simplification, the code detecting data_files is indeed no longer needed here, so well spotted, and good riddance. :)

The reason behind the proposed automatic detection for package __version__ was to declare the version only once, eliminating a common source of packaging mistakes, at the cost of some added complexity in setup.py. In my own projects, when preparing a new release, I always forget to update the version number somewhere, unless it's declared in one place only.

As many things, it's a tradeoff - if you think it is better to keep setup.py as simple as possible, and specify the version manually in both places, that is also fine.

I've 'pulled' your version of setup.py from the previous comment. I left in the character coding declaration and the __future__ imports, but otherwise the version of setup.py currently in the PR is identical to the one posted above.

Re LICENSE text, thanks for the file! Added to the PR as LICENSE.md. (.md because also README is .md, and README/LICENSE/CHANGELOG/AUTHORS usually come as a set in the same format.)

Speaking of which, are there any other contributors to name? Asking just to make sure; there weren't any on the GitHub network graph.

Do you want your email address in setup.py, or should we put mine, or leave it blank? In any case, it has a link to the project's GitHub page, which I think is sufficient as a minimal kind of contact information.

Re naming the functions, while transparency in naming does have its merit, there are a couple of contrasting considerations:

PySPQR extends SciPy's sparse solver functionality, and qr_solve() is a partial drop-in replacement for SciPy's spsolve(): its most important arguments A and b are compatible, have the same name, and are specified in the same order. Thus qr_solve() -> spsolve() would be what the average SciPy user expects. (Above I suggested solve(), but upon a closer look that's not consistent; both NumPy and SciPy use solve() only in the case where A is a dense matrix.)
We currently have two different sparse solver functions in the public API (necessitating two different names), SciPy has just one. This hints that it would be cleaner if we also had just one. We can dispatch internally based on whether b is sparse, i.e. isinstance(b, scipy.sparse.spmatrix).
Backslash is a MATLAB concept that does not exist in Python. Consider that NumPy exposes solve() instead of dgesv(), although it internally uses LAPACK.
In my opinion, to follow Python conventions, we should avoid uppercase; to a Python programmer, QR() sounds like a constructor for a QR object, because in Python everything except class names is usually lowercase. qr() is the standard pythonic spelling, regardless that it does a QR (uppercase) decomposition.

Technologicat commented 7 years ago

Tested to make sure: isinstance() against scipy.sparse.spmatrix correctly reports all the sparse matrix types in SciPy:

import numpy as np
import scipy.sparse

A_dense = np.eye(5)  # can use any dense matrix
for clsname in [s for s in dir(scipy.sparse) if s.endswith('_matrix')]:
    clsobj = getattr(scipy.sparse, clsname)
    A_sparse = clsobj(A_dense)
    print( isinstance(A_sparse, scipy.sparse.spmatrix) )  # True

Technologicat commented 7 years ago

I'd like to finish this, for which I need two decisions from you as the primary author of the project:

1: API

option a): expose two functions: qr and solve. In solve, detect type of b and dispatch automatically.
option b): expose three functions: QR, backslash, and backslash_sparse.
option c): other, please specify?

2: Module name

option a): sparseqr
option b): pysuitesparseqr
option c): other, please specify?

Once I have the decisions, I'll then rename the functions and the module, and update the PR to a final state that should be suitable for merging.

As for the other questions I had, I'll default to "no other contributors", and leave the email address field blank, unless otherwise specified.

yig commented 7 years ago

Thanks for your work on this.

For the API, I prefer option (a). qr() and solve() are straightforward names. I don't have a problem with solve() even if scipy uses spsolve() for sparse matrices. This entire package is about sparse matrices. For what it's worth, cvxopt's interface to UMFPACK uses solve().
Let's use the module name sparseqr. pysuitesparseqr is too long :-(.

The current setup.py in your repository is good. I prefer to err on the side of simplicity.

There are no other authors. I don't know what to say about the email field. I guess blank is fine. If they are wondering about the code, I'd answer the question. If they are wondering about the packaging, you'd answer the question. I don't know which is more likely. Since there is a link to the github, it is fine.

Technologicat commented 7 years ago

Oops, I meant to say spsolve (for consistency with SciPy), but solve is also fine. Good point about UMFPACK in cvxopt, shows that other libraries aren't obsessive about naming consistency with SciPy either. Going for option a) and solve.
sparseqr it is.

Ok, let's keep that version of setup.py, maybe just ditch the commented-out email line. As you probably noticed, I usually err on the side of slightly over-engineered automation :)

Authors and blank email, ok.

I'll update the PR soon-ish (probably tomorrow).

Technologicat commented 7 years ago

PR updated. Overview of everything so far:

Installable!
Has a version number, v1.0.0 (specified in setup.py and in sparseqr/__init__.py)
Has a LICENSE.md for GitHub to pick up
The module is now sparseqr
The exported functions are qr() and solve()
solve() dispatches automatically based on whether b is dense or sparse

Details of the latest changes in the log as usual.

Final things

Maybe this is getting silly, but I find there is one minor thing remaining:

There is a third function in the public API, namely permutation_from_E(), because it is needed for use with the output from qr(). However, its name is somewhat long and obscure.

We are already breaking backward compatibility in terms of naming of functions and modules, so if we want to rename the function, it would be best to do this before we release 1.0.0. (And we should tag and release it when we're done, to make the version number actually mean something. :) )

After that, any name changes would have to wait until 2.0.0 (if we follow semantic versioning), and considering that PySPQR in its current state is pretty much complete, there might never be a need to release a 2.0.0.

So, what to use as a name? This is a general routine to convert a permutation vector to a permutation matrix, not specific to the E returned by qr().

permutation_vector_to_matrix() is fully descriptive, but very long.
permutation_as_matrix()? Slightly long, but somewhat descriptive.
matrixify() or similar would be misleading, as this operation is specific to having a permutation vector as input.
Make qr() return the permutation matrix, and remove this function. In my opinion, not good - the matrix is more easily constructed from the vector than vice versa, and the vector is needed more often (as in the usage examples).
Not so important, leave as-is?

Then, a final final thing would be to convert the docstrings to follow the NumpyDoc format, for consistency with NumPy, SciPy and the scikits.

The NumpyDoc format also has the bonus that the Spyder IDE pretty-renders it, making the documentation easier to read for Spyder users.

Also, the docstrings should have max. 75 characters per line to make them readable in (standard default size) text terminals. Currently they wrap in an unreadable manner - some (maybe much) of this is to blame on my habit of maximizing my editor and terminal windows.

But I think the docstring update can wait until a 1.0.1 dedicated to that.

yig commented 7 years ago

The changes look great. I agree with changing permutation_from_E(). It is long and not really descriptive. I like descriptive names, so I prefer permutation_vector_to_matrix().

I agree with waiting to change the docstrings for a later minor version.

Technologicat commented 7 years ago

Descriptivized.

Maybe this is now complete? :)