westurner / pyline

Pyline is a grep-like, sed-like, awk-like command-line tool for line-based text processing in Python. https://pypi.python.org/pypi/pyline
https://pyline.readthedocs.org/en/latest/
BSD 3-Clause "New" or "Revised" License
37 stars 4 forks source link

Support vectorized numpy operations #27

Closed talwrii closed 8 years ago

talwrii commented 8 years ago

Example

seq 10 | pyline -N 'd**2'
seq 100 | pyline -N 'sum(d)'
seq 100 | pyline -N 'sum(d)'

This will likely be contentious:

That said, all things being equal, the world might only want one command line tool to quickly run python from the command line.

talwrii commented 8 years ago

See https://github.com/talwrii/pyline/tree/talwrii--numpy--2016-09-21--WIP--issue-27 for a reference implementation.

I wanted these features to exist, so I coded them; if this is deemed an ill fit for pyline then I can pull this into a separate tool without too much work.

Note that this is a work in progress. There is an unresolved question of the numpy dependency:

Opinions?

westurner commented 8 years ago

Are we happy for this tool to have a numpy dependency?

You could add it as an extras_require in setup.py; though I think expecting users to have numpy installed is a fair assumption.

If not are we happy to have certain functionality to have a numpy dependency? If so how should tests work.

try:
    import numpy as np
except ImportError:
    np = None

# ...

skipif(np is None, "NumPy is not installed")

So, probably pytest

import pytest
@pytest.mark.skipif
@unittest.skipif(np is None,
                    reason= "NumPy is not installed")
def testfuncname():
    # ...
westurner commented 8 years ago

I'm sort of partial to using full imports in the command (for the sake of making it easier to copy and paste to a program), so, personally, I'd prefer e.g. np.sum() over just sum(); but there may be a justifiable reason to be sage-y?

westurner commented 8 years ago

(I am open to a (first!) PR)

westurner commented 8 years ago

I wanted these features to exist, so I coded them; if this is deemed an ill fit for pyline then I can pull this into a separate tool without too much work.

In terms of scope; IDK about other functions which consume the whole file

westurner commented 8 years ago

You could add it as an extras_require in setup.py; though I think expecting users to have numpy installed is a fair assumption.

If there's an ImportError in the pyline function, $ pyline should return a nonzero error code

talwrii commented 8 years ago

Cool cool. It seems like you're open to the idea of this feature.

Numpy dependency while running

No one's going to use the extra_require :) . Yep just erroring out if numpy is missing and --numpy is used seems reasonable. Perhaps together with instructions to install numpy.

Numpy dependency while testing

So my concern about skipping tests is that it's an invitation for tests to not be run when they need to be. There appears to be no way to automatically have different requirements for python setup.py test. test_requires would appear to exist solely to mislead! (http://stackoverflow.com/questions/9607565/how-do-i-force-setup-py-test-to-install-dependencies-into-my-virtualenv).

Some options:

Continuous integration

There seems to be some sort of continuous integration (CI) for this, but a brief inspection of the source code failed to tell me how this worked. Is there anything I should bear in mind here?

What numpy things to put into the namespace

Some comments

Opinions? I don't have strong opinions other writing out numpy.blah is a little wordy.

Actions

If you give me some judgement calls then I'll finish off my branch and give you a pull request.

P.S What does sage-y mean :) ? Do you mean like the computer algebra system or like the accountancy system. I don't use either extensively!

talwrii commented 8 years ago

In terms of scope; IDK about other functions which consume the whole file

More details please!

westurner commented 8 years ago

In terms of scope; IDK about other functions which consume the whole file

More details please!

As-is, pyline iterates through the input without reading the whole file into RAM (thus avoiding issues with memory consumption).

The proposed changes build a list for each line of the whole input file (consumes the whole file) and then copies those into a numpy array.

I'm hesitant to expand the scope of this utility; though I do recognize the usefulness of a --numpy option.

westurner commented 8 years ago

[CI]

[long function names]

  • import numpy; np = numpy
  • from numpy import * could potentially cause namespace collision

numpy dependency

As-is, pyline requires zero dependencies. The pyline.py file can be copied to sys.path and runs fine.

There is obviously a tradeoff between {zero dependencies, fast-tests} and {third-party library features}.

westurner commented 8 years ago
talwrii commented 8 years ago

TBH, I don't need this functionality

Cool cool. You can have it if you want though :) (the "otherwise" approach seems reasonable).

Okay, unless you tell me otherwise I'm going to extract out a a tool called npcli which

If it becomes apparent that other people want this feature then I imagine we can easily finish off this branch and merge.

westurner commented 8 years ago

Cool cool. You can have it if you want though :) (the "otherwise" approach seems reasonable).

:) cool

I suppose, since the tests already require third party packages, requiring numpy for tests and --numpy would be fine

westurner commented 8 years ago

AFAIK, there's still not yet a numpy/__main__.py (for python -m numpy or python -m numpy.cli.

Now that I've added other optional dependencies to pyline, it's unlikely that it would be accepted for inclusion as a numpy CLI module