Support for the use case to enforce single-line strings, but not multiline strings

karamanolev commented 7 years ago

In our projects, we are trying to enforce always using ' for single line strings and flake8-quotes is perfect for that. However, for multiline strings, we use " for docstrings and ' for regular strings. Right now, they single-line strings and multiline strings emit the same warning, so we can't just ignore the latter. Also, --multiline-quotes allows configurability for that, but it's either ' or ", so in our project we can't use flake8-quotes at the moment. I think a good first step is to emit different errors for single-line and multiline. An optional next step is to have separate configurations for quotes for docstrings and regular multiline strings.

twolfson commented 7 years ago

I'm open to supporting different errors for inline and multiline strings. Would you be interested in writing a PR for that?

I'm on the fence about supporting separate configurations for docstrings and non-docstring multiline strings. It doesn't feel like a common use case =/

bpmcl commented 7 years ago

I'm on the fence about supporting separate configurations for docstrings and non-docstring multiline strings. It doesn't feel like a common use case =/

We are one such use-case. We use the different formats to clearly demarcate docstrings (double-quoted) from all other code, including multiline strings, which are all single-quoted. We have pinned to 0.9.0 to resolve our issues for now.

A potential workaround if this is not a use-case you would like to support is a separate error as suggested by @karamanolev, or an option to disable multiline string parsing. If he is not open to creating a PR for the former option, I can help.

karamanolev commented 7 years ago

@twolfson @bpmcl Thanks. I'll give it a spin in the next day or two and I'll keep you posted. I'll first start with separating the warnings for single line and multiline. The next step would be separate warnings for docstring-style strings and regular multiline. After that it would be easy to make it configurable - my idea is that docstrings get their setting by default from --multiline-quotes, but that can be overwritten using --docstring-quotes. WDYT?

bpmcl commented 7 years ago

@karamanolev That would definitely cover our usage. The different error messages is a good first step, but we would still like to correctly parse multiline strings according to our internal guidelines, so having this be configurable would be ideal.

If @twolfson is happy with this approach, I would be a willing tester for your PR.

karamanolev commented 7 years ago

@bpmcl @twolfson What's a good way to have a discussion about the way to implement this?

twolfson commented 7 years ago

@karamanolev @bpmcl Sorry, I misspoke last night due to being tired and rushed on other tasks. I'm open to supporting an option like --docstring-quotes but I think it will be difficult with our current implementation. That is tokenize doesn't seem to easily support detection for functions:

https://docs.python.org/3/library/tokenize.html

If you can find a way to achieve that easily or a library to replace tokenize that supports our current implementation easily, then I'm glad to support it.

With respect to separate inline/multiline errors, it's probably easiest to iterate on a PR so we can comment directly on the code

fxfitz commented 7 years ago

FWIW, we're hitting this problem now with 0.10.x. We would also like to see this added. :-)

/me pins 0.9.0 for now.

karamanolev commented 7 years ago

This is a naive first pass at the first feature. There is a somewhat longer explanation in the PR itself. Please let me know what you think: https://github.com/zheller/flake8-quotes/pull/57

karamanolev commented 7 years ago

Keeping the discussion about the specific PR over there, I think this is a good place to discuss proper support for docstrings. What I've found out so far is that the ast module can be helpful, but doesn't solve all our problems. You can get the docstring for a node using get_docstring, but it's just returned as a string. You don't get the AST node, so it's basically useless. As a result that, we'll need to check what is a docstring ourselves. I think this boils down to checking if the first expression within a module, class and function is a ast.Str and if so - treat it as the docstring. The worst of all, for a Str node you can only access the .s attribute to get the string value, but this is completely removed from the actual source code - as far as I can tell, there is no way to get syntactic information like what quotes were used. The only workaround I can think of is:

Use ast to parse the file into a tree, go over the source code and get a mapping of (lineno, offset) -> (string meta information). For now I think the meta information is only going to contain whether it's a docstring or not. In the future, if we need, we can extract more from the AST. We can then go over the tokenized form like we do right now and consult the prebuilt meta information to check how to treat the string.

twolfson commented 7 years ago

I think that strategy seems reasonable and was kind of expecting something like that. In JavaScript, there are typically libraries which interlink the token tree with the AST for this very purpose:

https://github.com/millermedeiros/rocambole

Another library to check would be flake8 itself. I think they've introduced more parsing options since we started using it so there might be an API we can leverage:

http://flake8.readthedocs.io/en/latest/plugin-development/plugin-parameters.html

twolfson commented 7 years ago

We have updated multliline errors from Q000 to Q001 via @karamanolev in #57 and released it in 0.11.0. I'm going to leave this PR open as it sounds like we still want to support --docstring-quotes

karamanolev commented 7 years ago

Guys, I've run into this Python bug http://bugs.python.org/issue16806 . It seems than other than extremely ugly workarounds, we won't be able to do much:

Use the token stream to detect docstring ourselves. I think this is relatively difficult and completely out of scope in this project.
Match the strings by lineno + content, which is going to work for 99.9% of the cases, but can be broken with something like:


    class Cls:
        "hello" 'hello' '''hello'''
        pass

Any ideas?

twolfson commented 7 years ago

Could you elaborate more on using the token stream to detect docstrings?

I concur that the lineno + content matching is probably going to cause use more issues than it's worth

karamanolev commented 7 years ago

Here's a proposal: For the module docstring it's easy - in the start of the token stream, ignore all NL, NEWLINE, INDENT tokens. If you hit a STRING token, that's the docstring. For classes and functions it's a bit more involved - look for a NAME token with a value 'class' or 'def'. They aren't valid identifiers, so they can't be anything else. After that token look for an OP token :. It's required to be present and it can't mean anything but start of the body of the class/function. After the :, ignore all NEWLINE, NL, INDENT tokens. If you hit a STRING, that's the docstring. If not, there's no docstring. I'll give it a shot and report if it works.

karamanolev commented 7 years ago

A sample implementation with some more comments is in a PR: https://github.com/zheller/flake8-quotes/pull/59

twolfson commented 6 years ago

The initial issue was resolved a while ago (#57). It looks like we left this open for docstrings which has been resolved via #59 and #73, released in 1.0.0.

zheller / flake8-quotes

Support for the use case to enforce single-line strings, but not multiline strings #56