winnow-rs / winnow

Making parsing a breeze
https://docs.rs/winnow
Other
572 stars 44 forks source link

`separated{0,1}` could accept non-consuming `sep` if `parser` is consuming. #325

Closed nicopap closed 9 months ago

nicopap commented 1 year ago

Please complete the following tasks

winnow version

0.5.15

Describe your use case

I have:

I want to use separated{0,1} such that sep is the separator and parser the element.

Currently, such a setup may panic in winnow if sep matches 0 tokens.

Click for minimal reproducible example ```rust use winnow::ascii::{alphanumeric1, multispace0}; use winnow::combinator::{alt, delimited, separated0}; use winnow::{PResult, Parser}; fn token_tree(input: &mut &str) -> PResult<()> { let token_trees = separated0::<_, _, (), _, _, _, _>(token_tree, multispace0); let mut parser = alt(( alphanumeric1.void(), delimited('(', delimited(multispace0, token_trees, multispace0), ')').void(), )); parser.parse_next(input) } fn main() { let separated = separated0::<_, _, (), _, _, _, _>; let text = std::hint::black_box(" hello (world(hello) world) ()(hello) world "); let parser = separated(token_tree, multispace0).recognize(); let parsed = delimited(multispace0, parser, multispace0).parse(text); assert_eq!(parsed, Ok("hello (world(hello) world) ()(hello) world")); } ```

Describe the solution you'd like

The implementation is trivial.

In the implementation of separated1 (and separated0) functions, move the if i.eof_offset() == len block to within the parser.parse_next(i) match, just before the res.accumulate(o), so that the infinite loop check accounts for the option of the parser being consuming, rather than just the sep.

This successfully prevent infinite loops, as we still guarantee the parser advances and never reaches a fixed point.

A side effect is that it makes rustfmt happy and reduces 1 level of indentation.

Alternatives, if applicable

For my specific use-case, the sep in question are whitespaces, and all I need to do is trim the parser output. I also am considering implementing a streaming tokenizer through the Stream trait to not have to care about whitespaces (and comments) in my parser.

Additional Context

When seeing separated, people often think of comma-separated list of elements as you'd see them in ALGOL-like languages. But I think whitespace delimited languages (say LISP) it makes sense to have a whitespace separator.

This use-case came in as I was designing a simple version of the rust TokenTree syntax element.

epage commented 1 year ago

This also recently came up in nom's issue tracker (rust-bakery/nom#1691) which was also discussed in the chat (https://matrix.to/#/!siAWDgffrOjdrhkczN:gitter.im/$dwtC3OMAmaJi7R5gspxJrPdujXnnb60J9gDk3v1bF6k?via=gitter.im&via=matrix.org&via=xiretza.xyz)

For easier reference for others, the relevant code is here: https://github.com/winnow-rs/winnow/blob/main/src/combinator/multi.rs#L303

epage commented 1 year ago

If I'm understanding your proposal, it is taking it from "sep must not be empty" to "(sep or parser) must not be empty".

On the surface that seems reasonable. There is a less direct, human aspect that I'm concerned about which is by having the assert fire less often, it'll make it less obvious and be a bug lurking in programs.

epage commented 9 months ago

This was also discussed in #365.

I'm going to close this. Higher-level built-in parsers are provided for convenience and don't have to be all things to all people. Rather than having logic that might be more difficult to predict, I think we should be consistent.

If there is a reason we should re-evaluate this, let us know!