winnow-rs / winnow

Making parsing a breeze
https://docs.rs/winnow
Other
525 stars 40 forks source link

offset position error #426

Closed baoyachi closed 8 months ago

baoyachi commented 8 months ago

Please complete the following tasks

rust version

rustc 1.74.0 (79e9716c9 2023-11-13)

winnow version

0.5.34

Minimal reproducible code

use winnow::ascii::{alpha1, digit1, multispace0};
use winnow::{Parser, PResult};
use winnow::combinator::{delimited, opt, repeat};

fn main() {
    let input1 = r#"(123,456,789)
    (000,111,22~~~2)
    "#;
    let offset1 = parse_digits.parse(input1).err().unwrap().offset();
    dbg!(offset1);// offset position error

    let input2 = r#"(123,456,789)
    (000,1~~~11,222)
    "#;
    let offset2 = parse_digits.parse(input2).err().unwrap().offset();
    dbg!(offset2);// offset position error
    assert_eq!(offset1, offset2) 
}

fn parse_digits<'a>(input: &mut &'a str) -> PResult<Vec<Vec<&'a str>>> {
    delimited(
        multispace0,
        repeat(1.., digit_group),
        multispace0,
    )
        .parse_next(input)
}

fn digit_group<'a>(input: &mut &'a str) -> PResult<Vec<&'a str>> {
    delimited(
        (multispace0, '(', multispace0),
        repeat(1.., (digit1, opt(",")).map(|x| x.0)),
        (multispace0, ')', multispace0),
    )
        .parse_next(input)
}

Steps to reproduce the bug with the above code

cargo run

Actual Behaviour

Offset always returns 18

+[src/main.rs:10] offset1 = 18
+[src/main.rs:10] offset2 = 18

input1 and input2 unexpectedly returned to the same position, which should be a bug.

Expected Behaviour

Should return true offset

Additional Context

No response

epage commented 8 months ago

I've added parse_nexts to help demonstrate what is happening

#!/usr/bin/env nargo
---
[dependencies]
winnow.path = "../winnow"
---

use winnow::ascii::{alpha1, digit1, multispace0};
use winnow::combinator::{delimited, opt, repeat};
use winnow::{PResult, Parser};

fn main() {
    let input1 = r#"(123,456,789)
    (000,111,22~~~2)
    "#;
    let offset1 = parse_digits.parse(input1).err().unwrap().offset();
    dbg!(offset1); // offset position error

    let mut input1 = r#"(123,456,789)
    (000,111,22~~~2)
    "#;
    parse_digits.parse_next(&mut input1);
    dbg!(input1);

    let input2 = r#"(123,456,789)
    (000,1~~~11,222)
    "#;
    let offset2 = parse_digits.parse(input2).err().unwrap().offset();
    dbg!(offset2); // offset position error

    let mut input2 = r#"(123,456,789)
    (000,1~~~11,222)
    "#;
    parse_digits.parse_next(&mut input2);
    dbg!(input2);

    assert_eq!(offset1, offset2);
}

fn parse_digits<'a>(input: &mut &'a str) -> PResult<Vec<Vec<&'a str>>> {
    delimited(multispace0, repeat(1.., digit_group), multispace0).parse_next(input)
}

fn digit_group<'a>(input: &mut &'a str) -> PResult<Vec<&'a str>> {
    delimited(
        (multispace0, '(', multispace0),
        repeat(1.., (digit1, opt(",")).map(|x| x.0)),
        (multispace0, ')', multispace0),
    )
    .parse_next(input)
}
$ ./winnow-426.rs
[/home/epage/.cargo/target/05/6398075af02aec/winnow-426.rs:16] offset1 = 18
[/home/epage/.cargo/target/05/6398075af02aec/winnow-426.rs:22] input1 = "(000,111,22~~~2)\n    "
[/home/epage/.cargo/target/05/6398075af02aec/winnow-426.rs:28] offset2 = 18
[/home/epage/.cargo/target/05/6398075af02aec/winnow-426.rs:34] input2 = "(000,1~~~11,222)\n    "

When encountering the second group, digit_group fails. The repeat inside of parse_digits then backtracks, finds the repeat is satisfied, and then finds parse_digits is satisfied. parse_next then succeeds, pointing at the start of the second group. The error then comes from the implicit eof that is a part of parse.

You can verify this by running with the debug feature

$ nargo run --manifest-path winnow-426.rs -F winnow/debug
> delimited                                                                  | "(123,456,789)\n    (000,111,22~~~2)\n
 > multispace0                                                               | "(123,456,789)\n    (000,111,22~~~2)\n
  > take_while                                                               | "(123,456,789)\n    (000,111,22~~~2)\n
  < take_while                                                               | +0
 < multispace0                                                               | +0
 > repeat                                                                    | "(123,456,789)\n    (000,111,22~~~2)\n
  > delimited                                                                | "(123,456,789)\n    (000,111,22~~~2)\n
   > multispace0                                                             | "(123,456,789)\n    (000,111,22~~~2)\n
    > take_while                                                             | "(123,456,789)\n    (000,111,22~~~2)\n
    < take_while                                                             | +0
   < multispace0                                                             | +0
   > one_of                                                                  | "(123,456,789)\n    (000,111,22~~~2)\n
    > any                                                                    | "(123,456,789)\n    (000,111,22~~~2)\n
    < any                                                                    | +1
    | verify                                                                 |
   < one_of                                                                  | +1
   > multispace0                                                             | "123,456,789)\n    (000,111,22~~~2)\n
    > take_while                                                             | "123,456,789)\n    (000,111,22~~~2)\n
    < take_while                                                             | +0
   < multispace0                                                             | +0
   > repeat                                                                  | "123,456,789)\n    (000,111,22~~~2)\n
    > digit1                                                                 | "123,456,789)\n    (000,111,22~~~2)\n
     > take_while                                                            | "123,456,789)\n    (000,111,22~~~2)\n
     < take_while                                                            | +3
    < digit1                                                                 | +3
    > opt                                                                    | ",456,789)\n    (000,111,22~~~2)\n
     > tag                                                                   | ",456,789)\n    (000,111,22~~~2)\n
     < tag                                                                   | +1
    < opt                                                                    | +1
    > digit1                                                                 | "456,789)\n    (000,111,22~~~2)\n    "
     > take_while                                                            | "456,789)\n    (000,111,22~~~2)\n    "
     < take_while                                                            | +3
    < digit1                                                                 | +3
    > opt:1                                                                  | ",789)\n    (000,111,22~~~2)\n    "∅
     > tag                                                                   | ",789)\n    (000,111,22~~~2)\n    "∅
     < tag                                                                   | +1
    < opt:1                                                                  | +1
    > digit1                                                                 | "789)\n    (000,111,22~~~2)\n    "∅
     > take_while                                                            | "789)\n    (000,111,22~~~2)\n    "∅
     < take_while                                                            | +3
    < digit1                                                                 | +3
    > opt:2                                                                  | ")\n    (000,111,22~~~2)\n    "∅
     > tag                                                                   | ")\n    (000,111,22~~~2)\n    "∅
     < tag                                                                   | backtrack
    < opt:2                                                                  | +0
    > digit1                                                                 | ")\n    (000,111,22~~~2)\n    "∅
     > take_while                                                            | ")\n    (000,111,22~~~2)\n    "∅
     < take_while                                                            | backtrack
    < digit1                                                                 | backtrack
   < repeat                                                                  | +11
   > multispace0                                                             | ")\n    (000,111,22~~~2)\n    "∅
    > take_while                                                             | ")\n    (000,111,22~~~2)\n    "∅
    < take_while                                                             | +0
   < multispace0                                                             | +0
   > one_of                                                                  | ")\n    (000,111,22~~~2)\n    "∅
    > any                                                                    | ")\n    (000,111,22~~~2)\n    "∅
    < any                                                                    | +1
    | verify                                                                 |
   < one_of                                                                  | +1
   > multispace0                                                             | "\n    (000,111,22~~~2)\n    "∅
    > take_while                                                             | "\n    (000,111,22~~~2)\n    "∅
    < take_while                                                             | +5
   < multispace0                                                             | +5
  < delimited                                                                | +18
  > delimited                                                                | "(000,111,22~~~2)\n    "∅
   > multispace0                                                             | "(000,111,22~~~2)\n    "∅
    > take_while                                                             | "(000,111,22~~~2)\n    "∅
    < take_while                                                             | +0
   < multispace0                                                             | +0
   > one_of                                                                  | "(000,111,22~~~2)\n    "∅
    > any                                                                    | "(000,111,22~~~2)\n    "∅
    < any                                                                    | +1
    | verify                                                                 |
   < one_of                                                                  | +1
   > multispace0                                                             | "000,111,22~~~2)\n    "∅
    > take_while                                                             | "000,111,22~~~2)\n    "∅
    < take_while                                                             | +0
   < multispace0                                                             | +0
   > repeat                                                                  | "000,111,22~~~2)\n    "∅
    > digit1                                                                 | "000,111,22~~~2)\n    "∅
     > take_while                                                            | "000,111,22~~~2)\n    "∅
     < take_while                                                            | +3
    < digit1                                                                 | +3
    > opt                                                                    | ",111,22~~~2)\n    "∅
     > tag                                                                   | ",111,22~~~2)\n    "∅
     < tag                                                                   | +1
    < opt                                                                    | +1
    > digit1                                                                 | "111,22~~~2)\n    "∅
     > take_while                                                            | "111,22~~~2)\n    "∅
     < take_while                                                            | +3
    < digit1                                                                 | +3
    > opt:1                                                                  | ",22~~~2)\n    "∅
     > tag                                                                   | ",22~~~2)\n    "∅
     < tag                                                                   | +1
    < opt:1                                                                  | +1
    > digit1                                                                 | "22~~~2)\n    "∅
     > take_while                                                            | "22~~~2)\n    "∅
     < take_while                                                            | +2
    < digit1                                                                 | +2
    > opt:2                                                                  | "~~~2)\n    "∅
     > tag                                                                   | "~~~2)\n    "∅
     < tag                                                                   | backtrack
    < opt:2                                                                  | +0
    > digit1                                                                 | "~~~2)\n    "∅
     > take_while                                                            | "~~~2)\n    "∅
     < take_while                                                            | backtrack
    < digit1                                                                 | backtrack
   < repeat                                                                  | +10
   > multispace0                                                             | "~~~2)\n    "∅
    > take_while                                                             | "~~~2)\n    "∅
    < take_while                                                             | +0
   < multispace0                                                             | +0
   > one_of                                                                  | "~~~2)\n    "∅
    > any                                                                    | "~~~2)\n    "∅
    < any                                                                    | +1
    | verify                                                                 | backtrack
   < one_of                                                                  | backtrack
  < delimited                                                                | backtrack
 < repeat                                                                    | +18
 > multispace0                                                               | "(000,111,22~~~2)\n    "∅
  > take_while                                                               | "(000,111,22~~~2)\n    "∅
  < take_while                                                               | +0
 < multispace0                                                               | +0
< delimited                                                                  | +18
> eof                                                                        | "(000,111,22~~~2)\n    "∅
< eof                                                                        | backtrack

What you need to do is to add cut_err calls for where it should disable backtracking. Likely this would be on the repeat inside of digit_group and the closing delimiter.

baoyachi commented 8 months ago

Thx @epage . "cut_err" is solvable, but during the process of analyzing the grammar, it's crucial to accurately identify which parts are disable backtracking. If "cut_err" is employed, manual marking of this portion of work is required, which does make the usage somewhat cumbersome.

epage commented 8 months ago

If you have ideas for another strategy, I would be open to it.

Some examples of what rely on backtracking

A rough approximation of combine is that it took the approach of making Cut the default and requires opting in to Backtrack. I said roughly which makes the above combinators work in some circumstances and then fail in others. Backtrack by default is much easier to work with.

baoyachi commented 8 months ago

@epage There is currently no better idea. I will close it.