winnow-rs / winnow

Making parsing a breeze
https://docs.rs/winnow
Other
572 stars 44 forks source link

Checkpoint the start of a Context #384

Closed epage closed 9 months ago

epage commented 11 months ago

Discussed in https://github.com/winnow-rs/winnow/discussions/373

Originally posted by **mernen** December 3, 2023 Hi, I'm porting a nom-based project to winnow, and while the redesigned API is very nice, I've noticed my error messages have degraded significantly: with nom, `context()` references always point to where in the input that particular parser began, whereas winnow's context references are pointing to where the error occurred — in effect, all contexts point to the same place. For comparison, I've created two roughly equivalent programs for parsing digits. I ask them both to parse the invalid number "0xZ123", and then print the detailed error:
nom version ```rust use nom::branch::alt; use nom::bytes::complete::{tag, take_while1}; use nom::combinator::{all_consuming, cut, map_res}; use nom::error::{context, VerboseError}; use nom::sequence::preceded; use nom::IResult; fn number(input: &str) -> IResult<&str, u64, VerboseError<&str>> { context("number", alt((hex, decimal)))(input) } fn hex(input: &str) -> IResult<&str, u64, VerboseError<&str>> { context( "hex number", preceded( tag("0x"), cut(context( "hex digits", map_res(take_while1(|c: char| c.is_ascii_hexdigit()), |s: &str| { u64::from_str_radix(s, 16) }), )), ), )(input) } fn decimal(input: &str) -> IResult<&str, u64, VerboseError<&str>> { context( "decimal number", map_res(take_while1(|c: char| c.is_ascii_digit()), |s: &str| { u64::from_str_radix(s, 10) }), )(input) } fn main() { let test = "0xZ123"; // println!("{:#?}", number(test)); match all_consuming(number)(test) { Ok((_, n)) => println!("number: {n}"), Err(nom::Err::Error(e) | nom::Err::Failure(e)) => eprintln!("{e}"), Err(err) => eprintln!("{err}"), } } ```
Output: ``` Parse error: TakeWhile1 at: Z123 in section 'hex digits', at: Z123 in section 'hex number', at: 0xZ123 in section 'number', at: 0xZ123 ```
winnow version ```rust use winnow::combinator::{alt, cut_err, preceded}; use winnow::error::{StrContext, TreeError}; use winnow::token::take_while; use winnow::{PResult, Parser}; fn number<'i>(input: &mut &'i str) -> PResult> { alt((hex, decimal)) .context(StrContext::Label("number")) .parse_next(input) } fn hex<'i>(input: &mut &'i str) -> PResult> { preceded( "0x", cut_err( take_while(1.., |c: char| c.is_ascii_hexdigit()) .try_map(|s| u64::from_str_radix(s, 16)) .context(StrContext::Label("hex digits")), ), ) .context(StrContext::Label("hex number")) .parse_next(input) } fn decimal<'i>(input: &mut &'i str) -> PResult> { take_while(1.., |c: char| c.is_ascii_digit()) .try_map(|s| u64::from_str_radix(s, 10)) .context(StrContext::Label("decimal number")) .parse_next(input) } fn main() { let test = "0xZ123"; // println!("{:#?}", number.parse(test)); match number.parse(test) { Ok(n) => println!("number: {n}"), Err(err) => eprintln!("{err}"), } } ```
Output: ``` 0xZ123 ^ in slice at 'Z123' 0: invalid hex digits at 'Z123' 1: invalid hex number at 'Z123' 2: invalid number at 'Z123' ``` I'm not sure if this is a bug or a difference in intention (the "invalid *label* at *input*" does convey a different idea from "in section *label*"), but stacking seems less useful.
epage commented 11 months ago

In trivial parsers, Input is just a fat pointer. Once you start adding Located and Recoverable (see #96), it grows dramatically and part of what you are seeing is a side effect of work to make those parsers more performant. In some cases, adding a couple of usizes to the Input, without any other change, slowed parsing down by 30%. The way we solved that is by switching to &mut Input. That has a risk of not doing all the bookkeeping right when reverting (and adding overhead in the happy/success path for that bookkeeping). The realization I had is we could get away without being transactional by letting errors stay pointing to where the error occurred. This also allows errors to not need to carry around the error-side's Input any more, allowing them to be shrunk as well, offering more performance benefits (e.g. the performance hit for enabling ContextError in gix is about 5%, dramatically less than the old VerboseError).

Another reason you are running into is the parsers I work with don't need the input for each Context, leading me to design the default case that way.

The final reason you are running into this is TreeError was a late edition to the API that I made to help with the port of gix from nom to winnow and mostly exists for debugging purposes because the errors literally dump the parser state and can't really be design for humans in mind and has a major performance hit.

The way to solve this would be, in a breaking change (slowly building up to being ready to make v0.6) is to take a Checkpoint at the start of context and pass that to add_context to let the error type decide what it wants to do with it (TreeError can decide to do clones, resets, etc). We should evaluate the other error methods to see if they should accept a Checkpoint as well.