zesterer / ariadne

A fancy diagnostics & error reporting crate
https://crates.io/crates/ariadne
MIT License
1.69k stars 74 forks source link

`Source::chars()` iter is lossy but not documented as so #13

Open robojeb opened 2 years ago

robojeb commented 2 years ago

I recently switched my parsing from going over a stream of characters produced by str::chars() to parsing over a stream of characters from ariadne::Source::chars(). This makes it a lot easier to load all of my source files through a cache which can be directly used by ariadne for printing reports.

Unfortunately it isn't well documented that str::char() and ariadne::Source::chars() don't produce the same, or even roughly equivalent, output.

I spent some time debugging why none of my line-ending parsers were succeeding before I realized that ariadne::Source::chars() doesn't produce any line-endings (or even whitespace).

fn main() {
    let raw_src = "a \nb";
    let src = ariadne::Source::from(raw_src);

    let raw_char_vec = raw_src.chars().collect::<Vec<_>>();
    let src_char_vec = src.chars().collect::<Vec<_>>();

    println!("{:#?}", raw_char_vec);
    println!("{:#?}", src_char_vec);

    assert_eq!(raw_src.len(), src.len());
    assert_eq!(raw_src.len(), src.chars().count());
}

It would be nice if ariadne::Source::chars() were documented as not producing any white-space.

It would also be nice if ariadne::Source provided another method which could get a stream of characters including line-endings with appropriate spans.

I mocked up an adapter which can re-engineer a character stream which is mostly equivalent to str::chars() as long as your parser doesn't care about white-space. It returns all the characters in the file with their correct indices, and then also injects newline characters using the index at the end of the line span. This works as a replacement for str::chars().enumerate() in my parsing code.

struct Renumerate<'a> {
    produced: usize,
    line_info: Option<(usize, usize, Vec<char>)>,
    lines: Box<dyn Iterator<Item = &'a ariadne::Line> + 'a>,
}

impl<'a> Renumerate<'a> {
    pub fn new(src: &'a ariadne::Source) -> Renumerate<'a> {
        let mut lines = src.lines();

        Renumerate {
            produced: 0,
            line_info: lines
                .next()
                .map(|line| (line.span().start, line.span().end, line.chars().collect())),
            lines: Box::new(lines),
        }
    }
}

impl<'a> Iterator for Renumerate<'a> {
    type Item = (usize, char);

    fn next(&mut self) -> Option<Self::Item> {
        if let Some((start, end, ref line_chars)) = self.line_info {
            if self.produced < line_chars.len() {
                self.produced += 1;
                Some((start + self.produced - 1, line_chars[self.produced - 1]))
            } else {
                self.line_info = self
                    .lines
                    .next()
                    .map(|line| (line.span().start, line.span().end, line.chars().collect()));
                self.produced = 0;
                Some((end - 1, '\n'))
            }
        } else {
            None
        }
    }
}
zesterer commented 2 years ago

Hi, thanks for opening this issue. I definitely agree that this should be documented better and I'll be adding this to my TODO list (including moving to byte offsets instead of character positions) for my planned refactor of the library.

From my end, I wonder whether it would be better for me to keep Source's internal view of things as an implementation detail: it's set up to make it as easy as possible for Ariadne's diagnostic generator to work with and isn't really intended as a way for library users to directly interact with a file's contents.

robojeb commented 2 years ago

Thanks for taking a look.

It would be fine if Source wasn't intended for use as an input type. Having percolated on it over night, I would change the flow of my code to: Parse on a str, convert str to Source and place in Cache, Print diagnostics from the Cache. This may already be what you intended it just wasn't the first thing that came to my mind because I was looking at the ariadne::Cache which directly returns Source objects.

I ended up writing my own cache so I didn't have to load the files from the filesystem at both parsing and diagnostics.

exoticorn commented 2 years ago

I ran into the same issue just now. I do agree that it doesn't have to be Sources job to act as input to a parser but given the state of documentation it's not unreasonable to try to use it that way.

For me, my main motivation to use Source as input to Chumsky was that it looked like the safest bet to avoid getting bitten by changes to the newline handling, ie.:

len: line.chars().count() + 1, // TODO: Don't assume all newlines are a single character!

Now, if you can promise not to work on this TODO until switching to byte offsets (at which point this is probably moot, anyway), I'd just switch to iterate the source string directly, filtering out any '\r' characters.

zesterer commented 2 years ago

This is handled slightly better in master, fwiw. I've not had time to sit down and do the rework of the crate I've been meaning to in order to properly address this though.