Closed zesterer closed 1 year ago
Does this imply being able to do something like this?
fn parse() -> impl Parser<char, &str, Error = ....> {}
Yes, among other things. There are a few complications though: feeding lifetimes through is quite an intricate process, and it's going to require a fairly fundamental rework of the Stream
trait.
There are also complications with existing parsers: if just('!').repeated()
results in an impl Parser<char, &str>
, then what does just('!')
result in? A char
? But then it's owned, which implies that .repeated()
is getting an unowned &str
out of thin air. It might require some extra combinators to be used effectively.
One option I'm considering is a combinator that complements map_with_span
that allows fetching the slice (&[T]
or &str
) of the input that was parsed instead of the span of the input like map_with_slice
. This would allow building zero-copy parsing on top of the existing API without needing to significantly change the API and while avoiding allocation.
@zesterer Are you still looking for any help with the zero-copy implementation? I'd love to lend a helping hand. I just finished browsing the whole branch and it looks fantastic so far. Really interesting!
@dark-fusion Definitely! I've been on a bit of a hiatus over the past few weeks (a mixture of burnout and a lot of work), but it's beginning to look like something that's of merge quality. I think it's probably time for me to write a list of things that need doing before merge (aside from waiting for GAT stabilisation). What sort of things were you interested in working on?
I'm most interested in adding support for binary-encoded formats, however I also understand that it would involve quite a bit of work and may be best suited as an extension crate (at least for now). This was also already discussed in #48.
While waiting for certain features to land on Rust's stable channel, I'm not sure where an extra set of hands would be most welcome.
Hello hello :D
I am looking to contribute more to chumsky, and am curious on if there are any open tasks that need to be worked on related to zero-copy parsing, and if so, what those might be!
Thanks for your time!
am curious on if there are any open tasks that need to be worked on
Hey, thanks for showing interest! There are a few things on the list, but it really depends on what you're interested in.
master
works okay for simple cases, but it would be nice to be able to generalise it a bit, perhaps by letting users provide their own recovery parsers instead of a handful of pre-built ones, as mentioned in #77)filter
/filter_map
are now just both methods on Parser
, filter
and try_map
, for example))text
module, perhaps making them a bit more flexible in the processcustom
parser in master
, but actually useful by strategically making methods like InputRef::next
public so that people can write their own inline parsers that manually pull tokens out of the inputParser
implementations for a bunch of types to improve the utility of things (&impl Parser
, Box<impl Parser>
, Rc<impl Parser>
, etc.)master
(.not()
, etc.)zero-copy
very fast. master
works fine, but each parser returns a rather complex type to represent the errors. It would be better to return just a Result<O, ()>
and have the errors stored inline within InputRef
(I've already started working on this, but it needs work), then figuring out how best to choose between error sets with combinators like Or
and Repeated
Any one or part of these would be really appreciated and would definitely help speed up the process of getting the branch ready for merge. I'd love to dedicated more time to it myself, but I'm going through quite a busy period in my life right now and have a bit less time to spare than I normally do :(
Thanks for the list :D I will take a crack at text
and get a pull request going when I can. And no worries about being busy! If someone blames you for that they can write it themselves ;)
zero-copy
with a pratt-parser, which can be found here.I am not sure what the exact issue is, but I don't know if zero-copy
is going to have a great time with compile times. For reference, I am compiling my Lua parser, and you know you are in for a great time when cargo check
takes almost 10 minutes. When I first ran cargo run --release
I killed the process after about 60 minutes.
I am up for any and all recommendations you have here, as I am kinda at a loss. I know you are busy, so there's no rush! Thanks for your time in advance.
@Zij-IT The very long or
chain here is a bit suspect to me. Due to deficiencies in rustc's ability to memoise trait obligation, Or
parsers tend to have exponential compilation times. I'd recommend trying choice
instead, it tends to not have this issue.
That aside, make sure to throw .boxed()
on the end of any large parsers. It has a relatively insubstantial runtime cost, but improves compilation times very significantly!
I mentioned #197 that I was able to get it to compile, but I figured I would show you some of the fruits of your labor :D
.lua
file in the luvit Lua librarymaster
Parser here - 212_198mszero-copy
Parser here - 149_218msI would say that is quite the achievement! I am curious how where there is still room for improvement and excited to see that number working its way down!
That's quite a nice improvement! Not as much as I'd hoped though, unfortunately. I think there's still more to be done here. Are these just the numbers for the lexer alone, or the parser too?
I've realised that the foldl
/foldr
combinators are extremely commonly hit in most parsers, so I'm interested to see whether we could make them ingest a Repeated
/SeparatedBy
directly instead of requiring a .collect()
into a vector beforehand, saving a lot of allocation and memory juggling.
Its for both the lexer and the parser! Each of the 200ish files goes from &str
to Chunk
!
Its me again, and with some fantastic news for me, and uncertain news for zero-copy
:D I was going back through my Parser looking for those optimizations you were hoping for and I found THE problem in my implementation. I was looking at a flamegraph and I noticed that the BinaryInfixOperator
followed by Foldr
was taking up a large portion of the time. I thought this was odd because I only use foldr
in two locations, so I went looking at them, and it hit me. This little portion of my expression parser has one CRITICAL issue:
let power = atom
.clone()
.then(just(TokenKind::OpPow).to(BinaryOp::Power))
.repeated()
.then(atom.labelled("binary operand"))
.foldr(|(lhs, op), rhs| Expr::Binary {
lhs: Box::new(lhs),
rhs: Box::new(rhs),
op,
});
It parses EVERY atom
twice if that atom
is not followed by a TokenKind::OpPow
. This is wonderfully terrifying, when the singular expression is, say, a giant function definition like this one:
test('fs chmod', function(expect)
local mode_async
local mode_sync
-- On Windows chmod is only able to manipulate read-only bit
-- TODO: test on windows
if is_windows then
mode_async = 256 --[[tonumber('0400', 8)]] -- read-only
mode_sync = 438 --[[tonumber('0600', 8)]] -- read-write
else
mode_async = 511 --[[tonumber('0777', 8)]]
mode_sync = 420 --[[tonumber('0644', 8)]]
end
local file1 = Path.join(__dirname, 'fixtures', 'a.lua')
local file2 = Path.join(__dirname, 'fixtures', 'a1.lua')
local function maskMode(mode, mask)
return bit.band(mode, mask or 511 --[[tonumber('0777',8)]])
end
fs.chmod(file1, mode_async, expect(function(err)
assert(not err)
if is_windows then
assert(maskMode(maskMode(fs.statSync(file1).mode), mode_async))
else
assert(mode_async == maskMode(fs.statSync(file1).mode))
end
-- TODO: accept mode in number
assert(fs.chmodSync(file1, mode_sync))
if is_windows then
assert(maskMode(maskMode(fs.statSync(file1).mode), mode_sync))
else
assert(mode_sync == maskMode(fs.statSync(file1).mode))
end
end))
fs.open(file2, 'a', tonumber('0666', 8), expect(function(err, fd)
assert(not err)
fs.fchmod(fd, mode_async, expect(function(err)
assert(not err)
if is_windows then
assert(maskMode(maskMode(fs.fstatSync(fd).mode), mode_async))
else
assert(mode_async == maskMode(fs.fstatSync(fd).mode))
end
-- TODO: accept mode in number
assert(fs.fchmodSync(fd, mode_sync))
if is_windows then
assert(maskMode(maskMode(fs.fstatSync(fd).mode), mode_sync))
else
assert(mode_sync == maskMode(fs.fstatSync(fd).mode))
end
fs.close(fd)
end))
end))
-- lchmod
if fs.lchmod then
local link = Path.join(__dirname, 'tmp', 'symbolic-link')
fs.unlinkSync(link)
fs.symlinkSync(file2, link)
fs.lchmod(link, mode_async, expect(function(err)
assert(not err)
p(fs.lstatSync(link).mode)
assert(mode_async == maskMode(fs.lstatSync(link).mode))
-- TODO: accept mode in number
fs.lchmodSync(link, string.format('%o', mode_sync))
assert(mode_sync == maskMode(fs.lstatSync(link).mode))
end))
end
end)
There are of course as you know many ways to solve this problem; I just went ahead and used the binary_infix_operator
from #144 (though I had to fix the Clone
implementation). This lead to the following improved times:
Its for both the lexer and the parser! Each of the 200ish files goes from
&str
toChunk
!
Out of interest, how do just the lexing times compare? My suspicion is that lexing probably got a lot faster, but parsing didn't (because parsing already requires a lot of collecting into vectors, recursive data structures, etc. so there probably wasn't much that zero-copy
could improve on).
I was going back through my Parser looking for those optimizations you were hoping for and I found THE problem in my implementation.
Ah yes, this is quite a common pitfall. Chumsky doesn't have memoisation, so it'll eagerly attempt to parse both interpretations twice, which can rapidly become exponential!
I'm quite surprised that master
and zero-copy
now have similar timings though. I would have hoped that at least some of its improvements were visible...
I will give it a look when I can tomorrow, but I have only moved over the parser as of currently, so both lexers are using the non-zero-copy parser combinators, and are identical.
Sorry for the confusion 😅 My branch naming scheme when working on my own stuff needs work.
It parses EVERY atom twice if that atom is not followed by a TokenKind::OpPow
@Zij-IT Could you further elaborate? I do not really understand why this is the case.
@Zij-IT Could you further elaborate? I do not really understand why this is the case.
Here is the parser again to prevent you having to scroll up :D
let power = atom
.clone()
.then(just(TokenKind::OpPow).to(BinaryOp::Power))
.repeated()
.then(atom.labelled("binary operand"))
.foldr(|(lhs, op), rhs| Expr::Binary {
lhs: Box::new(lhs),
rhs: Box::new(rhs),
op,
});
To move this onto more pseudocode and hopefully more understandable, here it is as EBNF
power := (atom Pow)* atom
Now, the important thing here is that the left and right hand side of the TokenKind::Pow
are the same item. I am going to go through parsing this with the following example:
12 ^ 12 ^ 12
To accurate parse the above expression we have to repeat the (atom Pow)
parser until it fails, and then we parse the final atom
. Here are the steps:
(atom Pow)
results in 12 ^ | 12 ^ 12
(atom Pow)
results in 12 ^ 12 ^ | 12
(atom Pow)
results in an error, because after parsing the last 12
, there is no Pow
.atom
results in 12 ^ 12 ^ 12 |
The problem is that in both steps 3 and 4 we have to parse 12
, because there is no way to pass the successfully parsed 12
to the next atom
.
For this tiny example, that is no issue. But in the Lua example I have above, some of the lines are nested in three lamda definitions, and because each expression will be parsed twice, every item within the third call will be parsed at a minimum 2^3 times.
Heyyo :) Me again! Life's been hectic for the last couple months, but I am excited to start helping out again.
Any issues you are currently hoping to get marked of the todo list for zero-copy that I can help with?
Hey, hello again! I think a big one would be porting docs / text parsers over, and perhaps some of the newer parsers like not
and and_is
(suggested in #208).
I'm also considering something big (but fundamentally uncomplicated): switching out the Output
associated type with an O
type parameter, as with master
. The primary reason is to allow implementing Parser
for !
, but in hindsight it just makes more sense and consistency with master
is good. That's quite a big & boring job though (lots of copy & paste, renaming, etc.), so I can completely understand if you're not interested in it.
lots of [boring] copy & paste, renaming, etc.
If you can describe the tasks (or make a small example to replicate), I might try to play with it, trying to automate most of it with vim if it can help you 🤔
It's a relatively simple process. Everywhere you see Parser<I, Output = O>
, you just turn it into Parser<I, O>
obviously this gets slightly more complicated than just being a copy & paste process in some of the trait bounds, but there shouldn't be anything particularly wild.
This is now being developed in #82