yhirose / cpp-peglib

A single file C++ header-only PEG (Parsing Expression Grammars) library
MIT License
900 stars 112 forks source link

Eager calculation of line_info is a perf bottleneck #80

Closed toddlipcon closed 4 years ago

toddlipcon commented 4 years ago

In my workload which uses enable_ast, I noticed that the constructor of line_info is a perf hotspot. Each of the AST-generating rules creates a line_info struct to pass into the AstNode, even though in most cases (successfuly parses) the line info will never be used. The actual construction of line_info scans from the beginning of the parse input, which means parsing is O(num_tokens * number of characters). Considering that the number of tokens is probably linear with respect to the number of characters, this is O(n^2) overall.

yhirose commented 4 years ago

@toddlipcon, thanks for another great feedback! I just want to make sure that I understand what you are talking about correctly. Are you talking about this like, right?

https://github.com/yhirose/cpp-peglib/blob/c0d2318b1582b0679618d53ede8bf9c98d7f06c8/peglib.h#L3348

I am thinking to cache previous results, so that a next call can start from the last processed position instead of the start of text.

toddlipcon commented 4 years ago

Yep, that is the line that is showing up in my profile (the loop scanning for '/n' inside that function, in particular).

Caching previous results could work but may be complex from a code flow perspective.

yhirose commented 4 years ago

@toddlipcon, I decided to simply drop line and column members from AstBase. So it's now a user's responsibility to calculate line and column numbers.

yhirose commented 4 years ago

@toddlipcon, I realized that some my apps depend on the line information on the Ast node... So I reverted my change and reimplemented with the line end index (std::vector<size_t> source_line_index) and binary search (std::lower_bound). If you think that it's a still bottle neck, I'll consider getting rid of them again.