yeslogic / fathom

🚧 (Alpha stage software) A declarative data definition language for formally specifying binary data formats. 🚧
Apache License 2.0
258 stars 14 forks source link

Global string interner #491

Closed Kmeakin closed 1 year ago

Kmeakin commented 1 year ago

I think we should make the StringInterner global. Usually global variables are seen as a code smell, but I think it can be justified in this case:

brendanzab commented 1 year ago

Yeah it does clean some stuff up. I imagine it might make things like a language server or interactive binary format editor/explorer harder to implement in the future? We’re a long way from that though, so saddling ourselves with this pain right now might be counterproductive?

Kmeakin commented 1 year ago

How could a global interner make LSP or interactive features harder?

brendanzab commented 1 year ago

Ahh, I imagine it would mean you were allocating strings in an unbounded fashion without cleaning them up later? Ie. a memory leak? Probably not a massive issue though.

Kmeakin commented 1 year ago

I think whatever the use case: batch compiler, LSP implementation, interactive editor, we would need to keep all strings live, since there is no way of knowing which strings are referenced and which are not

brendanzab commented 1 year ago

Yeah that makes sense. Come to think of it I think rust-analyzer uses some sort of reference counting for strings? Let’s do this then – I think the improved clarity will be a win.

Kmeakin commented 1 year ago

With a batch compiler with a clear separation of passes, you might be able to all interned strings at some late stage (eg once rust code/LLVM IR has been emmited and you hand off to rustc/llvm). But an LSP has no clear separating of passes. You can always receive new input from the user. Trying to figure out a way of identifying unreferenced strings would probably be a lot of effort for a small reduction in peak memory usage

Kmeakin commented 1 year ago

Yeah that makes sense. Come to think of it I think rust-analyzer uses some sort of reference counting for strings? Let’s do this then – I think the improved clarity will be a win.

Last time I checked they use smol-str: basically Arc<str> with a small string optimization. Saves on allocations and clones but no O(1) comparison

brendanzab commented 1 year ago

Yeah… we can always switch to that sometime if that’s a better approach.