String interning - Githubissues

yberreby commented 8 years ago

We might want to intern identifier strings. String interning speeds up comparisons, which are likely going to be very frequent during various stages of the compilation pipeline. It should also reduce memory allocation during parsing.

rustc uses a string interner, from which we could take inspiration.

MovingtoMars commented 8 years ago

Is is possible that rgo will have some kind of parallel parsing? In this case, we should consider making the interner thread-safe with synchronisation primitives.

yberreby commented 8 years ago

Yes, it is possible. I'm not sure a shared interner would be very efficient, though, because of false sharing and the synchronization overhead. Maybe it would be more efficient not to intern during parsing, and to do a single interning pass once all files have been parsed?

MovingtoMars commented 8 years ago

Well, that seems more like delaying the problem than solving it, if semantic/codegen could be parallel.

yberreby commented 8 years ago

Does it? AFAICT we shouldn't need to mutate the interner once parsing is complete, so the other passes could read interned strings concurrently without false sharing nor synchronization.

MovingtoMars commented 8 years ago

Hmmm, I guess you're right on that.

MovingtoMars commented 8 years ago

Not sure how names would be stored in the AST, if interning is done in a separate pass.

yberreby commented 8 years ago

They could be stored as an enum that can hold either a String or a interner ID, I guess.

The problem with this approach (interning as a separate pass) is that we're allocating many strings just to throw them away immediately, though. I wonder how other parallel compilers do it?

Your synchronized approach is starting to look more interesting. And we could convert the synchronized interner to one without synchronization once parsing is finished, since we won't mutate it further.

MovingtoMars commented 8 years ago

Liking the sound of that idea.

yberreby commented 8 years ago

The enum, or stripping away the synchronization after parsing?

MovingtoMars commented 8 years ago

I was referring to the stripping away method.

However, I now think it could be a bad idea to get rid of synchronisation after parsing. I'm sure there are valid use cases for modifying/adding identifiers in semantic analysis, especially if rgo is used as a library in another crate.

MovingtoMars commented 8 years ago

I've got a working implementation using RwLock right now. I don't see any way stripping away synchronisation would work.

yberreby commented 8 years ago

I don't see any way stripping away synchronisation would work.

Here's how I see it.

This is rustc's interner:

pub struct Interner<T> {
    map: RefCell<HashMap<T, Name>>,
    vect: RefCell<Vec<T> >,
}

The idea is to have two different structs: one where the HashMap and the Vec are wrapped in RwLock / Mutex, and one where they are stored as-is, or in a RefCell (I don't know why a RefCell is used in rustc's interner). During parsing, we'd use the first kind of struct, then we'd move out the map and the vector into the "plain" struct for use in other passes.

Given the added complexity of parallel parsing and the design trade-offs it implies, maybe it would be better to use a sequential implementation first, then, when the project is more mature, switch to a parallel one if the performance gain is substantial.

MovingtoMars commented 8 years ago

The biggest problem I see is from a usability standpoint.

If we strip of synchonisation, then if one thread modifies an interner, the other threads won't see it. Unless, we make the result read-only, but that provides a limited API to clients and semantic passes.

yberreby commented 8 years ago

I was thinking of making it read-only, yes. Cloning the interner seems wasteful.

Could you provide an example of a situation where a semantic pass would need to modify the interner? One case I can think of would be rewriting the AST to apply our own optimizations before LLVM.

Either way, I think this is premature optimization. Parsing is usually much, much faster than other passes such as codegen, and we cannot make an informed decision yet: the parser is not finished, and the type checking, semantic analysis, translation and codegen passes have not been written.

MovingtoMars commented 8 years ago

I agree, let's leave it until later.

yberreby / rgo

String interning #3