wz1000 / HieDb

Generates a references DB from .hie files
BSD 3-Clause "New" or "Revised" License
64 stars 24 forks source link

Indexing `typerefs` produces duplicates #61

Closed josephsumabat closed 7 months ago

josephsumabat commented 11 months ago

Indexing type refs seems to produce complete duplicate rows in the database.

Reproducible on https://github.com/josephsumabat/static-ls by running hiedb -D .hiedb index .hiefiles after building.

sqlite> select count(*) from typerefs;
3413
sqlite> select count(*) from (select distinct * from typerefs);
2090

On larger projects this can be an over 10x difference which greatly degrades the performance of indexing since so many unnecessary inserts are performed.

jhrcek commented 7 months ago

This bug report really piqued my interest. I tried indexing our modest codebase (~40k LOC of Haskell) After indexing we have 1 million rows in typerefs, but only 50k actually distinct rows.

Where is all of that duplication coming from?

This screenshot illustrates the issue for the most often repeated typeref within our codebase. It represents 2384 references to the type ghc-prim:GHC.Types:Type, that are all hiding behind this small "deriving JSON" src span.

Screenshot from 2024-02-07 16-14-05

Do you think it would be possible to tweak the indexing process to avoid all this duplication (because it only bloats the db, makes the inserts and lookups unnecessarily slower)?

I can look into that if you consider that as potentially promissing direction.

wz1000 commented 7 months ago

It would certainly be worthwhile to look into this.

josephsumabat commented 7 months ago

We noticed something similar on Mercury's code base as well (similar with something like 17 million rows but similar order of magnitude of thousands of unique rows). I did take some time to look into it a year ago but the --skip-typerefs feature (https://github.com/wz1000/HieDb/pulls?q=is%3Apr+is%3Aclosed) was originally motivated by this being a big bottleneck. Notably most ide features can be preserved without this table though.