wz1000 / HieDb

Generates a references DB from .hie files
BSD 3-Clause "New" or "Revised" License
64 stars 24 forks source link

Avoid inserting duplicate records in typerefs (fixes #61) #67

Closed jhrcek closed 7 months ago

jhrcek commented 7 months ago

First naive attempt at solving https://github.com/wz1000/HieDb/issues/61.

Here's a quick comparison of this PR with master:

  1. I built current master of haskell-language-server (to get bunch of .hie files to index): cabal clean && cabal build all --ghc-options=-fwrite-ide-info

  2. I indexed that directory with hiedb binary:

    # build hiedb with optimizations:
    cabal build --ghc-options=-O2
    # find hiedb exe in cabal build dir and use it to index build artifacts in hls repo:
    dist-newstyle/build/x86_64-linux/ghc-9.4.8/hiedb-0.5.0.1/x/hiedb/build/hiedb/hiedb -- -D tmp.db index ~/Devel/github.com/haskell/haskell-language-server/

You can see that difference with hls codebase is not that significant. But the difference is much more significant with our work codebase which has much more deriving of stuff:

wz1000 commented 7 months ago

It would also be good to have a comparison with using UNIQUE constraints in the db and using INSERT OR IGNORE when inserting.

jhrcek commented 7 months ago

I tried with the approach with UNIQUE constraint and it didn't find noticeable difference in perf. between it and master branch. I won't open PR with that change and just paste the diff so you can check it's the change you had in mind:

git diff --patch master..HEAD 
diff --git a/src/HieDb/Create.hs b/src/HieDb/Create.hs
index a29838d..3f28842 100644
--- a/src/HieDb/Create.hs
+++ b/src/HieDb/Create.hs
@@ -161,6 +161,7 @@ initConn (getConn -> conn) = do
                 \, ec      INTEGER NOT NULL \
                 \, FOREIGN KEY(id) REFERENCES typenames(id) DEFERRABLE INITIALLY DEFERRED \
                 \, FOREIGN KEY(hieFile) REFERENCES mods(hieFile) ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED \
+                \, CONSTRAINT uniqtyperef UNIQUE (id, hieFile, depth, sl, sc, el, ec) ON CONFLICT IGNORE \
                 \)"
   execute_ conn "CREATE INDEX IF NOT EXISTS typeref_id ON typerefs(id)"
   execute_ conn "CREATE INDEX IF NOT EXISTS typerefs_mod ON typerefs(hieFile)"
diff --git a/src/HieDb/Utils.hs b/src/HieDb/Utils.hs
index 0b5c073..596e4bf 100644
--- a/src/HieDb/Utils.hs
+++ b/src/HieDb/Utils.hs
@@ -56,7 +56,7 @@ addTypeRef (getConn -> conn) hf arr ixs sp = go 0
         Nothing -> pure ()
         Just occ -> do
           let ref = TypeRef occ hf d sl sc el ec
-          execute conn "INSERT INTO typerefs VALUES (?,?,?,?,?,?,?)" ref
+          execute conn "INSERT OR IGNORE INTO typerefs VALUES (?,?,?,?,?,?,?)" ref
       let next = go (d+1)
       case arr A.! i of
         HTyVarTy _ -> pure ()
wz1000 commented 7 months ago

I think we should add the UNIQUE constraint anyway so we can detect violations of this property.

jhrcek commented 7 months ago

I think we should add the UNIQUE constraint anyway so we can detect violations of this property.

Would this require schema version bump?

jhrcek commented 7 months ago

Tried this and the issue is that with the UNIQUE constraint the indexing time doubles. It's not as bad as on master, but still much worse than without it. Not sure if it's worth it..

wz1000 commented 7 months ago

Would this require schema version bump?

not strictly, though I guess it would be good if all the existing databases used by HLS are rebuilt so they don't violate the property.

wz1000 commented 7 months ago

Tried this and the issue is that with the UNIQUE constraint the indexing time doubles. It's not as bad as on master, but still much worse than without it. Not sure if it's worth it..

ok, I guess checking in haskell should be sufficient.

jhrcek commented 7 months ago

Turns out when switching to int map/set I forgot to flip the boolean condition: when (Set.notMember ...) -> unless (ISet.member ..., that's why it was apparently faster. Fixed in the last commit and the indexing speed is back to what it was before.

wz1000 commented 7 months ago

Bump the schema version and this should be good to go I think

jhrcek commented 7 months ago

Bumped. Is that just to force users to regenerate new hiedb?

jhrcek commented 7 months ago

Hey @wz1000 this should be ready to merge. It would be also great to get this on hackage/use that in hls. I can prepare a version bump PR if you want (and then push the update to hackage as I believe I'm still among the mainainers of this package).

wz1000 commented 7 months ago

@jhrcek feel free to prepare a release after updating the changelogs. Thanks!