yale-datachemist / entity-resolution

Apache License 2.0
1 stars 0 forks source link

Next Steps #60

Open timathom opened 1 month ago

timathom commented 1 month ago

Benchmark data

I updated the README for the benchmark data and was also able to add the new generated IDs for each name (in the Gen_IDs column).

Current approach for generating the embedding objects:

  1. Index the BIBFRAME RDF/XML Yale catalog data in an instance of the BaseX XML database.
  2. Use XQuery to extract the relevant data from the BIBFRAME resources. Code is here: extract-names.xq.
  3. This approach works and was easy for me to implement, but it takes a long time, even when running in parallel with the data split across 20 separate databases.

On the server:

To do

matko commented 1 month ago

Also, when calling vectorlink scan-neighbors for an intra-domain comparison, both A--B and B--A are compared, but we only need A--B because the relationship is undirected/intransitive.

Are you sure? scan-neighbors only opens and searches in one index. if A is your indexed domain, and B is your sequence domain, Only A's index is opened, and we search for all elements in B using the A index. So the search is one direction.

timathom commented 1 month ago

Right, but I was using the same domain as both A and B to try to cluster within the domain :)

matko commented 1 month ago

apologies, i misunderstood! For an intra-compare, there might already a better approach in vectorlink in the form of the duplicates subcommand. This does a nearest neighbors search for each element in the index. We could explore if that's an option here.