Next Steps - Githubissues

timathom commented 1 month ago

Benchmark data

I updated the README for the benchmark data and was also able to add the new generated IDs for each name (in the Gen_IDs column).

Current approach for generating the embedding objects:

Index the BIBFRAME RDF/XML Yale catalog data in an instance of the BaseX XML database.
Use XQuery to extract the relevant data from the BIBFRAME resources. Code is here: extract-names.xq.
This approach works and was easy for me to implement, but it takes a long time, even when running in parallel with the data split across 20 separate databases.

On the server:

The ops file for the complete dataset is located in /data/ops/lib_people_data_for_embedding.json.
The ops file for the benchmark dataset is located in /data/ops-sample/lib_people_benchmark_data_for_embedding.json.
The benchmark data is indexed in data/vector_storage, domain benchmark, commit b1.

To do

[ ] Create a visualization and report comparing the benchmark data to the results from the corresponding embeddings.
[ ] Prepare content to explain embeddings, HNSW, quantization, ANN, etc., to a general audience.
[ ] Update the matches_table code to include additional columns, as in lib_people_benchmark_matches.csv.
[ ] Also, when calling vectorlink scan-neighbors for an intra-domain comparison, both A--B and B--A are compared, but we only need A--B because the relationship is undirected/intransitive.
[ ] Modify the methodology to improve the results.
[ ] Port the XQuery code to WOQL? Need to implement a better approach to extracting the data so that the process is faster.
[ ] Currently, only names for people as contributors are being extracted, but people can also be subjects. We need to develop an approach that accounts for different kinds of relationships between person and work. Should we combine all names in a single vector domain, or store them in separate domains?
[ ] Remap and reindex the Library of Congress dataset for name lookups and reconciliation of disambiguated personal names in the Yale data.

matko commented 1 month ago

Also, when calling vectorlink scan-neighbors for an intra-domain comparison, both A--B and B--A are compared, but we only need A--B because the relationship is undirected/intransitive.

Are you sure? scan-neighbors only opens and searches in one index. if A is your indexed domain, and B is your sequence domain, Only A's index is opened, and we search for all elements in B using the A index. So the search is one direction.

timathom commented 1 month ago

Right, but I was using the same domain as both A and B to try to cluster within the domain :)

matko commented 1 month ago

apologies, i misunderstood! For an intra-compare, there might already a better approach in vectorlink in the form of the duplicates subcommand. This does a nearest neighbors search for each element in the index. We could explore if that's an option here.

yale-datachemist / entity-resolution

Next Steps #60