Open timathom opened 1 month ago
Also, when calling vectorlink scan-neighbors for an intra-domain comparison, both A--B and B--A are compared, but we only need A--B because the relationship is undirected/intransitive.
Are you sure? scan-neighbors only opens and searches in one index. if A is your indexed domain, and B is your sequence domain, Only A's index is opened, and we search for all elements in B using the A index. So the search is one direction.
Right, but I was using the same domain as both A and B to try to cluster within the domain :)
apologies, i misunderstood!
For an intra-compare, there might already a better approach in vectorlink in the form of the duplicates
subcommand. This does a nearest neighbors search for each element in the index. We could explore if that's an option here.
Benchmark data
I updated the README for the benchmark data and was also able to add the new generated IDs for each name (in the
Gen_IDs
column).Current approach for generating the embedding objects:
On the server:
/data/ops/lib_people_data_for_embedding.json
./data/ops-sample/lib_people_benchmark_data_for_embedding.json
.data/vector_storage
, domainbenchmark
, commitb1
.To do
matches_table
code to include additional columns, as in lib_people_benchmark_matches.csv.vectorlink scan-neighbors
for an intra-domain comparison, bothA--B
andB--A
are compared, but we only needA--B
because the relationship is undirected/intransitive.