Open ramSeraph opened 9 months ago
Results:
writing parameter_space/osm-devanagari-downsampled-parameter-space-clustered.jsonl... writing parameter_space/wikipedia-devanagari-downsampled-parameter-space-clustered.jsonl... writing parameter_space/wikidata-devanagari-downsampled-parameter-space-clustered.jsonl...
original... wikidata size 5093 osm size 2350 wikipedia size 42605 osm is subset of wikidata False len(osm - wikidata) 252 osm is subset of wikipedia False len(osm - wikipedia) 29 wikidata is subset of wikipedia False len(wikidata - wikipedia) 126
downsampled... wikidata size 4988 osm size 2268 wikipedia size 42495 osm is subset of wikidata False len(osm - wikidata) 251 osm is subset of wikipedia False len(osm - wikipedia) 29 wikidata is subset of wikipedia False len(wikidata - wikipedia) 126
Thanks for this analysis. Can you also share some frequencies? You know maybe we have out of the 42k downsampled wikipedia clusters like 500 which make up 99.99 percent of all values...
Did the percentile analysis.. here are the results:
Entries within 99.0 percentile: 2012
Entries within 99.9 percentile: 9351
Entries within 99.99 percentile: 31926
Attaching the whole wikipedia parameter space as well
Great result, thanks!
Not to be merged.. just creating a pull request to verify scripts and post results