Closed wipfli closed 9 months ago
The Khmer wikipedia with 10,000,000 words requires unique 1509 positioned glyphs with Noto Sans Khmer. After downscaling by 1/64 to MapLibre SDF glyphs there are 530 unique glyphs left.
Calculated with count_unique_glyphs.py
When extracting all tag values that have a tag key that starts with name
from the cambodia-latest.osm.pbf
geofabrik download, I found that 292 unique MapLibre glyphs are needed to render all the Khmer words.
All but one positioned glyphs needed for this were already contained in the Wikipedia set of positioned glyphs.
The one that was missing looks like this:
It is glyph with index 120 in NotoSans-Khmer.ttf with these metrics:
x_offset, y_offset, advance 0, 3, 0
We have glyph with index 120 in the wikipedia set, but only with these 4 metrics:
0, 0, 0 0, -1, 0 -4, 0, 0 -5, 0, 0
Quite good I would say...
perl -pe 's/[[:ascii:]]/\n/g' kmwiki-20231201-pages-articles-multistream.xml | sed '/^$/d' | grep -P "[\x{1780}-\x{17FF}]" | sort | uniq > khmer.txt
sed -n '2p' input.json | sed 's/,$//' | jq | perl -pe 's/[[:ascii:]]/\n/g' | sed '/^$/d' | sort | uniq
Factored out into https://github.com/wipfli/word-corpus
I think for the OSM corpus of Khmer labels we need something like 1.8k unique glyphs before downscaling by 1/64.
Curious to see how many glyphs are needed for all of Khmer Wikipedia...
https://github.com/phylypo/khmer-text-data