Khmer text corpus - Githubissues

wipfli / positioned-glyph-font

Postioned glyph font for MapLibre

https://wipfli.github.io/positioned-glyph-font/examples/protomaps/name-ne

MIT License

4 stars 1 forks source link

Khmer text corpus #1

Closed wipfli closed 9 months ago

wipfli commented 11 months ago

I think for the OSM corpus of Khmer labels we need something like 1.8k unique glyphs before downscaling by 1/64.

Curious to see how many glyphs are needed for all of Khmer Wikipedia...

https://github.com/phylypo/khmer-text-data

wipfli commented 11 months ago

The Khmer wikipedia with 10,000,000 words requires unique 1509 positioned glyphs with Noto Sans Khmer. After downscaling by 1/64 to MapLibre SDF glyphs there are 530 unique glyphs left.

Calculated with count_unique_glyphs.py

wipfli commented 11 months ago

When extracting all tag values that have a tag key that starts with name from the cambodia-latest.osm.pbf geofabrik download, I found that 292 unique MapLibre glyphs are needed to render all the Khmer words.

All but one positioned glyphs needed for this were already contained in the Wikipedia set of positioned glyphs.

The one that was missing looks like this:

It is glyph with index 120 in NotoSans-Khmer.ttf with these metrics:

x_offset, y_offset, advance 0, 3, 0

We have glyph with index 120 in the wikipedia set, but only with these 4 metrics:

0, 0, 0 0, -1, 0 -4, 0, 0 -5, 0, 0

wipfli commented 11 months ago

Quite good I would say...

wipfli commented 11 months ago

perl -pe 's/[[:ascii:]]/\n/g' kmwiki-20231201-pages-articles-multistream.xml | sed '/^$/d' | grep -P "[\x{1780}-\x{17FF}]" | sort | uniq > khmer.txt

wipfli commented 11 months ago

sed -n '2p' input.json | sed 's/,$//' |  jq |  perl -pe 's/[[:ascii:]]/\n/g' | sed '/^$/d' | sort | uniq

wipfli commented 9 months ago

Factored out into https://github.com/wipfli/word-corpus