Available Unicode ranges

wipfli / positioned-glyph-font

Postioned glyph font for MapLibre

https://wipfli.github.io/positioned-glyph-font/examples/protomaps/name-ne

MIT License

4 stars 1 forks source link

Available Unicode ranges #5

Open wipfli opened 9 months ago

wipfli commented 9 months ago

We need to map the 843 unique positioned glyphs for Devanagari found in #4 to some Unicode ranges. The question is which ranges are available for this?

In principle, we could use the Private Use Area (PUA) for this purpose. See https://en.wikipedia.org/wiki/Private_Use_Areas. It contains 6400 codepoints in the Basic Multilingual Plane, i.e., the codepoints that are smaller than 2**16. However, MapLibre GL JS already uses the PUA for storing images embedded in text. See shaping.ts. MapLibre fills up the PUA codepoints from the low to high so we could fill them up high to low but feels a bit dangerous.

Next, we could just use the codepoints that unicode anyway reserves for Devanagari. There are 3 ranges:

Devanagari, 128 codepoints
Devanagari Extended, 32 codepoints
Devanagari Extended-A, 10 codepoints

So that is a total of 170 codepoints available in Devanagari Unicode ranges which is not enough to cover all positioned glyphs.

Update: tagged_string.cpp in MapLibre Native seems to do the same as MapLibre GL JS, i.e., images in strings are referenced to by codepoints in the PUA.

wipfli commented 9 months ago

The next option is to use all codepoints starting from U+0000. I think however that this is a bad idea because MapLibre has some special behavior for white spaces and also for Latin letters. For example, to-uppercase can be used with Latin letters in MapLibre.

Finally, I think the actual best option is to use the codepoints which MapLibre thinks belong to script that are not supported, i.e., the ones that are not in is-supported-script of the style spec. MapLibre GL JS defines these codepoints with a heuristic here:

    if ((char >= 0x0900 && char <= 0x0DFF) ||
        // Main blocks for Indic scripts and Sinhala
        (char >= 0x0F00 && char <= 0x109F) ||
        // Main blocks for Tibetan and Myanmar
        isChar['Khmer'](char)) {
        // These blocks cover common scripts that require
        // complex text shaping, based on unicode script metadata:
        // http://www.unicode.org/repos/cldr/trunk/common/properties/scriptMetadata.txt
        // where "Web Rank <= 32" "Shaping Required = YES"
        return false;
    }

wipfli commented 9 months ago

Here we have

num_codepoints = (0x0DFF - 0x0900 + 1) + (0x109F - 0x0F00 + 1) + (0x17FF - 0x1780 + 1)

1824 codepoints in total. This is enough to cover all of Devanagari in one MapLibre font.

bdon commented 9 months ago

Is there any small change we can make to maplibre to allow the use of the CJK BMP range here?

It would be ideal if most unsupported scripts could use a single font stack, instead of needing a separate font stack for every script.

wipfli commented 9 months ago

I thought about CJK too, but I don't see how that could work. From the point of view of MapLibre, how should the renderer know if a codepoint in a CJK range should be rasterized locally (default behavior) or if it should fetch the glyph from the server?

wipfli commented 9 months ago

The other thing I was thinking is that we might be able to use codepoints outside of the basic multilingual plane for this, i.e., >= 2**16...

wipfli commented 5 months ago

It looks like MapLibre GL JS accepts text in the PUA.