yeslogic / allsorts

Font parser, shaping engine, and subsetter implemented in Rust
https://yeslogic.com/blog/allsorts-rust-font-shaping-engine/
Apache License 2.0
721 stars 21 forks source link

Tracking the correspondence between input text and output glyphs #31

Open mikeday opened 4 years ago

mikeday commented 4 years ago

Shaping proceeds roughly as follows:

  1. Characters are mapped to glyphs via the cmap table on a one-to-one basis (with some exceptions and special cases such as variation selector characters and zero width (non-)joiners and possibly after Unicode normalisation has taken place).

  2. The glyph array is permuted by the substitution lookups found in the GSUB table and other script-specific reordering may be applied.

  3. The glyphs are finally positioned relative to each other by the positioning lookups found in the GPOS table and their intrinsic metrics such as advance width.

During this process we attempt to track the connection between the original text input and the glyphs by remembering which characters each glyph came from and updating that appropriately in response to ligature substitutions. This is enough to support the ToUnicode mapping needed by PDF files so that copy and paste works, but not adequate for interactive applications that need to handle caret positioning, text selection, or efficient line-breaking via shaping boundaries as described in #29 (which Prince could also benefit from).

As a contrived example consider shaping the text "aba" and getting back glyphs [17-'b', 18-'a', 18-'a'], from this alone you can't tell which 'a' ended up where.

mikeday commented 4 years ago

Although the character->glyph relationship starts out roughly one-to-one, substitutions can break that in the following ways:

mikeday commented 4 years ago

An application may want to make high level queries such as:

The shaping process needs to maintain sufficient correspondence between the input text and output glyphs that these questions can be answered, even if the answer may not always be particularly useful in the general case, such as if the font has erased all of the glyphs or converted the entire text into a single ligature.

mikeday commented 4 years ago

Idea for future investigation: associate an (index, length) pair with every character and every glyph, representing the first and last glyph corresponding to that character, and vice versa. This is only an approximation but potentially a useful one.

Or perhaps a better starting point would be to consider the text buffer and the glyph buffer each split into contiguous subranges that map to each other, one character to one glyph in the simple case and potentially the entire input to the entire output in the case of complex script reordering, one giant ligature, or a pathological font like Addition.

behdad commented 3 years ago

This is what the HarfBuzz hb_glyph_info_t::cluster is about. I suggest you study what we do there.

Here's the section in our docs: https://harfbuzz.github.io/clusters.html

See also the following which makes it closer to what you propose: https://github.com/harfbuzz/harfbuzz/issues/1392

adrianwong commented 3 years ago

Thanks for the pointers @behdad, much appreciated.

LoganDark commented 2 years ago

Are there any unresolved questions here or is it just waiting on an implementation?

wezm commented 2 years ago

Are there any unresolved questions here or is it just waiting on an implementation?

I think the approach that the implementation would take is still undecided.

LoganDark commented 2 years ago

Are there any unresolved questions here or is it just waiting on an implementation?

I think the approach that the implementation would take is still undecided.

What if you provided a "userdata" field with a trait so the user can decide how to handle splitting and combining?

RawGlyph currently has a generic but it's never propagated into the Infos. Maybe you could make Info generic as well and allow the user to implement some trait to handle how things are propagated from the RawGlyphs?

You could add some sort of Userdata trait bound to RawGlyph, that would be another breaking change of course. But it would be a good one.

() could have an implementation that does nothing in all cases.