Produce standalone fonts when subsetting

wezm commented 4 years ago

2022 Update:

cmap generation when subsetting, which the original text of this issue focussed around landed in Allsorts 0.9. However, this does not get us all the way to generating standalone fonts. This issue will act as a tracking issue for things that are required to achieve that.

Original text:

The subsetting feature is currently tailored for the needs of subsetting fonts for embedding in PDFs, since that was our primary use case when developing Allsorts. The issue is that we don't include a cmap table in the subset font, which makes it invalid for use outside PDF. When a subset font is embedded in a PDF the cmap info is contained in the PDF directly, so we don't need to include it in the font.

In order to support more general subsetting it would be convenient to have an entry point that takes a list of chars and produces a font with glyphs for just those chars. This would be an incremental improvement on what we have so far and would still have some limitations: with chars as input there wouldn't be a way to include ligature glyphs. Doing so would require subsetting gpos, and gsub tables as well, which is a problem for another day.

The subsetting code lives in subset.rs. The new function signature could be along these lines:

/// Subset this font so that it only contains the glyphs for the supplied `chars`.
pub fn subset_chars(
    provider: &impl FontTableProvider,
    chars: &[char],
) -> Result<Vec<u8>, ReadWriteError>

The implementation would need to map chars to glyph ids using a technique similar to this. The subset font would need to include a new cmap table (probably using the Unicode platformID). There's a bunch of formats to chose from to encode the data. An initial implementation might just choose one of the simpler ones at the cost of size of the resulting font. A more sophisticated implementation could examine the data to determine the best option.

ebraminio commented 4 years ago

The issue is that we don't include a cmap table in the subset font, which makes it invalid for use outside PDF.

Some PDF readers also won't work without a valid cmap, https://crbug.com/1071958 guess is needed for their text selection to work properly.

yisibl commented 4 years ago

Looking forward to this feature.

wezm commented 2 years ago

I've just released 0.9, which implements building of a proper cmap table for subset fonts.

yisibl commented 2 years ago

@wezm Can you upgrade the dependency version in allsorts-tools?

Looks like it can be solved: https://github.com/yeslogic/allsorts-tools/issues/16

wezm commented 2 years ago

Yes I'm working on that next. I have a draft PR open for it https://github.com/yeslogic/allsorts-tools/pull/18

wezm commented 2 years ago

Reopening as we strip the OS/2 table which is required in OpenType fonts.

yisibl commented 2 years ago

@wezm I tried to submit a PR to fix it, PTAL. https://github.com/yeslogic/allsorts/pull/58

yisibl commented 1 year ago

Happy New Year! Any progress here?

wezm commented 1 year ago

No, sorry it's a pretty big piece of work that has not been scheduled yet.

dnlmlr commented 1 year ago

Hey! I am also trying to use subsetting for embedded fonts in PDF documents. Since I want to avoid getting too deep into the low level PDF structure, I am just using the genpdf -> printpdf -> lopdf stack. The plan was to embed the full subsetted font into the PDF files without touching the PDF internal mappings /Differences.

I got it to work on all tested PDF readers and printers with the current implementation of subset even though the OS/2 table is missing, but only if Unicode Encoding Records are used (mappings with CharExistence::BasicMultilingualPlane, CharExistence::AstralPlane). If CharExistence::MacRoman or CharExistence::DivinePlane is used, it doesn't work.

Would it be a sensible thing to allow forcing the default mode to be Unicode or are there any problems with this?

One workaround that I think I'll be using for now is to manually add a '€' character to the glyph_ids subset so that it can't be encoded with MacRoman, but this is not the nicest solution and will be a problem if a font doesn't actually have '€'

wezm commented 1 year ago

Would it be a sensible thing to allow forcing the default mode to be Unicode or are there any problems with this?

I don't think that would make sense as a default as it would unnecessarily inflate the font. There is already an internal CmapStrategy enum used to drive some of the cmap generation behaviour. A new variant could be added to that and then some way to select that strategy could be added.

dnlmlr commented 1 year ago

Yeah I agree that it shouldn't be default, since this is kind of an edge case. What I meant was a way to externally change the encoding mode, for example as a parameter to the subset function. Basically any mechanism that would allow to optionally prevent encoding with MacRoman.

yeslogic / allsorts

Produce standalone fonts when subsetting #27