yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Text passed to receiver is not UTF-8 encoded #447

Closed martinadamek closed 2 years ago

martinadamek commented 2 years ago

Maybe I misunderstood README saying:

Regardless of the internal encoding used in the PDF all text will be converted to UTF-8 before it is passed back from PDF::Reader.

While parsing some infoices, I am receiving ASCII-8BIT (or US-ASCII) encoded string in my receiver:

def show_text(arg)
  puts arg.encoding
end

Should this be possible, or did I misunderstood the API and docs? Btw, when I don't use receiver and check page.text, it is UTF-8 encoded, es expected.

yob commented 2 years ago

Apologies for the confusion.

page.text should always return utf-8 encoded text that's marked as such. The show_text callback is lower level, and it'll return the raw character codes from the PDF content stream and they are very rarely any recognisable encoding. They usually have to be converted into utf-8 via a mapping process.

martinadamek commented 2 years ago

Thanks!