Closed tgoeg closed 7 months ago
I think it could be encoded one step earlier. Could you test the devel branch?
I can confirm devel solves it and the outcome is the same!
Just out of interest: Do you understand why this is considered invalid UTF-8 if pdftotext
does seem to output valid UTF-8 chars as a substitute for what indeed seem to be invalid characters in the source doc?
No, I don't.
Alright, then it wasn't just me pulling my (non-existent) hair out to find an actual reason :-) Thanks again, I'll close this issue, then.
There are real-world PDFs that use pretty strange characters. A sample page of one such document is this: sample_invalid_UTF8specialchar.pdf
redmine_xapian
and/or Redmine itself has problems with extracts of this file, and I'm not sure what's really to blame as I know too little of both projects though I debugged this issue several hours already.pdftotext
extracts the following:These question marks are indeed valid UTF-8 codepoints to denote a replaced, invalid/undefined glyph. I would have thought everything should work out (though not really prettily, but that's a different story) if anything that comes after this expects valid UTF-8 (which it seems to be after
pdftotext
).What however happens is the following:
xapian_indexer.rb
usesomindex
to index attachments, which in turn usespdftotext
outside of the configuration/code of this plugin. It took me quite some time to understandredmine_xapian
does not index attachment files itself :-) With default options the very extracted string from above gets put into the xapian DB:If I search for any keyword in this document in Redmine, I get an HTTP 500 stating the following in the logs:
I am not sure where the offending string comes from. I actually don't think it's the data in xapian's DB, as the function
event_description
inpatches/attachment_patch.rb
seems to fetch this description data withRedmine::Search.cache_store.fetch
, which seems like some Redmine-internal data and notomindex
'ed data, but I might be wrong (as I said, I don't understand enough to judge this):We even have a forced encoding here, so one could believe that everything rendered in the
ActionView::Template
should be valid UTF-8. It seems it is not. It has a proper UTF-8 encoding, but some glyphs don't seem to be valid (which is strange, aspdftotext
explicitly replaced all undefined glyphs! But again, it seems this data does not come from mypdftotext
run viaomindex
). It may however be that Redmine itself extracted a description into its own DB (and did so with invalid characters).Note my SQL DB does not support
utf8mb4
as I haven't felt the additional storage inefficiency to be worth it until now. Still, I don't think this is the culprit as this would have led to problems when uploading the file initially, already. I don't think I can retrieve an invalid UTF-8 character if my SQL DB does not even support saving it in the first place.Now I even tried to get rid of the replacement glyphs altogether by defining the following for
OMINDEX
inxapian_indexer.rb
:OMINDEX = '/usr/bin/omindex --overwrite -Fapplication/pdf:"/usr/bin/pdftotext -enc Latin1 %f - | iconv -f LATIN1 -t UTF-8"'
(overwrite
just once so I can be sure I have fresh entries in the xapian DB)It does indeed get rid of the glyphs shown above:
However - and that's why I think Redmine does some indexing/extracting of a teaser text itself - the problem still remains the same, which cannot stem from xapian anymore, at least I think so..
If I change the template line to
I can circumvent the HTTP 500, but I get the � for valid UTF-8 special characters (like ä, ö, ü umlauts) as well, which is not what I want.
Changing the following is the best solution for me now, but I am very unsure whether it is a proper fix:
I don't even need to convert to Latin1 and back to UTF-8 this way.
Can you reproduce the error with my sample file or is this something setup-specific? I am using the following versions:
I think this is related to #111.
Thanks for this very helpful plugin!