xelkano / redmine_xapian

This plugin allows searches across attachments with xapian search engine
GNU General Public License v2.0
54 stars 26 forks source link

HTTP 500 when indexing files with non-UTF-8 characters #145

Closed tgoeg closed 7 months ago

tgoeg commented 8 months ago

There are real-world PDFs that use pretty strange characters. A sample page of one such document is this: sample_invalid_UTF8specialchar.pdf

redmine_xapian and/or Redmine itself has problems with extracts of this file, and I'm not sure what's really to blame as I know too little of both projects though I debugged this issue several hours already.

pdftotext extracts the following:

$ pdftotext sample_invalid_UTF8specialchar.pdf - | head -n2
Syntax Warning: Invalid Font Weight
Inhalt
Summary������������������������������������������������������������������������������������������������������������������������������������������������������ 3

These question marks are indeed valid UTF-8 codepoints to denote a replaced, invalid/undefined glyph. I would have thought everything should work out (though not really prettily, but that's a different story) if anything that comes after this expects valid UTF-8 (which it seems to be after pdftotext).

What however happens is the following:

xapian_indexer.rb uses omindex to index attachments, which in turn uses pdftotext outside of the configuration/code of this plugin. It took me quite some time to understand redmine_xapian does not index attachment files itself :-) With default options the very extracted string from above gets put into the xapian DB:

$ /usr/bin/omindex --overwrite -s german --db $redmine/file_index/german $redmine/files/2022/12 --url / --depth-limit=0 -v 2>/dev/null && quest -d $redmine/file_index/german "Zukunft"
[Entering directory ""]
Indexing "221202160005_d27cf288872aabed82423fb9f00c8306.pdf" as application/pdf ... added
Parsed Query: Query(zukunft@1)
Exactly 1 matches
MSet:
1: [0.264258]
url=/221202160005_d27cf288872aabed82423fb9f00c8306.pdf
sample=Klimawandel und psychische Gesundheit Positionspapier einer Task-Force der DGPPN Inhalt Summary ������������������������������������������������������������������������������������������������������������������������������������������������������...
caption=Klimawandel und psychische Gesundheit – Positionspapier einer Task-Force der DGPPN
author=DGPPN – Deutsche Gesellschaft für Psychiatrie und Psychotherapie
type=application/pdf
modtime=1704804210
pages=44
size=886379

If I search for any keyword in this document in Redmine, I get an HTTP 500 stating the following in the logs:

Processing by SearchController#index as HTML
  Parameters: {"utf8"=>"✓", "scope"=>"", "q"=>"dgppn"}
  Current user: my.user (id=3)
Can't open Xapian database /my/install/dir/file_index/repodb - #<IOError: DatabaseOpeningError: Couldn't stat '/my/install/dir/file_index/repodb' (No such file or directory)>
  Rendering plugins/redmine_xapian/app/views/search/index.html.erb within layouts/base
  Rendered plugins/redmine_xapian/app/views/search/index.html.erb within layouts/base (4.9ms)
Completed 500 Internal Server Error in 353ms (ActiveRecord: 310.6ms)

ActionView::Template::Error (invalid byte sequence in UTF-8):
    93:           <% # end %>
    94:           <%= link_to(highlight_tokens(e.event_title.truncate(255), @tokens), e.event_url) %>
    95:         </dt>
    96:         <dd><span class="description"><%= highlight_tokens(e.event_description, @tokens) %></span>
    97:         <span class="author"><%= format_time(e.event_datetime) %></span></dd>
    98:       <% end %>
    99:     </dl>

app/helpers/search_helper.rb:27:in `split'
app/helpers/search_helper.rb:27:in `highlight_tokens'
plugins/redmine_xapian/app/views/search/index.html.erb:96:in `block in _plugins_redmine_xapian_app_views_search_index_html_erb__3883580160676297093_84780'
plugins/redmine_xapian/app/views/search/index.html.erb:82:in `each'
plugins/redmine_xapian/app/views/search/index.html.erb:82:in `_plugins_redmine_xapian_app_views_search_index_html_erb__3883580160676297093_84780'
lib/redmine/sudo_mode.rb:61:in `sudo_mode'

I am not sure where the offending string comes from. I actually don't think it's the data in xapian's DB, as the function event_description in patches/attachment_patch.rb seems to fetch this description data with Redmine::Search.cache_store.fetch, which seems like some Redmine-internal data and not omindex'ed data, but I might be wrong (as I said, I don't understand enough to judge this):

43       # Event methods module
44       module EventMethods
45         def event_description
46           desc = Redmine::Search.cache_store.fetch("Attachment-#{id}")
47           if desc
48             Redmine::Search.cache_store.delete("Attachment-#{id}")
49           else
50             desc = description
51           end
52           desc&.force_encoding('UTF-8')
53         end
54       end

We even have a forced encoding here, so one could believe that everything rendered in the ActionView::Template should be valid UTF-8. It seems it is not. It has a proper UTF-8 encoding, but some glyphs don't seem to be valid (which is strange, as pdftotext explicitly replaced all undefined glyphs! But again, it seems this data does not come from my pdftotext run via omindex). It may however be that Redmine itself extracted a description into its own DB (and did so with invalid characters).

Note my SQL DB does not support utf8mb4 as I haven't felt the additional storage inefficiency to be worth it until now. Still, I don't think this is the culprit as this would have led to problems when uploading the file initially, already. I don't think I can retrieve an invalid UTF-8 character if my SQL DB does not even support saving it in the first place.

Now I even tried to get rid of the replacement glyphs altogether by defining the following for OMINDEX in xapian_indexer.rb: OMINDEX = '/usr/bin/omindex --overwrite -Fapplication/pdf:"/usr/bin/pdftotext -enc Latin1 %f - | iconv -f LATIN1 -t UTF-8"' (overwrite just once so I can be sure I have fresh entries in the xapian DB)

It does indeed get rid of the glyphs shown above:

# quest -d $redmine/file_index/german "Zukunft"
Parsed Query: Query(zukunft@1)
Exactly 1 matches
MSet:
2: [0.660578]
url=/2022/12/221202160005_d27cf288872aabed82423fb9f00c8306.pdf
sample=Klimawandel und psychische Gesundheit Positionspapier einer Task-Force der DGPPN Inhalt Summary 3 1. Auswirkungen des Klimawandels auf die psychische Gesundheit 4 1.1 Direkte Auswirkungen auf die Psyche Luftverschmutzung Hitze Extremwetter und Naturkatastrophen Angst vor der Zukunft Indirekte Folgen des Klimawandels auf die Psyche Nahrungsmittelunsicherheit Flucht und Migration Klimaungerechtigkeit 4 4 4 5 5 6 6 6 6 2. Handlungsempfehlungen für eine klimaneutrale Psychiatrie 7 2.1 2.2 2.3 Versorgung 7 ...
type=application/pdf
modtime=1704804210
size=886379

However - and that's why I think Redmine does some indexing/extracting of a teaser text itself - the problem still remains the same, which cannot stem from xapian anymore, at least I think so..

If I change the template line to

<dd><span class="description"><%= highlight_tokens(e.event_description.force_encoding('BINARY').encode("UTF-8", invalid: :replace, undef: :replace), @tokens) %></span>

I can circumvent the HTTP 500, but I get the � for valid UTF-8 special characters (like ä, ö, ü umlauts) as well, which is not what I want.

Changing the following is the best solution for me now, but I am very unsure whether it is a proper fix:

--- redmine_xapian/lib/redmine_xapian/patches/attachment_patch.rb.bak   2024-01-09 17:01:16.595888284 +0100
+++ redmine_xapian/lib/redmine_xapian/patches/attachment_patch.rb2024-01-09 17:01:09.223888544 +0100
@@ -49,7 +49,7 @@
           else
             desc = description
           end
-          desc&.force_encoding('UTF-8')
+          desc&.encode("UTF-8", invalid: :replace, undef: :replace)
         end
       end

I don't even need to convert to Latin1 and back to UTF-8 this way.

Can you reproduce the error with my sample file or is this something setup-specific? I am using the following versions:

Environment:
  Redmine version                5.0.6.stable
  Ruby version                   3.0.2-p107 (2021-07-07) [x86_64-linux-gnu]
  Rails version                  6.1.7.6
  Environment                    production
  Database adapter               Mysql2
Plugins:
redmine_xapian                 3.0.4

I think this is related to #111.

Thanks for this very helpful plugin!

picman commented 8 months ago

I think it could be encoded one step earlier. Could you test the devel branch?

tgoeg commented 8 months ago

I can confirm devel solves it and the outcome is the same!

Just out of interest: Do you understand why this is considered invalid UTF-8 if pdftotext does seem to output valid UTF-8 chars as a substitute for what indeed seem to be invalid characters in the source doc?

picman commented 7 months ago

No, I don't.

tgoeg commented 7 months ago

Alright, then it wasn't just me pulling my (non-existent) hair out to find an actual reason :-) Thanks again, I'll close this issue, then.