Open Jasmeet107 opened 4 years ago
I can imagine ways that the logic here can result in situations with an unreasonably large array.
100,000 x 100,000 is VERY large, and I'm trying to imagine just how small the characters would need to be for that size to be the result.
Could you try with the version I released in the past 24 hours (2.4.0)? It fixes the character width calculation for some non-embedded fonts and there's a chance it will improve your situation.
On Fri, 22 Nov. 2019, 04:59 Jasmeet Arora, notifications@github.com wrote:
When a pdf has a lot of really small characters in it, reading the PDF can use a huge amount of memory.
in page_layout.rb:
return "" if @runs.empty? page = row_count.times.map { |i| " " * col_count } @runs.each do |run| x_pos = ((run.x - @x_offset) / col_multiplier).round y_pos = row_count - (run.y / row_multiplier).round if y_pos <= row_count && y_pos >= 0 && x_pos <= col_count && x_pos >= 0 local_string_insert(page[y_pos-1], run.text, x_pos) end end interesting_rows(page).map(&:rstrip).join("\n") end```
specifically the second line of this method creates a really large array in some cases
the dimensions of this array are calculated using
internal_show_text
inpage_text_receiver.rb
, which is using the average font size of all characters. when there are a lot of characters with tiny font sizes (as pdfs often have, like small symbols etc.), it means the widths and heights of the rows and columns to create this array will be super small, thus resulting in cases of 100k x 100k (or even more) arrays. this is such a big array that it's crashing our application in these cases.@EugeneNF has also been working on this
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yob/pdf-reader/issues/314?email_source=notifications&email_token=AAAB7RCZCOYVUZQELE7LRPDQU3EBPA5CNFSM4JQGCLW2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H3FT2LA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB7RBSF4LRGVOE7UCL3IDQU3EBPANCNFSM4JQGCLWQ .
Just tried 2.4.0 - still seeing memory spike. We workaround this issue by implementing custom PageTextReceiver that ignores character of a very small font size
class PdfTextReceiver < PDF::Reader::PageTextReceiver
def show_text(string)
...
unless utf8_chars == SPACE || @state.font_size < 1
@characters << PDF::Reader::TextRun.new(newx, newy, scaled_glyph_width, @state.font_size, utf8_chars)
end
...
Cool, that seems like a good work around for now.
If you want to upstream something, I'd be happy to accept a patch that changes PageLayout to optionally drop characters that are very small. Something like this:
class PDF::Reader
class PageLayout
def initialize(runs, mediabox, minimum_font_size: 0)
runs = runs.reject { |run| run.font_size < minimum_font_size }
...
end
end
end
This small ~1KB PDF file almost instantly causes OOM. Tested on latest master.
I haven't investigated further. It is possible that this is an unrelated issue. The file renders fine in other PDF readers.
$ ls -la fuzz.pdf
-rw-rw-r-- 1 user user 1191 Apr 17 00:55 fuzz.pdf
$ ./tools/read-pdf.rb fuzz.pdf
[*] Processing 'fuzz.pdf'
[+] Processing complete
[*] Parsing 'fuzz.pdf'
[*] Version: 1.3
[*] Info: {:Creator=>"Prawn", :Producer=>"Prawn"}
[*] Metadata:
[*] Objects: <PDF::Reader::ObjectHash size: 5>
[*] Pages: 1
[*] Parsing PDF contents...
Killed
2293925.772727] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=ruby,pid=975755,uid=1000
[2293925.772754] Out of memory: Killed process 975755 (ruby) total-vm:5572072kB, anon-rss:4907892kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:10892kB oom_score_adj:0
[2293925.953717] oom_reaper: reaped process 975755 (ruby), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
When a pdf has a lot of really small characters in it, reading the PDF can use a huge amount of memory.
in
page_layout.rb
:specifically the second line of this method creates a really large array in some cases
the dimensions of this array are calculated using
internal_show_text
inpage_text_receiver.rb
, which is using the average font size of all characters. when there are a lot of characters with tiny font sizes (as pdfs often have, like small symbols etc.), it means the widths and heights of the rows and columns to create this array will be super small, thus resulting in cases of 100k x 100k (or even more) arrays. this is such a big array that it's crashing our application in these cases.@EugeneNF has also been working on this