yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

creating page layout uses a lot of memory #314

Open Jasmeet107 opened 4 years ago

Jasmeet107 commented 4 years ago

When a pdf has a lot of really small characters in it, reading the PDF can use a huge amount of memory.

in page_layout.rb:

def to_s
      return "" if @runs.empty?

      page = row_count.times.map { |i| " " * col_count }
      @runs.each do |run|
        x_pos = ((run.x - @x_offset) / col_multiplier).round
        y_pos = row_count - (run.y / row_multiplier).round
        if y_pos <= row_count && y_pos >= 0 && x_pos <= col_count && x_pos >= 0
          local_string_insert(page[y_pos-1], run.text, x_pos)
        end
      end
      interesting_rows(page).map(&:rstrip).join("\n")
    end

specifically the second line of this method creates a really large array in some cases

the dimensions of this array are calculated using internal_show_text in page_text_receiver.rb, which is using the average font size of all characters. when there are a lot of characters with tiny font sizes (as pdfs often have, like small symbols etc.), it means the widths and heights of the rows and columns to create this array will be super small, thus resulting in cases of 100k x 100k (or even more) arrays. this is such a big array that it's crashing our application in these cases.

@EugeneNF has also been working on this

yob commented 4 years ago

I can imagine ways that the logic here can result in situations with an unreasonably large array.

100,000 x 100,000 is VERY large, and I'm trying to imagine just how small the characters would need to be for that size to be the result.

Could you try with the version I released in the past 24 hours (2.4.0)? It fixes the character width calculation for some non-embedded fonts and there's a chance it will improve your situation.

On Fri, 22 Nov. 2019, 04:59 Jasmeet Arora, notifications@github.com wrote:

When a pdf has a lot of really small characters in it, reading the PDF can use a huge amount of memory.

in page_layout.rb:

  return "" if @runs.empty?

  page = row_count.times.map { |i| " " * col_count }
  @runs.each do |run|
    x_pos = ((run.x - @x_offset) / col_multiplier).round
    y_pos = row_count - (run.y / row_multiplier).round
    if y_pos <= row_count && y_pos >= 0 && x_pos <= col_count && x_pos >= 0
      local_string_insert(page[y_pos-1], run.text, x_pos)
    end
  end
  interesting_rows(page).map(&:rstrip).join("\n")
end```

specifically the second line of this method creates a really large array in some cases

the dimensions of this array are calculated using internal_show_text in page_text_receiver.rb, which is using the average font size of all characters. when there are a lot of characters with tiny font sizes (as pdfs often have, like small symbols etc.), it means the widths and heights of the rows and columns to create this array will be super small, thus resulting in cases of 100k x 100k (or even more) arrays. this is such a big array that it's crashing our application in these cases.

@EugeneNF has also been working on this

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yob/pdf-reader/issues/314?email_source=notifications&email_token=AAAB7RCZCOYVUZQELE7LRPDQU3EBPA5CNFSM4JQGCLW2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H3FT2LA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB7RBSF4LRGVOE7UCL3IDQU3EBPANCNFSM4JQGCLWQ .

EugeneNF commented 4 years ago

Just tried 2.4.0 - still seeing memory spike. We workaround this issue by implementing custom PageTextReceiver that ignores character of a very small font size

class PdfTextReceiver < PDF::Reader::PageTextReceiver
    def show_text(string)
        ...
        unless utf8_chars == SPACE || @state.font_size < 1
              @characters << PDF::Reader::TextRun.new(newx, newy, scaled_glyph_width, @state.font_size, utf8_chars)
        end
    ...
yob commented 4 years ago

Cool, that seems like a good work around for now.

If you want to upstream something, I'd be happy to accept a patch that changes PageLayout to optionally drop characters that are very small. Something like this:

class PDF::Reader
  class PageLayout
    def initialize(runs, mediabox, minimum_font_size: 0)
      runs = runs.reject { |run| run.font_size < minimum_font_size } 
      ...
    end
  end
end
bcoles commented 2 years ago

This small ~1KB PDF file almost instantly causes OOM. Tested on latest master.

I haven't investigated further. It is possible that this is an unrelated issue. The file renders fine in other PDF readers.

fuzz.pdf

$ ls -la fuzz.pdf 
-rw-rw-r-- 1 user user 1191 Apr 17 00:55 fuzz.pdf

$ ./tools/read-pdf.rb fuzz.pdf 
[*] Processing 'fuzz.pdf'
[+] Processing complete
[*] Parsing 'fuzz.pdf'
[*] Version: 1.3
[*] Info: {:Creator=>"Prawn", :Producer=>"Prawn"}
[*] Metadata: 
[*] Objects: <PDF::Reader::ObjectHash size: 5>
[*] Pages: 1
[*] Parsing PDF contents...
Killed
2293925.772727] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=ruby,pid=975755,uid=1000
[2293925.772754] Out of memory: Killed process 975755 (ruby) total-vm:5572072kB, anon-rss:4907892kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:10892kB oom_score_adj:0
[2293925.953717] oom_reaper: reaped process 975755 (ruby), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB