yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Reduce allocations when parsing hex strings #528

Closed yob closed 9 months ago

yob commented 9 months ago

Running a script based on one shared by Aaron at [1], I noticed we allocate a surprising number of objects when parsing hex strings.

The allocations.rb script (see below) when parsing a file with lots of hex strings shows the hex_string method as the top source of allocations. We can fix that!

before

$ ruby allocations.rb | head -n 10
                      sourcefile                        sourceline                   class                   count
------------------------------------------------------  ----------  ---------------------------------------  -----
<PWD>/lib/pdf/reader/parser.rb                                 176  Array                                    65246
<PWD>/lib/pdf/reader/parser.rb                                 176  String                                   63124
<PWD>/lib/pdf/reader/parser.rb                                 177  String                                   53500
<PWD>/lib/pdf/reader/buffer.rb                                 362  String                                   41386
<PWD>/lib/pdf/reader/buffer.rb                                 384  String                                   27386
<PWD>/lib/pdf/reader/transformation_matrix.rb                   20  Array                                    19238
<PWD>/lib/pdf/reader/page_state.rb                             243  Array                                    14846
<PWD>/lib/pdf/reader/encoding.rb                               143  Array                                    14336

$ ruby benchmark.rb
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
Warming up --------------------------------------
                         1.000 i/100ms
Calculating -------------------------------------
                          1.973 (± 0.0%) i/s -     20.000 in  10.135409s
{:ALLOCATIONS=>772349}

after

$ ruby allocations.rb | head -n 10
                      sourcefile                        sourceline                   class                   count
------------------------------------------------------  ----------  ---------------------------------------  -----
<PWD>/lib/pdf/reader/buffer.rb                                 362  String                                   41386
<PWD>/lib/pdf/reader/buffer.rb                                 384  String                                   27386
<PWD>/lib/pdf/reader/transformation_matrix.rb                   20  Array                                    19238
<internal:pack>                                                  8  String                                   17047
<PWD>/lib/pdf/reader/page_state.rb                             243  Array                                    14846
<PWD>/lib/pdf/reader/encoding.rb                               143  Array                                    14336
<PWD>/lib/pdf/reader/page_state.rb                             342  PDF::Reader::TransformationMatrix        10743
<PWD>/lib/pdf/reader/transformation_matrix.rb                  115  Array                                    10641

$ ruby benchmark.rb
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
Warming up --------------------------------------
                         1.000 i/100ms
Calculating -------------------------------------
                          2.097 (± 0.0%) i/s -     21.000 in  10.017634s
{:ALLOCATIONS=>561561}

benchmark.rb

$ cat benchmark.rb
#!/bin/env ruby

$LOAD_PATH << "lib"
require "pdf/reader"
require "benchmark/ips"

def allocations
  x = GC.stat(:total_allocated_objects)
  yield
  GC.stat(:total_allocated_objects) - x
end

def go
  doc = PDF::Reader.new(File.join(File.dirname(__FILE__), "spec/data/cairo-unicode.pdf"))
  doc.pages.each do |page|
    page.text #extract the text but do nothing with it
  end
end

Benchmark.ips { |x|
  x.config(:time => 10, :warmup => 5)
  x.report {
    go
  }
}
p ALLOCATIONS: allocations { go }

allocations.rb

$ cat allocations.rb
#!/bin/env ruby

$LOAD_PATH << "lib"
require "pdf/reader"
require "allocation_stats"

FILENAME = File.join(File.dirname(__FILE__), "spec/data/cairo-unicode.pdf")

def go
  doc = PDF::Reader.new(FILENAME)
  doc.pages.each do |page|
    page.text #extract the text but do nothing with it
  end
end

stats = AllocationStats.trace { go }
puts stats.allocations(alias_paths: true).group_by(:sourcefile, :sourceline, :class).sort_by_size.to_text

[1] https://tenderlovemaking.com/2023/09/02/fast-tokenizers-with-stringscanner.html