yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Rails 6.1.4.4 Mediabox is returned as nil #408

Closed mtownsen closed 2 years ago

mtownsen commented 2 years ago

I have a PDF parser that is no longer working after upgrading to Rails 6.1.4.4. I tracked the issue to a nil reference on the mediabox field for the PageTextReceiver. I went back to Rails 6.1.4.3 and the issue went away using the same version of pdf-reader.

yob commented 2 years ago

That's surprising! It's hard to understand how a rails bump with no change to pdf-reader would have this impact.

Were you able to get to the bottom of the issue?

mtownsen commented 2 years ago

Not yet. It’s very odd. I’m not seeing anything in that rails release that could have anything to do with this either from reviewing the change log and edits.

yob commented 2 years ago

Is it possible to share a stack trace that shows where in pdf-reader the exception is coming from?

On Wed, 29 Dec 2021, 04:15 mtownsen, @.***> wrote:

Not yet. It’s very odd. I’m not seeing anything in that rails release that could have anything to do with this either from reviewing the change log and edits.

— Reply to this email directly, view it on GitHub https://github.com/yob/pdf-reader/issues/408#issuecomment-1002204196, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB7RHWTGIOWJX73V5QQK3UTHWE5ANCNFSM5KJZNBAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

mtownsen commented 2 years ago

I am getting an exception out of PageLayout stating must not be nil I pulled a backtrace off the exception in hopes it would help.

["/usr/local/bundle/gems/pdf-reader-2.7.0/lib/pdf/reader/error.rb:55:invalidate_not_nil'", "/usr/local/bundle/gems/pdf-reader-2.7.0/lib/pdf/reader/page_layout.rb:22:in initialize'", "/var/www/fj/current/app/lib/PDFProcessor.rb:23:innew'", "/var/www/fj/current/app/lib/PDFProcessor.rb:23:in block in process'", "/var/www/fj/current/app/lib/PDFProcessor.rb:17:ineach'", "/var/www/fj/current/app/lib/PDFProcessor.rb:17:in process'", "/var/www/fj/current/app/lib/Linkedin.rb:14:inprocess'", "/var/www/fj/current/app/controllers/profiles_controller.rb:653:in parse_linkedinpdf'", "/var/www/fj/current/app/controllers/profiles_controller.rb:149:inupdate'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_controller/metal/basic_implicit_render.rb:6:in send_action'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/abstract_controller/base.rb:228:inprocess_action'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_controller/metal/rendering.rb:30:in process_action'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/abstract_controller/callbacks.rb:42:inblock in process_action'", "/usr/local/bundle/gems/activesupport-6.1.4.4/lib/active_support/callbacks.rb:117:in block in run_callbacks'", "/usr/local/bundle/gems/actiontext-6.1.4.4/lib/action_text/rendering.rb:20:inwith_renderer'", "/usr/local/bundle/gems/actiontext-6.1.4.4/lib/action_text/engine.rb:59:in block (4 levels) in <class:Engine>'", "/usr/local/bundle/gems/activesupport-6.1.4.4/lib/active_support/callbacks.rb:126:ininstance_exec'", "/usr/local/bundle/gems/activesupport-6.1.4.4/lib/active_support/callbacks.rb:126:in block in run_callbacks'", "/usr/local/bundle/gems/activesupport-6.1.4.4/lib/active_support/callbacks.rb:137:inrun_callbacks'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/abstract_controller/callbacks.rb:41:in process_action'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_controller/metal/rescue.rb:22:inprocess_action'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_controller/metal/instrumentation.rb:34:in block in process_action'", "/usr/local/bundle/gems/activesupport-6.1.4.4/lib/active_support/notifications.rb:203:inblock in instrument'", "/usr/local/bundle/gems/activesupport-6.1.4.4/lib/active_support/notifications/instrumenter.rb:24:in instrument'", "/usr/local/bundle/gems/activesupport-6.1.4.4/lib/active_support/notifications.rb:203:ininstrument'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_controller/metal/instrumentation.rb:33:in process_action'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_controller/metal/params_wrapper.rb:249:inprocess_action'", "/usr/local/bundle/gems/activerecord-6.1.4.4/lib/active_record/railties/controller_runtime.rb:27:in process_action'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/abstract_controller/base.rb:165:inprocess'", "/usr/local/bundle/gems/actionview-6.1.4.4/lib/action_view/rendering.rb:39:in process'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_controller/metal.rb:190:indispatch'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_controller/metal.rb:254:in dispatch'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/routing/route_set.rb:50:indispatch'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/routing/route_set.rb:33:in serve'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/journey/router.rb:50:inblock in serve'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/journey/router.rb:32:in each'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/journey/router.rb:32:inserve'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/routing/route_set.rb:842:in call'", "/usr/local/bundle/gems/meta_request-0.7.3/lib/meta_request/middlewares/app_request_handler.rb:15:incall'", "/usr/local/bundle/gems/meta_request-0.7.3/lib/meta_request/middlewares/meta_request_handler.rb:15:in call'", "/usr/local/bundle/gems/bullet-6.1.5/lib/bullet/rack.rb:12:incall'", "/usr/local/bundle/gems/rack-2.2.3/lib/rack/tempfile_reaper.rb:15:in call'", "/usr/local/bundle/gems/rack-2.2.3/lib/rack/etag.rb:27:incall'", "/usr/local/bundle/gems/rack-2.2.3/lib/rack/conditional_get.rb:40:in call'", "/usr/local/bundle/gems/rack-2.2.3/lib/rack/head.rb:12:incall'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/http/permissions_policy.rb:22:in call'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/http/content_security_policy.rb:18:incall'", "/usr/local/bundle/gems/rack-2.2.3/lib/rack/session/abstract/id.rb:266:in context'", "/usr/local/bundle/gems/rack-2.2.3/lib/rack/session/abstract/id.rb:260:incall'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/middleware/cookies.rb:689:in call'", "/usr/local/bundle/gems/activerecord-6.1.4.4/lib/active_record/migration.rb:601:incall'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/middleware/callbacks.rb:27:in block in call'", "/usr/local/bundle/gems/activesupport-6.1.4.4/lib/active_support/callbacks.rb:98:inrun_callbacks'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/middleware/callbacks.rb:26:in call'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/middleware/executor.rb:14:incall'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/middleware/actionable_exceptions.rb:18:in call'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/middleware/debug_exceptions.rb:29:incall'", "/usr/local/bundle/gems/rack-contrib-2.3.0/lib/rack/contrib/response_headers.rb:19:in call'", "/usr/local/bundle/gems/meta_request-0.7.3/lib/meta_request/middlewares/headers.rb:18:incall'", "/usr/local/bundle/gems/web-console-4.2.0/lib/web_console/middleware.rb:132:in call_app'", "/usr/local/bundle/gems/web-console-4.2.0/lib/web_console/middleware.rb:19:inblock in call'", "/usr/local/bundle/gems/web-console-4.2.0/lib/web_console/middleware.rb:17:in catch'", "/usr/local/bundle/gems/web-console-4.2.0/lib/web_console/middleware.rb:17:incall'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/middleware/show_exceptions.rb:33:in call'", "/usr/local/bundle/gems/railties-6.1.4.4/lib/rails/rack/logger.rb:37:incall_app'", "/usr/local/bundle/gems/railties-6.1.4.4/lib/rails/rack/logger.rb:26:in block in call'", "/usr/local/bundle/gems/activesupport-6.1.4.4/lib/active_support/tagged_logging.rb:99:inblock in tagged'", "/usr/local/bundle/gems/activesupport-6.1.4.4/lib/active_support/tagged_logging.rb:37:in tagged'", "/usr/local/bundle/gems/activesupport-6.1.4.4/lib/active_support/tagged_logging.rb:99:intagged'", "/usr/local/bundle/gems/railties-6.1.4.4/lib/rails/rack/logger.rb:26:in call'", "/usr/local/bundle/gems/sprockets-rails-3.4.2/lib/sprockets/rails/quiet_assets.rb:13:incall'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/middleware/remote_ip.rb:81:in call'", "/usr/local/bundle/gems/request_store-1.5.0/lib/request_store/middleware.rb:19:incall'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/middleware/request_id.rb:26:in call'", "/usr/local/bundle/gems/rack-2.2.3/lib/rack/method_override.rb:24:incall'", "/usr/local/bundle/gems/rack-2.2.3/lib/rack/runtime.rb:22:in call'", "/usr/local/bundle/gems/activesupport-6.1.4.4/lib/active_support/cache/strategy/local_cache_middleware.rb:29:incall'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/middleware/executor.rb:14:in call'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/middleware/static.rb:24:incall'", "/usr/local/bundle/gems/rack-2.2.3/lib/rack/sendfile.rb:110:in call'", "/usr/local/bundle/gems/actionpack-6.1.4.4/lib/action_dispatch/middleware/host_authorization.rb:113:incall'", "/usr/local/bundle/gems/utf8-cleaner-1.0.0/lib/utf8-cleaner/middleware.rb:21:in call'", "/usr/local/bundle/gems/webpacker-4.3.0/lib/webpacker/dev_server_proxy.rb:23:inperform_request'", "/usr/local/bundle/gems/rack-proxy-0.7.0/lib/rack/proxy.rb:63:in call'", "/usr/local/bundle/gems/railties-6.1.4.4/lib/rails/engine.rb:539:incall'", "/usr/local/bundle/gems/puma-5.5.2/lib/puma/configuration.rb:249:in call'", "/usr/local/bundle/gems/puma-5.5.2/lib/puma/request.rb:77:inblock in handle_request'", "/usr/local/bundle/gems/puma-5.5.2/lib/puma/thread_pool.rb:340:in with_force_shutdown'", "/usr/local/bundle/gems/puma-5.5.2/lib/puma/request.rb:76:inhandle_request'", "/usr/local/bundle/gems/puma-5.5.2/lib/puma/server.rb:447:in process_client'", "/usr/local/bundle/gems/puma-5.5.2/lib/puma/thread_pool.rb:147:inblock in spawn_thread'"]`

yob commented 2 years ago

pdf-reader 2.7.0 did change the way we assert that the mediabox is non-nil: https://github.com/yob/pdf-reader/blob/df93850d7e20056c55451aa4c476069dd2497dc5/lib/pdf/reader/page_layout.rb#L22

Here's how we did it in 2.6.0: https://github.com/yob/pdf-reader/blob/e3a02e081d1a5ca440d961bffb2eaec9de48e25e/lib/pdf/reader/page_layout.rb#L19

I'd expect them to behave identically though (with a slightly different error message).

Did you upgrade to pdf-reader v2.7.0 at the same time as upgrading rails to 6.1.4.4? Can you add some debugging just before app/lib/PDFProcessor.rb:23 to confirm what arguments you're passing to PageLayout? is the second argument nil?

mtownsen commented 2 years ago

We had previously upgraded to version 2.7.0 without any issues awhile back. Originally, I thought it was possible the logic changed as part of an upgrade until I realized the same version was working on Rails 6.1.4.3 for us.

I am passing in the characters and mediabox received from PageTextReceiver. The characters is populated but the mediabox is being returned as nil. Which is triggering the exception.

yob commented 2 years ago

I'm a bit stumped on this one!

Are you able to create an isolated reproduction that I can run?

mtownsen commented 2 years ago

Ditto. It really does not make sense that that one arbitrary rails upgrade broke this. It was a minor release with not much in it. I will continue to hack on the code to figure out what is going on. Perhaps it's something in my stack. We did a few Ruby version upgrades around the same time but those did not appear to break it either. I am sure its something in my stack.

The code I am using was mostly pulled from the below link. Then adapted to take the returned lines and sift thru them for the data points needed. It contains a demo script near the bottom if you are so inclined.

http://blog.peschla.net/2014/04/parsing-pdf-text-with-coordinates-in-ruby/

mrVVoo commented 2 years ago

Hi! I experience the same issue outside of Rails. I stumbled across this commit and wonder if the removal of the mediabox instance variable causes the error as the line 62 of the last example of the referenced tutorial passes the mediabox from a subclass of PageTextReceiver (not as instance variable available anymore) to a subclass of PageLayout which requires it.

Changing text_receiver.mediabox to page.rectangles[:MediaBox] results in no error, but I didn't check whether the results are equal across versions 2.6.0 and 2.8.0.

yob commented 2 years ago

good find!

I definitely didn't consider subclasses when considering if #402 was a non-breaking change.

Changing text_receiver.mediabox to page.rectangles[:MediaBox] results in no error, but I didn't check whether the results are equal across versions 2.6.0 and 2.8.0.

This is my recommended fix, and my expectation is it should continue to provide the same results between 2.6.0 and 2.8.0. The only caveat is that if PageLayout is also subsclassed then I can't guarantee it will work with the MediaBox provided as a Rectangle object rather than an Array.

mrVVoo commented 2 years ago

This is my recommended fix, and my expectation is it should continue to provide the same results between 2.6.0 and 2.8.0. The only caveat is that if PageLayout is also subsclassed then I can't guarantee it will work with the MediaBox provided as a Rectangle object rather than an Array.

Unfortunately it is. group_chars_into_runs was redefined in this example in order to "filter out duplicate chars before going on with regular logic" as stated in code comments in the tutorial. Anyway, perhaps @mtownsen can figure this out in more detail. I intend to use that in a new project so I cannot compare to older versions.

yob commented 2 years ago

I'm going to close this issue as I'm not planning to revert the internal @mediabox change in PDF::Reader::PageTextReceiver. Keeping the internals of classes unchanged so subclasses continue to work isn't viable for a (mostly) one person project.

Thanks again for tracking down the issuee @mrVVoo.

mtownsen commented 2 years ago

@yob Sorry for the delay. Thank you for your help in tracking this down. Great product and excited we can continue to use it. Thank you @mrVVoo for tracking this down! This is a huge help and the code continues to work as before now after implementing your solution. I am using the mediabox to calculate position of text blocks within a PDF and all the calculations returned are the same as before and it is parsing correctly. Reach out if you have any questions about what I was doing when you work thru your project.