yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.82k stars 271 forks source link

Getting Hyperlink Text #342

Open kepp14 opened 3 years ago

kepp14 commented 3 years ago

Hi! I'm having trouble getting the text related to a hyperlink in my PDF. By that, I mean that I have some text in my PDF, say SampleHyperlinkHere that when clicked opens another PDF. I'm able to get the PDF attached to the hyperlink using this script https://gist.github.com/danlucraft/5277732#gistcomment-2675302, but I want to be able to link which attachment comes from which text.

For example I have this page with 16 Annots:

irb(main):475:0> annots = page.attributes[:Annots]
=> #<PDF::Reader::Reference:0x00007fed3d4f1780 @id=59, @gen=0>
irb(main):476:0> objects = page.objects[annots]
=> [#<PDF::Reader::Reference:0x00007fed3d421350 @id=60, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d420e78 @id=61, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d420cc0 @id=62, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d420b08 @id=63, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d420950 @id=64, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d420478 @id=65, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d4202c0 @id=66, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d420108 @id=67, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42bbe8 @id=68, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42ba30 @id=69, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42b878 @id=70, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42b3a0 @id=71, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42b1e8 @id=72, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42b030 @id=73, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42ae78 @id=74, @gen=0>, #<PDF::Reader::Reference:0x00007fed3d42a978 @id=75, @gen=0>]
irb(main):477:0> objects.count
=> 16

and I notice that 8 of those are links (as expected) and I'm able to grab the attachment from the link for those 8 just fine.

irb(main):478:0> objects.each do |o|
irb(main):479:1* puts page.objects[o]
irb(main):480:1> end
{:A=>#<PDF::Reader::Reference:0x00007fed3d532c08 @id=391, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[296.7999, 702.28, 357.6099, 713.28]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d53b560 @id=384, @gen=0>, :Rect=>[295.2999, 700.78, 359.1099, 703.78], :AP=>#<PDF::Reader::Reference:0x00007fed3d53a570 @id=385, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d53a228 @id=386, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[296.7999, 702.28, 357.6099, 702.28]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d542c98 @id=377, @gen=0>, :Rect=>[238.0749, 688.65, 301.845, 691.65], :AP=>#<PDF::Reader::Reference:0x00007fed3d541d98 @id=378, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d541b40 @id=379, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[239.5749, 690.15, 300.345, 690.15]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d54aa88 @id=370, @gen=0>, :Rect=>[88.525, 584.9, 152.3049, 587.9], :AP=>#<PDF::Reader::Reference:0x00007fed3d549c28 @id=371, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d5499d0 @id=372, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[90.025, 586.4, 150.8049, 586.4]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d552828 @id=363, @gen=0>, :Rect=>[135.3999, 597.03, 199.21, 600.03], :AP=>#<PDF::Reader::Reference:0x00007fed3d551978 @id=364, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d5516f8 @id=365, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[136.8999, 598.53, 197.71, 598.53]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d55a5a0 @id=356, @gen=0>, :Rect=>[325.8299, 517.38, 389.98, 520.38], :AP=>#<PDF::Reader::Reference:0x00007fed3d5596f0 @id=357, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d559498 @id=358, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[327.3299, 518.88, 388.48, 518.88]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d5623e0 @id=349, @gen=0>, :Rect=>[416.07, 461.76, 479.85, 464.76], :AP=>#<PDF::Reader::Reference:0x00007fed3d561558 @id=350, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d561300 @id=351, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[417.57, 463.26, 478.35, 463.26]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d56a220 @id=342, @gen=0>, :Rect=>[153.287, 386.3599, 217.0671, 389.3599], :AP=>#<PDF::Reader::Reference:0x00007fed3d569398 @id=343, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d569140 @id=344, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[154.787, 387.8599, 215.5671, 387.8599]}
{:Type=>false, :Subtype=>:Line, :BS=>#<PDF::Reader::Reference:0x00007fed3d572010 @id=335, @gen=0>, :Rect=>[313.82, 330.85, 377.6, 333.85], :AP=>#<PDF::Reader::Reference:0x00007fed3d571160 @id=336, @gen=0>, :"AAPL:AKExtras"=>#<PDF::Reader::Reference:0x00007fed3d570f08 @id=337, @gen=0>, :C=>[0, 0, 1], :IC=>[0, 0, 1], :F=>4, :L=>[315.32, 332.35, 376.1, 332.35]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d57a648 @id=334, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[239.5749, 690.15, 300.345, 701.15]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d583d38 @id=333, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[90.025, 586.4, 150.8049, 597.4]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d5814c0 @id=332, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[136.8999, 598.53, 197.71, 609.53]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d58ab88 @id=331, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[327.3299, 518.88, 388.48, 529.88]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d588310 @id=330, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[417.57, 463.26, 478.35, 474.26]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d591a00 @id=329, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[191.9299, 387.8599, 252.71, 398.8599]}
{:A=>#<PDF::Reader::Reference:0x00007fed3d59b140 @id=328, @gen=0>, :Border=>[0, 0, 0], :Type=>true, :Subtype=>:Link, :Rect=>[315.32, 332.35, 376.1, 343.35]}

Is there a way to use the other 8 annotations to get the text associated with the hyperlinks, or another way that I'm missing? Appreciate the help!

yob commented 3 years ago

I'm not super familiar with the annotation options. However, my guess us the 8 Line annotations won't have any text associated with them. I also suspect that that text for the hyperlink is just part of the standard content stream of the page, and the Link annotations define an invisible annotation that sits on top of the text to handle clicks.

In theory it'd be possible to grab the Rect attribute from the 8 Link annotations, and then fetch only the text from the page that sits within those boundaries. Unfortunately pdf-reader doesn't offer a nice API to do that. You'd have to create a customised version of PDF::Reader::PageTextReceiver.