yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.82k stars 271 forks source link

extracted text does not match text of pdf #118

Open pblesi opened 10 years ago

pblesi commented 10 years ago

reader.pages.at(3).text produces this output:

• FAX/Scanner/Copiers • 2 Digital Cameras • 1 Cisco Router • Hub

however text contained when pdf is rendered is:

4 FAX/Scanner/Copiers 2 Digital Cameras 1 Cisco Router 1 Hub

As you can see the numbers for 2 of the elements in the list are missing.

It appears I cannot include the pdf file, but the raw content for this page is:

/C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 186.9 Tm

<0078>Tj /TT2 1 Tf -0.0004 Tc 0.0026 Tw 0.46 0 Td [( )-760(2 Poly Com systems )]TJ ET EMC /P <>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 172.26 Tm <0078>Tj /TT2 1 Tf -0.0002 Tc 0.7624 Tw 0.46 0 Td [( 4 )760(FAX/Scanner/Copiers )]TJ ET EMC /P <>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 157.68 Tm <0078>Tj /TT2 1 Tf -0.0002 Tc 0.0024 Tw 0.46 0 Td [( )-760(2 Digita)-4(l)2( Cameras )]TJ ET EMC /P <>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 143.04 Tm <0078>Tj /TT2 1 Tf -0.0002 Tc 0.0024 Tw 0.46 0 Td [( )-760(1 Cisco Router )]TJ ET EMC /P <>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 128.46 Tm <0078>Tj /TT2 1 Tf -0.0014 Tc 0.7636 Tw 0.46 0 Td [( 1 )760(Hub )]TJ ET EMC /P <>BDC BT /C2_0 1 Tf 0 Tc 0 Tw 12 0 0 12 97.2 113.82 Tm <0078>Tj /TT2 1 Tf -0.0004 Tc 0.0026 Tw 0.46 0 Td [( )-760(6 NEC projectors mounted on portable carts )]TJ ET EMC
aarmora commented 9 years ago

Did you find a solution for this? I believe I'm facing a similar issue.

yob commented 7 years ago

I suspect this is an issue with our text layout algorithms in the PageLayout class.

Unfortunately I'm short on time at the moment, but I'll happily accept patches if you want to investigate further,