wojtekmaj / react-pdf

Display PDFs in your React app as easily as if they were images.
https://projects.wojtekmaj.pl/react-pdf
MIT License
8.95k stars 861 forks source link

Text layer may contain overlapping areas (react-pdf 9.0.0) #1828

Open obecker opened 3 weeks ago

obecker commented 3 weeks ago

Before you start - checklist

Description

After upgrading react-pdf from 8.0.2 to 9.0.0 I observed that consecutive spans in the same line within the text layer may overlap (i.e. the spans are too wide). This prevents the correct selection of text in the document.

This is an example from the provided sample.pdf (page 2, penultimate paragraph):

Bildschirmfoto 2024-06-11 um 12 57 07

You can see the overlapping area at the word "bibendum".

Now, while I supposed that this must be something in the core pdf.js library, I am unable to reproduce the behavior in the pdf.js demo. I even downloaded the latest (4.3.136) release from https://github.com/mozilla/pdf.js/releases, ran npx serve in the extracted folder, and opened web/viewer.html with the sample.pdf - the issue is not there.

If you want to test it with a different PDF, try https://www.vbg.de/cms/_Resources/Persistent/7/0/d/c/70dc78bec739e6cbe27bc8ba77a16d15347461d7/M_Arzt_Anforderungen.pdf and here the last list item on the first page ("über Kenntnisse in der erforderlichen Röntgentechnik und Röntgendiagnostik verfügen.")

Steps to reproduce

Run yarn run dev in sample/create-react-app-5, scroll to page 2 and select the first line of the penultimate paragraph.

Try to select and copy the word `justo'

Expected behavior

The selected areas don't overlap. The word 'justo' gets copied.

Actual behavior

They do overlap. The copied text is 'utat'

Additional information

No response

Environment

wojtekmaj commented 3 weeks ago

Hmmmm, I can reproduce this:

image

Oddly enough, this doesn't happen for me in all cases. Our test suite is free from this issue (it seems), but samples are not.