wojtekmaj / react-pdf

Display PDFs in your React app as easily as if they were images.
https://projects.wojtekmaj.pl/react-pdf
MIT License
9.28k stars 878 forks source link

When Pages are all displayed inside a <Document>, pages after first can have a scaled up scaleX text layer (only found on some documents) #1848

Open jkrubin opened 1 month ago

jkrubin commented 1 month ago

Before you start - checklist

Description

I am using react-pdf and need the text layer for highlighting purposes. I wanted to swap over from the "single page" to "all page" recipe shown here https://github.com/wojtekmaj/react-pdf/wiki/Recipes

This code seems to work fine on the surface, but I found that for some PDFs, displaying in all-page caused the text layer past the first page to have some spans with very wrong scaleX transforms applied (way bigger than intended). When displayed in single-page format, all of the spans on the offending pages have spans with expected width.

I tested this using versions 9.0 and 9.1 using different workers and the bug always appears

Steps to reproduce

I made a minimal reproducible example below:

import { Document, Page, pdfjs } from "react-pdf";
import "./App.css";
import { useState } from "react";
import "react-pdf/dist/esm/Page/AnnotationLayer.css";
import "react-pdf/dist/esm/Page/TextLayer.css";

// import pdfWorker from "./assets/pdf.worker.min.mjs?url";
// pdfjs.GlobalWorkerOptions.workerSrc = pdfWorker;

pdfjs.GlobalWorkerOptions.workerSrc = new URL(
  "pdfjs-dist/build/pdf.worker.min.mjs",
  import.meta.url
).toString();

import pdf from "./data/apl_23_003.pdf";

function App() {
  const [numPages, setNumPages] = useState<number>();
  const [pageNumber, setPageNumber] = useState<number>(1);

  function onDocumentLoadSuccess({ numPages }: { numPages: number }): void {
    setNumPages(numPages);
  }

  function onDocumentLoadError(error: Error): void {
    console.error("Failed to load PDF document:", error);
  }

  return (
    <>
      <div>
        <button onClick={() => setPageNumber((prev) => Math.max(prev - 1, 1))}>
          Previous
        </button>
        <button
          onClick={() =>
            setPageNumber((prev) =>
              numPages && prev < numPages ? prev + 1 : prev
            )
          }
        >
          Next
        </button>
      </div>
      <Document
        file={pdf}
        onLoadSuccess={onDocumentLoadSuccess}
        onLoadError={onDocumentLoadError}
      >
        <Page pageNumber={pageNumber} />
        {Array.from(new Array(numPages), (el, index) => (
          <Page key={`page_${index + 1}`} pageNumber={index + 1} />
        ))}
      </Document>
    </>
  );
}

export default App;

As stated above this doesn't happen to every pdf, and while the best examples are on non-public PDFs, I found a public document where you can see this bug on pages 3 and onwards (attached).

apl_23_003.pdf

Expected behavior

I expect the text layer to fit over the text exactly, like it does when displayed like this

      <Document
        file={pdf}
        onLoadSuccess={onDocumentLoadSuccess}
        onLoadError={onDocumentLoadError}
      >
        <Page pageNumber={pageNumber} />
      </Document>

expected_behavior

Actual behavior

The text on pages after page 1 is displayed with text layer having a larger scaleX transform when displayed like this actual_behavior

      <Document
        file={pdf}
        onLoadSuccess={onDocumentLoadSuccess}
        onLoadError={onDocumentLoadError}
      >
        <Page pageNumber={pageNumber} />
        {Array.from(new Array(numPages), (el, index) => (
          <Page key={`page_${index + 1}`} pageNumber={index + 1} />
        ))}
      </Document>

Additional information

This bug will not occur on the first page displayed. For example, if page 4 is the one that gets stretched, and I display page 4 10 times in a row, the first page will be normal and subsequent pages will be stretched.

This bug does not affect all PDFs, Only a few that I have found. In debugging I noticed that the PDF does use some encoding that is not supported by my VSCode, I can still open the file and it says "this document contains many invisible unicode characters" This may contribute to some parsing error, but I don't know why that could occur only past the first page

Environment

jasoncardinale commented 1 month ago

I too am experiencing a similar issue. Say for a given page, when looking through the spans constituting the text layer, a vast majority of them do not contain a transform and are seemingly positioned correctly (no disjoint overlapping). However, there are some instances where the span will have a transformation applied to it along the x-axis. Something along the lines of transform: scaleX(n) where n is a value close to 1. However, when the page state updates (say I want to now highlight this text using <mark> so I modify the text in a customTextRenderer) all of a sudden this transform jumps to a larger or smaller value (n starts to approach 0 or 2). This only happens for very specific instances of words or lines within a pdf and for most cases I don't see any issue with the large transformation.

As a temporary solution, I am following the advice mentioned here: https://github.com/wojtekmaj/react-pdf/issues/332#issuecomment-458121654.

However, I found that the transformation is within the nested line spans and not in react-pdf__Page__textContent so I do this instead.

const removeTextLayerOffset = () => {
  const spans = document.querySelectorAll("span[role='presentation']")
  spans.forEach((span) => {
    const { style } = span as HTMLElement
    style.transform = ''
  })
}

And then use the function here

<Page ... onRenderTextLayerSuccess={removeTextLayerOffset} />

This removes all the transformations and for the most part yields good results. However, as mentioned before, some of lines already had a small transformation applied to them and so when we remove that, the text does not overlap perfectly (though the difference is not nearly as drastic as the worst offenders).

This solution seemingly works well enough for all the PDFs that I have tried it with so far but I don't believe it to be a satisfying solution. pdfjs is clearly doing some calculation under the hood to determine this transformations based on font size, screen width, etc. and so just removing it is definitely a work around.

See https://github.com/mozilla/pdf.js/blob/300e806efe7e6438e0b37d8eeb1a97d9e5d27daa/src/display/text_layer.js#L419 for how this transformation is calculated. My best guess is that width in this case is off due to some inability to properly calculate the width of the text in the line. It may have something to do with unrecognized fonts which have spacing + character width unsupported by pdfjs.

jkrubin commented 1 month ago

I don't believe this is an issue with pdfjs. if I uploaded my pdf to the pdfjs demo https://mozilla.github.io/pdf.js/web/viewer.html and the issue did not occur, also i can display the page normally in paginated mode, so the issue is something to do with subsequent pages.

I believe there needs to be a fix to react-pdf here

szl1993 commented 2 weeks ago

hoho i find the reason. See https://github.com/mozilla/pdf.js/blob/master/src/display/text_layer.js The pdf.js library use canvas.measureText to calculate the actual display width of <span/> elements and employs a static canvas for performance optimization. I logged the measurement information and found that when the issue occurred, the canvas.font property did not match the expected data.

if (prevFontSize !== fontSize || prevFontFamily !== fontFamily) {
        console.log("---------ctx font--------");
        console.log("textContent:", div.textContent);
        console.log("pageIndex", this);
        console.log("prevFontSize:", prevFontSize);
        console.log("fontSize:", fontSize);
        console.log("this.#scale", this.#scale);
        console.log("fontSize * this.#scale:", fontSize * this.#scale);
        console.log("-------------------------");
        ctx.font = `${fontSize * this.#scale}px ${fontFamily}`;
        params.prevFontSize = fontSize;
        params.prevFontFamily = fontFamily;
      }

      // Only measure the width for multi-char text divs, see `appendText`.
      const { width } = ctx.measureText(div.textContent);

      if (width > 0) {
        transform = `scaleX(${(canvasWidth * this.#scale) / width}) ${transform}`;
      }

      if (
        div.textContent ===
        "scale x show error text"
      ) {
        console.log("----------measureText------------");
        console.log("transform:", transform);
        console.log("width:", width);
        console.log("ctx.font:", ctx.font);
        console.log("fontSize:", fontSize);
        console.log("fontFamily:", fontFamily);
        console.log("oldPrevFontSize:", oldPrevFontSize);
        console.log("oldPrevFontFamily:", oldPrevFontFamily);
        console.log("this.#scale:", this.#scale);
        console.log("canvasWidth:", canvasWidth);
      }
    }

so the reason is <Page/> render parallel causing canvas attribute error. my solution is to determine whether the current page is in display status. If it is not in display status not render TextLayer.

jkrubin commented 2 weeks ago

I implemented a Mutex approach like this and it solved all of the width issues

type PageProps = {
  pageNumber: number;
  pageLoadLock: Mutex
  scale: number;
}
export const PageWrapper: React.FC<PageProps> = ({
    pageNumber,
    pageLoadLock,
    scale
}) => {
  const [readyToLoadTextLayer, setReadyToLoadTextLayer] = useState<boolean>(false)
  const releaseRef = useRef<(() => void) | null>(null);
  const pageRef = useRef<HTMLDivElement>(null);

  useEffect(() => {
    const acquireLock = async () => {
      console.log(`page ${pageNumber} waiting for mutex`);
      const releaseLock = await pageLoadLock.acquire();

      // Store the release function in a ref
      releaseRef.current = () => {
        releaseLock();
      };

      setReadyToLoadTextLayer(true);
      console.log(`page ${pageNumber} acquired`);
    };

    acquireLock();

    // Cleanup to release lock if the component unmounts before the lock is released
    return () => {
      if (releaseRef.current) {
        releaseRef.current();
      }
    };
  }, [pageNumber, pageLoadLock]);

  const handleTextLayerLoad = () => {
    console.log(`page ${pageNumber} loaded text layer`);
    if (releaseRef.current) {
      releaseRef.current();
      releaseRef.current = null; // Clear the ref after releasing
    }
    console.log(`page ${pageNumber} has released lock`);
  };
  return (
    <Page
        pageNumber={pageNumber}
        scale={scale}
        renderTextLayer={readyToLoadTextLayer}
        onRenderTextLayerSuccess={handleTextLayerLoad}
        onLoadError={(error) => console.error(error)}
    />
  );
};

Wanted to flag to @wojtekmaj if we can include some fix to this race condition in the react-pdf lib