Fix viewer.py Adapt page size on the fly if necessary

alexawake commented 6 months ago

It fixes the issue of documents with page sizes different from the first one, these might not be viewed correctly afterwards since the viewer assumes all pages being the size of the first page. This is a typical scenario in documents where a first page with extra information is added, for instance on exporting this PDF https://eudml.org/doc/202780 (Access to full text + Export)

Related with pull request #2480 "Add support to python only pdfviewer for displaying pdf files where not all pages are the same size"

but with the adaption solution it is not needed that all pages of the document are computed in advance

jmoraleda commented 6 months ago

The approach in this PR renders the content of all pages in bounding boxes of the same size, choosing a bounding box large enough to fit the largest page. This means that, for smaller pages, very large and usightly page gaps are produced.

For example compare the output produced with this document by this approach when compared to choosing a bounding box for each page that exactly fits the page, as is done in PR #2480.

Result when using this PR

Result when using PR #2480

(Disclaimer: I am the author of PR #2480)

alexawake commented 6 months ago

Hi Jorge, testing your approach #2480 "choosing a bounding box for each page that exactly fits the page" with the example pdf, I see that the zoom changes on each page in order to do that fit of the page. But I don't think that this decision should be taken automatically. If the PDF is designed with pages of different sizes for whatever reason this should be kept in that way. Both Chrome and Edge browsers keep the different sizes of the pages in the example pdf, but certanly in contrast to my approach (#2492 ) they center the pages and eliminate the gaps.

jmoraleda commented 6 months ago

Hi Alejandro,

Thank you for taking the time to look at this.

I see that the zoom changes on each page in order to do that fit of the page.

In PR #2480 The zoom changes automatically when the pdfviewer is set to "fit page width" (the initial setting of the pdfviewer widget) or "fit page height" . In that case, the viewer adjusts the zoom for each page so the page fits the width or height as intended. When the pdfviewer is set to display at a constant zoom then the zoom stays at the target zoom and each page is scaled to that level of zoom.

IMHO this is the intended and most correct behavior.

In all cases the page gap is adjusted to not leave extraneous space between pages.

If other people are interested in this topic, I think it would be helpful to get them to test both approaches and get their opinion. Also, if anyone has real-world examples of pdf's with different page sizes that can be made publicly available, that would be helpful. (I have such examples, hence the motivation for my PR, but they contain private patient data and I cannot share them)

alexawake commented 6 months ago

Hi Jorge, I've taken a deeper look into your change and the whole viewer.py in general and I can tell you following.

your approach getting in advance all page sizes is the correct one: I've seen that takes almost no time even in huge documents, for instance in one about 1400 pages. But precisely testing that document it happens that if one page has issues (pdf corrupt) then an exception prevents to show anything of the document. It can be easily fixed by adding one more except in the lines
```
    """ Return width, height for the page """
    try:
        page = self.pdfdoc.load_page(pageNum)
    except AttributeError: # old PyMuPDF version
        page = self.pdfdoc.loadPage(pageNum)
    except: # otherwise if PDF is corrupt at some page
        return (100, 100) # for example or any other value
```

In general, not only your change but in all try / except AttributeError in viewer.py a "except:" line should be added with the right action, for instance in RenderPage it could be simply return

def RenderPage(self, gc, pageno, scale=1.0):
    " Render the set of pagedrawings into gc for specified page "
    try:
        page = self.pdfdoc.load_page(pageno)
    except AttributeError: # old PyMuPDF version
        page = self.pdfdoc.loadPage(pageno)
    except: # otherwise if for example PDF is corrupt at some page
        return


regarding the adaption to the zoom on each page, testing your example pdf in other viewers all them in "page width mode" take the maximum width in the document and show all pages centered to it, which I think is the right approach.

- testing PyPDF2 capability
viewer.py dedicate more than 2 thirds of the code to support this library but : it is outdated and not working anymore due to deprecated calls or members
After fixing the deprecated stuff, I've seen that the viewer actually cannot render almost anything from a simple PDF.
I can fully understand the whish to work and improve the viewer with this full open library and free of C modules but in any case, even if it were comparable with MuPDF, I would separate these two different approaches in two files, for example
          pdfViewerMuPDF.py
          pdfViewerPyPDF2.py
and if desired another module could do the "module switch" between both 

by now personally I will only work on  pdfViewerMuPDF.py and I will publish it in my fork when finished.

jmoraleda commented 6 months ago

Hello Alejandro,

Thank you again for your time. I am glad we have reached consensus. I wonder what is the best link to move/copy the discussion to PR #2480 since we have agreed that is the approach we would like to merge.

Being able to handle partially broken documents

I think this would be a good feature. And from your comments, it seems its implementation is straight forward. I can add support for it in PR #2480 before it gets merged or later in a separate PR if it gets merged before then. Could you add there a link to the broken document(s) you used for testing this, if they are public, so I can test this new feature when the user navigates to broken pages, or scrolls around broken pages? Thank you.

Dropping support for PyPDF2

I don't use PyPDF2 either. The last time I tried was four year ago and I could not get it to work either. But I just checked and the library seems to have a new version that is actively developed: https://github.com/py-pdf/pypdf, so perhaps before considering to drop support for it, we should upgrade to the latest version and try it. Regardless, I think this work should be done as part of a separate PR though.

wxWidgets / Phoenix