xperseguers / t3ext-extractor

TYPO3 Extension extractor
https://extensions.typo3.org/extension/extractor
GNU General Public License v2.0
15 stars 24 forks source link

Error when uploading files: "DOMDocument::loadXML(): Start tag expected, '<' not found in Entity" #63

Closed seirerman closed 1 year ago

seirerman commented 1 year ago

Extractor works mostly fine, but I get fatal errors when uploading certain files: PHP Warning: DOMDocument::loadXML(): Start tag expected, '<' not found in Entity, line: 1 in /var/dvt/html/t3tiro/web/typo3conf/ext/extractor/Classes/Service/Php/PhpService.php line 265 | TYPO3\CMS\Core\Error\Exception thrown in file /var/dvt/html/t3tiro/web/typo3/sysext/core/Classes/Error/ErrorHandler.php in line 137 This leads to TYPO3 uploading the file despite the error, but not creating a sys_file_metadata record and editors not being able to edit the metadata within TYPO3.

You can use this PDF that can hopefully reproduce this error during the upload process. A coworker of mine who specializes in automated document creation checked the file for PDF syntax and also did a preflight check (no idea what that is :-) ) in Adobe Acrobat. He said the file looks fine.

Uploading the same file without extractor produces no errors. Any idea what causes this?

seirerman commented 1 year ago

There's no error when basic.enable_php = 0, btw. So the bug lies somewhere within the native PHP processing.

xperseguers commented 1 year ago

Internally the PDF you provide leads to loading following XML:

?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.1.0-jc003">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about=""
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
        xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/"
        xmlns:xmp="http://ns.adobe.com/xap/1.0/"
      dc:format="application/pdf"
      pdf:Keywords="Geschäftszahl: KB-WR/B-2691/14-2023"
      pdf:Producer="Aspose.Words for Java 23.1.0; modified using iText® 7.1.15 ©2000-2021 iText Group NV (Land Tirol, p. A. DVT-Daten-Verarbeitung-Tirol GmbH; licensed version)"
      pdfaid:conformance="A"
      pdfaid:part="1"
      xmp:CreateDate="2023-10-03T07:38:00Z"
      xmp:CreatorTool="Microsoft Office Word"
      xmp:ModifyDate="2023-10-03T11:32:06+02:00">
      <dc:title>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">Öffentliche Bekanntmachung WE GmbH</rdf:li>
        </rdf:Alt>
      </dc:title>
      <dc:creator>
        <rdf:Seq>
          <rdf:li>Bezirkshauptmannschaft Kitzbühel</rdf:li>
        </rdf:Seq>
      </dc:creator>
      <dc:subject>
        <rdf:Bag>
          <rdf:li>Geschäftszahl: KB-WR/B-2691/14-2023</rdf:li>
        </rdf:Bag>
      </dc:subject>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

<?xpacket end="w"?>

so the start < is somehow missing

seirerman commented 1 year ago

Thank you for checking the file. I'll forward that to my coworker...

xperseguers commented 1 year ago

Still investigating....

The PDF is encoded with LF, whereas the code skips CR + LF ("DOS"/Windows format), that's why there is a character offset

seirerman commented 1 year ago

I've added your patch, but now I occationally get this error: Core: Error handler (BE): PHP Warning: DOMDocument::loadXML(): ParsePI: PI xpacket never end ... in Entity, line: 46 in /typo3conf/ext/extractor/Classes/Service/Php/PhpService.php line 268

xperseguers commented 1 year ago

Certainly some other oddity with the parsing (which I didn’t write in the first place). Could you please send me an example document causing the problem so I could double check what happens exactly?

seirerman commented 1 year ago

I found another file that causes a ParsePI error. The file gets uploaded, a sys_file_metadata records is created and no error is visible for the editor. But there are 3 warnings in the protocol module:

Unfortunately I don't have the tools to check the integrity of PDF files like this one...

tehplague commented 1 year ago

It would be nice if extractor gracefully handled metadata extraction errors by catching and logging the problem but returning with an empty $metadata array. That way, TYPO3 could at least insert a sys_file_metadata record for an editor to edit subsequently. With the current implementation, TYPO3 ends up without this record, thus disabling the edit metadata functionality for the file in the backend. I monkey-patched extractor here by wrapping extractMetadataFromPdf with try ... catch.