xperseguers / t3ext-extractor

TYPO3 Extension extractor
https://extensions.typo3.org/extension/extractor
GNU General Public License v2.0
15 stars 24 forks source link

Error when uploading files: "DOMDocument::loadXML(): Start tag expected, '<' not found in Entity" #63

Closed seirerman closed 10 months ago

seirerman commented 1 year ago

Extractor works mostly fine, but I get fatal errors when uploading certain files: PHP Warning: DOMDocument::loadXML(): Start tag expected, '<' not found in Entity, line: 1 in /var/dvt/html/t3tiro/web/typo3conf/ext/extractor/Classes/Service/Php/PhpService.php line 265 | TYPO3\CMS\Core\Error\Exception thrown in file /var/dvt/html/t3tiro/web/typo3/sysext/core/Classes/Error/ErrorHandler.php in line 137 This leads to TYPO3 uploading the file despite the error, but not creating a sys_file_metadata record and editors not being able to edit the metadata within TYPO3.

You can use this PDF that can hopefully reproduce this error during the upload process. A coworker of mine who specializes in automated document creation checked the file for PDF syntax and also did a preflight check (no idea what that is :-) ) in Adobe Acrobat. He said the file looks fine.

Uploading the same file without extractor produces no errors. Any idea what causes this?

seirerman commented 12 months ago

There's no error when basic.enable_php = 0, btw. So the bug lies somewhere within the native PHP processing.

xperseguers commented 12 months ago

Internally the PDF you provide leads to loading following XML:

?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.1.0-jc003">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about=""
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
        xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/"
        xmlns:xmp="http://ns.adobe.com/xap/1.0/"
      dc:format="application/pdf"
      pdf:Keywords="Geschäftszahl: KB-WR/B-2691/14-2023"
      pdf:Producer="Aspose.Words for Java 23.1.0; modified using iText® 7.1.15 ©2000-2021 iText Group NV (Land Tirol, p. A. DVT-Daten-Verarbeitung-Tirol GmbH; licensed version)"
      pdfaid:conformance="A"
      pdfaid:part="1"
      xmp:CreateDate="2023-10-03T07:38:00Z"
      xmp:CreatorTool="Microsoft Office Word"
      xmp:ModifyDate="2023-10-03T11:32:06+02:00">
      <dc:title>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">Öffentliche Bekanntmachung WE GmbH</rdf:li>
        </rdf:Alt>
      </dc:title>
      <dc:creator>
        <rdf:Seq>
          <rdf:li>Bezirkshauptmannschaft Kitzbühel</rdf:li>
        </rdf:Seq>
      </dc:creator>
      <dc:subject>
        <rdf:Bag>
          <rdf:li>Geschäftszahl: KB-WR/B-2691/14-2023</rdf:li>
        </rdf:Bag>
      </dc:subject>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

<?xpacket end="w"?>

so the start < is somehow missing

seirerman commented 12 months ago

Thank you for checking the file. I'll forward that to my coworker...

xperseguers commented 12 months ago

Still investigating....

The PDF is encoded with LF, whereas the code skips CR + LF ("DOS"/Windows format), that's why there is a character offset

seirerman commented 12 months ago

I've added your patch, but now I occationally get this error: Core: Error handler (BE): PHP Warning: DOMDocument::loadXML(): ParsePI: PI xpacket never end ... in Entity, line: 46 in /typo3conf/ext/extractor/Classes/Service/Php/PhpService.php line 268

xperseguers commented 12 months ago

Certainly some other oddity with the parsing (which I didn’t write in the first place). Could you please send me an example document causing the problem so I could double check what happens exactly?

seirerman commented 11 months ago

I found another file that causes a ParsePI error. The file gets uploaded, a sys_file_metadata records is created and no error is visible for the editor. But there are 3 warnings in the protocol module:

Unfortunately I don't have the tools to check the integrity of PDF files like this one...

tehplague commented 11 months ago

It would be nice if extractor gracefully handled metadata extraction errors by catching and logging the problem but returning with an empty $metadata array. That way, TYPO3 could at least insert a sys_file_metadata record for an editor to edit subsequently. With the current implementation, TYPO3 ends up without this record, thus disabling the edit metadata functionality for the file in the backend. I monkey-patched extractor here by wrapping extractMetadataFromPdf with try ... catch.