Error when uploading files: "DOMDocument::loadXML(): Start tag expected, '<' not found in Entity"

seirerman commented 1 year ago

Extractor works mostly fine, but I get fatal errors when uploading certain files: PHP Warning: DOMDocument::loadXML(): Start tag expected, '<' not found in Entity, line: 1 in /var/dvt/html/t3tiro/web/typo3conf/ext/extractor/Classes/Service/Php/PhpService.php line 265 | TYPO3\CMS\Core\Error\Exception thrown in file /var/dvt/html/t3tiro/web/typo3/sysext/core/Classes/Error/ErrorHandler.php in line 137 This leads to TYPO3 uploading the file despite the error, but not creating a sys_file_metadata record and editors not being able to edit the metadata within TYPO3.

You can use this PDF that can hopefully reproduce this error during the upload process. A coworker of mine who specializes in automated document creation checked the file for PDF syntax and also did a preflight check (no idea what that is :-) ) in Adobe Acrobat. He said the file looks fine.

Uploading the same file without extractor produces no errors. Any idea what causes this?

TYPO3 11.5.31
extractor 2.3.0
exiftool 12.60
pdfinfo 0.26.5
basic.enable_tika=0
basic.enable_tools_exiftool=1
basic.enable_tools_pdfinfo=1
basic.enable_php=1

seirerman commented 1 year ago

There's no error when basic.enable_php = 0, btw. So the bug lies somewhere within the native PHP processing.

xperseguers commented 1 year ago

Internally the PDF you provide leads to loading following XML:

?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.1.0-jc003">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about=""
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
        xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/"
        xmlns:xmp="http://ns.adobe.com/xap/1.0/"
      dc:format="application/pdf"
      pdf:Keywords="Geschäftszahl: KB-WR/B-2691/14-2023"
      pdf:Producer="Aspose.Words for Java 23.1.0; modified using iText® 7.1.15 ©2000-2021 iText Group NV (Land Tirol, p. A. DVT-Daten-Verarbeitung-Tirol GmbH; licensed version)"
      pdfaid:conformance="A"
      pdfaid:part="1"
      xmp:CreateDate="2023-10-03T07:38:00Z"
      xmp:CreatorTool="Microsoft Office Word"
      xmp:ModifyDate="2023-10-03T11:32:06+02:00">
      <dc:title>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">Öffentliche Bekanntmachung WE GmbH</rdf:li>
        </rdf:Alt>
      </dc:title>
      <dc:creator>
        <rdf:Seq>
          <rdf:li>Bezirkshauptmannschaft Kitzbühel</rdf:li>
        </rdf:Seq>
      </dc:creator>
      <dc:subject>
        <rdf:Bag>
          <rdf:li>Geschäftszahl: KB-WR/B-2691/14-2023</rdf:li>
        </rdf:Bag>
      </dc:subject>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

<?xpacket end="w"?>

so the start < is somehow missing

seirerman commented 1 year ago

Thank you for checking the file. I'll forward that to my coworker...

xperseguers commented 1 year ago

Still investigating....

The PDF is encoded with LF, whereas the code skips CR + LF ("DOS"/Windows format), that's why there is a character offset

seirerman commented 1 year ago

I've added your patch, but now I occationally get this error: Core: Error handler (BE): PHP Warning: DOMDocument::loadXML(): ParsePI: PI xpacket never end ... in Entity, line: 46 in /typo3conf/ext/extractor/Classes/Service/Php/PhpService.php line 268

xperseguers commented 1 year ago

Certainly some other oddity with the parsing (which I didn’t write in the first place). Could you please send me an example document causing the problem so I could double check what happens exactly?

seirerman commented 1 year ago

I found another file that causes a ParsePI error. The file gets uploaded, a sys_file_metadata records is created and no error is visible for the editor. But there are 3 warnings in the protocol module:

Core: Error handler (BE): PHP Warning: DOMDocument::loadXML(): ParsePI: PI xpacket never end ... in Entity, line: 52 in /var/dvt/html/t3tiro/web/typo3conf/ext/extractor/Classes/Service/Php/PhpService.php line 268
Core: Error handler (BE): PHP Warning: DOMDocument::loadXML(): ParsePI: PI xpacket never end ... in Entity, line: 614 in /var/dvt/html/t3tiro/web/typo3conf/ext/extractor/Classes/Service/Php/PhpService.php line 268
Core: Error handler (BE): PHP Warning: DOMDocument::loadXML(): ParsePI: PI xpacket never end ... in Entity, line: 620 in /var/dvt/html/t3tiro/web/typo3conf/ext/extractor/Classes/Service/Php/PhpService.php line 268

Unfortunately I don't have the tools to check the integrity of PDF files like this one...

tehplague commented 1 year ago

It would be nice if extractor gracefully handled metadata extraction errors by catching and logging the problem but returning with an empty $metadata array. That way, TYPO3 could at least insert a sys_file_metadata record for an editor to edit subsequently. With the current implementation, TYPO3 ends up without this record, thus disabling the edit metadata functionality for the file in the backend. I monkey-patched extractor here by wrapping extractMetadataFromPdf with try ... catch.

xperseguers / t3ext-extractor

Error when uploading files: "DOMDocument::loadXML(): Start tag expected, '<' not found in Entity" #63

xperseguers / t3ext-extractor

Error when uploading files: "DOMDocument::loadXML(): Start tag expected, '&lt;' not found in Entity" #63

Error when uploading files: "DOMDocument::loadXML(): Start tag expected, '<' not found in Entity" #63