Closed seirerman closed 1 year ago
There's no error when basic.enable_php = 0, btw. So the bug lies somewhere within the native PHP processing.
Internally the PDF you provide leads to loading following XML:
?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.1.0-jc003">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
dc:format="application/pdf"
pdf:Keywords="Geschäftszahl: KB-WR/B-2691/14-2023"
pdf:Producer="Aspose.Words for Java 23.1.0; modified using iText® 7.1.15 ©2000-2021 iText Group NV (Land Tirol, p. A. DVT-Daten-Verarbeitung-Tirol GmbH; licensed version)"
pdfaid:conformance="A"
pdfaid:part="1"
xmp:CreateDate="2023-10-03T07:38:00Z"
xmp:CreatorTool="Microsoft Office Word"
xmp:ModifyDate="2023-10-03T11:32:06+02:00">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Öffentliche Bekanntmachung WE GmbH</rdf:li>
</rdf:Alt>
</dc:title>
<dc:creator>
<rdf:Seq>
<rdf:li>Bezirkshauptmannschaft Kitzbühel</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:subject>
<rdf:Bag>
<rdf:li>Geschäftszahl: KB-WR/B-2691/14-2023</rdf:li>
</rdf:Bag>
</dc:subject>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
so the start <
is somehow missing
Thank you for checking the file. I'll forward that to my coworker...
Still investigating....
The PDF is encoded with LF, whereas the code skips CR + LF ("DOS"/Windows format), that's why there is a character offset
I've added your patch, but now I occationally get this error:
Core: Error handler (BE): PHP Warning: DOMDocument::loadXML(): ParsePI: PI xpacket never end ... in Entity, line: 46 in /typo3conf/ext/extractor/Classes/Service/Php/PhpService.php line 268
Certainly some other oddity with the parsing (which I didn’t write in the first place). Could you please send me an example document causing the problem so I could double check what happens exactly?
I found another file that causes a ParsePI error. The file gets uploaded, a sys_file_metadata records is created and no error is visible for the editor. But there are 3 warnings in the protocol module:
Unfortunately I don't have the tools to check the integrity of PDF files like this one...
It would be nice if extractor
gracefully handled metadata extraction errors by catching and logging the problem but returning with an empty $metadata
array. That way, TYPO3 could at least insert a sys_file_metadata
record for an editor to edit subsequently. With the current implementation, TYPO3 ends up without this record, thus disabling the edit metadata functionality for the file in the backend.
I monkey-patched extractor
here by wrapping extractMetadataFromPdf
with try ... catch
.
Extractor works mostly fine, but I get fatal errors when uploading certain files:
PHP Warning: DOMDocument::loadXML(): Start tag expected, '<' not found in Entity, line: 1 in /var/dvt/html/t3tiro/web/typo3conf/ext/extractor/Classes/Service/Php/PhpService.php line 265 | TYPO3\CMS\Core\Error\Exception thrown in file /var/dvt/html/t3tiro/web/typo3/sysext/core/Classes/Error/ErrorHandler.php in line 137
This leads to TYPO3 uploading the file despite the error, but not creating a sys_file_metadata record and editors not being able to edit the metadata within TYPO3.You can use this PDF that can hopefully reproduce this error during the upload process. A coworker of mine who specializes in automated document creation checked the file for PDF syntax and also did a preflight check (no idea what that is :-) ) in Adobe Acrobat. He said the file looks fine.
Uploading the same file without extractor produces no errors. Any idea what causes this?