zendtech / IbmiToolkit

PHP frontend to XMLSERVICE for IBM i development.
BSD 3-Clause "New" or "Revised" License
46 stars 34 forks source link

Bug or not, when using xmltoolkit/xmlservice with invalid characters in IBM i database field #178

Open ChriMech opened 2 years ago

ChriMech commented 2 years ago

I am using php xmltoolkit with xmlservice to fetch text from an ibm i database file through a call to an RPG programm. Everything works fine, unless the character database field contains "invalid" characters below EBCDIC x'40' (i.e x'3F'). This surely should not be, but somehow the database has thousands of records with one or more of this non-text characters within a text field, which should include an article description.

When the RPG Program call fetches such a text field through an RPG program parameter, i get this : Warning: simplexml_load_string(): Entity: line 36: parser error : CData section not finished Articl in /QOpenSys/pkgs/share/zendphp/libraries/xmltoolkit/1.8.5/ToolkitServiceXML.php on line 807 Warning: simplexml_load_string(): <data var='TEXT' type='50a' ><![CDATA[Article « description]]></data> in /QOpenSys/pkgs/share/zendphp/libraries/xmltoolkit/1.8.5/ToolkitServiceXML.php on line 807 Warning: simplexml_load_string(): ^ in /QOpenSys/pkgs/share/zendphp/libraries/xmltoolkit/1.8.5/ToolkitServiceXML.php on line 807 Warning: simplexml_load_string(): Entity: line 36: parser error : PCDATA invalid Char value 26 in /QOpenSys/pkgs/share/zendphp/libraries/xmltoolkit/1.8.5/ToolkitServiceXML.php on line 807

The invalid characters show as

My question: Is this a bug in xmlservice or ToolkitServiceXML.php or do you have any suggestions, on how i should deal with the situation? Thank you very much, Christian

alanseiden commented 2 years ago

@ChriMech You could specify the binary data type, which returns data as a hex string. In your PHP code, you could then convert the hex characters into something usable. https://github.com/zendtech/IbmiToolkit/blob/5ce2382183bbbfe5ca0a9c40357bfd7c080ef828/ToolkitApi/Toolkit.php#L1631

$params[] = $conn->AddParameterBin('out', 50, 'TEXT', 'mytext', '');

Try this and let us know if you receive hex characters that you can work with.

alanseiden commented 2 years ago

If you are curious what XMLSERVICE does for binary type, here is where some of the relevant XMLSERVICE code for the binary type (return as hex) is done: https://github.com/IBM/xmlservice/blob/20800d6be4aa4748b6eb226896c899c9c7e752ab/src/plugconv.rpgle#L1208 It shows the example node:


<data type='5b'>F0F1F2CDEF</data>
ChriMech commented 1 year ago

Thank you, but to be honest, I had something different in my mind, i.e. a suggestion for a future improvement of XMLSERVICE.

Sometimes by mistake, text within character based Database fields (A-Z, etc.) contain some Bytes with EBCDIC Codes below x'40'. This is often caused by improper data import from non IBM i data sources to IBM i tables, when character conversion from UTF-8 to EBCDIC CCSID 1141 is not done correctly with special german or european characters like ä, ö, ü.

In many cases this is not a big problem, because most IBM i related functions substitute “bad” characters with a replacement symbol (mainly a solid block or a square). This is true within all 5250 greenscreen applications, but also when using SQL to show table contents, or when using SQL from within PHP scripts. Even the remote program call APIs included in IBM i Client Access perform this substitution automatically, not to mention the old Net.Data scripting language on IBM i.

Only XMLSERVICE does nothing like this, but produces invalid xml when such bad characters are encountered. This invalid xml eventually results in fatal errors within the toolkit.

So my proposal for improvement of XMLSERVICE is: When preparing the XML to output data for a character typed program parameter, check to make sure the XML will be valid. If needed, replace invalid characters with a substitution (maybe even a blank). I don’t know whether this check should be done before or after translation from IBM i to webserver character set. Probably the easiest will be to modify the translation table to achieve this, but I am not the expert in this.

alanseiden commented 1 year ago

@ChriMech No guarantees, but would these options to ignore errors help on the PHP side? Perhaps you could test your XML example with this command: $xml = simplexml_load_string('your XML string here', "SimpleXMLElement", LIBXML_NOERROR | LIBXML_ERR_NONE);

alanseiden commented 1 year ago

@ChriMech Have you had a chance to test a "bad" XML string with the command above? If it helps, we could incorporate these options into the toolkit. Thank you.

ChriMech commented 1 year ago

@alanseiden I use the php xmltoolkit together with CW to call RPG programs. To do the test you suggested, i modified the php code in .../xmltoolkit/1.8.5/ToolkitServiceXML.php on line 807 (within function getParamsFromXml() ) to read like this: $xmlobj = simplexml_load_string($xml, "SimpleXMLElement", LIBXML_NOERROR | LIBXML_ERR_NONE);

With this, the "ugly" PHP warnings don't come up anymore, but the function call to getParamsFromXml() returns false and produces the bad.xml file. (see attached zip file bad.zip). Consequently the toolkit program call fails.

The invalid character within the xml string is a x'1A' shown in lines 36 and 50 of bad.xml. So just ignoring the incorrect XML characters while doing the simplexml_load_string(), does not look like an option. I still think XMLSERVICE on IBM i should check for invalid characters when preparing the xml for charcter typed fields. But I don't dare to review the XMLSERVICE source code myself.

alanseiden commented 1 year ago

Thanks! We'll review your results.

alanseiden commented 5 months ago

Although I understand the intent behind this issue, the difficulty of deciding what is "bad" data in a particular character set will make this request unlikely to be implemented unless someone submits a PR that can be tested under various character encodings.

ChriMech commented 5 months ago

I don't object closing the issue. It really is a rare case.

alanseiden commented 5 months ago

@ChriMech I also encountered this issue when creating the CW and calling various "Q*" system APIs. Certain data structure fields produced binary data. I handled them using the "hole" or "binary" data types. The difficulty would be if we could not predict the binary data in advance, as in your case.

@ChriMech I don't mind keeping the issue open if we want to resolve it on the PHP side, but if it's going to be an xmlservice enhancement then we should move it to the xmlservice project. Am I correct in thinking you feel it should be part of xmlservice? I do recall, for comparison, that the Easycom toolkit replaced binary characters with spaces. In the xmlservice project, we need to get momentum to improve the test suite structure.

alanseiden commented 5 months ago

I'm going to reopen this issue. We have some ideas.