Allowing non-XML to flow through the pipeline will be handy, but encoding/decoding may still be necessary at the boundaries. We could have p:encode/p:decode steps that convert between base64. They could also do hexBinary, maybe uuencode, etc.
jallabine said on 2015-03-24, 14:03h:
That would be great! I need that feature for a natural language processing pipeline operating on and creating some text files that contain German umlauts. The individual transformations, mostly XSLT transformations and one carried out in an external application, work fine for themselves. But if I combine them in an XProc pipeline, the umlauts are distorted although the encoding seems to be UTF-8 (according to Notepad++). So if there is any possibility of fixing that issue, I would be delighted. Thanks a lot!
ndw said on 2015-06-11, 12:49h:
That sounds like a bug@jallabinecan you send me an example that demonstrates it?
On 2015-06-11, 13:01h: xquery added the steps label.
On 2015-10-07, 14:37h: ndw added this to the XProc 2.0 LC milestone.
jallabine said on 2015-12-22, 13:32h:
Dear Norman,
I’m so sorry I could not reply to your e-mail earlier – lots of work and family issues.
My pipeline contains one p:exec step integrating an external application named TreeTagger (which is a part-of-speech tagger for several languages). This application receives text files, tokenizes them and adds POS information related to the respective words. Since my mother tongue is German, I used German texts with some umlauts for processing.
Initially, everything is okay with the result text but in the following steps, the umlauts appear distorted in a way that indicates encoding problems. In the meantime, I noticed two things:
First, it apparently is a matter of display because Notepad++ (on a Windows machine) indicates “UTF-8” in the status bar and it actually IS possible to further process the text files. But the number of characters is not correctly counted then, which I absolutely need for an XSLT step later on in which strings are sorted according to their lengths. (There is no alternative for using text files, just to mention that.)
Second, the problem seems to vanish if I assign an encoding attribute with the value of “utf-8” to each p:store step following the TreeTagger p:exec step. As far as I understood, XProc is UTF-8 based in its own right – anyway, the mentioned workaround is functional which is all I needed in the first place. It would be interesting, though, to know why the issue arose.
I wish you, your family and friends a merry Christmas and a happy New Year, and thanks for the good work you have been doing so far in developing the XProc standard!
Best regards,
Sabine
word b sign Sabine MahrGraduate Translator, Technical WriterSchoenbacher Hauptstr. 57D-35745 Herborn
Von: Norman Walsh [mailto:notifications@github.com]Gesendet: Mittwoch, 7. Oktober 2015 16:38An: xproc/specificationspecification@noreply.github.comCc: jallabinesabine.mahr@wordbsign.comBetreff: Re: [specification] Steps for doing encoding and decoding (#134)
Steps for doing encoding and decoding
Opened by: ndw on 2015-02-09, 12:44h
ndw said on 2015-02-09, 12:44h:
Allowing non-XML to flow through the pipeline will be handy, but encoding/decoding may still be necessary at the boundaries. We could have p:encode/p:decode steps that convert between base64. They could also do hexBinary, maybe uuencode, etc.
jallabine said on 2015-03-24, 14:03h:
That would be great! I need that feature for a natural language processing pipeline operating on and creating some text files that contain German umlauts. The individual transformations, mostly XSLT transformations and one carried out in an external application, work fine for themselves. But if I combine them in an XProc pipeline, the umlauts are distorted although the encoding seems to be UTF-8 (according to Notepad++). So if there is any possibility of fixing that issue, I would be delighted. Thanks a lot!
ndw said on 2015-06-11, 12:49h:
That sounds like a bug@jallabinecan you send me an example that demonstrates it?
ndw said on 2015-10-07, 14:37h:
Nudge@jallabinecan you provide more detail?
jallabine said on 2015-12-22, 13:32h:
Dear Norman,
I’m so sorry I could not reply to your e-mail earlier – lots of work and family issues.
My pipeline contains one p:exec step integrating an external application named TreeTagger (which is a part-of-speech tagger for several languages). This application receives text files, tokenizes them and adds POS information related to the respective words. Since my mother tongue is German, I used German texts with some umlauts for processing.
Initially, everything is okay with the result text but in the following steps, the umlauts appear distorted in a way that indicates encoding problems. In the meantime, I noticed two things:
First, it apparently is a matter of display because Notepad++ (on a Windows machine) indicates “UTF-8” in the status bar and it actually IS possible to further process the text files. But the number of characters is not correctly counted then, which I absolutely need for an XSLT step later on in which strings are sorted according to their lengths. (There is no alternative for using text files, just to mention that.)
Second, the problem seems to vanish if I assign an encoding attribute with the value of “utf-8” to each p:store step following the TreeTagger p:exec step. As far as I understood, XProc is UTF-8 based in its own right – anyway, the mentioned workaround is functional which is all I needed in the first place. It would be interesting, though, to know why the issue arose.
I wish you, your family and friends a merry Christmas and a happy New Year, and thanks for the good work you have been doing so far in developing the XProc standard!
Best regards,
Sabine
word b sign Sabine MahrGraduate Translator, Technical WriterSchoenbacher Hauptstr. 57D-35745 Herborn
Germany
Phone +49-(0)2777/911184Fax +49-(0)2777/911185mailto:sabine.mahr@wordbsign.comsabine.mahr@wordbsign.comhttp://www.wordbsign.comwww.wordbsign.com
Von: Norman Walsh [mailto:notifications@github.com]Gesendet: Mittwoch, 7. Oktober 2015 16:38An: xproc/specificationspecification@noreply.github.comCc: jallabinesabine.mahr@wordbsign.comBetreff: Re: [specification] Steps for doing encoding and decoding (#134)
Nudgehttps://github.com/jallabine@jallabinecan you provide more detail?
—Reply to this email directly or#134 (comment)view it on GitHub.https://github.com/notifications/beacon/ALF7ciCWNdOBLA7SnhxqV4ROyVI-fgYDks5o5SWrgaJpZM4Ddr1l.gif