xproc / 1.0-specification

The 1.0 XProc specification and now abandoned drafts of a 2.0 XML specification
12 stars 6 forks source link

Steps for doing encoding and decoding #134

Open ndw opened 9 years ago

ndw commented 9 years ago

Allowing non-XML to flow through the pipeline will be handy, but encoding/decoding may still be necessary at the boundaries. We could have p:encode/p:decode steps that convert between base64. They could also do hexBinary, maybe uuencode, etc.

jallabine commented 9 years ago

That would be great! I need that feature for a natural language processing pipeline operating on and creating some text files that contain German umlauts. The individual transformations, mostly XSLT transformations and one carried out in an external application, work fine for themselves. But if I combine them in an XProc pipeline, the umlauts are distorted although the encoding seems to be UTF-8 (according to Notepad++). So if there is any possibility of fixing that issue, I would be delighted. Thanks a lot!

ndw commented 9 years ago

That sounds like a bug @jallabine can you send me an example that demonstrates it?

ndw commented 9 years ago

Nudge @jallabine can you provide more detail?

jallabine commented 8 years ago

Dear Norman,

I’m so sorry I could not reply to your e-mail earlier – lots of work and family issues.

My pipeline contains one p:exec step integrating an external application named TreeTagger (which is a part-of-speech tagger for several languages). This application receives text files, tokenizes them and adds POS information related to the respective words. Since my mother tongue is German, I used German texts with some umlauts for processing.

Initially, everything is okay with the result text but in the following steps, the umlauts appear distorted in a way that indicates encoding problems. In the meantime, I noticed two things:

First, it apparently is a matter of display because Notepad++ (on a Windows machine) indicates “UTF-8” in the status bar and it actually IS possible to further process the text files. But the number of characters is not correctly counted then, which I absolutely need for an XSLT step later on in which strings are sorted according to their lengths. (There is no alternative for using text files, just to mention that.)

Second, the problem seems to vanish if I assign an encoding attribute with the value of “utf-8” to each p:store step following the TreeTagger p:exec step. As far as I understood, XProc is UTF-8 based in its own right – anyway, the mentioned workaround is functional which is all I needed in the first place. It would be interesting, though, to know why the issue arose.

I wish you, your family and friends a merry Christmas and a happy New Year, and thanks for the good work you have been doing so far in developing the XProc standard!

Best regards,

Sabine


word b sign Sabine Mahr Graduate Translator, Technical Writer Schoenbacher Hauptstr. 57 D-35745 Herborn

Germany

Phone +49-(0)2777/911184 Fax +49-(0)2777/911185 mailto:sabine.mahr@wordbsign.com sabine.mahr@wordbsign.com http://www.wordbsign.com www.wordbsign.com

Von: Norman Walsh [mailto:notifications@github.com] Gesendet: Mittwoch, 7. Oktober 2015 16:38 An: xproc/specification specification@noreply.github.com Cc: jallabine sabine.mahr@wordbsign.com Betreff: Re: [specification] Steps for doing encoding and decoding (#134)

Nudge https://github.com/jallabine @jallabine can you provide more detail?

— Reply to this email directly or https://github.com/xproc/specification/issues/134#issuecomment-146214086 view it on GitHub. https://github.com/notifications/beacon/ALF7ciCWNdOBLA7SnhxqV4ROyVI-fgYDks5o5SWrgaJpZM4Ddr1l.gif