Escaping of note content in RDF export

avram commented 12 years ago

People in the TEI world have noticed that our RDF export makes a mess of HTML tags in item data:

<rdf:value>&lt;h6>1256 to 1272&lt;/h6>
&lt;p>&amp;nbsp;&lt;/p>
&lt;p>page 32&amp;nbsp; roll 1218a 1272 John the Clerk against William de Grendon regarding the warrant of 8 acres&lt;/p>
&lt;p>page 40 ditto&lt;/p>
&lt;p>&amp;nbsp;p108 roll 144 1269 Claim by Margery who was the wife of Henry of Ashbourne&amp;nbsp; re dower from various individuals including Stephen of Ireton the third part of an acre of meadow in Snelston, and ?( William de ) Hulton in Clifton .&amp;nbsp; William de hylton gives up dower amongst others.&amp;nbsp; Makes one wonder whether&amp;lt;per corresp='#williamofhultonclerk' role='m'&amp;gt;William de Hulton&amp;lt;/per&amp;gt; and William the clerk are the same person.&lt;/p>
&lt;p>page 109 ditto Roger is the son of Henry of Ashbourne and is in the custody of Margaret countess Derby&amp;nbsp; and lands in the custody of Edmund king's son&lt;/p>
&lt;p>page 9&amp;nbsp; and 10 1258 Information re Henry of Ashbourne.&amp;nbsp; Holds a court. Case of villeinage.&amp;nbsp; Confirms Henry heir of&amp;nbsp; Robert of Ashbourne.&amp;nbsp; Stephen of Ireton one of the pledges for Henry.&lt;/p>
</rdf:value>
</bib:Memo>

We are presumably doing the same with things like <i> in item titles. A proper solution to this, as suggested in the linked thread on eXist-TEIXML, is to namespace those tags. We would also need to replace non-XML entities like  .

Unfortunately, this behavior has its roots in the underlying Tabulator RDF engine; I don't how we'd convince it to handle this with namespacing.

I would like help on this, if we have anyone still on the team who has experience with the RDF engine.

simonster commented 12 years ago

This is not all that crazy; it is pretty standard to escape HTML in Atom/RSS. I think that implementing the RDF changes (effectively, permitting parseType="Literal") is relatively trivial. What might be harder is writing code that reliably converts the output of TinyMCE from HTML to valid XML so that we don't end up with corrupt RDF by doing this.

dstillman commented 12 years ago

TinyMCE output should be valid XHTML already.

avram commented 12 years ago

We can probably ignore it for titles until Zotero has real rich text support for other fields, since there's no way to enforce proper use of <i>, etc. in other fields at present.

I think that implementing the RDF changes (effectively, permitting parseType="Literal") is relatively trivial.

We still need to namespace this out of the main RDF namespace and into the XHTML one, but maybe that is easy too.

simonster commented 12 years ago

@dstillman Even if TinyMCE's output is XHTML, what do we do about Word copy/paste and people who use the HTML editor? We could escape it in that case, but then someone parsing Zotero RDF with a plain XML parser might be confused by the inconsistency (and to an RDF parser there shouldn't be any difference anyway).

dstillman commented 12 years ago

We have TinyMCE set to automatically clean up code. <div><blink>Foo</blink></div (without the trailing ">") entered into the code editor becomes <div>Foo</div>.

aurimasv commented 12 years ago

What are the issues with storing data that contains < in CDATA sections?

simonster commented 12 years ago

Since it would be the same to an XML parser, I'm not sure putting things in a CDATA section is worth the hassle (unless people are really parsing Zotero RDF without an XML parser, which seems like a bad idea to me). I admit I'm not entirely convinced there's anything wrong with the status quo. Even if our HTML happens to be XHTML, as a standard, XHTML is dead.

aurimasv commented 12 years ago

I don't follow. From Avram's initial post (shortened): <rdf:value><h6>1256 to 1272</h6> <p>&nbsp;</p></rdf:value>

To make it look nice:

`

1256 to 1272

` Clearly invalid xml and will not parse. `1256 to 1272

]]>` Valid and looks nice. I feel like I'm missing something about this discussion though.

simonster commented 12 years ago

What I'm saying is that there is no difference between the first and last example to a parser, i.e.,

(new DOMParser()).parseFromString("<value>&lt;h6>1256 to 1272&lt;/h6>&lt;p>"+
"&amp;nbsp;&lt;/p></value>", "text/xml").documentElement.firstChild.nodeValue 
== (new DOMParser()).parseFromString("<value><![CDATA[<h6>1256 to 1272</h6><p>"+
"&nbsp;</p>]]></value>", "text/xml").documentElement.firstChild.nodeValue
/*
true
*/

If someone is trying to parse these files with something besides an XML parser, they are doing it wrong. I don't think it's worth putting any effort into this for the pleasure of people who sit down and read the XML, because that's not what it's meant for.

If there's any change needed, it'd be to something more like the second example:

<rdf:value parseType="literal"><h6>1256 to 1272</h6>
<p>&#160;</p></rdf:value>

This is valid and should look the same to as the other examples to an RDF parser, but is (slightly) cleaner and easier to deal with if you want to parse note contents with an XML parser. However, in the age of HTML5, I'm not convinced that people should be parsing (X)HTML with an XML parser at all as a matter of principle, because XHTML is dead.

dstillman commented 12 years ago

However, in the age of HTML5, I'm not convinced that people should be parsing (X)HTML with an XML parser at all as a matter of principle, because XHTML is dead.

That's probably overstating things—there's still a valid XML serialization of HTML5 (XHTML5), and I imagine TinyMCE will continue to produce valid XML markup even when it adds HTML5 elements. The general point is reasonable, but there is something to be said for readability, and if we don't see ourselves outputting non-XML HTML at any point, is there any reason not to use the literal example? We do embed XHTML directly in various server API modes, so this would be consistent with that.

zotero / translators

Escaping of note content in RDF export #81

1256 to 1272