zeldigas / text2confl

Publisher of documents to confluence
Apache License 2.0
13 stars 2 forks source link

using a & in an image URL makes text2confl crash #183

Closed feliksik closed 3 weeks ago

feliksik commented 3 weeks ago

imagine the following asciidoc:

= test

image::https://example.com/image?id=12&param=b[image]

https://example.com/image?id=12&param=b[link]

this results in the following Confluence XML:

<div class="openblock">
<div class="content">

</div>
</div>
<p><ac:image ac:alt="image"><ri:url ri:value="https://example.com/image?id=12&param=b" /></ac:image></p>
<p><a href="https://example.com/image?id=12&amp;param=b">link</a></p>

And it makes the tool crash:

Some pages content is invalid:
1. ./test.adoc: [6:84] The reference to entity "param" must end with the ';' delimiter.

The cause is that & in an XML attribute must apparently be escaped. As we can see, this is done for the <a href, but not for the <ri:url ri:value attribute.

zeldigas commented 3 weeks ago

@feliksik do you know if confluence itself is ready to process such content?

feliksik commented 3 weeks ago

Confluence is ready to process content with the escaped variant;

  <ac:image ac:alt="image">
    <ri:url ri:value="https://example.com/image.png?id=12&amp;param=b" />
  </ac:image>

This is properly rendered a an image to the unescaped variant (and it's the only valid XML to represent an &).

If the current implementation would not trip over the XML validation, Confluence actually accepts the invalid (unescaped) XML, and then converts it to store it as an escaped &amp; in the storage format (but it would still be ugly to depend on Confluence fixing it).

I just discovered that with an inline image (i.e., using image: instead of image::), it does already work as it should:

= test

image:https://example.com/image.png?id=12&param=b[image]

is converted to

  <ac:image ac:alt="image">
    <ri:url ri:value="https://example.com/image.png?id=12&amp;param=b" />
  </ac:image>
zeldigas commented 3 weeks ago

@feliksik fixed and released as 0.17.0

feliksik commented 1 week ago

The following also breaks with some pages content is invalid: ./bug.adoc: [13:47] The reference to entity "param" must end with the ';' delimiter.


++++
<a href="https://example.com/image?id=12&param=b">this links breaks text2confl</a>
++++

I think we should conclude that the proper approach is not to to have macro-specific escapings, like we did before, but instead, only do a replacement of the & in the html output, because it will be treated as XML, which will first unescape the &.

I'm not entirely sure though - alternatively, we could also apply the mechanism used above for literal sections (++++, +++, and i believe ~~~ or something).

zeldigas commented 1 week ago

@feliksik please create separate issue for this finding. The challenge for asciidoc here is that some rendered stuff is escaped while other parts are not, so it's not possible just "normalize" whole output as some post processing step. Maybe i'm missing something, but this is my current understanding..

feliksik commented 1 week ago

I think I might have been mistaken. It look like a href-attribute in an HTML anchor needs & characters to be escaped as &amp;, so it's really the responsibility of the author of the passthrough block to provide this to make sure the AsciiDoc HTML output is correct -- that is, the fact that's it's later interpreted/treated as XML does not add this as a new requirement. So I think issue regarding the ++++ should be ignored. Thanks Dmitry!