weavejester / hiccup

Fast library for rendering HTML in Clojure
http://weavejester.github.io/hiccup
Eclipse Public License 1.0
2.68k stars 174 forks source link

Escape all strings by default #122

Closed gyim closed 8 years ago

gyim commented 8 years ago

This commit implements issue #114 (Safe HTML by default). It introduces a few breaking changes:

  1. Since all strings are escaped by default, adding HTML markup from string literals or expressions will not work. Use hiccup.util/raw-str to prevent these values from being escaped.
  2. hiccup.util/escape-html, hiccup.core/h and hiccup.core/html returns a RawString object instead of a String. They can be converted to java.lang.String with the built-in str function.

Note that unlike hiccup.util/html, macros in hiccup.page (html4, html5, xhtml) still return a string so that their result can be passed directly to Ring. In the rare cases when the output is embedded to another html page (e.g. into an iframe) it should be also tagged with hiccup.util/raw-str.

weavejester commented 8 years ago

Thanks for the patch. I have a few suggested changes.

  1. Let's rename raw-str to raw-string. We don't need to abbreviate a function that isn't going to be used often.
  2. Let's remove the h function entirely. The abbreviation isn't necessary anymore.
  3. The escape-html function should probably still return a string. With automatic escaping, it's not going to be used under the html macro as it was before, so it should be treated like a utility function for edge case uses.
  4. I don't think the html4, xhtml and html5 macros should return strings. I think we should favor consistency over convenience, especially since Ring integration problems can be solved with middleware or similar.
  5. I think the RawString type itself should be moved into the hiccup.compiler namespace, as it's directly to do with the compiler. However, the hiccup.util/raw-string function should remain in the same place.
  6. You backtick a few things that could be calculated inline. I'll add comments to the lines themselves.
  7. Some more tests are necessary. In particular, what happens if you have a html inside a html? What happens if you call a function that uses html? That sort of thing.
gyim commented 8 years ago

Thanks for the feedback, it is very helpful, and I agree with most of your points. There are a few things though where I would take a different approach:

  1. I would introduce as few breaking changes as possible, and make these changes easy to discover and fix. Ideally, users should not notice any change unless they rely on the current non-escaping behavior (e.g. by embedding HTML entites to string literals). If this is not possible, a few minor changes are OK, but requiring hundreds of changes per project would certainly annoy users and possibly slow down the adoption. So I would not remove or break the (h) function, at least not immediately. I would make it unnecessary and mark it as obsolete, and maybe remove it a few years later.
  2. I would expose RawString as little as possible. It is an implementation detail, and users should not care about its existence: they want a java.lang.String output, not some custom object. I agree about preferring consistency over convenience though: it is not right to return RawString at one place and java.lang.String at another.

You pointed out that escape-html could return a string, and this gave me an idea: why not make everything return a string except raw-string? So how about this:

  1. html, html4, html5 and xhtml should all return a string. When you embed a "partial template" (html inside html) you have to use raw-string. I think this is a clean and consistent behavior, and using "partial templates" is probably not a very common use case. Also, if you miss it, you will immediately see it on the output (a whole HTML subpage will be escaped, which is easy to spot)
  2. h should become an alias for identity, and be marked as obsolete in the doc string. I would also rename the escape-html function so if somebody uses it direcly instead of h (which should be the minority of the use cases) it will cause a compile-time error rather than a lot of double-escaped strings (which are hard to discover if escape-html is used on non-HTML data)
  3. raw-string returns a RawString. Since it is a new function there will be no surprises for existing users. Since this is the only function that returns a RawString, it will be possible to render an entire webpage without ever creating a RawString instance.
weavejester commented 8 years ago

I would introduce as few breaking changes as possible

I'm going to have to disagree pretty strongly with this point. If we must introduce breaking changes, then it's better to have them all in one release, rather than spread across multiple releases. In the long term it's much less painful.

The reason for this is that any breaking change, no matter how small, carries with it a huge fixed cost. The is particularly true for a language like Clojure that doesn't allow multiple versions of the same package in the same dependency tree. For example, if someone is using the Ring-Devel package, which depends on Hiccup 1.x, they won't be able to use Hiccup 2.x.

So I'd prefer that people have to deal with all the issues with moving to Hiccup 2.0 once, rather than spread the pain across Hiccup 3.0, 4.0 and so forth.

I would expose RawString as little as possible. It is an implementation detail, and users should not care about its existence

RawString is the return value from html and raw-string, which means it's part of the API and therefore not an implementation detail. We've introduced a new type so we can distinguish between a string and the output from html. It's very much part of the design.

We shouldn't be half-hearted in how we define APIs. Either we hide an abstraction away entirely, or we embrace it. We shouldn't have an abstraction that we expect the end user to make use of, and at the same time try and pretend it's not there.

html, html4, html5 and xhtml should all return a string. When you embed a "partial template" (html inside html) you have to use raw-string. I think this is a clean and consistent behavior, and using "partial templates" is probably not a very common use case.

This would work, except that "partial templates" are a very common use case in Hiccup. Typically in Hiccup you'll have functions like:

(defn panel [title & body]
  [:div.panel [:h2.title title] body])

And if you want to speed them up, you add a html macro in:

(defn panel [title & body]
  (html [:div.panel [:h2.title title] body]))

This precomputes a lot of the output, effectively reducing the runtime problem down to a series of string concatenations, which is much more performant.

On the other hand, actually transforming the RawString into an output string is something that typically will only happen once in an application. In fact, you could easily add something like:

(extend-protocol compojure.response/Renderable
  hiccup.compiler.RawString
  (render [raw _] (str raw))

And then you could keep using html in the same way as you were before. The next version of Ring will have a similar protocol for extending the types allowed in the :body key. So from a practical perspective, not much would change.

If we're going to optimise for the common case, then html should return a RawString.

gyim commented 8 years ago

All right, I tried to finish the implementation as discussed above. I hope it's ready (or very close to ready) for a merge now.

So I'd prefer that people have to deal with all the issues with moving to Hiccup 2.0 once, rather than spread the pain across Hiccup 3.0, 4.0 and so forth.

Got it. To be honest, I would probably go with a backwards-compatible change (e.g. keep h forever, or make auto-escaping opt-in), even if it is a bit half-hearted or unconvenient. But I know that it has downsides, and it is clearly your choice.

RawString is the return value from html and raw-string, which means it's part of the API and therefore not an implementation detail. We've introduced a new type so we can distinguish between a string and the output from html. It's very much part of the design.

Let me put it in a different way.

The semantics of the RawString class is something like this: "it is a value that is almost a string, but hiccup.core/html does not escape it". This is a perfectly acceptable return value for the raw-string function, but it does not seem right to me that the top-level API should return such a value. The caller does not want an "almost a string": it cannot do anything with it except to turn it to a string immediately. But then what is the advantage of not returning a string in the first place?

It is true that it is more convenient if there are nested html calls all over the code. But nested html calls are an optimization, and while I admit that I do not have much data on this particular case, I think optimization is always a minority of use cases. People normally optimize when they measured a bottleneck - and in a typical web app there are a lot of things to optimize before HTML generation becomes a bottleneck (database queries, caching of computation-intensive data, etc.). And in a many use cases (e.g. admin pages, computation-intensive web apps) it cannot be a real bottleneck at all.

So I would not optimize for convenience of the nested html calls, at least not this way. If it is really a widely used feature, I would create an explicit shorthand function for that, or perhaps use a dynamic var-based technique to detect nested calls.

Anyway, it is my last wall-of-text ;) I just wanted to make sure that this change does not make things more difficult than necessary... If you change your mind, I would be happy to modify the code. Or I will finish it this way (if it still needs improvements).

gyim commented 8 years ago

I think the RawString type itself should be moved into the hiccup.compiler namespace, as it's directly to do with the compiler. However, the hiccup.util/raw-string function should remain in the same place.

Unfortunately, hiccup.compiler requires the hiccup.util namespace, so it would cause a circular dependency. I left RawString in hiccup.util for now, maybe it could be moved to a separate namespace.

weavejester commented 8 years ago

My preference is to favour consistency over convenience in APIs. If html returns a RawString, then the type of the input matches the type of the output. If html returns a string, then we introduce a degree of inconsistency.

Even if a user never nests html in their own code, what happens when they use someone else's code? Because functions are opaque, we can only optimise them if we have control over their source. This means that a performant library should use html liberally, in order to maximise the amount of HTML that can be precomputed.

Further, returning a custom type by default opens the way for additional functionality in future without breaking backward compatibility. At the moment we simply wrap a string, but we could potentially wrap a more complex structure that would allow, for instance, rendering with different whitespace options, or for different doctypes.

The fundamental problem with Hiccup 1.x is that strings are not a rich enough data type to represent everything we might want to communicate in the output. Not escaping by default is merely the most visible problem caused by the underlying design decision to rely on strings.

I also don't think we're losing much in terms of convenience. At most we're introducing an additional six characters the user has to type. At least we introduce zero and everything works as before. That seems a small price to pay for a more consistent, functional, and future-proof API.

weavejester commented 8 years ago

I lost track of this PR a little over Christmas. I believe there's still a little work to do on it? I'll add some notes to the source code.

weavejester commented 8 years ago

Aside from those two points, I think this PR can be squashed and merged.

gyim commented 8 years ago

Sorry for the long delay... I modified the code according to your requests. If it's OK, I will squash the commits (the last commit is quite large, so I thought it's easier to review it this way).

weavejester commented 8 years ago

Thanks for the update! I have one small code suggestion, and if you could remove the (set? x) change I think we're just about ready to merge.

Squashing the commits is a good idea :)

gyim commented 8 years ago

Done. I also removed an unnecessary test: checking whether escape-html returns a string (and not a RawString). Since this function did not change after all, and RawStrings cannot be compared to strings anymore, I think this test is no longer useful.

weavejester commented 8 years ago

Thanks for the updated commit! It looks good. I'll find some time this weekend to give it a last look over and manual test. If all goes well, I'll merge.

weavejester commented 8 years ago

Thanks for all your hard work on this PR, and thanks for being patient with my feedback :)

Incidentally, I think I've changed my mind about dropping h. If we keep it, but alias it to str and add a deprecation notice, then we can create Hiccup code that works in both 1.0 and 2.0.