yogthos / markdown-clj

Markdown parser in Clojure
Eclipse Public License 1.0
540 stars 120 forks source link

Describe best practices for sanitising non-markdown HTML properly with markdown-clj #155

Closed Ashe closed 5 years ago

Ashe commented 5 years ago

Hey there, I was wondering if it was possible to maybe show more examples using :replacement-transformers as I am not so sure about how I'd go about escaping HTML properly.

                ;; Body
                [:div
                  {:dangerouslySetInnerHTML
                    {:__html  (md/md->html 
                                  (:post-summary p)
                                  :replacement-transformers
                                  (cons escape-html mdt/transformer-vector))}}]

(def ^:dynamic ^:no-doc *html-mode* :xhtml)

(defn- escape-html
  "Change special characters into HTML character entities."
  [text state]
  [ (if (and (not= :code state) (not= :codeblock state))
      (-> text
        (s/replace #"&"  "&")
        (s/replace #"<"  "&lt;")
        (s/replace #">"  "&gt;")
        (s/replace #"\"" "&quot;")
        (s/replace #"'" (if (= *html-mode* :sgml) "&#39;" "&apos;")))
      text) state])

The code above attempts to strip out any HTML that has not been created by markdown-clj and succeeds, but, it does fail when HTML has been placed inside of a code block. Seeing as it is important to sanitise HTML from queries but also to maintain the ability to produce readable code blocks, I feel like it'd be useful to new users to show an example of going about this.

Ashe commented 5 years ago

As described in #36 I tried this:

(defn- escape-html
  "Change special characters into HTML character entities."
  [text state]
  (let [sanitized-text 
          (clojure.string/escape text 
             {\& "&amp;" 
              \< "&lt;" 
              \> "&gt;" 
              \" "&quot;"
              \' "&#39;"})]
    [(if (not (or (:code state) (:codeblock state)))
      sanitized-text text) state]))

This works for the most part, although single-line code blocks get escaped. This will have to do for now though.

yogthos commented 5 years ago

Hi,

Yeah, avoiding escaping within inline code is trickier, so this is a reasonable approach. As a note, it's probably better to do the check for code block before sanitizing since that will avoid the work in cases where it's not needed. I'll add this as an example in the docs to help others running into this problem.