wvuweb / cleanslate-cms

A place to file issues and view releases for CleanSlate CMS. http://cleanslatecms.wvu.edu
6 stars 0 forks source link

Sanitize bad markup in Mercury editable regions #2

Open adamjohnson opened 9 years ago

adamjohnson commented 9 years ago

From @adamjohnson on October 9, 2013 19:30

When using Chrome, Mercury outputs style tags, random classes, divs, and other strange things in some crazy places after working with content (highlight something, then format it a bit).

Keeping solid, semantic markup is vital to SEO and content reuse.

Style Attribute

http://cl.ly/Rrpj

Apple Converted Space

http://cl.ly/RroE

Copied from original issue: wvuweb/cleanslate#6

adamjohnson commented 9 years ago

While we're at it, everyone should star this issue in the Blink issue tracker by clicking the grey star next to "Issue 226941" in the top left:

https://code.google.com/p/chromium/issues/detail?id=226941

adamjohnson commented 9 years ago

This article from The Guardian is a very worthwhile read:

http://www.theguardian.com/info/developer-blog/2014/mar/20/inside-the-guardians-cms-meet-scribe-an-extensible-rich-text-editor

They seem to be dealing with the same contentEditable issues that we're dealing with (there are no standards, extremely messy output). So, they made Scribe. Looks very cool! They've made it to be extensible via plugins. Interesting ones look to be Sanitize and Intelligent Unlink Command, among others. Another nice aspect of Scribe is that The Guardian is paying these guys to work on it and its under active development.

adamjohnson commented 9 years ago

From @nreckart on June 4, 2014 14:49

Very interesting! I can see the potential for incorporating Scribe into Mercury. That could save us from a lot of headaches. And the fact that an organization the size of The Guardian is the driving force behind it is encouraging.

adamjohnson commented 9 years ago

A rather lengthy post, but I think it brings up valid points:

https://medium.com/medium-eng/why-contenteditable-is-terrible-122d8a40e480

dmolsen commented 9 years ago

So scribe uses HTMLJanitor which is also maintained by The Guardian. That can be used sans scribe which might make this easier to implement. An example:

<html>
    <body>
        <div id="test">
            <p foo bar="bad"><a href="">Hello World</a></p>
        </div>
        <script src="html-janitor.js"></script>
        <script>
            var el = document.getElementById("test");
            var janitor = new HTMLJanitor({
                tags: {
                  p: { foo: true, bar: 'baz' }
                }
              });
              console.log(el.innerHTML);
              console.log(janitor.clean(el.innerHTML));
        </script>
    </body>
</html>

I'm not seeing a list of events that are associated with Mercury so not sure when it would be best to run sanitize.

dmolsen commented 9 years ago

I don't think a sanitize will work that well but it's worth a try assuming we're not mucking with mercury (e.g. trying to roll in scribe). If there is an event that we can hook into let's throw a sanitize test at it. Is that something you can do @adamjohnson?

adamjohnson commented 9 years ago

I can help with UI stuff. Happy to get that mocked up in Mercury as long as either Nathan or Steve helps with the event and other JS stuff.

I like Steve's idea about having a "Sanitize" button in the HTML editor. Click the button, HTML Janitor runs. Opt in > auto.


For the record, Scribe gets my vote over Mercury. As Nathan has said, its better maintained with a big organization behind it.

On that note, if big implementation changes are required for Scribe, it'd be worth looking into others like Redactor, TinyMCE and the lot to make sure we're comfortable with where those projects are headed and their support.

dmolsen commented 9 years ago

@adamjohnson -

If you want to open a new issue regarding the WYSIWYG that's fine. It's one that will require a lot of discussion and won't be a Q1 delivery. It's off-topic for this issue.

nreckart commented 9 years ago

Scribe would not be a replacement for Mercury. We're too heavily invested in some of the design decisions made for Mercury to ditch it at this point. And Scribe isn't an editor. It's an editor framework.

It is conceivable that Scribe could be integrated into Mercury. Mercury implements many of it's own custom 'editor' functions when interacting with a contentEditable. These functions could be handled by Scribe while still maintaining the interaction with Mercury.

dmolsen commented 9 years ago

For ref, let's make sure that we double-check Word crap when implementing this solution. See #114.

dustinmazon commented 9 years ago

+1

nreckart commented 9 years ago

I made an attempt to integrate Scribe with Mercury and didn't have much luck. Perhaps Mercury's biggest advantage, keeping the site separate from the editor via an iframe, proved to be a big downside here. I couldn't get Scribe to work correctly within the iframe. Scribe is setup to use require.js, which was also complicating things.

I guess we'll just have to stick with some sort of sanitizing for now. I'll look into that next.

nreckart commented 9 years ago

@dmolsen @adamjohnson

So...can anyone tell me exactly what we want to "sanitize"? I'll need specifics. Unfortunately, we're not going to be able to do much at this time to reign in the behavior of contentEditable across browsers. I'm not sure how much me can do to clean up the random divs and spans that sometimes get generated. Are we willing to say that there can never be a legitimate case for inline styles?

joshrcook commented 9 years ago

At the very least, I think the Microsoft Office tags would be low-hanging fruit. From what I see, they always seem to start with <!--[if gte mso and end with <![endif]-->. Could we do some kind of regex pattern matching to get rid of those tags?

There also seems to be <!--StartFragment--> and <!--EndFragment--> comments that Word inserts that could be removed.

architarious commented 9 years ago

+1

Getting rid of all of the div/span tags that popup everytime you hit enter would be a great start. Old slate wasn't really all that bad when it came to handling content. If there where just a way to tell the user where tags end and start it might be a decent temporary fix (for chrome users) until something more elaborate can be devised.

adamjohnson commented 9 years ago

These tags should always be sanitized:

http://stackoverflow.com/questions/9228277/deprecated-tags-in-html5/9228371#9228371

and (in Chrome), this <span> tag is always irrelevant and undesireable:

<span style="font-size: 1em; line-height: 1.6;"></span>

As seen in the photo in the original post, you'll get this style as an attribute as well:

style="font-size: 1em; line-height: 1.6;"

This should also always be sanitized (although I haven't seen it in a while):

<span class="Apple-converted-space"></span>

Chrome finds great joy in adding extraneous <span> tags to your contentEditable regions. I wish there was a way to avoid this but I doubt there is.


Sanitize word:

After digging around the internet for something that cleans up text when pasting from word, here's what I found:

http://patisserie.keensoftware.com/en/pages/remove-word-formatting-from-rich-text-editor-with-javascript https://github.com/summernote/summernote/issues/303 https://github.com/jamster/text_cleaner https://github.com/metaraine/wordsoap https://github.com/metaraine/wordsoap-regex https://github.com/ivinstar/sanitize_msword

Many of these do similar stuff. If you need specific tags that MS Word inserts to blacklist, I could pull those out. But it seems that the regexes from one of these tools should be able to convey the general idea.


Opt-in Sanitize HTML:

When editing a page, click to view the HTML of the editable region. A button could be added to clean/sanitize HTML. A (growl-style?) notification could pop up saying "HTML Cleaned. Click 'Save and Replace' to keep these changes."

https://github.com/punkave/sanitize-html

https://code.google.com/p/jquery-clean/

Whitelist: https://github.com/joetannenbaum/mr-clean#html

I often use and like http://www.dirtymarkup.com/ for this; however, I don't know what library it uses to sanitize HTML.

If you're wondering what elements to specifically clean here, perhaps we could reference one of the libraries above OR we could reference the whitelist in mercury.js and be sure to add a few that we removed (like img and maybe a few others).


These tags should always be converted:

Semantic Elements:

Convert <b> to <strong>

and

<i> to <em>

https://github.com/guardian/scribe-plugin-formatter-html-ensure-semantic-elements

Related: http://stackoverflow.com/questions/271743/whats-the-difference-between-b-and-strong-i-and-em


New Pages

After creating a new page, all editable regions (contentEditable) should—if possible—start with an empty paragraph tag (<p></p>).

I've found starting with semantic markup from the onset helps contentEditable do a good job to create semantic markup while you're typing.

adamjohnson commented 9 years ago

Josh and Cole, could you provide us with a few pages in the wild that have bad markup caused by contentEditable? That would help us tremendously.

adamjohnson commented 9 years ago

Here's a quick test I did to see if I could make some jank markup (I succeeded):

http://pastie.org/10067511

I especially like the last sentence surrounded by random <span> tags as well as the plethora of empty div tags.

Perhaps this can be used as a test when making the opt-in sanitize HTML feature.

joshrcook commented 9 years ago

Sure, here's a good example of bad Word markup, in the first content block on the page. All I did to produce this code was copy the content into Word on a Mac and then copy it right back over to the page:

http://it.wvu.edu/governance

On that same page, look at the content block and the links surrounding the IT Oversight Committee. I see this a lot in CleanSlate, where you'll have a bunch of empty links. I can recreate CleanSlate creating the empty links, but I think that might be another issue all together other than just bad markup.

Those are two of the biggest markup issues that I don't think Adam has mentioned.

Is this what you were asking for, @adamjohnson?