Is beautiful soup entity encoding your inserted html?

worldcompany / djangoembed

rich media consuming and providing with django

http://djangoembed.readthedocs.org

MIT License

138 stars 38 forks source link

Is beautiful soup entity encoding your inserted html? #33

Open sligodave opened 12 years ago

sligodave commented 12 years ago

Hi, I could be wrong here but just in case, I said I'd bring this to your attention.

At the end of the parse_data method of the HTMLParser where you call "replaceWith" on the matched url; It appears that with the step from BeautifulSoup 3.2.0 to BeautifulSoup 3.2.1 the inserted html is now being entity encoded, thus breaking things.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("")
soup.insert(0, "<b>YAY</b>")
print unicode(soup)

The above under BS 3.2.0 printed:

``` <b>YAY</b>

Under BS 3.2.1 it prints: <b>YAY</b>

I haven't had the time to dig an awful lot but the solution might be to create a BS representation of the replacement html and pass that to replaceWith.

Thanks, Dave

coleifer commented 12 years ago

Yep you're totally right, I thought I had opened an issue to that effect here but I guess I had not. I actually opened up a bug on their launchpad and need to respond with some info. I will use the example you provided, thanks for that. The maintainer has some suggestions and you can follow up here: https://bugs.launchpad.net/beautifulsoup/+bug/949074

coleifer commented 12 years ago

May be interested in my replacement proejct for djangoembed, http://micawber.readthedocs.org/ -- the html parser does not ahve this issue.

azreda commented 12 years ago

This issue remains, breaking the HTML parsing method. Downgrading to 3.2.0 is a temporary solution.

The proper solution is described on the bug report: "If you put a string into the soup its XML characters should always be escaped. Since you want "YAY" to be treated as an HTML tag, you can create a Tag object instead"

coleifer commented 12 years ago

You can also see on lines 130/131 of micawber, I have fixed this: https://github.com/coleifer/micawber/blob/master/micawber/parsers.py#L130

Please note - i am not working on this project anymore. I've written a replacement:

https://github.com/coleifer/micawber