zenhack / iron-blogger2

GNU General Public License v3.0
3 stars 3 forks source link

post.summary scrubs some images #47

Closed jywarren closed 8 years ago

jywarren commented 9 years ago

I'm not sure if it's because it scrubs some markup, or if it is reading <description> and not <content:encoded>. The latter includes inline images in a standard WordPress feed:

<description><![CDATA[Here&#8217;s a Gosper curve cut into paper with a Silhouette Cameo desktop paper cutter. Thanks to Owen Maresh for showing me the Gosper curve, which is a space-filling curve formed with a single line, and therefore, here, with a single cut.]]></description>
<content:encoded><![CDATA[<p><a href="http://unterbahn.com/wp-content/uploads/2015/08/tmp_28008-IMG_20150820_135709-1665396675.jpg"><img src="http://unterbahn.com/wp-content/uploads/2015/08/tmp_28008-IMG_20150820_135709-1665396675-1024x758.jpg" alt="tmp_28008-IMG_20150820_135709-1665396675" width="100%" class="alignnone size-large wp-image-2255" /></a></p>
<p>Here&#8217;s a Gosper curve cut into paper with a <a href="//publiclab.org/notes/warren/07-30-2015/silhouette-cameo-desktop-paper-cutter-for-prototyping">Silhouette Cameo</a> desktop paper cutter. Thanks to <a href="http://owen.maresh.info">Owen Maresh</a> for showing me the Gosper curve, which is a space-filling curve formed with a single line, and therefore, here, with a single cut.</p>]]></content:encoded>
zenhack commented 9 years ago

It does do some scrubbing, but it should allow image tags (and clearly does, since other images are getting through). This is the sanitized summary as stored in the DB:

Here&#8217;s a Gosper curve cut into paper with a Silhouette Cameo desktop paper cutter. Thanks to Owen Maresh for showing m
e the Gosper curve, which is a space-filling curve formed with a single line, and therefore, here, with a single cut.<div cla
ss="crp_related"><h3>Related Posts:</h3><ul><li><a class="crp_title" href="http://unterbahn.com/2015/08/vectorizing-sketches-
and-photos-with-your-smartphoneweb-browser/">Vectorizing sketches and photos with your smartphone/web&hellip;</a></li><li><a 
class="crp_title" href="http://unterbahn.com/2013/05/for-controlling-spacetime/">For controlling spacetime</a></li><li><a cla
ss="crp_title" href="http://unterbahn.com/2013/05/space-oddity-cover-music-video-recorded-on-the-iss/">Space Oddity cover, mu
sic video recorded on the ISS</a></li><li><a class="crp_title" href="http://unterbahn.com/2012/10/muddling-through-tax-revenu
e-numbers/">Muddling through tax revenue numbers</a></li><li><a class="crp_title" href="http://unterbahn.com/2014/03/balopaix
ao-is-gone/">BaloPaixão is gone</a></li></ul></div>

Docs about feedparser's sanitization:

https://pythonhosted.org/feedparser/html-sanitization.html

The above is consistent with the theory that it's using the description element; if you actually download the feed and look at the entry you'll see that it does include the rest of the content there, not just the text that you quoted.

I'm not totally sure what the right approach is here. I don't know if feedparser exposes a way to change this, and I'm not keen on mucking around in its plumbing unless I need to. Thoughts?

jywarren commented 9 years ago

Well, do you specify the "description" element, or is that a Feedparser default? It does seem that it has methods for accessing common RSS elements; could we just grab "content:encoded" instead of "description"? Can you point me at the relevant code in IB? Thanks!

https://pythonhosted.org/feedparser/common-rss-elements.html#accessing-common-channel-elements

On Sat, Aug 22, 2015 at 4:29 PM, Ian Denhardt notifications@github.com wrote:

It does do some scrubbing, but it should allow image tags (and clearly does, since other images are getting through). This is the sanitized summary as stored in the DB:

Here’s a Gosper curve cut into paper with a Silhouette Cameo desktop paper cutter. Thanks to Owen Maresh for showing m e the Gosper curve, which is a space-filling curve formed with a single line, and therefore, here, with a single cut.<div cla ss="crp_related">

Related Posts:

Docs about feedparser's sanitization:

https://pythonhosted.org/feedparser/html-sanitization.html

The above is consistent with the theory that it's using the description element; if you actually download the feed and look at the entry you'll see that it does include the rest of the content there, not just the text that you quoted.

I'm not totally sure what the right approach is here. I don't know if feedparser exposes a way to change this, and I'm not keen on mucking around in its plumbing unless I need to. Thoughts?

— Reply to this email directly or view it on GitHub https://github.com/zenhack/iron-blogger2/issues/47#issuecomment-133751812 .

zenhack commented 9 years ago

https://github.com/zenhack/iron-blogger2/blob/master/ironblogger/model.py#L242

...I'd forgotten how gross that bit was.

What we're actually trying to grab is "summary", but I think feedparser abstracts things a bit and falls back to other things if it's not there. The library isn't a 1:1 mapping to atom or rss; it's intented to let the programmer not care about the differences.

Whatever we end up doing, I have two concerns:

  1. Needs to be portable betwen rss and atom
  2. Should prefer summaries over yanking the entire post.
zenhack commented 9 years ago

Had a look; content:encoded is supposed to be the full post, so it fails criterion (2). I don't suppose there's a way to configure wordpress to put the right things in the description? It's doing some other weird things too, like putting related posts in there...

zenhack commented 8 years ago

@jywarren, I don't see a clean way to fix this. Unless you have any ideas I'd like to tag this as wontfix and close. Let me know.

jywarren commented 8 years ago

I think that, just to summarize, it's looking for <summary>, and gets <description> which is actually a summary which Wordpress prepares, which doesn't include images.

I guess we just leave it -- it's too bad that the combined display won't show some of the really nice images, especially since in a non-zero number of my posts, there is no content except for images. Text over image content-type bias? :-)

Anyhow I'll file a related idea I had which helps a little bit.

zenhack commented 8 years ago

Alright, closing then.