wikimedia / html-metadata

MetaData html scraper and parser for Node.js (supports Promises and callback style)
MIT License
138 stars 44 forks source link

OG:Image:url being overwritten #16

Closed SeanDunford closed 9 years ago

SeanDunford commented 9 years ago

I'm attempting to use html-metadata on this url http://www.lemonde.fr/

If you visit the page and inspect you can see the og tags screen shot 2015-05-13 at 8 35 58 pm

I am expecting to return back http://s1.lemde.fr/medias/web/1.2.672/img/placeholder/opengraph.jpg as the og:image:url but instead I am getting this back.

screen shot 2015-05-13 at 8 37 57 pm

Doing some inspection it looks like at one point the parseOpenGraph function actually populates the value with the correct url but then later overrides it. =[

Changing https://github.com/wikimedia/html-metadata/blob/master/index.js#L208 to
if (root && !root[propertyValue[2]]){ property = propertyValue[2]; root[property] = content; }

seems to prevent overwriting the expected value but has the side effect of not adding any of the other image metadata to the opengraph.image object. Not sure what other side effects it may have as well. Any idea how this could be edited to add new properties to the subobjects but not override them? Is that in scope for the project?

mvolz commented 9 years ago

Whoa, that was a pretty bad bug! One issue was that non-og tags were getting added as subproperties, see: https://github.com/wikimedia/html-metadata/pull/17

I pulled through an emergency fix so if you pin to version 0.1.2, it should fix that particular problem. However, it's not wholly resolved because the code still isn't checking to verify that the third property belongs to the correct second property, it just assumes that and og:blank:tag belongs to og:image, for instance. I'll fix that soon too.

You'll notice in this case you'll get the correct url but no image size properties. That's because lemonde.fr is not up to spec; it's required that any subproperties go after the super property declaration. Since the og:image tag is declared after og:image:width, for instance, og:image:width won't get added to anything. There's a very good reason for this, which is that multiple images in a page is allowed, and if we aren't strict about this than the wrong sub-properties might get added to og:image.

mvolz commented 9 years ago

0.1.3 is published and this should resolve all remaining issues about subproperties getting added in the wrong spot.

Thanks for reporting this!

SeanDunford commented 9 years ago

Wow, thanks. Websites not implementing OG correctly are out of scope for us but i knew there was something wrong when it was grabbing the url from another non og element and overwriting the actual information. Thanks for looking into this.