zotero / translators

Zotero Translators
http://www.zotero.org/support/dev/translators
1.28k stars 757 forks source link

[EM] add support for additional OpenGraph formats #932

Closed adam3smith closed 8 years ago

adam3smith commented 9 years ago

http://ogp.me/ haven't checked in detail what we currently recognize, but we'd want to support article:, book:, video:, and music:

adam3smith commented 9 years ago

E.g. this has some of the article: tags: https://hbr.org/2015/08/how-to-do-walking-meetings-right

adam3smith commented 8 years ago

Also add the simple meta name="author": http://www.w3schools.com/tags/att_meta_name.asp

zuphilip commented 8 years ago

In the PR #973 there are also examples with og as well as <meta name="author" content="..."/>.

adam3smith commented 8 years ago

closed in https://github.com/zotero/translators/pull/998

owcz commented 8 years ago

Not all OG, but for EM, I think the following should be supported:

  1. <span itemprop="name" content="Andrew Goldfarb"> (ex) for author (spec)
  2. <meta name="live_date" value="2012-07-27"> (ex) for date
  3. (then the IGN translator can be deprecated)
  4. <meta name="pub_date" content="2015-12-21T20:15:19+00:00"> (ex) for date
  5. <time class="meta__time updated" datetime="2014-08-04T18:00:00-04:00"> (ex) for date
  6. Should Wired be using <meta itemprop="datePublished" content="2015-05-25T05:45:37+00:00"> or <meta name="DisplayDate" content="2015-05-25"> instead of what it's using right now?
  7. Eurogamer would use <span itemprop="datePublished" content="2015-11-05"> too (many of these sites still need custom author support—I'll work on that)
  8. USgamer supports <p class="published" itemprop="datePublished" content="2016-16-01">, but do we support that date format? (YYYY-DD-MM not MM-DD)
  9. EM didn't use <meta property="og:author" content="Jonathan Holmes"> for author with Destructoid
  10. JSON-LD, such as "name": "Jake Muncy" in A.V. Club
  11. Disqus comments sections also hold some date-created metadata, but I think it might be a better route to just have sites implement real metadata
zuphilip commented 8 years ago

See also ticket for microdata #366, for JSON-LD #917

adam3smith commented 8 years ago

OK on 2,4. Will look 9. 10 definitely, likely in a separate translator that can then be called from EM (and should as a final goal also enable export I think) as per the ticket @zuphilip links to.

6 uses the date in the parsley JSON

<meta name='parsely-page' content='{"title": "If Sex Videogames Make You Feel Weird, That&#8217;s the Point", "link": "http://www.wired.com/2015/05/sex-videogames-make-feel-weird-thats-point/", "image_url": "http://www.wired.com/wp-content/uploads/2015/05/Screen-Shot-2015-05-21-at-12.32.53-PM-150x150-e1432241140212.png", "type": "post", "post_id": "1784713", "pub_date": "2015-05-25T05:45:37+00:00", "section": "WIRED", "author": "Wired Staff", "tags": []}'>

which makes sense to me

1,5,7,8: generally speaking no. The problem is that these will often end up being used mulitple times on a page, e.g. under a link+description for another article or even a comment.

What we could do is to add 1.) to the byline search and add a date search that follows the same logic (find the element closest to the title). If so, like the currently byline search, that should be a last "desparate" attempt if all else fails, given chance of getting something wrong.

Let us know if you're interested in taking any of these.

zuphilip commented 8 years ago

Microdata are nested, i.e. 1 looks like

<... itemscope itemtype="http://schema.org/Article">
   <... itemprop="articleBody">
      <... itemprop="author" itemscope itemtype="http://schema.org/Person">
         <... itemprop="name" content="Andrew Goldfarb">