zotero / translators

Zotero Translators
http://www.zotero.org/support/dev/translators
1.31k stars 765 forks source link

CRAN translator is blocked by their use of an HTML frame #1487

Open katrinleinweber opened 7 years ago

katrinleinweber commented 7 years ago

Zotero doesn't use R-Packages.js when browsing CRAN "the normal way" (red). One has to open / deep-link the package‘s individual page for the translator to find all the meta-data (green).

cran s frames prevent zotero from easily extracting citations 1 cran s frames prevent zotero from easily extracting citations 2

Please let me know whether I should report this issue to CRAN as well. My grasp of JavaScript is weak, but I presume it could be used traverse into the frame.

katrinleinweber commented 7 years ago

Dear @gaborcsardi and @jeroen, because of your involvement in METACRAN, might this be relevant for you?

adam3smith commented 7 years ago

Zotero no longer recognizes URLs of embedded frames like it used to, which has likely caused this regression. This will need to get addressed in the translator (unless, of course, CRAN could be convinced to do away with those hideous iframes, but I'd assume that's just not a priority for them, understandably)

jeroen commented 7 years ago

I don't know why I get tagged here but you should can get package metadata from the DESCRIPTION file in the root of the package. The most important fields are also indexed in the PACKAGES file.

CRAN won't change their webpage or URL to accommodate your scraper.

katrinleinweber commented 7 years ago

I read over at https://ropensci.org/blog/2013/06/14/goals-for-year/ that R package citations were one success metric for rOpenSci. Therefore, I started wondering about which side here has the best lever to make importing citation info as easy for users as possible (single click, for example). Seems like in this case it's Zotero, correct?

dstillman commented 7 years ago

Zotero no longer recognizes URLs of embedded frames like it used to

Just as a reminder for whoever works on this, this can be re-enabled for a given translator with the targetAll property, but if the frames are on the same domain then they can just be referenced from the root document.

katrinleinweber commented 7 years ago

If any of the translators used targetAll (none is found by the search here), I'd try working out the two lines in this case.

dstillman commented 7 years ago

See the comment I linked to above, which explains how the property works, and let us know if you have questions. Again, though, unless the frames are on a different domain, this shouldn't use targetAll — it should just trigger on the root document and access the frame content from the matching root doc.

adam3smith commented 7 years ago

Also, since the main URL actually doesn't change, this will also need ZU.monitorDOMChanges and then the Framework doesn't work well so we're likely looking at a full re-write, I'm afraid. Given this, I'm not sure if this every worked -- either the site change or I don't see how this would have worked even before the detect change.

dstillman commented 7 years ago

It's possible Zotero 4.0 triggered detection on frame location changes even if the root document stayed the same? I'm not sure whether targetAll would do the same. (Probably not.)

adam3smith commented 7 years ago

I don't think it did, but either way, targetAll doesn't (just tested).

gaborcsardi commented 7 years ago

@katrinleinweber I am not sure what you mean. Do you want to use an alternate source?

There are metadata files on CRAN, e.g. PACKAGES.gz in https://cran.r-project.org/src/contrib/, this is a better source than web scraping CRAN.

METACRAN has its own proper package database, see https://github.com/metacran/crandb#the-raw-api It is somewhat similar to the node.js registry. But this is not a primary source, of course.

adam3smith commented 7 years ago

I think @katrinleinweber was just tagging people who she hoped might be able and interested to help improving support for citing R packages using popular reference manager software.

Not sure if you're the right people to ask, but it's certainly a justified question in general given that increasing citations to packages is one of Ropensci's goals. I'm not following software citation initiatives as closely, but on the data citation end, facilitating direct import into reference managers from repositories has been one of the key strategies. (see e.g. https://www.biorxiv.org/content/early/2017/10/09/097196)

Zotero is such a reference manager (integrates nicely with Rmd/RStudio via citr, btw.), so the use case here is not importing massive amounts of metadata (for which the PACKAGES.gz would indeed be far better) but to enable users to import metadata for the R packages they are using -- maybe half a dozen -- so that they can then cite them easily, ideally at the same time as they add them to R.

The crandb API sounds good, the JSON I get from http://crandb.r-pkg.org/citr, e.g., looks great. Only question though is if that's up to date with CRAN? If so we could use that.

gaborcsardi commented 7 years ago

importing massive amounts of metadata

Note that PACKAGES.gz is less than 400KB, so it is not exactly massive amounts. It does not have all metadata, though.

enable users to import metadata for the R packages they are using

If you want to do this package by package, you can use the DESCRIPTION files, e.g. https://cran.r-project.org/web/packages/scientoText/DESCRIPTION instead of the web pages. Assuming you are not doing this with R code.

adam3smith commented 7 years ago

Correct, this is all in javascript. The description files are structured, right? As in -- the parts before the colon are standardized?

gaborcsardi commented 7 years ago

Yes, somewhat. The important fields are standard, yes. The set of fields is extensible, but this is rarely used as people don't know it (it is undocumented), and you can ignore these custom fields, anyway.

Here is a parser I wrote in JS: https://github.com/r-hub/rdesc-parser It is not bullet proof, e.g. the encoding is not always UTF-8, but there can be an Encoding: field in the file itself, which means that you need to recode. In practice it is either ASCII (if no Encoding field), UTF-8, or latin1.

The format is quite simple DCF (Debian Control File), but the individual fields have their own format, and there is a number of them.

katrinleinweber commented 7 years ago

@adam3smith This, exactly.

@gaborcsardi Seems like sometimes, a few lines of written motivation might have told more than 2 screenshots ;-) How do regular users, students, scientists, etc. get this DESCRIPTION content into a reference manager, to then cite it like any other reference in their respective writing workflows/toolchains? Neither copy-paste, nor command-line foo should IMHO be the only solutions.

Can Mendeley, EndNote, Citavi etc. import the proper citation info from the package's landing page? Or, vice versa: can package authors who provide the citation info (in DESCRIPTION for example) be sure that users can easily import it into the reference manager of their choice?

My motivation is to help align the workflow (or even click-trace) of software citation with that of regular journal article citations etc. Preferably using FLOSS tools & platforms :-)

gaborcsardi commented 7 years ago

@katrinleinweber Well, I have no idea what Zotero is, I am sorry. I also haven't ever used EndNote, etc.

How do regular users, students, scientists, etc. get this DESCRIPTION content into a reference manager,

Or, vice versa: can package authors who provide the citation info

I suppose the reference manager supports the bibtex format. Then there is a standard R way to do both of these. E.g:

❯ citation(package = "ggplot2")

To cite ggplot2 in publications, please use:

  H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
  Springer-Verlag New York, 2016.

A BibTeX entry for LaTeX users is

  @Book{,
    author = {Hadley Wickham},
    title = {ggplot2: Elegant Graphics for Data Analysis},
    publisher = {Springer-Verlag New York},
    year = {2016},
    isbn = {978-3-319-24277-4},
    url = {http://ggplot2.org},
  }
❯ citation(package = "dplyr")

To cite package ‘dplyr’ in publications use:

  Hadley Wickham, Romain Francois, Lionel Henry and Kirill Müller
  (2017). dplyr: A Grammar of Data Manipulation. R package version
  0.7.4. https://CRAN.R-project.org/package=dplyr

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {dplyr: A Grammar of Data Manipulation},
    author = {Hadley Wickham and Romain Francois and Lionel Henry and Kirill Müller},
    year = {2017},
    note = {R package version 0.7.4},
    url = {https://CRAN.R-project.org/package=dplyr},
  }

Package authors can add their own references, and if they don't do that, citation auto-generates one from the package metadata in DESCRIPTION.

gaborcsardi commented 7 years ago

Btw. if there is custom citation information, that is also available on the web page. E.g: https://cran.r-project.org/web/packages/ggplot2/citation.html

But unfortunately otherwise this url does not exist. I think it would make sense to always create it, and fall back to the auto-generated reference. I think it is not entirely hopeless to convince CRAN maintainers about this, if you think that having a bibtex record for each CRAN package would help your work.

gaborcsardi commented 7 years ago

@adam3smith

The crandb API sounds good, the JSON I get from http://crandb.r-pkg.org/citr, e.g., looks great. Only question though is if that's up to date with CRAN? If so we could use that.

It is fairly up to date. An update is attempted every 6 minutes (current config, might change slightly), but sometimes it fails, since number of connections to the main CRAN machine is limited.

But if you are after citation information, then it is actually better to use the references provided by the package author, right? So you could

  1. use the custom bibtex from https://cran.r-project.org/web/packages/ggplot2/citation.html etc. if available, and
  2. parse DESCRIPTION or query crandb otherwise.
adam3smith commented 7 years ago

that seems right, especially given Hadley's preference to have his book cited.

gaborcsardi commented 7 years ago

Btw. I just checked what Zotero is, and I agree that it is very cool, and it would be great it could import the correct citation information for CRAN packages. :)

gaborcsardi commented 7 years ago

Btw. 2. two advantages of crandb over parsing DESCRIPTION from the cran web site is that it also has:

  1. archived packages
  2. all package versions. E.g.: http://crandb.r-pkg.org/citr/all, http://crandb.r-pkg.org/citr/0.1.0 etc
katrinleinweber commented 6 years ago

related to https://github.com/force11/force11-sciwg/issues/9