Open katrinleinweber opened 7 years ago
Dear @gaborcsardi and @jeroen, because of your involvement in METACRAN, might this be relevant for you?
Zotero no longer recognizes URLs of embedded frames like it used to, which has likely caused this regression. This will need to get addressed in the translator (unless, of course, CRAN could be convinced to do away with those hideous iframes, but I'd assume that's just not a priority for them, understandably)
I don't know why I get tagged here but you should can get package metadata from the DESCRIPTION file in the root of the package. The most important fields are also indexed in the PACKAGES file.
CRAN won't change their webpage or URL to accommodate your scraper.
I read over at https://ropensci.org/blog/2013/06/14/goals-for-year/ that R package citations were one success metric for rOpenSci. Therefore, I started wondering about which side here has the best lever to make importing citation info as easy for users as possible (single click, for example). Seems like in this case it's Zotero, correct?
Zotero no longer recognizes URLs of embedded frames like it used to
Just as a reminder for whoever works on this, this can be re-enabled for a given translator with the targetAll property, but if the frames are on the same domain then they can just be referenced from the root document.
If any of the translators used targetAll
(none is found by the search here), I'd try working out the two lines in this case.
See the comment I linked to above, which explains how the property works, and let us know if you have questions. Again, though, unless the frames are on a different domain, this shouldn't use targetAll — it should just trigger on the root document and access the frame content from the matching root doc.
Also, since the main URL actually doesn't change, this will also need ZU.monitorDOMChanges
and then the Framework doesn't work well so we're likely looking at a full re-write, I'm afraid.
Given this, I'm not sure if this every worked -- either the site change or I don't see how this would have worked even before the detect change.
It's possible Zotero 4.0 triggered detection on frame location changes even if the root document stayed the same? I'm not sure whether targetAll would do the same. (Probably not.)
I don't think it did, but either way, targetAll
doesn't (just tested).
@katrinleinweber I am not sure what you mean. Do you want to use an alternate source?
There are metadata files on CRAN, e.g. PACKAGES.gz
in https://cran.r-project.org/src/contrib/, this is a better source than web scraping CRAN.
METACRAN has its own proper package database, see https://github.com/metacran/crandb#the-raw-api It is somewhat similar to the node.js registry. But this is not a primary source, of course.
I think @katrinleinweber was just tagging people who she hoped might be able and interested to help improving support for citing R packages using popular reference manager software.
Not sure if you're the right people to ask, but it's certainly a justified question in general given that increasing citations to packages is one of Ropensci's goals. I'm not following software citation initiatives as closely, but on the data citation end, facilitating direct import into reference managers from repositories has been one of the key strategies. (see e.g. https://www.biorxiv.org/content/early/2017/10/09/097196)
Zotero is such a reference manager (integrates nicely with Rmd/RStudio via citr, btw.), so the use case here is not importing massive amounts of metadata (for which the PACKAGES.gz would indeed be far better) but to enable users to import metadata for the R packages they are using -- maybe half a dozen -- so that they can then cite them easily, ideally at the same time as they add them to R.
The crandb API sounds good, the JSON I get from http://crandb.r-pkg.org/citr, e.g., looks great. Only question though is if that's up to date with CRAN? If so we could use that.
importing massive amounts of metadata
Note that PACKAGES.gz
is less than 400KB, so it is not exactly massive amounts. It does not have all metadata, though.
enable users to import metadata for the R packages they are using
If you want to do this package by package, you can use the DESCRIPTION
files, e.g. https://cran.r-project.org/web/packages/scientoText/DESCRIPTION instead of the web pages. Assuming you are not doing this with R code.
Correct, this is all in javascript. The description files are structured, right? As in -- the parts before the colon are standardized?
Yes, somewhat. The important fields are standard, yes. The set of fields is extensible, but this is rarely used as people don't know it (it is undocumented), and you can ignore these custom fields, anyway.
Here is a parser I wrote in JS:
https://github.com/r-hub/rdesc-parser
It is not bullet proof, e.g. the encoding is not always UTF-8, but there can be an Encoding:
field in the file itself, which means that you need to recode. In practice it is either ASCII (if no Encoding
field), UTF-8, or latin1
.
The format is quite simple DCF (Debian Control File), but the individual fields have their own format, and there is a number of them.
@adam3smith This, exactly.
@gaborcsardi Seems like sometimes, a few lines of written motivation might have told more than 2 screenshots ;-) How do regular users, students, scientists, etc. get this DESCRIPTION
content into a reference manager, to then cite it like any other reference in their respective writing workflows/toolchains? Neither copy-paste, nor command-line foo should IMHO be the only solutions.
Can Mendeley, EndNote, Citavi etc. import the proper citation info from the package's landing page? Or, vice versa: can package authors who provide the citation info (in DESCRIPTION
for example) be sure that users can easily import it into the reference manager of their choice?
My motivation is to help align the workflow (or even click-trace) of software citation with that of regular journal article citations etc. Preferably using FLOSS tools & platforms :-)
@katrinleinweber Well, I have no idea what Zotero is, I am sorry. I also haven't ever used EndNote
, etc.
How do regular users, students, scientists, etc. get this DESCRIPTION content into a reference manager,
Or, vice versa: can package authors who provide the citation info
I suppose the reference manager supports the bibtex format. Then there is a standard R way to do both of these. E.g:
❯ citation(package = "ggplot2")
To cite ggplot2 in publications, please use:
H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
Springer-Verlag New York, 2016.
A BibTeX entry for LaTeX users is
@Book{,
author = {Hadley Wickham},
title = {ggplot2: Elegant Graphics for Data Analysis},
publisher = {Springer-Verlag New York},
year = {2016},
isbn = {978-3-319-24277-4},
url = {http://ggplot2.org},
}
❯ citation(package = "dplyr")
To cite package ‘dplyr’ in publications use:
Hadley Wickham, Romain Francois, Lionel Henry and Kirill Müller
(2017). dplyr: A Grammar of Data Manipulation. R package version
0.7.4. https://CRAN.R-project.org/package=dplyr
A BibTeX entry for LaTeX users is
@Manual{,
title = {dplyr: A Grammar of Data Manipulation},
author = {Hadley Wickham and Romain Francois and Lionel Henry and Kirill Müller},
year = {2017},
note = {R package version 0.7.4},
url = {https://CRAN.R-project.org/package=dplyr},
}
Package authors can add their own references, and if they don't do that, citation
auto-generates one from the package metadata in DESCRIPTION
.
Btw. if there is custom citation information, that is also available on the web page. E.g: https://cran.r-project.org/web/packages/ggplot2/citation.html
But unfortunately otherwise this url does not exist. I think it would make sense to always create it, and fall back to the auto-generated reference. I think it is not entirely hopeless to convince CRAN maintainers about this, if you think that having a bibtex record for each CRAN package would help your work.
@adam3smith
The crandb API sounds good, the JSON I get from http://crandb.r-pkg.org/citr, e.g., looks great. Only question though is if that's up to date with CRAN? If so we could use that.
It is fairly up to date. An update is attempted every 6 minutes (current config, might change slightly), but sometimes it fails, since number of connections to the main CRAN machine is limited.
But if you are after citation information, then it is actually better to use the references provided by the package author, right? So you could
that seems right, especially given Hadley's preference to have his book cited.
Btw. I just checked what Zotero is, and I agree that it is very cool, and it would be great it could import the correct citation information for CRAN packages. :)
Btw. 2. two advantages of crandb over parsing DESCRIPTION from the cran web site is that it also has:
Zotero doesn't use R-Packages.js when browsing CRAN "the normal way" (red). One has to open / deep-link the package‘s individual page for the translator to find all the meta-data (green).
Please let me know whether I should report this issue to CRAN as well. My grasp of JavaScript is weak, but I presume it could be used traverse into the frame.