zotero / translators

Zotero Translators
http://www.zotero.org/support/dev/translators
1.28k stars 756 forks source link

Import from CSV #773

Closed zuphilip closed 10 years ago

zuphilip commented 10 years ago

This morning I had the idea that importing from CSV could be for some use cases quite handy. For example we have here at the library a lot of lists of bibliographic data, e.g. all books contained in a ebook pakage or all journals contained in a "big-deal" collection of some publisher or a wish list from a new professor. If we could import them at once and then maybe even export them as MARC data, or maybe a dublicate detection (overlap analysis) could be interesting. What do you think in general abouta CSV import?

Well, you might argue that CSV is not a reasonable bibliographic format. Yes, but we can just restrict to CSV where the columns have names from the exportFields list from the CSV export translator. If you have any Excel file with bibliographic metada, it should be easy to preprocess it that the column names match one of the filed from this list and save it as CSV. Everything else could be just handled as a note. What do you think?

aurimasv commented 10 years ago

I wouldn't have anything against parsing back the format that we put out (column order can vary). I would probably discard columns we can't recognize instead of adding them as notes.

Parsing CSV is a bit messy though, due to the way quotes are escaped, but obviously doable. This would also mean that we expect authors, and other multi-value fields, to be in a specific format (one that we use for export), so I'm not sure how generally applicable this translator would be to tabular data. I guess you could pre-process the data as needed before importing into Zotero.

Feel free to start a pull request and we'll see how this works out. I, personally, have little interest in supporting such an import format.

adam3smith commented 10 years ago

I'm actually worried about this, so before we put too much work into this, let's talk this through a little more. My concern is that someone will see "CSV" as a supported import format and we'll start getting error reports from people who try to important any type of CSV. So I don't think we should package that in Zotero. I do understand the use, though, so if you want to host it somewhere separately so that people would need to seek it out and would, presumably, be more likely to read instructions, that might make sense.

aurimasv commented 10 years ago

We could rename it Zotero CSV, but yes, raising user expectations is a concern.

zuphilip commented 10 years ago

Yes, speaking it trough is exactly what I wanted before starting to work too much in a "wrong direction". The users would see the option if they click on the gear icon and then "Import..." and then view at the data formats. My current list of data formats is:

Bibliontology RDF (*.rdf)
MODS (*.xml)
Bookmarks (*.html)
CSL JSON (*.json)
CTX (*.^https?://freecite\.library\.brown\.edu)
Endnote XML (*.xml)
ISI Web of Knowledge (*.^https?://[^/]*webofknowledge\.com/)
MAB2 (*.mab2)
MARC (*.marc)
MARCXML (*.xml)
MEDLINE/nbib (*.txt)
NCBI PubMed ...
OVID Tagged (*.txt)
RDF (*.rdf)
Refer/BibIX (*.txt)
RefWorks Tagged Format (*.)
RIS (*.ris)
Zotero TestCase (*.json)
Better BibTeX (*.bib)
BibTeX (*.bib)

aurimasv suggestion is to add a line like this

Zotero CSV (*.csv)

That is fine for me. I would argue that the CSV import option seems similiar to "Zotero TestCase" which will accept any json file, but will only work properly with json files following the same rules as Zotero does. Moreover, TXT is also a common format to store stuff in different ways. BTW xml and rdf are also very general data formats, but here you can explicitely say which "rules" you are following.

aurimasv commented 10 years ago

I doubt many users check that list, since we choose All Files by default. I don't think we should be worried about users seeing .txt and thinking that all TXT files will be parsable. There are just some users that will expect ridiculous things no matter what we do. The label next to the extension should provide sufficient guidance though and all labels seem quite specific (with the exception of RDF, which I don't know what to call what we actually are able to import).

"Zotero TestCase" is not an official translator btw.

zuphilip commented 10 years ago

@adam3smith What do you think about the suggested label? Would this "calm down" user expectations to a reasonable level?

I can provide some documentation for CSV import/export on the appropriate documentation pages: https://www.zotero.org/support/getting_stuff_into_your_library#importing_from_other_tools , https://www.zotero.org/support/dev/data_formats . On these pages we can try to write what can be expected form import and what not. Moreover, we could then also link to these pages if there is any strange question in the forum.

dstillman commented 10 years ago

I don't think we want CSV import enabled in Zotero proper. CSV is a terrible format for Zotero data, and just supporting it sends the message that it's something people could reasonably use for data interchange. Among other things, people could easily pick "Zotero CSV" from the export list just because "CSV" is something they've heard of.

And for importing new data, to be useful at all, this would basically require someone to study the CSV export translator and figure out all the subtle ways that this could lead to bad data (quote escaping, creator formatting, etc.) and format things exactly as expected, treating this as a real format rather than as a hacky way to get data into Excel.

If you have to generate custom data for Zotero anyway, you might as well put it in an exchange format not from the '60s. For example, Zotero can easily be made to import and export Zotero API JSON, which can handle the entire Zotero data model unambiguously.

zuphilip commented 10 years ago

Okay, I can accept the strategic decision (no CSV import).

One could try to first convert the csv (xls) to an appropriate json and then import it. There are some tools like csv2json, but they all seem to simple. Well, maybe at some point it will be possible to export to json from Excel...

Zotero can easily be made to import and export Zotero API JSON

Would this be welcomed as an official translator?

dstillman commented 10 years ago

Zotero can easily be made to import and export Zotero API JSON

Would this be welcomed as an official translator?

Yes, but the code to generate and parse API JSON either exists or will soon exist in the pre–API syncing branch I'm currently working on, so it should wait for that. Separate issue created.