RDF/ Embedded metadata - tags separated by comma aren't split on import

adam3smith commented 12 years ago

see e.g. here: http://www.ajol.info/index.php/thrb/article/view/56367 The keywords are given as While separate subject tags - one per keyword - might be more elegant, I don't see anything in the DC specs that would make this illegal: http://dublincore.org/usage/terms/history/#subjectT-001

The Embedded Metadata translator calls the RDF translator, which parses this here https://github.com/zotero/translators/blob/master/RDF.js#L605 I fixed this for me locally using the code below, but I don't know the translator or the data format very well @ajlyon @aurimasv - any thoughts?

    if (subject.match(/,|;/)){
         var tags = subject.split(/,|;/);
         for (i in tags){
            newItem.tags.push(tags[i]);
         }
        }
        else{
            newItem.tags.push(subject);
        }

simonster commented 12 years ago

I'd rather fix this in the Embedded Metadata translator, since I think seeing this in real RDF would be rather unusual, but my opinion is not very strong.

avram commented 12 years ago

I agree that this shouldn't be changed in the RDF translator-- it's an example of black magic that is likely to cause unexpected and hard-to-detect issues in the future. I suggest adding the logic to Embedded Metadata, but requiring that the tag have at least 2 commas to be split by comma, or at least two semicolons to be split by semicolon, with the semicolons taking precedence, and splitting on only one delimiter.

That's a confusing way to put it, but I think you get it.

adam3smith commented 12 years ago

yep, I had actually thought about requiring two commas myself, will do.

aurimasv commented 12 years ago

Is it that common to have a comma or semicolon inside a keyword/subject? I don't think I've ever encountered it before. It should pretty much always be safe to split by semicolon.

I suppose we can also make Embedded Metadata parse the "keyword" value (and "description") if they are not retrieved by other means. Even though it's not part of Highwire, it is embedded metadata and it's very common.

On Thu, Mar 15, 2012 at 00:44, Sebastian Karcher < reply@reply.github.com

wrote:

yep, I had actually thought about requiring two commas myself, will do.

Reply to this email directly or view it on GitHub: https://github.com/zotero/translators/issues/209#issuecomment-4514312

adam3smith commented 12 years ago

Yes, I'll be assuming that splitting on a semicolon is safe and yes, I'll include the keywords metatag as a fallback.

On Thu, Mar 15, 2012 at 12:56 AM, Aurimas Vinckevicius reply@reply.github.com wrote:

Is it that common to have a comma or semicolon inside a keyword/subject? I don't think I've ever encountered it before. It should pretty much always be safe to split by semicolon.

I suppose we can also make Embedded Metadata parse the "keyword" value (and "description") if they are not retrieved by other means. Even though it's not part of Highwire, it is embedded metadata and it's very common.

On Thu, Mar 15, 2012 at 00:44, Sebastian Karcher < reply@reply.github.com

wrote:

yep, I had actually thought about requiring two commas myself, will do.

Reply to this email directly or view it on GitHub: https://github.com/zotero/translators/issues/209#issuecomment-4514312

Reply to this email directly or view it on GitHub: https://github.com/zotero/translators/issues/209#issuecomment-4514797

Sebastian Karcher Ph.D. Candidate Department of Political Science Northwestern University

avram commented 12 years ago

Is it that common to have a comma or semicolon inside a keyword/subject? I don't think I've ever encountered it before. It should pretty much always be safe to split by semicolon.

Smith, Adam (1723-1790); Keynes, John Maynard (1883-1946)

Invented, but supremely plausible.

Semicolons less so; maybe we can always split on them.

aurimasv commented 12 years ago

So the logic then is that if we can't split by semicolon, we need at least two commas to split.

@Sebastian: could you also include "citation_keywords" (Wiley Online uses it, with commas as delimiter)

On Thu, Mar 15, 2012 at 02:01, Avram Lyon < reply@reply.github.com

wrote:

Is it that common to have a comma or semicolon inside a keyword/subject? I don't think I've ever encountered it before. It should pretty much always be safe to split by semicolon.

Smith, Adam (1723-1790); Keynes, John Maynard (1883-1946)

Invented, but supremely plausible.

Semicolons less so; maybe we can always split on them.

Reply to this email directly or view it on GitHub: https://github.com/zotero/translators/issues/209#issuecomment-4514820

adam3smith commented 12 years ago

Agree with ajlyon - I'm concerned about commas in names. Here is what I have - I'm not sure if this is the best way of doing this, but it works very stably even for complete worst case data scenarios. Any concerns?

//Deal with tags in a string
    //we might want to look at the keyword metatag later
    var keywords = ZU.xpath(doc, '//meta[@name="keywords"]/@content')

    //If we already have tags - run through them one by one and create a semicolon delimited string
    //This  will deal with multiple tags, some of them comma delimited, some semicolon, some individual
    if (newItem.tags) {
        var tags = "";
        for (var i in newItem.tags) {
            if (newItem.tags[i].match(/;/)) {
                tags += ";" + newItem.tags[i];
            } else if (newItem.tags[i].match(/.+,.+,/)) {
                tags +=  ";" + newItem.tags[i].replace(/,/g, ";");
            }
            else {
                tags += ";" + newItem.tags[i];
            }
        }
        //cleanup the string and split to create tags
        newItem.tags = tags.replace(/^;/, "").split(/;/);
        //if we don't have tags from the DC.subject fields - fall back on keywords
    } else if (keywords.length > 0) {
        var tags = "";
        for (var i in keywords) {
            if (keywords[i].textContent.match(/;/)) {
                tags += ";" +keywords[i].textContent;
            } else if (keywords[i].textContent.match(/.+,.+,/)) {
                tags += ";" +keywords[i].textContent.replace(/,/g, ";");
            }
                else {
                tags += ";" + keywords[i].textContent;
            }
        }
        newItem.tags = tags.replace(/^;/, "").split(/;/);
    }

adam3smith commented 12 years ago

@aurimasv - will add citation_keywords once this general format is OKd.

avram commented 12 years ago

Suggested change:

if (newItem.tags) {
    var tags = [];
    for (var i in newItem.tags) {
        if (newItem.tags[i].match(/;/)) {
            tags = tags.concat(newItem.tags[i].split(/;/));
        } else if (newItem.tags[i].match(/.+,.+,/)) {
            tags = tags.concat(newItem.tags[i].split(/,/));
        }
        else {
            tags.push(newItem.tags[i]);
        }
    }
newItem.tags = tags;

Functionally equivalent, but keeps the focus on building the array we need.

aurimasv commented 12 years ago

How about

//Deal with tags in a string
    //we might want to look at the citation_keyword metatag later
    if(!newItem.tags && !newItem.tags.length)
         newItem.tags = ZU.xpath(doc, '//meta[@name="citation_keywords"]/@content').map(function(t) { return t.textContent; });

  //fall back to "keywords"
    if(!newItem.tags.length)
         newItem.tags = ZU.xpath(doc, '//meta[@name="keywords"]/@content').map(function(t) { return t.textContent; });

    //If we already have tags - run through them one by one and create a semicolon delimited string
    //This  will deal with multiple tags, some of them comma delimited, some semicolon, some individual
    if (newItem.tags.length) {
        var tags = [], t;
        for (var i in newItem.tags) {
            t = newItem.tags[i].split(/\s*,\s*/);
            if (newItem.tags[i].indexOf(';') == -1 && t.length > 2) {
                tags.concat(t);
            } else {
                tags.concat(newItem.tags[i].split(/\s*;\s*/));
            }
        }
        newItem.tags = tags;
    }

I didn't test this

aurimasv commented 12 years ago

Ooops, it's meant to be concat instead of concatenate

and

tags.concatenate(t.split(/\s;\s/); should be tags.concat(newItem.tags[i].split(/\s;\s/);

Edit: Fixed above

aurimasv commented 12 years ago

@ajlyon, didn't mean to replicate your code. I was editing on github and didn't notice the new comment pop up before I posted.

avram commented 12 years ago

@aurimasv No problem-- now @adam3smith has plenty of examples to work from!

adam3smith commented 12 years ago

pull request is here: https://github.com/zotero/translators/pull/210 basically used Aurimas' code, but it needs tags = tags.concat(... to work. Also changed the comments to reflect this.

adam3smith commented 12 years ago

new translator is up.

zotero / translators

RDF/ Embedded metadata - tags separated by comma aren't split on import #209