Closed adam3smith closed 12 years ago
I'd rather fix this in the Embedded Metadata translator, since I think seeing this in real RDF would be rather unusual, but my opinion is not very strong.
I agree that this shouldn't be changed in the RDF translator-- it's an example of black magic that is likely to cause unexpected and hard-to-detect issues in the future. I suggest adding the logic to Embedded Metadata, but requiring that the tag have at least 2 commas to be split by comma, or at least two semicolons to be split by semicolon, with the semicolons taking precedence, and splitting on only one delimiter.
That's a confusing way to put it, but I think you get it.
yep, I had actually thought about requiring two commas myself, will do.
Is it that common to have a comma or semicolon inside a keyword/subject? I don't think I've ever encountered it before. It should pretty much always be safe to split by semicolon.
I suppose we can also make Embedded Metadata parse the "keyword" value (and "description") if they are not retrieved by other means. Even though it's not part of Highwire, it is embedded metadata and it's very common.
On Thu, Mar 15, 2012 at 00:44, Sebastian Karcher < reply@reply.github.com
wrote:
yep, I had actually thought about requiring two commas myself, will do.
Reply to this email directly or view it on GitHub: https://github.com/zotero/translators/issues/209#issuecomment-4514312
Yes, I'll be assuming that splitting on a semicolon is safe and yes, I'll include the keywords metatag as a fallback.
On Thu, Mar 15, 2012 at 12:56 AM, Aurimas Vinckevicius reply@reply.github.com wrote:
Is it that common to have a comma or semicolon inside a keyword/subject? I don't think I've ever encountered it before. It should pretty much always be safe to split by semicolon.
I suppose we can also make Embedded Metadata parse the "keyword" value (and "description") if they are not retrieved by other means. Even though it's not part of Highwire, it is embedded metadata and it's very common.
On Thu, Mar 15, 2012 at 00:44, Sebastian Karcher < reply@reply.github.com
wrote:
yep, I had actually thought about requiring two commas myself, will do.
Reply to this email directly or view it on GitHub: https://github.com/zotero/translators/issues/209#issuecomment-4514312
Reply to this email directly or view it on GitHub: https://github.com/zotero/translators/issues/209#issuecomment-4514797
Sebastian Karcher Ph.D. Candidate Department of Political Science Northwestern University
Is it that common to have a comma or semicolon inside a keyword/subject? I don't think I've ever encountered it before. It should pretty much always be safe to split by semicolon.
Smith, Adam (1723-1790); Keynes, John Maynard (1883-1946)
Invented, but supremely plausible.
Semicolons less so; maybe we can always split on them.
So the logic then is that if we can't split by semicolon, we need at least two commas to split.
@Sebastian: could you also include "citation_keywords" (Wiley Online uses it, with commas as delimiter)
On Thu, Mar 15, 2012 at 02:01, Avram Lyon < reply@reply.github.com
wrote:
Is it that common to have a comma or semicolon inside a keyword/subject? I don't think I've ever encountered it before. It should pretty much always be safe to split by semicolon.
Smith, Adam (1723-1790); Keynes, John Maynard (1883-1946)
Invented, but supremely plausible.
Semicolons less so; maybe we can always split on them.
Reply to this email directly or view it on GitHub: https://github.com/zotero/translators/issues/209#issuecomment-4514820
Agree with ajlyon - I'm concerned about commas in names. Here is what I have - I'm not sure if this is the best way of doing this, but it works very stably even for complete worst case data scenarios. Any concerns?
//Deal with tags in a string
//we might want to look at the keyword metatag later
var keywords = ZU.xpath(doc, '//meta[@name="keywords"]/@content')
//If we already have tags - run through them one by one and create a semicolon delimited string
//This will deal with multiple tags, some of them comma delimited, some semicolon, some individual
if (newItem.tags) {
var tags = "";
for (var i in newItem.tags) {
if (newItem.tags[i].match(/;/)) {
tags += ";" + newItem.tags[i];
} else if (newItem.tags[i].match(/.+,.+,/)) {
tags += ";" + newItem.tags[i].replace(/,/g, ";");
}
else {
tags += ";" + newItem.tags[i];
}
}
//cleanup the string and split to create tags
newItem.tags = tags.replace(/^;/, "").split(/;/);
//if we don't have tags from the DC.subject fields - fall back on keywords
} else if (keywords.length > 0) {
var tags = "";
for (var i in keywords) {
if (keywords[i].textContent.match(/;/)) {
tags += ";" +keywords[i].textContent;
} else if (keywords[i].textContent.match(/.+,.+,/)) {
tags += ";" +keywords[i].textContent.replace(/,/g, ";");
}
else {
tags += ";" + keywords[i].textContent;
}
}
newItem.tags = tags.replace(/^;/, "").split(/;/);
}
@aurimasv - will add citation_keywords once this general format is OKd.
Suggested change:
if (newItem.tags) {
var tags = [];
for (var i in newItem.tags) {
if (newItem.tags[i].match(/;/)) {
tags = tags.concat(newItem.tags[i].split(/;/));
} else if (newItem.tags[i].match(/.+,.+,/)) {
tags = tags.concat(newItem.tags[i].split(/,/));
}
else {
tags.push(newItem.tags[i]);
}
}
newItem.tags = tags;
Functionally equivalent, but keeps the focus on building the array we need.
How about
//Deal with tags in a string
//we might want to look at the citation_keyword metatag later
if(!newItem.tags && !newItem.tags.length)
newItem.tags = ZU.xpath(doc, '//meta[@name="citation_keywords"]/@content').map(function(t) { return t.textContent; });
//fall back to "keywords"
if(!newItem.tags.length)
newItem.tags = ZU.xpath(doc, '//meta[@name="keywords"]/@content').map(function(t) { return t.textContent; });
//If we already have tags - run through them one by one and create a semicolon delimited string
//This will deal with multiple tags, some of them comma delimited, some semicolon, some individual
if (newItem.tags.length) {
var tags = [], t;
for (var i in newItem.tags) {
t = newItem.tags[i].split(/\s*,\s*/);
if (newItem.tags[i].indexOf(';') == -1 && t.length > 2) {
tags.concat(t);
} else {
tags.concat(newItem.tags[i].split(/\s*;\s*/));
}
}
newItem.tags = tags;
}
I didn't test this
Ooops, it's meant to be concat instead of concatenate
and
tags.concatenate(t.split(/\s;\s/); should be tags.concat(newItem.tags[i].split(/\s;\s/);
Edit: Fixed above
@ajlyon, didn't mean to replicate your code. I was editing on github and didn't notice the new comment pop up before I posted.
@aurimasv No problem-- now @adam3smith has plenty of examples to work from!
pull request is here: https://github.com/zotero/translators/pull/210 basically used Aurimas' code, but it needs tags = tags.concat(... to work. Also changed the comments to reflect this.
new translator is up.
see e.g. here: http://www.ajol.info/index.php/thrb/article/view/56367 The keywords are given as While separate subject tags - one per keyword - might be more elegant, I don't see anything in the DC specs that would make this illegal: http://dublincore.org/usage/terms/history/#subjectT-001
The Embedded Metadata translator calls the RDF translator, which parses this here https://github.com/zotero/translators/blob/master/RDF.js#L605 I fixed this for me locally using the code below, but I don't know the translator or the data format very well @ajlyon @aurimasv - any thoughts?