zerocrates / OaiPmhRepository

OAI-PMH repository plugin for Omeka
http://omeka.org/codex/Plugins/OaiPmhRepository
GNU General Public License v3.0
3 stars 15 forks source link

Conformity to Oai-Pmh protocol #8

Open Daniel-KM opened 10 years ago

Daniel-KM commented 10 years ago

Hi,

I check the Oai-Pmh response on the three main online validators. On http://validator.oaipmh.com, the plugin works fine, except for listRecords (only Dublin Core). On http://oval.base-search.net and http://re.cs.uct.ac.za, there are some other errors and recommandations too.

Sincerely,

Daniel Berthereau Infodoc & Knowledge management

zerocrates commented 10 years ago

Can you reproduce the messages (or some of them) here?

These things often only show up for particular combinations of data and can sometimes be hard to get "from scratch."

zerocrates commented 10 years ago

As a concrete example, I ran my own instance of the Repository against the first validator you mentioned, from oaipmh.com (which is much nicer-looking than any of the validators that existed when I first wrote this plugin), and I didn't get any errors from any of the "available commands" options on the left.

re.cs.uct.ac.za (the same old ugly validator that I did have when first writing this) has been churning away for quite some time with no results.

I do get quite a few notices and a few errors from the base-search validator.

The ones I don't consider relevant or valid

The two errors are interesting, and don't show up from other sources.

One complains about the "toolkit" metadata in Identify. As far as I can tell, this one is happening because the schema document for the toolkit namespace isn't where it used to be anymore. Virginia Tech (where it's supposed to be) doesn't seem to have much to do with OAI-PMH anymore, so I wouldn't expect that to change. I'm not sure this really matters all that much, but it could be resolved by simply removing that metadata section altogether.

The one really valid one I see there is a complaint about day-granularity harvesting not working correctly. There does seem to be a problem with "until" that's making it wrongly exclude records with datestamps exactly the same as the date requested. The spec's very clear that both sides should be inclusive.

zerocrates commented 10 years ago

I've fixed several issues with the date processing: day-granularity dates weren't interpreted as UTC dates, and they weren't correctly "inclusive" of the whole day. The new code forces UTC interpretation of standalone days, and converts the "until" handling to tack one "granularity unit" (a day or a second) onto the specified date and then uses an exclusive < operator in the SQL.

Additionally, there's an update to the date handling to actually handle the checks against added and modified separately. The old code only worked right in the cases where both dates were inside, before, or after the requested range. If modified was on one side of the range and added on the other, each could pass one of the from/until sides of the check and give a false result.

The OVAL validator still complains about the selective harvesting after these changes, but I believe that error is actually the result of a bug in the validator.

Daniel-KM commented 10 years ago

Hi,

I'm sometime busy and with the time difference, you reply faster than me...

I try your update.

For the first, I didn't have results for ListRecords CDWALITE / MODS / OMEKA-XML (only OAI_DC), but this is due to a slow response of a server (more than 30 seconds). It doesn't appear on other sites I check. I'm looking for the reason why the response is so long.

For the Repository explorer of the university of Cape Town, the problem is that there is no complete response (test of non-Omeka OAI repository works, even if it needs some minutes to check). It seems to be related to the speed of response too.

For Oval, ok for the toolkit. I have no alert about language, because documents got it. For batch size, you can set the default to 100 and not 50. Same for expiration token (1440). And why some settings are in config.ini and some others in config form?

The last error was the one you say (No incremental (day granularity) harvesting of ListRecords. Harvest for reference date 2014-09-30 returned record with date 2014-10-13.) and this is an error of the validator. In fact, it doesn't understand that the harvest is done against added and modified dates, but the datestamp of a record is only the newest one.

And can you set the earliest date stamp for identify (<earliestDatestamp>1970-01-01T00:00:00Z</earliestDatestamp>) (see https://github.com/zerocrates/OaiPmhRepository/pull/6).

Thanks,

Sincerely,

Daniel Berthereau Infodoc & Knowledge management

Daniel-KM commented 10 years ago

Hi,

I just think to another issue related to added / modified, but it's hard to resolve. Protocol says that the date should be updated only if there is a change in a metadata of the record. But often, records are edited by users or contributors, and they save it without any change, or a change of a non-exposed metadata (public/reserved, featured, etc.)... So the modified date is updated, even if all metadata are identic.

Sincerely,

Daniel Berthereau Infodoc & Knowledge management

Daniel-KM commented 10 years ago

Hi,

The slow response for METS and omeka-xml records list are related to the fact that the option oaipmh_repository_expose_files is not checked and files are always added. In my case, there may be more than one thousand files attached to a single item (pages of digitalized books), so there is a time out. See https://github.com/zerocrates/OaiPmhRepository/pull/9.

Sincerely,

Daniel Berthereau Infodoc & Knowledge management

zerocrates commented 10 years ago

Wow, that is quite a large number of files per item. I could see a memory issue easily, as METS and omeka-xml (especially) are very verbose formats so the sheer size of the XML response could get unwieldy. A timeout is an interesting result, though. Even many thousand extra files over expectations shouldn't easily cause a timeout here. Well, at any rate, the expose flag should be being consistently applied.

I think for at least one of the validators there was/is also some timeouts happening trying to load that "toolkit" XSD. I reached out to the organization that once hosted that schema to see if it could be restored or the timeout fixed, but I haven't heard back.

Daniel-KM commented 10 years ago

Hi,

Files are many, but they are displayed with the Internet Archive BookReader (https://github.com/Daniel-KM/BookReader) (example: https://patrimoine.mines-paristech.fr/document/Combes_Traite_1844). And pdf can't be embedded, because they are too heavy.

Sincerely,

Daniel Berthereau Infodoc & Knowledge management