whole-tale / serialization-format

Holds documentation about Whole Tale's serialization format
0 stars 0 forks source link

Fix schema.org domain in 'aggregates' section #3

Open ThomasThelen opened 3 years ago

ThomasThelen commented 3 years ago

We have a couple of unbound terms (not in a vocab) and schema.org terms with incorrect domains being used in physical files that are aggregated. For example,

    "aggregates": [
        {
            "md5": "81faaedac351f28092bd845a48c6d0a5",
            "size": 170,
            "schema:license": "CC-BY-4.0",
            "mimeType": "text/plain",
            "uri": "../data/LICENSE"
        }
    ],

This can be read as an RDF triple as... wt:Tale ore:Aggregates ore:AggregatedResource

Solution

From the ORE documentation,

Note that asserting that a resource is a member of the class of Aggregated Resources does not imply anything other than that it is aggregated by at least one Aggregation. As such, this class is mostly informative and there is no need to assert that aggregated resources are instances of the ore:AggregatedResource class.

In summary it means that calling this object an ore:AggregatedResource doesn't really give us anything particularly useful, other than letting us know that it's aggregated.

Calling downloaded files a plain CreativeWork is probably a stretch. schema,org also has the DataDownload type which sounds promising but represents an entire dataset.

One clean solution is to create our own type that's a sublcass of a schema:CreativeWork and ore:AggregatedResource. This allows us to mostly use this object how it's currently used.

An example of what this looks like as an OWL class,

<https://vocabularies.wholetale.org/wt/1.0/wt#physicalFile>
  a owl:Class ;
  rdfs:subClassOf <https://schema.org/CreativeWork>, <http://www.openarchives.org/ore/1.0/vocabulary#aggr_res>;
  rdfs:comment "A class that represents a file that physically exists on disk."@en ;
  rdfs:label "Physical File"@en .

Alternative

The alternative is to use something from https://id.loc.gov/ontologies/premis-3-0-0.html

It has support for the notion of a file, license, and cryptographic signature, but schema,org isn't compatible with it. It's also arguably not light weight. We still run into the issue of needing to subclass ore:AggregatedResource.

Dealing with 'md5'

If we turn the object into a schema,org class, we can use the suggestion here This results in replacing md5 with "schema:identifier": "ni:///md5;81faaedac351f28092bd845a48c6d0a5"

Dealing with 'size'

CreativeWork objects have a property for size, size.

"schema:size": {
   "@type": "schema:QuantitativeValue"
   "schema:value": 170
}

or as a string "schema:size": "170"

Dealing with 'schema:license'

schema:license is a valid property for CreativeWork classes, but is expected to be a CreativeWork or URL

Dealing with 'mimeType'

This value should change to schema:encodingFormat

Example

    "aggregates": [
        {
        "@type": "wt:PhysicalFile",
            "schema:identifier": "ni:///md5;81faaedac351f28092bd845a48c6d0a5",
            "schema:size": {
               "@type": "schema:QuantitativeValue"
               "schema:value": 170
            }
            "schema:license": "CC-BY-4.0",
            "schema:encodingFormat": "text/plain",
            "uri": "../data/LICENSE"
        }
    ],

Misc

It might make sense to leverage the idea of os:File here.

craig-willis commented 3 years ago

As discussed, I would just use our own vocabulary (wt) for the unbound terms and define them.