zimeon / ocfl-py

OCFL tools in Python
MIT License
20 stars 6 forks source link

Missing file when creating OCFL object from directory #80

Closed johnguirgis closed 3 years ago

johnguirgis commented 3 years ago

I am trying to create an OCFL object from a bag that was generated from some online content. I used the following command,

python3 ocfl-py/ocfl-object.py --create --srcdir 5 --id 5:5 --objdir ./ocfl_obj --name name --message hello --address a@company.com

which generated this OCFL object. The issue is that one of the files from the bag (5/data/6/5/10/OBJ.jpg) is not being added into the corresponding directory (ocfl_obj/v1/content/data/6/5/10) in the OCFL object. Any guidance on why this might be or how to resolve?

zimeon commented 3 years ago

There are two files in the source directory that are identical:

5> diff -s data/5/4/9/OBJ.jpg data/6/5/10/OBJ.jpg
Files data/5/4/9/OBJ.jpg and data/6/5/10/OBJ.jpg are identical
simeon@RottenApple 5> shasum -a 512 data/5/4/9/OBJ.jpg data/6/5/10/OBJ.jpg 
07c9212cf01a37532499a721d515c418f1128e4cc1aff3c195030e7b5ebf7d5e667482277a6b27cf6ce3480225796694f8ca89c6616e0de2718be617fe1e106a  data/5/4/9/OBJ.jpg
07c9212cf01a37532499a721d515c418f1128e4cc1aff3c195030e7b5ebf7d5e667482277a6b27cf6ce3480225796694f8ca89c6616e0de2718be617fe1e106a  data/6/5/10/OBJ.jpg

thus in the OCFL object it is usual that only one copy is stored. The inventory lists both logical files in the state for the one digest:

...
  "versions": {
    "v1": {
      "created": "2021-08-12T13:05:03.056950Z",
      "message": "hello",
      "state": {
        "07c9212cf01a37532499a721d515c418f1128e4cc1aff3c195030e7b5ebf7d5e667482277a6b27cf6ce3480225796694f8ca89c6616e0de2718be617fe1e106a": [
          "data/5/4/9/OBJ.jpg",
          "data/6/5/10/OBJ.jpg"
        ],
...

The ocfl-object.py code has a flag --no-dedupe to avoid this deduping. I'm not sure why one would want to use it in production but as a test it does show that and object with duplicate content can be created:

> ocfl-object.py --create --no-dedupe --srcdir 5 --id 5:5 --objdir ./ocfl_obj --name name --message hello --address a@company.com
INFO:ocfl.object:Created OCFL object 5:5 in ./ocfl_obj

> ls -l ocfl_obj/v1/content/data/5/4/9/OBJ.jpg ocfl_obj/v1/content/data/6/5/10/OBJ.jpg 
-rw-r--r--  1 simeon  staff  70727 Aug 12 10:41 ocfl_obj/v1/content/data/5/4/9/OBJ.jpg
-rw-r--r--  1 simeon  staff  70727 Aug 12 10:41 ocfl_obj/v1/content/data/6/5/10/OBJ.jpg

> more ocfl_obj/inventory.json
{
  "digestAlgorithm": "sha512",
  "head": "v1",
  "id": "5:5",
  "manifest": {
    "07c9212cf01a37532499a721d515c418f1128e4cc1aff3c195030e7b5ebf7d5e667482277a6b27cf6ce3480225796694f8ca89c6616e0de2718be617fe1e106a": [
      "v1/content/data/5/4/9/OBJ.jpg",
      "v1/content/data/6/5/10/OBJ.jpg"
    ],
...

Note also that the command you used it not interpreting the source as a Bagit bag but just as a directory structure. To interpret as a bag use --srcbag instead of --srcdir, e.g.:

> ocfl-object.py --create --srcbag 5 --id 5:5 --objdir ./ocfl_obj --name name --message hello --address a@company.coma
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/5/node.jsonld
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/5/node_en.json
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/6/node.jsonld
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/6/node_en.json
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/7/node.jsonld
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/7/node_en.json
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/5/4/media.jsonld
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/5/4/media_en.json
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/6/5/media.jsonld
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/6/5/media_en.json
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/7/6/media.jsonld
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/7/6/media_en.json
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/5/4/9/OBJ.jpg
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/5/4/9/file.json
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/5/4/9/file.jsonld
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/6/5/10/OBJ.jpg
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/6/5/10/file.json
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/6/5/10/file.jsonld
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/7/6/11/OBJ.jpg
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/7/6/11/file.json
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/data/7/6/11/file.jsonld
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/bag-info.txt
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/bagit.txt
INFO:bagit:Verifying checksum for file /Users/simeon/Downloads/dd/5/manifest-sha1.txt
INFO:ocfl.object:Created OCFL object 5:5 in ./ocfl_obj