zimeon / ocfl-py

OCFL tools in Python
MIT License
20 stars 7 forks source link

Harvard 1: Load existing, standalone OCFL object #105

Closed awoods closed 2 weeks ago

awoods commented 1 year ago

As a part of our bulk download process, we would like to pull down individual OCFL objects from S3 to local disk, then use ocfl-py to inspect and pull out specific files.

This will involve three new functions in ocfl-py:

  1. load individual object
  2. ~list files in object (optional version arg, default head)~
  3. ~get content (arg: logical path, optional version arg)~

This issue is to design the CLI interaction for step 1.

zimeon commented 1 year ago

This is the function I'm not sure about, where is the object loaded to/from?

awoods commented 1 year ago

Actually, I should probably rephrase this ticket (and the other two: https://github.com/zimeon/ocfl-py/issues/106 & https://github.com/zimeon/ocfl-py/issues/107) to remove the "CLI" design comment. I would like to load an object and interact with it by using ocfl-py as an imported library. For this ticket, I would like to load the object into memory as a Python object.

zimeon commented 1 month ago

@awoods - I don't think loading an object into memory makes much sense, that could be really big! I certainly understand loading an inventory associated with an object on storage. Currently one can load the inventory and get a disc based on the parsed JSON:

>>> import ocfl
>>> object = ocfl.Object(path="fixtures/1.1/good-objects/spec-ex-full")
>>> inv = object.parse_inventory()
>>> inv
{'digestAlgorithm': 'sha512', 'fixity': {'md5': {'184f84e28cbe75e050e9c25ea7f2e939': ['v1/content/foo/bar.xml'], '2673a7b11a70bc7ff960ad8127b4adeb': ['v2/content/foo/bar.xml'], 'c289c8ccd4bab6e385f5afdd89b5bda2': ['v1/content/image.tiff'], 'd41d8cd98f00b204e9800998ecf8427e': ['v1/content/empty.txt']}, 'sha1': {'66709b068a2faead97113559db78ccd44712cbf2': ['v1/content/foo/bar.xml'], 'a6357c99ecc5752931e133227581e914968f3b9c': ['v2/content/foo/bar.xml'], 'b9c7ccc6154974288132b63c15db8d2750716b49': ['v1/content/image.tiff'], 'da39a3ee5e6b4b0d3255bfef95601890afd80709': ['v1/content/empty.txt']}}, 'head': 'v3', 'id': 'ark:/12345/bcd987', 'manifest': {'4d27c86b026ff709b02b05d126cfef7ec3aed5f83f5e98df7d7592f7a44bd1dc7f29509cff06b884158baa36a2bbeda11ab8a64b56585a70f5ce1fa96e26eb53': ['v2/content/foo/bar.xml'], '7dcc352f96c56dc5b094b2492c2866afeb12136a78f0143431ae247d02f02497bbd733e0536d34ec9703eba14c6017ea9f5738322c1d43169f8c77785947ac31': ['v1/content/foo/bar.xml'], 'cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e': ['v1/content/empty.txt'], 'ffccf6baa21809716f31563fafb9f333c09c336bb7400088f17e4ff307f98fc9b14a577f92f3285913b7f53a6d5cf004503cf839aada1c885ac69336cbfb862e': ['v1/content/image.tiff']}, 'type': 'https://ocfl.io/1.1/spec/#inventory', 'versions': {'v1': {'created': '2018-01-01T01:01:01Z', 'message': 'Initial import', 'state': {'7dcc352f96c56dc5b094b2492c2866afeb12136a78f0143431ae247d02f02497bbd733e0536d34ec9703eba14c6017ea9f5738322c1d43169f8c77785947ac31': ['foo/bar.xml'], 'cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e': ['empty.txt'], 'ffccf6baa21809716f31563fafb9f333c09c336bb7400088f17e4ff307f98fc9b14a577f92f3285913b7f53a6d5cf004503cf839aada1c885ac69336cbfb862e': ['image.tiff']}, 'user': {'address': 'mailto:alice@example.com', 'name': 'Alice'}}, 'v2': {'created': '2018-02-02T02:02:02Z', 'message': 'Fix bar.xml, remove image.tiff, add empty2.txt', 'state': {'4d27c86b026ff709b02b05d126cfef7ec3aed5f83f5e98df7d7592f7a44bd1dc7f29509cff06b884158baa36a2bbeda11ab8a64b56585a70f5ce1fa96e26eb53': ['foo/bar.xml'], 'cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e': ['empty.txt', 'empty2.txt']}, 'user': {'address': 'mailto:bob@example.com', 'name': 'Bob'}}, 'v3': {'created': '2018-03-03T03:03:03Z', 'message': 'Reinstate image.tiff, delete empty.txt', 'state': {'4d27c86b026ff709b02b05d126cfef7ec3aed5f83f5e98df7d7592f7a44bd1dc7f29509cff06b884158baa36a2bbeda11ab8a64b56585a70f5ce1fa96e26eb53': ['foo/bar.xml'], 'cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e': ['empty2.txt'], 'ffccf6baa21809716f31563fafb9f333c09c336bb7400088f17e4ff307f98fc9b14a577f92f3285913b7f53a6d5cf004503cf839aada1c885ac69336cbfb862e': ['image.tiff']}, 'user': {'address': 'mailto:cecilia@example.com', 'name': 'Cecilia'}}}}
>>> inv['digestAlgorithm']
'sha512'
>>> inv['versions']['v1']
{'created': '2018-01-01T01:01:01Z', 'message': 'Initial import', 'state': {'7dcc352f96c56dc5b094b2492c2866afeb12136a78f0143431ae247d02f02497bbd733e0536d34ec9703eba14c6017ea9f5738322c1d43169f8c77785947ac31': ['foo/bar.xml'], 'cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e': ['empty.txt'], 'ffccf6baa21809716f31563fafb9f333c09c336bb7400088f17e4ff307f98fc9b14a577f92f3285913b7f53a6d5cf004503cf839aada1c885ac69336cbfb862e': ['image.tiff']}, 'user': {'address': 'mailto:alice@example.com', 'name': 'Alice'}}
awoods commented 1 month ago

Agreed, loading the inventory into memory versus the entire object with content files makes sense. Our use cases for this involve an OCFL repository that is created and managed by a separate application and we need a Python library to help read/inspect existing OCFL objects.

For example, we have the need to ask an OCFL object (or its inventory) for the "content path" of a specific file for which we have the "logical path". We do not know/care in which version the file was created nor do we know/care if the logical file was de-duplicated.

Ideally, the Python code importing an ocfl-py module would not need to parse/understand the underlying JSON structure of OCFL inventories.

zimeon commented 2 weeks ago

For the specific case of finding a logical path in some version of an object a search like the following would work (close to #105):

>>> import ocfl
>>> obj = ocfl.Object(path="fixtures/1.1/good-objects/spec-ex-full")
>>> inv = obj.parse_inventory()
>>> for vdir in reversed(inv.version_directories):
...     if logical_path in inv.version(vdir).logical_paths: 
...         print("Found %s in %s with content in %s" % (logical_path, vdir, inv.version(vdir).content_path_for_logical_path(logical_path)))
...         break
... 
Found empty.txt in v2 with content in v1/content/empty.txt

I guess I'm open to create something like the following to search backward through versions to find the last version and content path for a specificed logical path

vdir, content_path = inv.find_logical_path("empty.txt")  

if that is of particular interest

zimeon commented 2 weeks ago

Added the method in #129, closing.

Further suggestions welcome

awoods commented 2 weeks ago

Thanks, @zimeon !