wtsi-hgi / python-baton-wrapper

Python 3 wrapper for baton
GNU Lesser General Public License v3.0
0 stars 0 forks source link

Bug in _BatonIrodsMetadataMapper -- index out of range #44

Closed irinaColgiu closed 8 years ago

irinaColgiu commented 8 years ago

I am pretty sure this is a bug, though i haven't investigated it very thoroughly. So you iterate with i which takes values in the range of (0, nrof AVUs (called metadata_for_paths in the code)), and then you attempt to access the i-th element of the paths.

        for i in range(len(metadata_for_paths)):
            entity = self._create_entity_with_path(paths[i])

This is obviously only going to work when you test with nr_of_avus = nr_of_paths, otherwise if there is 1 path and 3 AVUs one gets a IndexError:

  File "/nfs/users/nfs_i/ic4/Projects/python3/serapis/ENV/lib/python3.5/site-packages/baton/_baton/baton_metadata_mappers.py", line 82, in remove
    self._modify(paths, metadata, BATON_METAMOD_OPERATION_REMOVE)
  File "/nfs/users/nfs_i/ic4/Projects/python3/serapis/ENV/lib/python3.5/site-packages/baton/_baton/baton_metadata_mappers.py", line 104, in _modify
    entity = self._create_entity_with_path(paths[i])
IndexError: list index out of range

from: https://github.com/wtsi-hgi/python-baton-wrapper/blob/master/baton/_baton/baton_metadata_mappers.py#L87-L88

colin-nolan commented 8 years ago

Either a single IrodsMetadata collection is supplied for all paths, or a separate collection is supplied for each path, i.e.:

mapper.add(["/path_1", "/path_2"], metadata_for_path_1_and_2)
# Or
mapper.add(["/path_1", "/path_2"], [metadata_for_path_1, metadata_for_path_2])

If the mapper is used correctly, the len(paths) == len(metadata_for_paths) assertion always passes. However, to help in case where it's not, e.g.

# Wrong!
mapper.add(["/path_1", "/path_2", "/path_3"], [metadata_for_path_1, metadata_for_path_2])

I'll add a check so that a more informative exception is raised than the IndexError that you pointed out.

irinaColgiu commented 8 years ago

I don't understand, the 2 seem to me unrelated - why would the number of paths I want to modify depend on the length of metadata - ie number of avus I want to add or remove?! Most of the times I will want to add a set of 10-20-30 AVUs let's say, to 1 data_object, or remove a couple of AVUs from 1 data_object.

For me this part throws an exception both when I want to add and when I want to remove metadata:

In [16]: irods.data_object.metadata.add("/humgen/projects/serapis_staging/test-baton/test_metadata_add_rm.txt", [IrodsMetadata({'key1': {'val1'}}), IrodsMetadata({'key2': {'val2'}})])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-16-11ee5a0305c5> in <module>()
----> 1 irods.data_object.metadata.add("/humgen/projects/serapis_staging/test-baton/test_metadata_add_rm.txt", [IrodsMetadata({'key1': {'val1'}}), IrodsMetadata({'key2': {'val2'}})])

/nfs/users/nfs_i/ic4/Projects/python3/serapis/ENV/lib/python3.5/site-packages/baton/_baton/baton_metadata_mappers.py in add(self, paths, metadata)
     56 
     57     def add(self, paths: Union[str, Iterable[str]], metadata: IrodsMetadata):
---> 58         self._modify(paths, metadata, BATON_METAMOD_OPERATION_ADD)
     59 
     60     def set(self, paths: Union[str, Iterable[str]], metadata: IrodsMetadata):

/nfs/users/nfs_i/ic4/Projects/python3/serapis/ENV/lib/python3.5/site-packages/baton/_baton/baton_metadata_mappers.py in _modify(self, paths, metadata_for_paths, operation)
    102 
    103 
--> 104 class BatonDataObjectIrodsMetadataMapper(_BatonIrodsMetadataMapper):
    105     """
    106     iRODS data object metadata mapper, implemented using baton.

IndexError: list index out of range
colin-nolan commented 8 years ago

Remember that IrodsMetadata is a collection of key-(set)value pairs. Instead of the invalid:

irods.data_object.metadata.add("/path", [IrodsMetadata({'key1': {'val1'}}), IrodsMetadata({'key2': {'val2'}})])

you'll need:

irods.data_object.metadata.add("/path", IrodsMetadata({'key1': {'val1'}, 'key2': {'val2'}}))

I don't understand, the 2 seem to me unrelated - why would the number of paths I want to modify depend on the length of metadata - ie number of avus I want to add or remove?!

It is possible to specifiy that a single collection of metadata is to be used with each path, e.g. with the example below, metadata is set for all paths:

mapper.set(["/path_1", "/path_2"], metadata)

The mapper also allows different metadata to be set for each path, e.g. in this example, metadata_1 is set for /path_1 and metadata_2 is set for /path_2:

mapper.set(["/path_1", "/path_2"], [metadata_1, metadata_2])

I have made some corrections/improvements to the documentation and interface to make this clearer. I have also added the aforementioned check for an invalid set of arguments so as to raise a more informative exception.

Most of the times I will want to add a set of 10-20-30 AVUs let's say, to 1 data_object, or remove a couple of AVUs from 1 data_object.

Most of the time then you won't be using the more elaborate "multi-path, multi-metadata" functionality. Instead, you'll just need:

mapper.set("/path/data_object", metadata)
irinaColgiu commented 8 years ago

OK, I see... That wasn't clear to me from the docs...Well in this case calling it len(metadata_for_paths) in the context of: python len(paths) == len(metadata_for_paths) was very confusing, since metadata is a dict, the length of a dict is the number of elements in it, which is the nr of attribute names (keys in this case).

As an observation, I don't think this:

mapper.set(["/path_1", "/path_2"], metadata)

is going to be used, as the metadata for 2 files will always be different. I can't think of any use case in which there would be identical metadata for 2 data objects/collections.

I think in the case of set method there is too much functionality crowded on the same method, and it makes it confusing as to what it does, in the sense that it's not obvious if/how does the behavior of the set method change depending on the types of the params, I mean if I were to list all of the functionality of 1 method:

which a) looks to me like overloading a method in Java, (which is not necessarily bad, just smells like Java more than Python) and b) to me it would be more clear if you just call them differently, because the mental effort to understand how does the functionality differ in each case is bigger the more cases you have, and can be avoided by giving each a relevant name - e.g. - set(list1, list2) could be called set_all, and set(list1, object) -> set_to_each, or smth like that. IMHO.

colin-nolan commented 8 years ago

As an observation, I don't think this:

mapper.set(["/path_1", "/path_2"], metadata)

is going to be used, as the metadata for 2 files will always be different. I can't think of any use case in which there would be identical metadata for 2 data objects/collections.

set could be used to clear the metadata to an inital shared set of shared key/values, e.g. study_id, upon unique key/values may be written for each entity? Probably not as useful as the equivalent add method though, which could add shared key/values on top of existing keys.

looks to me like overloading a method in Java...

I guess the more Pythonic way is to overload methods by use of different parameters, e.g.:

def add(self, path: str=None, paths: Iterable[str]=None,  metadata: IrodsMetadata=None, metadata_list:List[IrodsMetadata]=None)
# Instead of
def add(self, paths: Union[str, Iterable[str]], metadata: Union[IrodsMetadata, List[IrodsMetadata]])

It can certainly make the parameters themselves clearer. However, it comes at the cost of making the input to the function more opaque - the former method signiture offers no hint to the minimal set of arguments that must be supplied or of what combinations are valid.

The way this currently is makes it consistent with the rest of the interface. I do agree: it is indeed consistently more Javary than Pythonic!

to me it would be more clear if you just call them differently, because the mental effort to understand how does the functionality differ in each case is bigger the more cases you have, and can be avoided by giving each a relevant name

I'm not sure such a change would make things too much better: the mental effort is shifted from knowing about the purpose of the parameters to knowing about the purpose of a collection of related functions.