Using pre-tokenized queries / documents does not work at the moment

webis-de / ir_axioms

↕️ Intuitive axiomatic retrieval experimentation.

https://pypi.org/project/ir_axioms/

MIT License

23 stars 1 forks source link

Using pre-tokenized queries / documents does not work at the moment #50

Open mam10eks opened 3 months ago

mam10eks commented 3 months ago

This commit adds some failing unit tests: https://github.com/webis-de/ir_axioms/commit/4a747d4bd22f4aea1fb754ebef48dbb5febbcc8a

Should be simple to resolve this. We load the term-pipeline from the terrier index which we implemented at a time when the pre-tokenized feature was not yet available in PyTerrier, so we likely have a wrong pipeline in case pre-tokenized is specified.

mam10eks commented 3 months ago

cc @Parry-Parry, @heinrichreimer.

mam10eks commented 3 months ago

Alright, for pretokenized indexes, termpipelines= is in the index/data.properties file, and in this case ir_axioms uses a default term-pipeline that applies some normalization.

@heinrichreimer Do you have any preferences how we could solve this? E.g., so that it is usable but maybe still compatible with previous behaviour?

Parry-Parry commented 3 months ago

@heinrichreimer @mam10eks So I assume the default pipe is stopwords, porter stemmer, this is always included in data.properties should shouldn't be an issue in the default case

mam10eks commented 3 months ago

one possible suggestion could also be that we introduce a new PreTokenizedTerrierIndexContext that is a TerrierIndexContext and jst overrides the termpipeline property?

janheinrichmerker commented 3 months ago

I'd say it would be best to fix this in the PyTerrier backend here: https://github.com/webis-de/ir_axioms/blob/4212946b2d96ab5175e3df886fca13cb35368fd0/ir_axioms/backend/pyterrier/__init__.py#L229-L234 Is there a PyTerrier API to access the pre-tokenized terms given the document ID?