Open mam10eks opened 3 months ago
cc @Parry-Parry, @heinrichreimer.
Alright, for pretokenized indexes, termpipelines=
is in the index/data.properties
file, and in this case ir_axioms uses a default term-pipeline that applies some normalization.
@heinrichreimer Do you have any preferences how we could solve this? E.g., so that it is usable but maybe still compatible with previous behaviour?
@heinrichreimer @mam10eks So I assume the default pipe is stopwords, porter stemmer, this is always included in data.properties should shouldn't be an issue in the default case
one possible suggestion could also be that we introduce a new PreTokenizedTerrierIndexContext
that is a TerrierIndexContext
and jst overrides the termpipeline
property?
I'd say it would be best to fix this in the PyTerrier backend here: https://github.com/webis-de/ir_axioms/blob/4212946b2d96ab5175e3df886fca13cb35368fd0/ir_axioms/backend/pyterrier/__init__.py#L229-L234 Is there a PyTerrier API to access the pre-tokenized terms given the document ID?
This commit adds some failing unit tests: https://github.com/webis-de/ir_axioms/commit/4a747d4bd22f4aea1fb754ebef48dbb5febbcc8a
Should be simple to resolve this. We load the term-pipeline from the terrier index which we implemented at a time when the pre-tokenized feature was not yet available in PyTerrier, so we likely have a wrong pipeline in case pre-tokenized is specified.