Data preprocessing steps used in the paper

mdrpanwar commented 2 years ago

Thanks a lot for this awesome work, and for releasing the code for the same!

I used your repository and was able to reproduce results for vanilla kNN-MT (K=8) and Adaptive kNN-MT (K=4) on the provided preprocessed data for the IT domain.

I have two queries:

I would like to run your model on other datasets (e.g. WMT'19 as mentioned in section 4.1 of kNN-MT paper). Could you please mention the preprocessing scripts that I could use for the same?
Could you also confirm that the K mentioned in table 2 in your paper is actually the max-k that the Meta-k network was trained on?

Thanks a lot!

zhengxxn commented 2 years ago

The preprocessing contains following steps:
- tokenize and apply-bpe for the raw data file, I think you can refer to the scripts in examples/translation. And make sure that you should use the same bpe-codes as the pre-trained model.
- binarize the data, and you can follow the fairseq official document.
- create the datastore, build faiss_index ... just as the readme shows. But you should adjust the hyper-parameters depending on the dataset size.
Sure. Do you find anything incorrect in the code?

mdrpanwar commented 2 years ago

Thanks a lot!

I will follow these steps and get back in case I need more clarification.
No, nothing incorrect as such. I just wanted to confirm my understanding.

I am closing the issue for now. I shall reopen if needed.

mdrpanwar commented 2 years ago

Hi,

I have a few more questions:

I noticed that in the multi-domains dataset, the dict of source and target language is the same. Is that a requirement for (Adaptive) kNN-MT? What if we have a base model with different source and target dictionaries?
How do you calculate the datastore size for a dataset? Since the datastore is created from training data, it seems that the number of unique tokens in the training set for the target language should be the datastore size. I just want to know how you compute it from the binarized data.

Your detailed steps above have been of much help. Thanks a lot!

zhengxxn commented 2 years ago

Sorry for the late reply.

It's certainly ok to use a base model with different source and target dictionaries. It depends on how you train the base model. Since recent studies show that a shared dictionary ( or shared BPE ) would be helpful for the translation quality improvements, ( especially on European language translation ) so in De<->En, Fr<->En, etc, we usually use the shared dictionary. But that's not necessary for NMT and (Adaptive) kNN-MT.
The DSTORE_SIZE depends on the num of tokens of target language train data. You can get it in two ways:
- find it in the preprocess.log file, which is created by fairseq-process, and in data-binary folder.
- calculate with bash command, wc -w + wc - l ( since we need a for each sentence ) of the raw data ( bpe-processed ) file.

mdrpanwar commented 2 years ago

Thanks a lot for the details. It's very helpful.

I noticed that the DSTORE_SIZE is some nearest multiple of 10 larger than the actual number of target tokens in preprocess.log file. Any large number would work, right? (since the purpose is to just have all target tokens in datastore).
For using any new pretrained base model with (adaptive) kNN-MT, we need to register the model in the adaptive-knn-mt/fairseq/models/transformer.py (here we can change any default values of the fairseq transformer based on base NMT model's architecture like layer size, etc). Now we can use this newly registered model as --arch argument in the inference scripts. Is there any other step that I missed?

zhengxxn commented 2 years ago

I think a little larger or just as same as the num of target tokens is ok ... cause empty items maybe have an impact on the training and retrieval of Faiss.
Yep, if you just change the hidden size, num layers, or something for standard Transformer arch, you can just register the new arch in thetransformer.py, like the "transformer_wmt19_de_en" at the end of this file.

mdrpanwar commented 2 years ago

Thank you very much!

zhengxxn / adaptive-knn-mt

Data preprocessing steps used in the paper #4