webis-de / wi21-query-obfuscation-with-keyqueries

0 stars 1 forks source link

Issue with Anserini version + with SearchArgs #16

Open liv-daliberti opened 11 months ago

liv-daliberti commented 11 months ago

Hello -- this repo calls for anserini-0.9.5-SNAPSHOT. I've check anserini's tagged version, and it does not appear that 0.9.5 exists! It is of particular importance, because there appears to be a large discrepancy between the SearchArgs associated with Anserini in version 10.0.0 and the SearchArgs expected by this base. What can/do you recommend :)

mam10eks commented 11 months ago

Hello :)

Nice that you stumbled upon this repo! I think the 0.9.5-SNAPSHOT version was, at some point, the version mentioned in the pom file on the main branch of anserini.

I found the source code for which the anserini .pom specifies version 0.9.5-SNAPSHOT, e.g., here: https://github.com/castorini/anserini/blob/2af2026a0ea8ba367b7ff9c8300322d9e72d8b45/pom.xml#L5

I also have the maven artifacts that we used in the experiments, so you can use exactly the same anserini jar as we had, I uploaded this as a zip file here: https://files.webis.de/data-in-production/data-research/wi21-query-obfuscation/anserini-0.9.5-snapshot.zip

If you unzip this, and place the contents of the zip file (e.g., anserini-0.9.5-SNAPSHOT.jar, anserini-0.9.5-SNAPSHOT.pom, etc.) into your maven path (e.g., I have maven 3.8 on arch linux, for me, I have to place the contents of the files into the directory ~/.m2/repository/io/anserini/anserini/0.9.5-SNAPSHOT), this should resolve your problem?

Best regards, Maik

liv-daliberti commented 10 months ago

This returned a 504 error on my end: https://files.webis.de/data-in-production/data-research/wi21-query-obfuscation/anserini-0.9.5-snapshot.zip

Additionally -- I'm not able to access the repo.webis-snapshots.de -- this dependency produces an issue. Any repo.webis materials appear to be off-limits. Do you have any suggestions here?

I would love to rebuild your materials exactly as shown to reproduce the jupyter notebook results.

liv-daliberti commented 10 months ago

To clarify -- I have access to ClueWeb09 data; I am missing some of the dependencies to construct this repo in order to recreate :)

mam10eks commented 10 months ago

Ah, indeed, the 504 error is because we have a scheduled downtime of our cluster today and tomorrow, (sorry, I forgot that).

In two days, you should be able to download this again (i.e., on 31.08.2023).

Best regards,

Maik

Voronsky commented 10 months ago

Hello!

I followed the steps here also to try to recompile what is in the repository, and I've hit somewhat of a wall. Based on the discussion here,

I was able to use your anserini-0.9.5-snapshot.jar file (much appreciated), and the compilation got stuck trying to resolve dependencies for

de.webis.corpora:webis-uuid:1.0.jar

It seems the pom file uses a repo that is unavailable (it returns a 500 error)

                <repository>
                        <id>repo.webis.de</id>
                        <url>https://repo.webis.de/artifactory/libs-release/</url>
                </repository>
                <repository>
                        <id>repo.webis-snapshots.de</id>
                        <url>https://repo.webis.de/artifactory/libs-snapshot-webis-gradle/</url>
                </repository>

That said, I looked through github and saw this repository: https://github.com/chatnoir-eu/webis-uuid , of matching name and seems affiliated with the webis organization?

Any who, after being able to compile their .jar artifact and running against this project I received this

[INFO] Compiling 72 source files to /anserini/target/classes
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /anserini/src/main/java/de/webis/keyqueries/KeyQueryChecker.java:[8,33] package com.google.common.collect does not exist
[ERROR] /anserini/src/main/java/de/webis/keyqueries/generators/DocumentTfIdfKeyQueryCandidateGenerator.java:[10,33] package com.google.common.collect does not exist
[ERROR] /anserini/src/main/java/de/webis/keyqueries/generators/chatnoir/NounPhraseExtraction.java:[18,33] package com.google.common.collect does not exist
[ERROR] /anserini/src/main/java/de/webis/keyqueries/generators/chatnoir/SensitiveTerms.java:[13,32] package org.apache.commons.lang3 does not exist
[ERROR] /anserini/src/main/java/de/webis/keyqueries/generators/DocumentCollectionTfIdfKeyQueryCandidateGenerator.java:[12,38] package org.apache.commons.lang3.tuple does not exist
[ERROR] /anserini/src/main/java/de/webis/keyqueries/generators/DocumentCollectionTfIdfKeyQueryCandidateGenerator.java:[14,33] package com.google.common.collect does not exist
[ERROR] /anserini/src/main/java/de/webis/keyqueries/util/Util.java:[40,33] package com.google.common.collect does not exist
[ERROR] /anserini/src/main/java/de/webis/keyqueries/util/Util.java:[42,38] package org.apache.commons.lang3.tuple does not exist
[ERROR] /anserini/src/main/java/de/webis/keyqueries/util/Util.java:[327,32] cannot find symbol
  symbol:   class Pair
  location: interface de.webis.keyqueries.util.Util
[ERROR] /anserini/src/main/java/de/webis/keyqueries/util/Util.java:[331,32] cannot find symbol
  symbol:   class Pair
  location: interface de.webis.keyqueries.util.Util
[ERROR] /anserini/src/main/java/de/webis/keyqueries/generators/chatnoir/JawsSynonyms.java:[7,30] package edu.smu.tspell.wordnet does not exist
[ERROR] /anserini/src/main/java/de/webis/keyqueries/generators/chatnoir/JawsSynonyms.java:[8,30] package edu.smu.tspell.wordnet does not exist
[ERROR] /anserini/src/main/java/de/webis/keyqueries/generators/chatnoir/JawsSynonyms.java:[9,30] package edu.smu.tspell.wordnet does not exist
[ERROR] /anserini/src/main/java/de/webis/keyqueries/generators/chatnoir/JawsSynonyms.java:[52,44] cannot find symbol
  symbol:   class WordNetDatabase
  location: class de.webis.keyqueries.generators.chatnoir.JawsSynonyms

My initial thought is that, it is clear what is in the chatnoir-eu repository is not what is being asked for in this repo's pom.xml

Would you happen to also be able to upload your webis-uuid artifact that you used as well, so that I may also compile this successfully ? It seems at this point dependent on it.

Thank you :)

mam10eks commented 10 months ago

Hi all, I uploaded the corresponding files to https://files.webis.de/data-in-production/data-research/wi21-query-obfuscation/de-webis-corpora-webis-uuid-1-0.zip

With that, you should hopefully be able to compile it.

The zip de-webis-corpora-webis-uuid-1-0.zip contains all files that you have to move into your local maven repository, i.e., into the directory ~/.m2/repository/de/webis/corpora/webis-uuid/1.0.

I hope this resolves the problems :)

Best regards, Maik

mam10eks commented 10 months ago

Dear all,

as there were quite some issues with dependencies that became missing over the time, I now compiled a docker development container webis/wi21-query-obfuscation-with-keyqueries:0.0.1-dev that has all dependencies already in the docker image. The source code compiles without errors in this docker image. I also adapted the compile-anserini.sh file to use this new docker image.

I added a file .devcontainer.json to the repository that points to this development docker image. VS-Code can now open the project in this container, and the issues with the anserini version should disappear with this.

For some reason, the unit tests did not compile anymore, but as we did not change anything in the code, I assume this is "ok-ish" as I am pretty sure I only committed code for which the tests were successful. I had to add -DskipTests=true -Dmaven.test.skip=true to the maven command so that it skips the tests. I also had to add a program configuration to the appassembly plugin in maven (this was a bit weird to me, because the versions did not change, maybe the appassembly plugin now expects some different configuration albeit the version is the same).

Does this docker image resolve your problems?

I really like dev-containers as they simplify everything quite substantial, you can find a very good introduction to them here: https://code.visualstudio.com/docs/devcontainers/containers

Best regards, Maik

Voronsky commented 10 months ago

Dear all,

as there were quite some issues with dependencies that became missing over the time, I now compiled a docker development container webis/wi21-query-obfuscation-with-keyqueries:0.0.1-dev that has all dependencies already in the docker image. The source code compiles without errors in this docker image. I also adapted the compile-anserini.sh file to use this new docker image.

I added a file .devcontainer.json to the repository that points to this development docker image. VS-Code can now open the project in this container, and the issues with the anserini version should disappear with this.

For some reason, the unit tests did not compile anymore, but as we did not change anything in the code, I assume this is "ok-ish" as I am pretty sure I only committed code for which the tests were successful. I had to add -DskipTests=true -Dmaven.test.skip=true to the maven command so that it skips the tests. I also had to add a program configuration to the appassembly plugin in maven (this was a bit weird to me, because the versions did not change, maybe the appassembly plugin now expects some different configuration albeit the version is the same).

Does this docker image resolve your problems?

I really like dev-containers as they simplify everything quite substantial, you can find a very good introduction to them here: https://code.visualstudio.com/docs/devcontainers/containers

Best regards, Maik

Yes I made progress! Thank you for the update, I was able to successfully build via the compile-anserini.sh . Although i've hit another snag.

If i continue the instructions in the README.md , when I run

./crypsor-index/index-everything.sh

ls: cannot access '/mnt/ceph/storage/data-in-progress/data-research/web-search/private-web-search-with-keyqueries/reranking-index-anserini/allow-lists': No such file or directory

I get that error. I figured by the last part of the path string, it must be expecting a folder to exist that perhaps was not created. So i created that folder allow-lists/ inside the path it expects to be in. I re-ran the script and simply got no response and the folder remains empty.

So if i run ./arampatzis-hbc.sh

I am hit with this Java exception due to a missing class

Error: Could not find or load main class de.webis.crypsor.BuildArampatzisHbc
Caused by: java.lang.ClassNotFoundException: de.webis.crypsor.BuildArampatzisHbc

The same happens for the other bash scripts as well.

Am i missing a step perhaps? Thanks again for your continued help!

mam10eks commented 10 months ago

Ah, I remember, the /mnt/ceph/storage was an internal path that I forgot to upload.

This directory contains allow-lists with document ids that are used for follow-up experiments. I now uploaded https://files.webis.de/data-in-production/data-research/wi21-query-obfuscation/allow-lists.zip which contains all the expected content of this file, can you please re-run this step with all the files contained in allow-lists.zip extracted into your allow-lists/ directory?

The problem with the Error: Could not find or load main class de.webis.crypsor.BuildArampatzisHbc seems to be unrelated. Is there a jar called target/anserini-0.9.5-SNAPSHOT-fatjar.jar in your compiled resources? This jar is configured here, maybe it is named differently on your system for some reason?

Best regards,

Maik

Voronsky commented 10 months ago

Ah, I remember, the /mnt/ceph/storage was an internal path that I forgot to upload.

This directory contains allow-lists with document ids that are used for follow-up experiments. I now uploaded https://files.webis.de/data-in-production/data-research/wi21-query-obfuscation/allow-lists.zip which contains all the expected content of this file, can you please re-run this step with all the files contained in allow-lists.zip extracted into your allow-lists/ directory?

The problem with the Error: Could not find or load main class de.webis.crypsor.BuildArampatzisHbc seems to be unrelated. Is there a jar called target/anserini-0.9.5-SNAPSHOT-fatjar.jar in your compiled resources? This jar is configured here, maybe it is named differently on your system for some reason?

Best regards,

Maik

So regarding the allow-lists, i have all that data now in the proper path. I than cd'ed into the crypsor-indexing folder as it would try to run another bash script at the same level called run-indexing-command.sh

./crypsor-indexing/index_everything.sh: line 5: ./run-indexing-command.sh: No such file or directory

Once i went into that folder, and ran it again, I got this error

/usr/local/bin/mvn-entrypoint.sh: 50: exec: ./target/appassembler/bin/IndexCollection: not found

Seems the binary IndexCollection is missing. This is all I have in the target folder

target/appassembler/bin/
├── app
└── app.bat

0 directories, 2 files

As for the unrelated error, I infact did not have that jar file in the target folder. I moved it over, and reran the command and was met with this

Read results from /mnt/ceph/storage/data-in-progress/data-research/web-search/private-web-search-with-keyqueries/scrambling-on-anserini/arampatzis-bm25/2.jsonl
Exception in thread "main" java.nio.file.NoSuchFileException: /mnt/ceph/storage/data-in-progress/data-research/web-search/private-web-search-with-keyqueries/scrambling-on-anserini/arampatzis-bm25/2.jsonl
    at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
    at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219)
    at java.base/java.nio.file.Files.newByteChannel(Files.java:371)
    at java.base/java.nio.file.Files.newByteChannel(Files.java:422)
    at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:420)
    at java.base/java.nio.file.Files.newInputStream(Files.java:156)
    at java.base/java.nio.file.Files.newBufferedReader(Files.java:2839)
    at java.base/java.nio.file.Files.readAllLines(Files.java:3330)
    at java.base/java.nio.file.Files.readAllLines(Files.java:3370)
    at de.webis.crypsor.BuildArampatzisHbc.process(BuildArampatzisHbc.java:36)
    at de.webis.crypsor.BuildArampatzisHbc.main(BuildArampatzisHbc.java:27)

Seems there's a missing jsonl file it expects ?

mam10eks commented 10 months ago

Ok, the problem with the missing IndexCollection seems to come from the changes in the appassembler-maven-plugin. I specified the version of this plugin in the pom, but it still seems to name things differently and also behave differently than three years ago when we initially implemented this.

I tracked the changes in the original Anserini pom.xml required by the changes in the appassembler-maven-plugin, and committed them, with that, re-compiling now produces again the IndexCollection binary.

I currently zip all the contents of the scrambling-on-anserini directory and will upload it so that you have all the required data. (might take a while, uncompressed, it is 50GB, I hope that it amounts to 10GB or so when compressed, this way, you still could only extract the sub-paths of the zip archive that you need).

I will post a message as soon as this zip file is available online.

Best regards, Maik

mam10eks commented 10 months ago

Compression is awesome, this directory scrambling-on-anserini was 50GB on disk, but the archive is only 1GB.

You can download it via https://files.webis.de/data-in-production/data-research/wi21-query-obfuscation/scrambling-on-anserini.zip, and than can extract the subpaths of this zip that are used by the tool.

Voronsky commented 10 months ago

Greetings, first of all thank you for the missing 50GB of data. That helped reproduce further steps in the README.md . So I was able to successfully do this part

./arampatzis-hbc.sh

and saw all the jsonl files being written. But with that said, I still cannot get the indexing to work.

This error still persists:

/usr/local/bin/mvn-entrypoint.sh: 50: exec: ./target/appassembler/bin/IndexCollection: not found

based on the script

It seems that it expects that binary to be inside the docker container itself, but it seems it does not.

I do have it in my target folder :

target/appassembler/bin/
├── ApproximateNearestNeighborEval
├── ApproximateNearestNeighborEval.bat
├── ApproximateNearestNeighborSearch
├── ApproximateNearestNeighborSearch.bat
├── DumpAnalyzedQueries
├── DumpAnalyzedQueries.bat
├── ExtractAverageDocumentLength
├── ExtractAverageDocumentLength.bat
├── ExtractDocumentLengths
├── ExtractDocumentLengths.bat
├── ExtractNorms
├── ExtractNorms.bat
├── ExtractTopDfTerms
├── ExtractTopDfTerms.bat
├── FeatureExtractorCli
├── FeatureExtractorCli.bat
├── IndexCollection
├── IndexCollection.bat
├── IndexHnswDenseVectors
├── IndexHnswDenseVectors.bat
├── IndexReaderUtils
├── IndexReaderUtils.bat
├── IndexVectors
├── IndexVectors.bat
├── SearchCollection
├── SearchCollection.bat
├── SearchHnswDenseVectors
├── SearchHnswDenseVectors.bat
├── SimpleSearcher
├── SimpleSearcher.bat
├── SimpleTweetSearcher
└── SimpleTweetSearcher.bat

0 directories, 32 files

But i am wondering if there is something missing in the script where that binary is placed inside the docker for it to run ? From what i gathered the docker container does not have it in the path it expects it to be, hence the error.

mam10eks commented 10 months ago

Alright, thats a good start.

For the indexing: at least you now have the IndexCollection binary. I guess there is now only some kind of an "off by one" error in some paths.

I.e., this run-indexing-command.sh script tries to mount the root of this repository to /anserini in the docker container and if there is the target directory that you posted above it should work.

But it seems like this run-indexing-command.sh script assumes that you execute it within the crypsor-indexing directory, can this be the case that you have to execute it from this directory? At least from the corresponding line in the script I assume it has to be executed from this directory.

Best regards,

Maik

Voronsky commented 10 months ago

Alright, thats a good start.

For the indexing: at least you now have the IndexCollection binary. I guess there is now only some kind of an "off by one" error in some paths.

I.e., this run-indexing-command.sh script tries to mount the root of this repository to /anserini in the docker container and if there is the target directory that you posted above it should work.

But it seems like this run-indexing-command.sh script assumes that you execute it within the crypsor-indexing directory, can this be the case that you have to execute it from this directory? At least from the corresponding line in the script I assume it has to be executed from this directory.

Best regards,

Maik

Yes. so as per README.md instructions these were the initial steps, were to clone then cd into the repo and run the compile script which now works, but the next step says this

./crypsor-indexing/index_everything.sh

That script assumes another exists at the same level as where you ran it. https://github.com/webis-de/wi21-query-obfuscation-with-keyqueries/blob/91ba37663d18d6573f6b1d2523550c9e475e55f6/crypsor-indexing/index_everything.sh#L5

The run-indexing-command.sh script, will then pull in data into the /anserini folder inside that Docker container

https://github.com/webis-de/wi21-query-obfuscation-with-keyqueries/blob/main/crypsor-indexing/run-indexing-command.sh#L9

But it seems the subsequent lines are trying to call inside that docker, to a script at a path that does not exist where it thinks it should, which seems to produce this error here

/usr/local/bin/mvn-entrypoint.sh: 50: exec: ./target/appassembler/bin/IndexCollection: not found

Now what I have done to somewhat isolate the issue was that I placed the generated /target folder , with all of its generated binaries from the compile command, inside the crypsor-indexing folder and I still get that same error.

wi21-query-obfuscation-with-keyqueries/crypsor-indexing$ tree -L 1
.
├── indexes
├── index_everything.sh
├── rank-everything.sh
├── run-indexing-command.sh
├── run-retrieval-command.sh
└── target

I also took those scripts out and placed it at the same directory level as where the /target folder is generated initially, and I still get that same error.

So it could be i'm missing a step or my environment is not as it should be, but I'm wondering if there might be an issue in the docker container instead?

Thanks for the prolonged help here, I do hope this has not been daunting for you. I feel we are almost half way there! As I was so far able to run the compile command and the arampatzis command successfully :smile:

mam10eks commented 10 months ago

Dear @Voronsky,

I also think we are close to resolve this :) I made this commit which resolves this "off by one error" in the index command, so that the indexing works as described in the readme again: https://github.com/webis-de/wi21-query-obfuscation-with-keyqueries/commit/962aeda51c61a2f1c0eed3238b77068d71549dea

Does this also resolve the problem on your end?

Best regards, Maik

Voronsky commented 10 months ago

Dear @Voronsky,

I also think we are close to resolve this :) I made this commit which resolves this "off by one error" in the index command, so that the indexing works as described in the readme again: 962aeda

Does this also resolve the problem on your end?

Best regards, Maik

Huzzah, that commit moved us further along!

So yes, I was now able to index everything in the ceph path via the crypsor script. Yet with that said, when I now run the final part

./run-all-cw09.sh

I am met with this error:

Run Topic 2
Exception in thread "main" org.apache.lucene.index.IndexNotFoundException: no segments* file found in MMapDirectory@/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@33bc72d1: files: []
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:675)
    at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:84)
    at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:76)
    at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:64)
    at io.anserini.search.SimpleSearcher.<init>(SimpleSearcher.java:202)
    at io.anserini.search.SimpleSearcher.<init>(SimpleSearcher.java:183)
    at de.webis.crypsor.CrypsorArgs.getSimpleSearcher(CrypsorArgs.java:52)
    at de.webis.crypsor.Main.<init>(Main.java:40)
    at de.webis.crypsor.Main.main(Main.java:59)
 so do i need to have that in a specific folder path, or what might i be missing here?
real    0m1.324s
user    0m0.008s
sys 0m0.016s

the no segments* file found tells me it was trying to find something that does not exist in the /index subfolder.

Given the script is named after running this against the clueweb009 data, and looking inside the script itself, it seems to make calls to this other script where it tries to load data from a /mnt/raid path into the docker container. Is that something I need to make myself and put the data in there, or am I missing information here?

https://github.com/webis-de/wi21-query-obfuscation-with-keyqueries/blob/main/run-cw09-topic.sh#L5

Thank you!

mam10eks commented 10 months ago

Awesome, now we are close to resolving the problem :)

Indeed, please replace /mnt/raid/data/av80sybu/anserini/indexes/lucene-index.cw12b13 with the location of your ClueWeb12b13 index and /mnt/ceph/storage/data-in-progress/data-research/web-search/private-web-search-with-keyqueries/scrambling-on-anserini with the directory where you want to have stored the outputs.

I created the ClueWeb12b13 index using this documentation (pointing to the current version of Anserini): https://github.com/castorini/anserini/blob/0ad723af3c935cbb3305192d31492e612a824a38/docs/regressions-cw12b13.md

Best regards, Maik

liv-daliberti commented 10 months ago

We've got the ClueWeb09 data; I'm going to get the ClueWeb12 data mailed to us. I'd thought it might be possible to do without the ClueWeb12 data. I was incorrect. We're going to get that data ordered out to us ASAP (and then we will pick this effort back up!)

mam10eks commented 10 months ago

Awesome, that sounds good! The ClueWeb12 is not required per se, we only used it to have a more realistic setup where the corpus of the private and the public search engine are different.

You could also set up a similar environment by splitting the ClueWeb09 into two parts.