web-arena-x / webarena

Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
https://webarena.dev
Apache License 2.0
633 stars 90 forks source link

Issue with the reddit website? #136

Open gagb opened 2 months ago

gagb commented 2 months ago

I am not super sure but I noticed something weird with the reddit website in the benchmark:

I was looking at task 29: "Tell me the count of comments that have received more downvotes than upvotes for the user who made the latest post on the DIY forum."

But when I go to the website, something doesn't look right. If I sort by hot, the website shows a submission that was made a day ago. But if I sort by new, the website shows the latest post was at least an year old. Am I not reading this correctly?

If the website is broken, it will confuse the agents and I am not sure we can have a valid evaluation.

image image

cc: @afourney @cheng-tan

peterychang commented 2 months ago

It looks like there's a task to create a post in the DIY sub about midjourney, which is where those entries are coming from... Maybe we should create a tool that resets the sites before the tests get run? Not sure how to avoid conflicting with multiple people running simultaneously though

afourney commented 2 months ago

I think the broader point is that "New" does not appear to be accurate. There's something functionally wrong with the sort order.

shuyanzhou commented 2 months ago

Hi @gagb @peterychang, the demo websites we hosted are not recommended as test environments for the exact reason you mentioned. They are only for demonstration purposes.

To ensure reproducibility, you can find how to host the websites here. After running the full evaluation set, you can reset the environment to a deterministic initial state.

Let me dig into the issue of "New" functionality.

gagb commented 2 months ago

Hi @gagb @peterychang, the demo websites we hosted are not recommended as test environments for the exact reason you mentioned. They are only for demonstration purposes.

To ensure reproducibility, you can find how to host the websites here. After running the full evaluation set, you can reset the environment to a deterministic initial state.

Let me dig into the issue of "New" functionality.

Yes my comment was more about the unexpected behavior of the Hot/New button.

afourney commented 2 months ago

Hi @gagb @peterychang, the demo websites we hosted are not recommended as test environments for the exact reason you mentioned. They are only for demonstration purposes. To ensure reproducibility, you can find how to host the websites here. After running the full evaluation set, you can reset the environment to a deterministic initial state. Let me dig into the issue of "New" functionality.

Yes my comment was more about the unexpected behavior of the Hot/New button.

@gagb, my understanding is that this behavior was found on our own locally-hosted site, correct? The New/Hot functionality is not working as expected on at least one of our locally-hosted instances.

@shuyanzhou I believe the concern is that there are many questions that operate on the same forums or projects, and so running those questions in parallel, or in a different order, can lead to situations like this. Is the advice to reset the Docker images after each task? Or after each run?

gagb commented 2 months ago

Hi @gagb @peterychang, the demo websites we hosted are not recommended as test environments for the exact reason you mentioned. They are only for demonstration purposes.

To ensure reproducibility, you can find how to host the websites here. After running the full evaluation set, you can reset the environment to a deterministic initial state.

Let me dig into the issue of "New" functionality.

Yes my comment was more about the unexpected behavior of the Hot/New button.

@gagb, my understanding is that this behavior was found on our own locally-hosted site, correct? The New/Hot functionality is not working as expected on at least one of our locally-hosted instances.

@shuyanzhou I believe the concern is that there are many questions that operate on the same forums or projects, and so running those questions in parallel, or in a different order, can lead to situations like this. Is the advice to reset the Docker images after each task? Or after each run?

Yes on our locally hosted

shuyanzhou commented 2 months ago

Thank you all, I can reproduce the issue of sorting by New. It smells like the timestamps of posts are wrong? @frankxu2004 can you investigate it?

@afourney, good question. The instances need to be reset to initial states after evaluating the full set of 812 examples. Please check out the details here. We have ordered the examples to prevent earlier ones from influencing later evaluations. For example, if there is a. task of checking the newest post on subreddit X and a task of making a post on X, the second one will always be executed later than the first one. That being said, parallel running can cause issues. We have this script that can roughly support 3-4 parallel runs. It is also possible to reset the environment every N examples.

gagb commented 2 months ago

Thank you all, I can reproduce the issue of sorting by New. It smells like the timestamps of posts are wrong? @frankxu2004 can you investigate it?

@afourney, good question. The instances need to be reset to initial states after evaluating the full set of 812 examples. Please check out the details here. We have ordered the examples to prevent earlier ones from influencing later evaluations. For example, if there is a. task of checking the newest post on subreddit X and a task of making a post on X, the second one will always be executed later than the first one. That being said, parallel running can cause issues. We have this script that can roughly support 3-4 parallel runs. It is also possible to reset the environment every N examples.

Awesome! Really appreciate the quick response. We'll try to use this order.

frankxu2004 commented 1 month ago

Hey @gagb @afourney, sorry for the late response. We can confirm that there's a small bug regarding the "New" sort option. Internally it is using the post id as the sort criteria, and as new posts have a very small id, compared to pre-populated posts that have their original reddit ID. To solve this issue, we can change the sort criteria for the "New" button. Do note that to do so the test cases in WebArena original dataset might be invalidated so we might not change the official docker image. However if you would like to fix this issue, please go inside the postmill docker, and modify the following code: src/Pagination/SubmissionPage.php Change Submission::SORT_NEW => ['id' => true], to Submission::SORT_NEW => ['timestamp' => true],

And modify this function to the following:

    public function isFieldValid(string $fieldName, $value): bool {
        switch ($fieldName) {
        case 'ranking':
        case 'id':
        case 'netScore':
        case 'commentCount':
            return is_numeric($value) && \is_int(+$value);
        case 'lastActive':
            return (bool) @\DateTime::createFromFormat(\DateTime::ATOM, $value);
        case 'timestamp':
            return (bool) @\DateTime::createFromFormat(\DateTime::ATOM, $value);
        default:
            return false;
        }
    }

Save the file and it should work properly by sorting by timestamp.