Open gagb opened 2 months ago
It looks like there's a task to create a post in the DIY sub about midjourney, which is where those entries are coming from... Maybe we should create a tool that resets the sites before the tests get run? Not sure how to avoid conflicting with multiple people running simultaneously though
I think the broader point is that "New" does not appear to be accurate. There's something functionally wrong with the sort order.
Hi @gagb @peterychang, the demo websites we hosted are not recommended as test environments for the exact reason you mentioned. They are only for demonstration purposes.
To ensure reproducibility, you can find how to host the websites here. After running the full evaluation set, you can reset the environment to a deterministic initial state.
Let me dig into the issue of "New" functionality.
Hi @gagb @peterychang, the demo websites we hosted are not recommended as test environments for the exact reason you mentioned. They are only for demonstration purposes.
To ensure reproducibility, you can find how to host the websites here. After running the full evaluation set, you can reset the environment to a deterministic initial state.
Let me dig into the issue of "New" functionality.
Yes my comment was more about the unexpected behavior of the Hot/New button.
Hi @gagb @peterychang, the demo websites we hosted are not recommended as test environments for the exact reason you mentioned. They are only for demonstration purposes. To ensure reproducibility, you can find how to host the websites here. After running the full evaluation set, you can reset the environment to a deterministic initial state. Let me dig into the issue of "New" functionality.
Yes my comment was more about the unexpected behavior of the Hot/New button.
@gagb, my understanding is that this behavior was found on our own locally-hosted site, correct? The New/Hot functionality is not working as expected on at least one of our locally-hosted instances.
@shuyanzhou I believe the concern is that there are many questions that operate on the same forums or projects, and so running those questions in parallel, or in a different order, can lead to situations like this. Is the advice to reset the Docker images after each task? Or after each run?
Hi @gagb @peterychang, the demo websites we hosted are not recommended as test environments for the exact reason you mentioned. They are only for demonstration purposes.
To ensure reproducibility, you can find how to host the websites here. After running the full evaluation set, you can reset the environment to a deterministic initial state.
Let me dig into the issue of "New" functionality.
Yes my comment was more about the unexpected behavior of the Hot/New button.
@gagb, my understanding is that this behavior was found on our own locally-hosted site, correct? The New/Hot functionality is not working as expected on at least one of our locally-hosted instances.
@shuyanzhou I believe the concern is that there are many questions that operate on the same forums or projects, and so running those questions in parallel, or in a different order, can lead to situations like this. Is the advice to reset the Docker images after each task? Or after each run?
Yes on our locally hosted
Thank you all, I can reproduce the issue of sorting by New
. It smells like the timestamps of posts are wrong? @frankxu2004 can you investigate it?
@afourney, good question. The instances need to be reset to initial states after evaluating the full set of 812 examples. Please check out the details here. We have ordered the examples to prevent earlier ones from influencing later evaluations. For example, if there is a. task of checking the newest post on subreddit X
and a task of making a post on X
, the second one will always be executed later than the first one. That being said, parallel running can cause issues. We have this script that can roughly support 3-4 parallel runs. It is also possible to reset the environment every N examples.
Thank you all, I can reproduce the issue of sorting by
New
. It smells like the timestamps of posts are wrong? @frankxu2004 can you investigate it?@afourney, good question. The instances need to be reset to initial states after evaluating the full set of 812 examples. Please check out the details here. We have ordered the examples to prevent earlier ones from influencing later evaluations. For example, if there is a. task of checking the newest post on subreddit
X
and a task of making a post onX
, the second one will always be executed later than the first one. That being said, parallel running can cause issues. We have this script that can roughly support 3-4 parallel runs. It is also possible to reset the environment every N examples.
Awesome! Really appreciate the quick response. We'll try to use this order.
Hey @gagb @afourney, sorry for the late response. We can confirm that there's a small bug regarding the "New" sort option. Internally it is using the post id as the sort criteria, and as new posts have a very small id, compared to pre-populated posts that have their original reddit ID. To solve this issue, we can change the sort criteria for the "New" button. Do note that to do so the test cases in WebArena original dataset might be invalidated so we might not change the official docker image. However if you would like to fix this issue, please go inside the postmill docker, and modify the following code:
src/Pagination/SubmissionPage.php
Change
Submission::SORT_NEW => ['id' => true],
to
Submission::SORT_NEW => ['timestamp' => true],
And modify this function to the following:
public function isFieldValid(string $fieldName, $value): bool {
switch ($fieldName) {
case 'ranking':
case 'id':
case 'netScore':
case 'commentCount':
return is_numeric($value) && \is_int(+$value);
case 'lastActive':
return (bool) @\DateTime::createFromFormat(\DateTime::ATOM, $value);
case 'timestamp':
return (bool) @\DateTime::createFromFormat(\DateTime::ATOM, $value);
default:
return false;
}
}
Save the file and it should work properly by sorting by timestamp.
I am not super sure but I noticed something weird with the reddit website in the benchmark:
I was looking at task 29: "Tell me the count of comments that have received more downvotes than upvotes for the user who made the latest post on the DIY forum."
But when I go to the website, something doesn't look right. If I sort by hot, the website shows a submission that was made a day ago. But if I sort by new, the website shows the latest post was at least an year old. Am I not reading this correctly?
If the website is broken, it will confuse the agents and I am not sure we can have a valid evaluation.
cc: @afourney @cheng-tan