wimglenn / advent-of-code-data

Get your Advent of Code data with a single import statement
MIT License
516 stars 54 forks source link

example data is not parsed correctly #99

Closed emteeoh closed 1 year ago

emteeoh commented 1 year ago

Working on AoC 2022, day 7. The actual example_data is:

$ cd /
$ ls
dir a
14848514 b.txt
8504156 c.dat
dir d
$ cd a
$ ls
dir e
29116 f
2557 g
62596 h.lst
$ cd e
$ ls
584 i
$ cd ..
$ cd ..
$ cd d
$ ls
4060174 j
8033020 d.log
5626152 d.ext
7214296 k

but aocd thinks it is:

$ system-update --please --pretty-please-with-sugar-on-top
Error: No space left on device

I guess you're just scraping the webpage for the first <\code> tags... It would be nice if there was a way to either be smarter about it, or maybe allow us to get at all example code fragments through aocd...

wimglenn commented 1 year ago

You're right, aocd.models.Puzzle.example_data was implemented fairly simply as "whatever the first code block contains". I also noticed that it got it wrong for Aoc 2022, day 7 - there have been a few other puzzles where it misses too. Some puzzles don't even have an example at all!

I'm not sure returning all the code fragments would be that useful - 2022/day/7 contains 60 code blocks. There might be something clever we could do where we try to find a code-block with "similar" structure or contents as the actual user's data, but this doesn't sound like an easy task to get 100% correct.

For context it was a user request from #66 where I commented that we could make some best-effort attempt and, unfortunately, I don't really have a better idea here. I'll leave the issue open in case you (or someone else) wants to make an attempt to do something smarter in a pull-request.

Ryton commented 1 year ago

An idea:

  • The sentence before the code block (as far as i can checked) mostly contains (and often starts with) 'for example', and always ends with a colon ':'.

Some examples:


For example: For example, consider this ticket: For example, suppose that in your bag, you have adapters with the following joltage ratings: For example, consider the following initial state: For example, suppose you have the following list of contents from six rucksacks:

But not always. We (i?) could look for this text pattern, and if not present, return the first block == current behaviour?

@emteeoh : If this happens, you can (manually) replace the text in the file "..config\aocd\github.\__example_input.txt " by the correct content, and then reload. But you probably already knew that?

wimglenn commented 1 year ago

It would also be nice to programmatically provide access to the example solutions. But the patterns in the html are not predictable enough to detect easily.

rnbwdsh commented 1 year ago

I wrote a small heuristic to find the test data by 1) substitute A-Z with A, a-z with a, 0-9 with 1 2) calculate an inverse mutual information of sets len(different) / len(common) 3) Take the first, lowest-difference-scoring one

Here is a gist with the diff results for all of the inputs so far. I've only checked 2015, 2016 and 2022 and in general, this seems to work WAY better.

I also already created a PR, #100

rnbwdsh commented 1 year ago

I've also created a bot using ChatGPT to "intelligently" extract the inputs and outputs. You must have a valid, confirmed email account and EMAIL/PASSWORD in your env.

Judging from the input-extraction, this performs significantly worse than the heuristic. To be specific, only 69% of the outputs of chatgpt are a proper json file (even though I explicitly promt it), and 91% contain a valid json structure. But probably <60% are actually correct.

Here is a small statistics POC on another branch. You can review the answer yourself, probably hardcoding them from the years 2015-2021 would be the easiest solution, as this would reduce the load on the server and it's like... 10kb?

So if chatgpt can't do it consistently right, I don't think there is a heuristic to properly find it.

wimglenn commented 1 year ago

Hardcoding the locations of the codeblocks is OK, but I do not think it is acceptable to redistribute examples or the prose directly. And a better heuristic is still a good idea, because including the code block location offsets can only help with past puzzles, whereas the heuristic should help with future puzzles.

rnbwdsh commented 1 year ago

I asked @ericwastl on twitter if he has any opinions on it and linked the thread. Let's see what he answers.

undergroundmonorail commented 1 year ago

I wish I had seen this issue before trying it myself haha. I went with a much simpler heuristic, just:

  • Split the input data and a potential example code block into "sections", where a section is just the string split on '\n\n'
  • If there are any candidate blocks with exactly the same number of sections as the input data, throw out all others
  • Return the one with the most similar average line length

It seems to perform fairly well. I wonder if there's a simple way to get an acceptably high hit rate.

rnbwdsh commented 1 year ago

Got a reply:

I don't currently have plans to support an API for this. You may not hard-code the puzzle text in the tool. I might some day add annotations that would support this, but currently I recommend actually reading the puzzle.

Hardcoding locations from the previous years doesn't seem that elegant, especially if they can change. Hardcoding the proper texts would have been a quite robust solution but I get his POV. And generally, a solution that works for not-yet-released stuff would be the best, even though I don't think extracting the example solutions is that easy. But the example inputs seems to be mostly a solveable thing.

The ChatGPT one was more of a fun experiment, if you (@undergroundmonorail ) were relating to your heuristic being more simple. Otherwise, I think my variant in the first post with the 3 steps is shorter than yours and performs already insanely good and better than the current one.

Your heuristic theoretically sounds nice too, but not all levels are line-separated and levels like 2019/16 have very short input lines in the "given" and very long ones in the solution. You can use the gist I linked to test yours, but I'm quite sure it performs worse.

wimglenn commented 1 year ago

v2.0.0 is released. All previous puzzles should have correct example parsing now. Future puzzles may not initially be scraped correctly, but they'll eventually be correct once logic in the package aocd-example-parser is updated to handle them. The parser code will be kept separate from aocd, so that parser implementation can be updated/improved without needing to cut a new release of advent-of-code-data itself.

Note: the puzzle.example_data (str) attribute is gone. Now we have puzzle.examples which returns a list of Example instances. Example instances have the associated answers too, when available.

The runner aoc script can now execute solvers with the example data/answers instead of user's data/answers. That is done with aoc --example.

aocd now has a plugin system for parsing example data. The default parser is still the fairly naive "first pre block has input data, last code block has puzzle answer" idea. This approach is only good about 40% of the time. For the people in this thread that were brainstorming better approaches, it should be easy now to try them out in this new plugin system. Other user's example parsers can be plugged into aocd by supplying a function - your parser will be sent a Page instance with raw and parsed HTML, a list of "real" datas for comparison, and it should return a list of Example instances to the caller. The module aocd_example_parser/plugins.py provides two such parser functions.

The performance of an example parser can be checked against the reference implementation by using aoce --example-parser=my_example_parser.

@Ryton @undergroundmonorail @rnbwdsh If you're interested, try out the new parser plugin system and see if you can come up with a better example parser than aocd's default. Let me know what you think!