yourealwaysbe / forkyz

Forkyz Crosswords
GNU General Public License v3.0
39 stars 5 forks source link

Support for automatic downloading of non-dated puzzles such as Private Eye #50

Closed pseudomonas closed 2 years ago

pseudomonas commented 2 years ago

Private Eye has excellent acrosslite-format puzzles where the URL format corresponds not to the date but to the crossword number.

eg https://www.private-eye.co.uk/crossword links to https://www.private-eye.co.uk/pictures/crossword/download/729.puz

There would seem to be a couple of approaches to get these: a) derive the crossword number from the date (tricky given that the Eye has some double-issues, so is not strictly fortnightly) b) visit the /crossword URL and scrape out the link (if present; some specials have larger-than-15×15 crosswords which are PDF-only) c) visit the parent directory https://www.private-eye.co.uk/pictures/crossword/download/ and extract the most recent .puz file in there (relies on the directory remaining browsable)

Obviously I can download them manually when I remember; this is not a high-priority request.

yourealwaysbe commented 2 years ago

Thanks for these.

I think (b) is the best way of doing it, and should be fairly straightforward when i get around to it.

pseudomonas commented 2 years ago

I agree that b) is the sensible approach. Scheduling a page check for "every thursday" should be OK, even if slightly over half the thursdays yield no new crosswords.

yourealwaysbe commented 2 years ago

Could you give an example of the not quite fortnightly numbering? Glancing through, i just see that over Christmas there's an issue missing, but the numbering also skips accordingly. Maybe deriving the number from the date will work.

yourealwaysbe commented 2 years ago

Ah, i see some irregularity. There's a 5-week break over Christmas 2020/21, and by the looks of it another week shift at another time. The publication date on the back issues page seems to show some changes to the day of the week over the years too.

The dates on the download index are also not reliable -- they can be a few days before the publication (often on Tuesdays).

Option (c) might be better. There's already the page scraper facility (in the "experimental" sources). It would just need a slight adjustment to download the most recent not already downloaded (currently it takes the first three from the top of the page i think).