zooniverse / Talk-archiver

A static site generator for old Talk forums, based on elevenpack.
Apache License 2.0
0 stars 1 forks source link

Remaining projects #56

Closed eatyourgreens closed 3 years ago

eatyourgreens commented 4 years ago

Larger projects (~50k subjects or more) that have yet to be fully archived.

Approved

Noting that all pending issues have been dealt with.

Pending

Using https://docs.google.com/document/d/1sfVy7O-dQK7vgWn10-f9oqNhnh2uKIyUzSe4lKizIEA/edit for reviews

eatyourgreens commented 4 years ago

Recents, boards and discussion pages are built for each of these projects. Users, collections, tags and subjects haven’t been done.

eatyourgreens commented 4 years ago

Tags should be done now for each of those projects.

eatyourgreens commented 4 years ago

Galaxy Zoo has many broken images from subjects that were hosted in their own S3 bucket, outside of zooniverse-static.

Screenshot of the Galaxy Zoo Talk home page, showing many broken images for recent subject comments.

wgranger commented 4 years ago

Just finished auditing Penguin Watch. Two things I've noticed:

beckyrother commented 4 years ago

Auditing Planet Hunters:

eatyourgreens commented 4 years ago

Planet Hunters:

One board is missing: https://talk.planethunters.org/#/boards/BPH0000008

https://talk.planethunters.org/boards/BPH0000008 is present, so I think this is fine.

Subject images missing on new site

Planet Hunters didn't use images as subjects. We decided quite early on that we weren't going to build custom subject viewers for each project. Subjects that can't display an image or a video won't be shown, but there should be a link to the original JSON data, including file locations and metadata for each subject.

eatyourgreens commented 4 years ago

Penguin Watch:

https://talk.penguinwatch.org/#/boards - under chat, has “we’re moving to a new slot with the Zooniverse”

https://talk.penguinwatch.org/boards/BPZ000000v is completely missing from the new site. It is listed as 0 posts and 0 discussions on the original site, which might explain why it was skipped. Science Gossip, similarly, has a discussion which is present on the old site but not present in the API responses that are used to build the static pages. https://talk.sciencegossip.org/#/boards/BSC0000003/discussions/DSC0000036

EDIT: removing the check on board.discussions here fixed the missing Penguin Watch board. I'm not sure of the implications of this for other projects, so I haven't committed that change. https://github.com/zooniverse/Talk-archiver/blob/b5eb813082970ea249111ec44a3f28ed95160e79/src/helpers/boards.js#L10-L20

eatyourgreens commented 4 years ago

Penguin Watch:

https://talk.penguinwatch.org/tags/egg/collections.html - shows no collections

Tagged collections seem to be broken in general. This Galaxy Zoo page is also empty, but should list 6 collections. https://talk.galaxyzoo.org/tags/edgeon/collections.html

EDIT: here's another example of broken collection tags. https://talk.milkywayproject.org/tags/starcluster/

This line should be tag.userCollections not tag.collections (collections is already used by Eleventy.) I don't know how many sites are affected by this bug. https://github.com/zooniverse/Talk-archiver/blob/b5eb813082970ea249111ec44a3f28ed95160e79/src/tags/userTags/collections.njk#L15

srallen commented 4 years ago

Likely non-blocking issue for Disk Detective: https://github.com/zooniverse/Talk-archiver/issues/76

Approved.

mcbouslog commented 4 years ago

Operation War Diary - I think all of the following are known issues and have been identified previously or in this issue

Collections

Tags

eatyourgreens commented 4 years ago

Operation War Diary

https://talk.operationwardiary.org/tags/badscan/ (notes 10 discussions, shows 2)

All 10 are listed on https://talk.operationwardiary.org/tags/badscan/discussions.html but maybe using headings as links was a bad idea? See #16.

mcbouslog commented 4 years ago

All 10 are listed on https://talk.operationwardiary.org/tags/badscan/discussions.html but maybe using headings as links was a bad idea? See #16.

Ah I see now! Hmm, not sure, but I think it's fine as is, I think I just missed it.

camallen commented 4 years ago

All 10 are listed on https://talk.operationwardiary.org/tags/badscan/discussions.html but maybe using headings as links was a bad idea? See #16.

Ah I see now! Hmm, not sure, but I think it's fine as is, I think I just missed it.

@mcbouslog if all your pending comments are resolved - please approve the site and move the site to the approved section in https://github.com/zooniverse/Talk-archiver/issues/56#issue-649834276

mcbouslog commented 4 years ago

I'm not sure if:

Collections

Has been addressed?

I've updated my pending link to this comment, as I think this is the only remaining open item for Operation War Diary.

camallen commented 4 years ago

https://talk.galaxyzoo.org/recent - some pages still have broken images and require new code rerun that uses the direct s3 thumbnail urls vs thumbnail server. see, https://talk.galaxyzoo.org/manifest/hosts.json and the underlying issue with thumbnails https://github.com/zooniverse/thumbnailer/pull/14#issuecomment-672949262

We might be better served by merging the www.galaxyzoo.org/ bucket data into the zooniverse-static/www.galaxyzoo.org/ paths and then using thumbnails / static server to avoid serving data out of s3 in perpetuity but this might be too more effort than required for the task at hand. TDB

eatyourgreens commented 4 years ago

I'm seeing all the images load on https://talk.galaxyzoo.org/recent, since rebuilding that page on Friday using #77.

shaunanoordin commented 4 years ago

I'm taking on Chimp and See this week. Finally I can prove which one of us is the superior primate.

wgranger commented 4 years ago

Space Warps:

Comparing build logs: https://talk.spacewarps.org/logs/build.log - Not seeing "users" appear in JSON/HTML output https://talk.spacewarps.org/manifest/build.json - seeing 31,816 users here

Discussions linked to subjects https://talk.spacewarps.org/#/subjects/ASW0008kij - Mentions several linked discussions https://talk.spacewarps.org/subjects/ASW0008kij/ - Doesn't mention any linked discussions

User collections don't sync up https://talk.spacewarps.org/#/users/c_cld https://talk.spacewarps.org/users/c_cld/

Discussions linked to collections https://talk.spacewarps.org/collections/CSWL00000p/ - No discussions linked https://talk.spacewarps.org/#/collections/CSWL00000p - Many discussions linked

Mismatch in discussion count linked to tags https://talk.spacewarps.org/#/search?tags[lens]=true https://talk.spacewarps.org/tags/lens/

eatyourgreens commented 4 years ago

Space Warps:

User collections don't sync up https://talk.spacewarps.org/#/users/c_cld https://talk.spacewarps.org/users/c_cld/

Looks like c_cld was called C_cld at one point. Collections are matched up by exact matches on username. https://talk.spacewarps.org/#/collections/CSWS000h5x https://talk.spacewarps.org/collections/CSWS000h5x/ (links to a C_cld user profile that doesn't exist.)

Discussions linked to collections https://talk.spacewarps.org/collections/CSWL00000p/ - No discussions linked https://talk.spacewarps.org/#/collections/CSWL00000p - Many discussions linked

That's interesting. I haven't come across the Discussions mentioning this section in any other projects.

eatyourgreens commented 4 years ago

Space Warps is using the mentions feature. We can probably get subject.mentions from the API responses for subjects. I'm not sure if the collections export included collection.mentions for each collection. https://github.com/zooniverse/Talk/blob/2e8ad17390c1d623f1868d078379e73958ff74e4/app/views/focus/discussions.eco#L18-L24

eatyourgreens commented 4 years ago

Following up on the Space Warps mentions feature, I'm looking at the JSON files from the archived site.

Collections are built from the data exports, which don't include mentions. https://talk.spacewarps.org/api/collections/CSWL00000p.json

Subjects are archived directly from the Ouroboros API, and do include mentions (which we ignore in the HTML.) https://talk.spacewarps.org/api/subjects/ASW0008kij.json

eatyourgreens commented 4 years ago

Space Warps:

User collections don't sync up https://talk.spacewarps.org/#/users/c_cld https://talk.spacewarps.org/users/c_cld/

I've fixed this, for that account, by keying users by user ID, rather than name. However, I'm now running into a username collision for Space Warps. Two different user IDs are trying to use the same URL.

Output conflict: multiple input files are writing to `dist/api/users/pandamonium2956.json`. Use distinct `permalink` values to resolve this conflict.
camallen commented 4 years ago

I'm seeing all the images load on https://talk.galaxyzoo.org/recent, since rebuilding that page on Friday using #77.

@eatyourgreens excellent

eatyourgreens commented 4 years ago

Operation War Diary

If each of those extra 11 collections has been archived to its own page (eg. https://talk.operationwardiary.org/collections/CWDS0000sz/) then this isn't a problem. If any of them are missing, then that would be a problem.

camallen commented 4 years ago

All GZ pages rebuilt to use non s3 hosts, see https://talk.galaxyzoo.org/manifest/hosts.json and https://zooniverse.slack.com/archives/C0138Q1LVCL/p1598041404108100?thread_ts=1598015745.086600&cid=C0138Q1LVCL

AnLand commented 4 years ago

Hi! We have seen that videos containing humans have been removed in the static sites. For Chimp&See, this brought the issue that also chimp and gorilla videos has been removed in the case of habituated communities - so, researchers have been seen in the same video as a chimp or gorilla. Is it possible to differentiate here? The problem might be limited to the videos in this collection (as we hopefully tagged these cases all with #habituated). I am just posting here in addition to the respective zooniverse talk thread.

AnLand commented 4 years ago

Hi again! Two addition the Chimp&See science team asked me to report:

What would be cool - but is not essential! - is to sort the science board with the chimp matching sites according to their number: https://talk.chimpandsee.org/boards/ :-)

Thanks!

camallen commented 4 years ago

@AnLand we've explicitly dropped the feature discussions mentioning this, see https://github.com/zooniverse/Talk-archiver/issues/80#issuecomment-676577771

Images are missing partly from discussion threads and I am not sure about the pattern, e.g., here with all kinds of different displays, not displays, misses, etc.

Can you please provide URL links that we can review?

What would be cool - but is not essential! - is to sort the science board with the chimp matching sites according to their number: https://talk.chimpandsee.org/boards/ :-)

We won't have time to do this. Our aim here was to turn off our old API infrastructure (save $$$) but archive the volunteer content for posterity.

Noting this effort is not only for Chimp & See but for all our projects that ran on this infrastructure (~36 projects) and I believe we have achieved these aims.

AnLand commented 4 years ago

@camallen Thanks for checking - whatever is possible to achieve! Here the link that shows quite well the different image display in one thread: https://talk.chimpandsee.org/boards/BCP000000s/discussions/DCP0001uc1/ Sorry for missing out to include it earlier.

camallen commented 4 years ago

@AnLand those images are now showing, e.g. https://talk.chimpandsee.org/boards/BCP000000s/discussions/DCP0001uc1/

AnLand commented 4 years ago

Looks all good to me. Thank you so much!

AnLand commented 4 years ago

Hi again, two links to discussion boards are suddenly empty. They worked last week and all other boards seem to be fine. Could you please have a look? Thanks!

It seems that I am able to open discussions within this folder, when I find them via google search.

camallen commented 4 years ago

i'm seeing content for these links above. Perhaps an intermittent issue?

eatyourgreens commented 4 years ago

That's interesting. I've been consistently getting a blank page here, but only when I access it on my phone. https://talk.galaxyzoo.org/boards/BGZ0000008/discussions/DGZ0002r39/

I'd thought it was my phone, but maybe there's an issue with caching for these URLs?

AnLand commented 4 years ago

I emptied my cache and can now see this page https://talk.chimpandsee.org/boards/BCP000000v/, but now the boards are blank: https://talk.chimpandsee.org/boards/ Sorry, I can't provide any more information.

eatyourgreens commented 4 years ago

@AnLand thanks, that's useful to know. If you visit the blank page in a private window, does it still come up empty? Also, if you right click on the blank page and choose View Page Source, do you get any HTML code at all for the page?

I'd check myself, but https://talk.chimpandsee.org/boards/ loads successfully for me.

AnLand commented 4 years ago

You ask for this view view-source:https://talk.chimpandsee.org/boards/, right? No, there is nothing. Just blank.

eatyourgreens commented 4 years ago

Thanks, that's exactly what I wanted to know. It sounds like the browser isn't downloading anything at all, not even a partial page.

camallen commented 3 years ago

@AnLand i cannot reproduce this behaviour at all, i see content via https://talk.chimpandsee.org/boards/ and view-source:https://talk.chimpandsee.org/boards/

Can you test again and if it is still not working provide your browser details via https://www.whatismybrowser.com/

AnLand commented 3 years ago

I tested again and all seems to be fine now. Thank you! (Sorry for the late response as well.)

camallen commented 3 years ago

these have all been done