web-archive-group / WALK

Web Archives for Longitudinal Knowledge
8 stars 2 forks source link

An Access Request for the HCF collection #7

Closed tsuomela closed 8 years ago

tsuomela commented 8 years ago

@ianmilligan1 @ruebot

UAL was recently contacted by the Edmonton Oilers to recover some images from the Heritage Community Foundation collection.

The seed is http:/www.oilersheritage.com/

Would warcbase be able to extract the images from the seed after the collection is added to the cloud?

ruebot commented 8 years ago

It might be related to this ticket: https://github.com/lintool/warcbase/issues/211

tsuomela commented 8 years ago

At this point it would probably be enough to just extract everything from the single seed and then share that with the Oilers.

Being able to identify the image using a checksum is probably overkill for this initial request but certainly has utility for analysis and comparison of images.

ianmilligan1 commented 8 years ago

Right now, we could definitely grab all the URLs but then I think we'd have to wget them from the Internet Archive which doesn't strike me as an ideal solution... (unless there aren't that many images)

ruebot commented 8 years ago

Heritage Community Foundation is done downloading, and can be found here: /data/heritage_community_foundation

Should I setup warcbase on the machine?

ianmilligan1 commented 8 years ago

Great! Sure, or I am happy to set it up.

ruebot commented 8 years ago

I can get it set-up here in a few. Setting up the last few scripts to grab the low priority datasets.

ruebot commented 8 years ago

Warcbase is all setup. I'd hold off on running anything until I get all the datasets copied over. That should be done in the next day or so.

tsuomela commented 8 years ago

Wonderful. Thanks guys for tackling this. I know the issue may be a bit tangential but I thought it may be of interest to some in the library community because accessing WARC files and collections is one of the questions I hear from librarians. I'm curious to see what we can do through warcbase.

ianmilligan1 commented 8 years ago

Need to liase with Jimmy at some point and see what the best way to do this is.. we can run an image extract job and see how big the numbers are, but if it's a ton of images we'll want to extract them from the WARCs themselves rather than make a bunch of slow calls to the Internet Archive..

ianmilligan1 commented 8 years ago

OK! Now that we have warcbase set up, I could write a job to get all the image URLs from the oilers heritage site. We could wget them down and get them to you somehow.

Is this still useful to you @tsuomela? And is it only for one crawl or for multiple crawls?

ianmilligan1 commented 8 years ago

I'll close for now, unless we hear that this is still an active research question!