Closed tsuomela closed 8 years ago
It might be related to this ticket: https://github.com/lintool/warcbase/issues/211
At this point it would probably be enough to just extract everything from the single seed and then share that with the Oilers.
Being able to identify the image using a checksum is probably overkill for this initial request but certainly has utility for analysis and comparison of images.
Right now, we could definitely grab all the URLs but then I think we'd have to wget
them from the Internet Archive which doesn't strike me as an ideal solution... (unless there aren't that many images)
Heritage Community Foundation is done downloading, and can be found here: /data/heritage_community_foundation
Should I setup warcbase on the machine?
Great! Sure, or I am happy to set it up.
I can get it set-up here in a few. Setting up the last few scripts to grab the low priority datasets.
Warcbase is all setup. I'd hold off on running anything until I get all the datasets copied over. That should be done in the next day or so.
Wonderful. Thanks guys for tackling this. I know the issue may be a bit tangential but I thought it may be of interest to some in the library community because accessing WARC files and collections is one of the questions I hear from librarians. I'm curious to see what we can do through warcbase.
Need to liase with Jimmy at some point and see what the best way to do this is.. we can run an image extract job and see how big the numbers are, but if it's a ton of images we'll want to extract them from the WARCs themselves rather than make a bunch of slow calls to the Internet Archive..
OK! Now that we have warcbase set up, I could write a job to get all the image URLs from the oilers heritage site. We could wget
them down and get them to you somehow.
Is this still useful to you @tsuomela? And is it only for one crawl or for multiple crawls?
I'll close for now, unless we hear that this is still an active research question!
@ianmilligan1 @ruebot
UAL was recently contacted by the Edmonton Oilers to recover some images from the Heritage Community Foundation collection.
The seed is http:/www.oilersheritage.com/
Would warcbase be able to extract the images from the seed after the collection is added to the cloud?