webrecorder / webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
Apache License 2.0
426 stars 39 forks source link

Taking too long to load Coursera archive files #38

Open cksajil opened 6 years ago

cksajil commented 6 years ago

Hi,

I have tried to read Coursera archived files(ahttps://archive.org/details/archiveteam_coursera) but its taking too long to load. Even after loading the pages are not rendered properly. I think this is because the files are pretty huge(~50GB). Is there a way to view the archives efficiently ?.

ikreymer commented 6 years ago

We're looking at ways to use the existing CDX indexes, which in this case, you can also download from archive.org along with the WARC files. That would speed up the loading.

However, with this data set in particular, since this data set it from a crawl, there's no guarantee that a complete page is all in one WARC file, it may be spread over several files (individual items in that collection). You may need to open more than one WARC file and use a combined index to see all the data. This scale of operation is outside the scope of Webrecorder Player at this point.

What is the exact use case that you have? Are you trying to get the entire Coursera collection running locally, or just looking for a few pages?

cksajil commented 6 years ago

Hi, Last year coursera migrated to their new platform and while doing that they removed some of their old courses(https://www.class-central.com/report/coursera-old-platform-shutdown-download-courses/). The archive team was able to preserve most of them which is available at https://archive.org/details/archiveteam_coursera.

To my understanding each of the file has a set of courses (please see https://github.com/cksajil/Bioinfo/blob/master/foo_names.md). I wanted to see the content of the course High Performance Scientific Computing which is in the archive (https://archive.org/download/archiveteam_coursera_20160627202720/coursera_20160627202720.megawarc.warc.gz). But it seems after taking around 1 hour in my laptop(i5, 4GB), only first 500 entries are shown and most of them are blank template pages. What if I want to see beyond 500 entries ?.

TSSans-art commented 5 years ago

I think I am having the same problem, torrented this archive https://archive.org/details/warc_wonderfl_net_20151009 and I can't open any of the 0-7 parts in Webrecorder Player, it just gets to 100% and says "Please wait while the archive is indexed..."

3lit3h4xx0r666 commented 4 years ago

So what's the deal, you figure out how to read the index file or has development stalled like the re-indexing of the already indexed files. I'm tryin' to look at this porn and this thing ain't working. Nothing worse than downloading a giant porn file and then you're all ready to sit back and look at it and, oh... wait.. you gotta write some code.

3lit3h4xx0r666 commented 4 years ago

anyway. where do i start?