web-archive-group / WALK

Web Archives for Longitudinal Knowledge
8 stars 2 forks source link

Failures in Various Derivative Generations #12

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

I've gone through and extracted URLs and link networks for all the collections. However, we've had quite a few failures (you can read about all the fun stuff over in this warcbase issue).

The following collections have both URLs and networks:

alberta_education_curriculum/
alberta_floods_2013/
alberta_oil_sands/
canadian_business_grey_literature/
elxn42/
energy_environment/
hcf_alberta_online_encyclopedia/
health_sciences_grey_literature/
heritage_community_foundation/
humanities_computing/
lfrancophonie_de_louest_canadien/
ottawa_shooting_october_2014/
prarie_provinces/
web_archive_general/

The following collections have only URLs, no networks - IdleNoMore failed twice:

alberta_education_curriculum/
alberta_floods_2013/
alberta_oil_sands/
canadian_business_grey_literature/
elxn42/
energy_environment/
hcf_alberta_online_encyclopedia/
health_sciences_grey_literature/
heritage_community_foundation/
humanities_computing/
idle_no_more/
lfrancophonie_de_louest_canadien/
ottawa_shooting_october_2014/
prarie_provinces/
web_archive_general/

The following collections have no derivatives generated yet:

alberta_education_curriculum_francais/
circumpolar/
government_information/
university_of_alberta_websites/
ianmilligan1 commented 8 years ago

Update!

Government_Information and University_of_Alberta_Websites have URLs generated. Am going to try to generate networks for Idle No More, Go'vt Inofmration, and UA Websites.

Circumpolar continues to fail, due to presence of 44GB WARC file.

ianmilligan1 commented 8 years ago

Running

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 

val circumpolar = 
  RecordLoader.loadArchives("/data/circumpolar/*.gz", sc) 
  .keepValidPages() 
  .map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl))) 
  .countItems() 
  .saveAsTextFile("/data/derivatives/urls/circumpolar")

will error log once it crashes, even with 300gb swap file in place.

ianmilligan1 commented 8 years ago

This was all fixed in a more recent warcbase build, and we've generated these derivatives.