Closed ianmilligan1 closed 8 years ago
Update!
Government_Information and University_of_Alberta_Websites have URLs generated. Am going to try to generate networks for Idle No More, Go'vt Inofmration, and UA Websites.
Circumpolar continues to fail, due to presence of 44GB WARC file.
Running
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val circumpolar =
RecordLoader.loadArchives("/data/circumpolar/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl)))
.countItems()
.saveAsTextFile("/data/derivatives/urls/circumpolar")
will error log once it crashes, even with 300gb swap file in place.
This was all fixed in a more recent warcbase build, and we've generated these derivatives.
I've gone through and extracted URLs and link networks for all the collections. However, we've had quite a few failures (you can read about all the fun stuff over in this warcbase issue).
The following collections have both URLs and networks:
The following collections have only URLs, no networks - IdleNoMore failed twice:
The following collections have no derivatives generated yet: