Closed ianmilligan1 closed 7 years ago
Ian will try to generate (we've had failures).
Running
import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader, ExtractBoilerpipeText}
import org.warcbase.spark.rdd.RecordRDD._
RecordLoader.loadArchives("/data/ALBERTA_alberta_education_curriculum/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlMonth, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("/data/derivatives/text/ALBERTA_alberta_education_curriculum-not-boilerpiped")
RecordLoader.loadArchives("/data/ALBERTA_alberta_floods_2013/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlMonth, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("/data/derivatives/text/ALBERTA_alberta_floods_2013-not-boilerpiped")
RecordLoader.loadArchives("/data/WAHR_ymmfire/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlMonth, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("/data/derivatives/text/WAHR_ymmfire-not-boilerpiped")
RecordLoader.loadArchives("/data/ALBERTA_idle_no_more/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlMonth, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("/data/derivatives/text/ALBERTA_idle_no_more-not-boilerpiped")
and will then use bash to calculate word frequency, etc.
@greebie Having issues w/ the Idle No More collection (others have generated well). Do you have a backup?
Seems like T-Space might be the best from a structural point of view. Snowden would be second. Let's do Snowden, since it's perhaps more meaningful.
Great. We already have that processed. I'll hopefully find a cycle today or tomorrow to calculate the word frequency/etc. from the raw extracted text!
OK now processing word frequencies with
cat TORONTO_snowden_archive-text.txt | tr -d [:punct:] | tr [:upper:] [:lower:] | tr [:space:] '\n' | grep -v "^\s*$" | sort | uniq -c | sort -bnr > snowden_frequency.txt
Progress:
Ok @greebie – frequency files done and available in /data/WALK-Compare/WALK-Compare/frequencies
on WALK.
-rw-rw-r-- 1 ubuntu 27M Oct 29 15:51 ALBERTA_alberta_education_curriculum_frequency.txt
-rw-rw-r-- 1 ubuntu 327M Oct 29 16:52 ALBERTA_alberta_floods_frequency.txt
-rw-rw-r-- 1 ubuntu 6.0M Oct 29 15:30 snowden_frequency.txt
-rw-rw-r-- 1 ubuntu 8.6M Oct 29 15:38 UVIC_Calendar_frequency.txt
-rw-rw-r-- 1 ubuntu 57M Oct 29 17:04 WAHR_panamapapers_frequency.txt
Note the sizes (esp. on floods & panama papers). I think maybe take the first X lines of each for your analysis? There's a lot of cruft in the floods file, not 100% sure why. Maybe use head
to get 10,000 lines or something like that? Let me know if you want me to do so, just let me know how many values you'd need to extract meaningful data.
Actually I'll also put them into Dropbox and slack you the links.
Thanks Ian. It is fairly easy to select lines in Python, so all should be fine. I am working on this now.
I've applied them to the analysis in the latest push. More collections could provide a more interesting analysis I think. It's also interesting because I was able to show what words were the matching ones in the correspondence analysis.
Fantastic! I'm happy to generate more word frequency data if that helps. Just let me know collections and I can do so. Looking forward to chatting tomorrow!
Okay. How about Francophonie, ymmfire, alberta oil sands and UVIC environmental orgs.
Also if you have any ones that are already available, let's add those for good luck.
Great! I have those four collections in plain text, so will begin running frequency on them. Some of them are quite large so might take a while.. we can chat today, see if this is something we want to bake in on the processing level.
And done! Here is /data/WALK-Compare/WALK-Compare/frequencies
now.
-rw-rw-r-- 1 ubuntu ubuntu 27M Oct 29 15:51 ALBERTA_alberta_education_curriculum_frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 327M Oct 29 16:52 ALBERTA_alberta_floods_frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 13M Nov 11 14:29 ALBERTA_alberta_oil_sands-text-frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 1.4G Nov 11 14:45 ALBERTA_lfrancophonie_de_louest_canadien-text-frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 6.0M Oct 29 15:30 TORONTO_snowden_frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 8.6M Oct 29 15:38 UVIC_Calendar_frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 5.1M Nov 11 14:46 UVIC_environmental_organizations_and_resources_of_bc-text-frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 57M Oct 29 17:04 WAHR_panamapapers_frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 57M Nov 11 14:03 WAHR_panamapapers-text-frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 6.5M Nov 11 15:10 WAHR_ymmfire-not-boilerpiped-frequency.txt
Word frequency might be useful when comparing collections. If @greebie wants word comparison data to compare collections, check out
/data/derivatives/text
and let me know. I can generate and send to you.