web-archive-group / WALK

Web Archives for Longitudinal Knowledge
8 stars 2 forks source link

Word frequency on collections #46

Closed ianmilligan1 closed 7 years ago

ianmilligan1 commented 8 years ago

Word frequency might be useful when comparing collections. If @greebie wants word comparison data to compare collections, check out /data/derivatives/text and let me know. I can generate and send to you.

ianmilligan1 commented 8 years ago

Ian will try to generate (we've had failures).

ianmilligan1 commented 8 years ago

Running

import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader, ExtractBoilerpipeText}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadArchives("/data/ALBERTA_alberta_education_curriculum/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlMonth, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
  .saveAsTextFile("/data/derivatives/text/ALBERTA_alberta_education_curriculum-not-boilerpiped")

RecordLoader.loadArchives("/data/ALBERTA_alberta_floods_2013/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlMonth, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
  .saveAsTextFile("/data/derivatives/text/ALBERTA_alberta_floods_2013-not-boilerpiped")

RecordLoader.loadArchives("/data/WAHR_ymmfire/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlMonth, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
  .saveAsTextFile("/data/derivatives/text/WAHR_ymmfire-not-boilerpiped")

RecordLoader.loadArchives("/data/ALBERTA_idle_no_more/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlMonth, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
  .saveAsTextFile("/data/derivatives/text/ALBERTA_idle_no_more-not-boilerpiped")

and will then use bash to calculate word frequency, etc.

ianmilligan1 commented 8 years ago

@greebie Having issues w/ the Idle No More collection (others have generated well). Do you have a backup?

greebie commented 8 years ago

Seems like T-Space might be the best from a structural point of view. Snowden would be second. Let's do Snowden, since it's perhaps more meaningful.

ianmilligan1 commented 8 years ago

Great. We already have that processed. I'll hopefully find a cycle today or tomorrow to calculate the word frequency/etc. from the raw extracted text!

ianmilligan1 commented 8 years ago

OK now processing word frequencies with

cat TORONTO_snowden_archive-text.txt | tr -d [:punct:] | tr [:upper:] [:lower:] | tr [:space:] '\n' | grep -v "^\s*$" | sort | uniq -c | sort -bnr > snowden_frequency.txt

Progress:

ianmilligan1 commented 8 years ago

Ok @greebie – frequency files done and available in /data/WALK-Compare/WALK-Compare/frequencies on WALK.

-rw-rw-r-- 1 ubuntu  27M Oct 29 15:51 ALBERTA_alberta_education_curriculum_frequency.txt
-rw-rw-r-- 1 ubuntu 327M Oct 29 16:52 ALBERTA_alberta_floods_frequency.txt
-rw-rw-r-- 1 ubuntu 6.0M Oct 29 15:30 snowden_frequency.txt
-rw-rw-r-- 1 ubuntu 8.6M Oct 29 15:38 UVIC_Calendar_frequency.txt
-rw-rw-r-- 1 ubuntu  57M Oct 29 17:04 WAHR_panamapapers_frequency.txt

Note the sizes (esp. on floods & panama papers). I think maybe take the first X lines of each for your analysis? There's a lot of cruft in the floods file, not 100% sure why. Maybe use head to get 10,000 lines or something like that? Let me know if you want me to do so, just let me know how many values you'd need to extract meaningful data.

ianmilligan1 commented 8 years ago

Actually I'll also put them into Dropbox and slack you the links.

greebie commented 8 years ago

Thanks Ian. It is fairly easy to select lines in Python, so all should be fine. I am working on this now.

greebie commented 8 years ago

I've applied them to the analysis in the latest push. More collections could provide a more interesting analysis I think. It's also interesting because I was able to show what words were the matching ones in the correspondence analysis.

ianmilligan1 commented 8 years ago

Fantastic! I'm happy to generate more word frequency data if that helps. Just let me know collections and I can do so. Looking forward to chatting tomorrow!

greebie commented 8 years ago

Okay. How about Francophonie, ymmfire, alberta oil sands and UVIC environmental orgs.

greebie commented 8 years ago

Also if you have any ones that are already available, let's add those for good luck.

ianmilligan1 commented 8 years ago

Great! I have those four collections in plain text, so will begin running frequency on them. Some of them are quite large so might take a while.. we can chat today, see if this is something we want to bake in on the processing level.

ianmilligan1 commented 8 years ago

And done! Here is /data/WALK-Compare/WALK-Compare/frequencies now.

-rw-rw-r-- 1 ubuntu ubuntu  27M Oct 29 15:51 ALBERTA_alberta_education_curriculum_frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 327M Oct 29 16:52 ALBERTA_alberta_floods_frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu  13M Nov 11 14:29 ALBERTA_alberta_oil_sands-text-frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 1.4G Nov 11 14:45 ALBERTA_lfrancophonie_de_louest_canadien-text-frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 6.0M Oct 29 15:30 TORONTO_snowden_frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 8.6M Oct 29 15:38 UVIC_Calendar_frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 5.1M Nov 11 14:46 UVIC_environmental_organizations_and_resources_of_bc-text-frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu  57M Oct 29 17:04 WAHR_panamapapers_frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu  57M Nov 11 14:03 WAHR_panamapapers-text-frequency.txt
-rw-rw-r-- 1 ubuntu ubuntu 6.5M Nov 11 15:10 WAHR_ymmfire-not-boilerpiped-frequency.txt