Closed roger-mahler closed 3 years ago
The release is live on https://westac.humlab.umu.se
@fredrik1984 @Stubbendorff @Gunnarnordahl
┌──[16:57:32]─[roger@p2]─[~/.../welfare-state-analytics/workbench_sandbox]
└──> (poetry) λ co_occurrence --help
Usage: co_occurrence [OPTIONS] CORPUS_CONFIG INPUT_FILENAME OUTPUT_FILENAME
Options:
-c, --concept TEXT Concept
--no-concept Filter out concept word
--count-threshold INTEGER Filter out co_occurrences below threshold
-w, --context-width INTEGER Width of context on either side of concept.
Window size = 2 * context_width + 1
-m, --phrase TEXT Phrase
-n, --phrase-file TEXT Phrase filename
-p, --partition-key TEXT Partition key(s)
-i, --pos-includes TEXT List of POS tags to include e.g. "|NN|JJ|".
-m, --pos-paddings TEXT List of POS tags to replace with a padding
marker.
-x, --pos-excludes TEXT List of POS tags to exclude e.g.
"|MAD|MID|PAD|".
-b, --lemmatize / --no-lemmatize
Use word baseforms
-l, --to-lowercase / --no-to-lowercase
Lowercase words
-r, --remove-stopwords [swedish|english]
Remove stopwords using given language
--min-word-length INTEGER RANGE
Min length of words to keep
--max-word-length INTEGER RANGE
Max length of words to keep
--keep-symbols / --no-keep-symbols
Keep symbols
--keep-numerals / --no-keep-numerals
Keep numerals
--only-alphabetic Keep only tokens having only alphabetic
characters
--only-any-alphanumeric Keep tokens with at least one alphanumeric
char
-f, --force / --no-force Ignore checkpoints
--help Show this message and exit.
┌─[16:55:58]─[roger@p2]─[~/.../welfare-state-analytics/workbench_sandbox]
└──> (poetry) λ nohup co_occurrence --pos-includes "NN|PM" --pos-paddings "UO|JJ|AB|HA|IE|IN|PL|PP|KN|SN|VB|PC" --context-width 5 --lemmatize --to-lowercase --partition-key year --remove-stopwords swedish --concept information ./doit.yml /data/westac/riksdagens-protokoll.1920-2019.sparv4.csv.zip ./output/information_w5_NNPM_UOJJABHAIEINPLPPKNSNVBPC_lemma_no_stops_NEW/information_w5_NNPM_UOJJABHAIEINPLPPKNSNVBPC_lemma_no_stops_NEW &
Välj "riksdagens-protokoll" i bölädderlista för PoS-statistik. Övriga val är inte konfigurerade.
Release that includes a number of new features as bug fixes.
Noteworthy changes are: