welfare-state-analytics / welfare_state_analytics

Welfare State Analytics
5 stars 0 forks source link

Release v0.3.18 (penelope) #136

Closed roger-mahler closed 3 years ago

roger-mahler commented 3 years ago

Release that includes a number of new features as bug fixes.

Noteworthy changes are:

roger-mahler commented 3 years ago

The release is live on https://westac.humlab.umu.se

roger-mahler commented 3 years ago

@fredrik1984 @Stubbendorff @Gunnarnordahl

roger-mahler commented 3 years ago

Command Line Interface

┌──[16:57:32]─[roger@p2]─[~/.../welfare-state-analytics/workbench_sandbox]
└──>  (poetry) λ co_occurrence --help

Usage: co_occurrence [OPTIONS] CORPUS_CONFIG INPUT_FILENAME OUTPUT_FILENAME

Options:
  -c, --concept TEXT              Concept
  --no-concept                    Filter out concept word
  --count-threshold INTEGER       Filter out co_occurrences below threshold
  -w, --context-width INTEGER     Width of context on either side of concept.
                                  Window size = 2 * context_width + 1

  -m, --phrase TEXT               Phrase
  -n, --phrase-file TEXT          Phrase filename
  -p, --partition-key TEXT        Partition key(s)
  -i, --pos-includes TEXT         List of POS tags to include e.g. "|NN|JJ|".
  -m, --pos-paddings TEXT         List of POS tags to replace with a padding
                                  marker.

  -x, --pos-excludes TEXT         List of POS tags to exclude e.g.
                                  "|MAD|MID|PAD|".

  -b, --lemmatize / --no-lemmatize
                                  Use word baseforms
  -l, --to-lowercase / --no-to-lowercase
                                  Lowercase words
  -r, --remove-stopwords [swedish|english]
                                  Remove stopwords using given language
  --min-word-length INTEGER RANGE
                                  Min length of words to keep
  --max-word-length INTEGER RANGE
                                  Max length of words to keep
  --keep-symbols / --no-keep-symbols
                                  Keep symbols
  --keep-numerals / --no-keep-numerals
                                  Keep numerals
  --only-alphabetic               Keep only tokens having only alphabetic
                                  characters

  --only-any-alphanumeric         Keep tokens with at least one alphanumeric
                                  char

  -f, --force / --no-force        Ignore checkpoints
  --help                          Show this message and exit.

Example call using the CLI


┌─[16:55:58]─[roger@p2]─[~/.../welfare-state-analytics/workbench_sandbox]
└──>  (poetry) λ nohup co_occurrence --pos-includes "NN|PM" --pos-paddings "UO|JJ|AB|HA|IE|IN|PL|PP|KN|SN|VB|PC" --context-width 5 --lemmatize --to-lowercase --partition-key year --remove-stopwords swedish --concept information ./doit.yml  /data/westac/riksdagens-protokoll.1920-2019.sparv4.csv.zip ./output/information_w5_NNPM_UOJJABHAIEINPLPPKNSNVBPC_lemma_no_stops_NEW/information_w5_NNPM_UOJJABHAIEINPLPPKNSNVBPC_lemma_no_stops_NEW &
roger-mahler commented 3 years ago

Välj "riksdagens-protokoll" i bölädderlista för PoS-statistik. Övriga val är inte konfigurerade.