A package of supervised LDA which can incorporate labels, tree priors, and hinge loss. Most code can be applied to general data, as long as they meet the format requirements. The code for computing Pearson correlation coefficient is specifically implemented for the EmoInt data.
slda.jar
in the root directory.lib/
directory.The general command line format is
java -cp slda.jar:lib/* cmd.{Tools} -arg1 <arg1-value> -arg2 <arg2-value> ... -argn <argn-value>
slda.jar:lib/*
with slda.jar;lib/*
.-Xmx20G
to request a maximum of 20GB memory if your dataset is large.{Tools}
are
CmdSLDA
: Run supervised LDA.CmdTSLDA
: Run supervised LDA with tree priors.CmdTree
: Build tree priors using pre-trained word embeddings.CmdEval
: Compute the Pearson correlation coefficient between predictions and gold labels. This is implemented specifically for the EmoInt data.-h
option to get help information.java -cp slda.jar:lib/* cmd.CmdSLDA -v <vocab-file> -d <corpus-file> -l <label-file> -m <model-file>
Required arguments
<vocab-file>
: Vocabulary file. Each line contains a unique word.<corpus-file>
: Corpus file in which documents are represented by word indexes and frequencies. Each line contains a document in the following format
<doc-len> <word-type-1>:<frequency-1> <word-type-2>:<frequency-2> ... <word-type-n>:<frequency-n>
<doc-len>
is the total number of tokens in this document. <word-type-i>
denotes the i-th word in <vocab-file>
, starting from 0. Words with zero-frequency can be omitted.
<model-file>
: Trained model file in JSON format. Read and written by program.-l <label-file>
[optional in test]: Label file. Each line contains the corresponding document's numeric label. If a document's label is not available, leave the corresponding line empty.-t
: Use the model for test (default: false).-a <alpha-value>
: Parameter of the Dirichlet prior of document distributions over topics (default: 0.01). Must be a positive real number.-b <beta-value>
: Parameter of the Dirichlet prior of topic distributions over words (default: 0.01). Must be a positive real number.-k <num-topics>
: Number of topics (default: 20). Must be a positive integer.-i <num-iters>
: Number of iterations (default: 500). Must be a positive integer.-mu <mu-value>
: The mean of the Gaussian priors for regression parameters (default 0.0).-n <nu-value>
: The variance of the Gaussian priors for regression parameters (default: 1.0). Must be a positive real number.-s <sigma-value>
: The variance for the Gaussian distribution for generating documents' response labels (default: 1.0). Must be a positive real number.-hl
: Use hinge loss as the loss function (default false).-c <c-value>
: The regularization parameter for hinge loss (default 1.0). Must be a positive real number.-e <epsilon-value>
: The error bound for hinge loss (default 0.1). Must be a positive real number.-tc <topic-count-file>
: File for documents' topic counts. Each line contains a document's numbers of tokens assigned to topics. Topic counts are separated by space.-r <topic-file>
: File for showing human-readable topics and top positive/negative words.-w <num-top-word>
: Number of top words for human-readable topics and for positive/negative weights (default: 20). Must be a positive integer.-p <pred-file>
: File for predicted values. Each line contains a predicted value.java -cp slda.jar:lib/* cmd.CmdTSLDA -v <vocab-file> -tp <tree-prior-file> -d <corpus-file> -l <label-file> -m <model-file>
-tp <tree-prior-file>
: File of tree priors. Tree priors can be built using the tree prior construction tool and pre-trained word embeddings. Or you can build your own following the format. The representation of a leaf node is <word-id>:<word>
where <word-id>
is word's ID, i.e., the line number of this word in the <vocab-file>
(starting from 0), and <word>
is the string representation of the word itself.java -cp slda.jar:lib/* cmd.CmdTree -v <vocab-file> -e <embedding-file> -o <tree-prior-file>
-v <vocab-file>
: Vocabulary file. Same format with supervised LDA.-e <embedding-file>
: Pre-trained word embedding file. Follows the format of word2vec output: The first line contains the numbers of words and dimensions, separated by space; Each of the following line contains the word and its embeddings, separated by space.-o <tree-prior-file>
: The file for storing human-readable tree priors.-t <tree-prior-type>
: The type of tree priors (default 1):
-k <child-number>
: The number of child nodes per internal node for a two-level tree (default 10). Must be a positive integer.java -cp slda.jar:lib/* cmd.CmdEval -p <prediction-file> -l <gold-label-file>
<prediction-file>
and <gold-label-file>
must have the same number of lines.-p <prediction-file>
: The predicted value file. Each line contains a predicted value. Can be written by the -p <pred-file>
option in supervised LDA.-l <gold-label-file>
: The gold label file. Format same with <prediction-file>
.-o <output-file>
: The file for writing the two Pearson correlation coefficients. The first line contains the Pearson correlation coefficient of all examples. The second line contains the Pearson correlation coefficient of the examples with gold labels greater than 0.5.Jon D. McAuliffe and David M. Blei. 2008. Supervised topic models. In Proceedings of Advances in Neural Information Processing Systems.
Weiwei Yang, Jordan Boyd-Graber, and Philip Resnik. 2017. Adapting Topic Models using Lexical Associations with Tree Priors. In Proceedings of Empirical Methods in Natural Language Processing.