Closed Ramay7 closed 5 years ago
Hi, I met some problems about SCL again while I ran experiments.
Firstly, I tried to run on Multi-Domain Sentiment Dataset. The raw samples are texts and words. I wanna to know how do you change these text-based data into numerical data ? If you use bag of words model, how is the dictionary created (by combining both source data and target data together?) ? And is there some reductions on the dimension of each sample (such as select top D
frequent words?) ?
What's more, I also tried to run SCL method on another benchmark dataset about domain adaptation which can be found here. The examples in this dataset are all numerical. However, no matter what number of pivot features I choose, error_naive
is always the same as error_adapt
. Does it need some modifications in scl.m
to fit other kinds of dataset? If so, what kind of modifications should be made?
Look forward to your generous reply. Thanks!
Hi @Ramay7, thanks for taking an interest in libTLDA. Sorry for the late response, I was apparently not being notified about new issues.
Regarding your first question:
Can you explain this abnormal phenomenon ?
It is entirely possible that domain adaptive classifiers and transfer learners can perform worse than naive classifiers. This is known as 'negative transfer'. It arises when you make false assumptions about your data (e.g. invalid covariate shift) or when the domains are too dissimilar (http://web.engr.oregonstate.edu/~tgd/publications/rosenstein-marx-kaelbling-dietterich-hnb-nips2005-transfer-workshop.pdf is a great, short paper describing this).
I wanna to know how do you change these text-based data into numerical data ?
Sci-kit has a module to help you encode text into numerical data (e.g. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
If you use bag of words model, how is the dictionary created (by combining both source data and target data together?) ?
For the dictionary, you need all data.
And is there some reductions on the dimension of each sample (such as select top D frequent words?) ?
Top-D most frequent (after tf-idf extraction) is one way to do that. You could try other feature extraction methods, such as PCA. Whatever you think is a good idea for the data that you're using.
Does it need some modifications in scl.m to fit other kinds of dataset? If so, what kind of modifications should be made?
This sounds like the regularization parameter is set to a too large value. Did you cross-validate it?
Thanks for your detailed response ! I will try more !
Hi, thanks for your sharing.
However, while I run
exmaple.m
with methodaclfr = 'scl';
20 times, I found SCL did not seem perform well than naive method. Below is the detailed result.Naive method wins SCL in 3 cases, and ties in 4 cases. Namely, SCL loses in 13 cases.
Can you explain this abnormal phenomenon ? Or it is just because the instances used in
example.m
are synthetic ?Thanks !