Closed wush978 closed 9 years ago
Another related issue: https://github.com/wush978/FeatureHashing/issues/64#issuecomment-85996108
@wush978 regarding the format of the Vignette, if you want to build a HTML file, why not using Markdown instead of Latex? IMO Latex is good for PDF but may be an overkill for HTML.
Kind regards, Michael
@pommedeterresautee , thanks for reminding me. I'll switch the build tools to HTML based vignette.
Hi. This is my early draft of the sentiment analysis vignette: https://github.com/lewis-c/lewis-c.github.io/blob/master/sentiment%20analysis%20with%20FeatureHashing.Rmd
I'm still planning to review the text to clarify things and add hyperlinks so it's a little early for comments on the detailed drafting. But it would be useful to check -
Ah, I see there is a vignette folder already in 0.9.1 - I will amend the format to match.
Yes, the vignette folder in branch dev/0.9.1
is the format for vignette. Please feel free to give any suggestion about it.
If there is no issue about v0.9
from CRAN, I'll check the draft this weekend and we can start to discuss the upcoming features in v0.9.1
.
Dear @lewis-c ,
After reading your draft, maybe a new functionality is to directly create a xgb.DMatrix
object from hashed.model.matrix
, as suggested by @pommedeterresautee before ( sorry that I forget where @pommedeterresautee suggested this feature).
This feature requires the C-linking between the FeatureHashing and xgboost. A good news is that I did a C-linking between FeatureHashing and digest before in #26. However, the xgboost need to properly expose its declaration of xgb.DMatrix
in C/C++ and register the C function to R first. I contributed this kind of feature to digest, so it is ok for me to help xgboost to implement this feature.
The point is that this feature requires xgboost to implement some API first and the implementation need to be confirmed by the maintainers of the xgboost.
Sorry, I think my lazy language may have created some confusion: in my earlier question I was asking whether you wanted any existing FeatureHashing functions to be illustrated in the sentiment analysis vignette. But maybe it's best to keep that to the essential steps. I've created a pull request for my latest draft.
As for the separate question of development ideas for FeatureHashing, I would say:
And thanks again for all your work on the package. I'm still learning about how to handle larger datasets in R but it's good to know I have the option of FeatureHashing for text data or features with many categorical values.
Let's focus on the second point.
IMO, the construction of bi-grams should be implemented in another package whose purpose is to do semantic analysis. I search the existed CRAN packages and notice that qlcMatrix
and biogram
might be able to construct bi-grams. Is it possible to feed the FeatureHashing with the result of these packages?
Thanks Wush. I hadn't seen those packages but I was really thinking about building the bigrams via FeatureHashing rather than a data pre-processing stage. I guess I was thinking of Vowpal Wabbit and how easy it is there to just specify the ngram in the main command.
I only asked because when I looked at other text processing packages in R I was surprised that even the most popular packages such 'tm' rely on RWeka to build bigrams which seems a bit clumsy and slow.
But I completely understand that you need to decide on your priorites for FeatureHashing and maybe bigrams isn't a high enough priority or, as you suggest, maybe you want a 'special function' that is more flexible.
Dear @lewis-c ,
Could I close this issue because the vignette is done in https://github.com/wush978/FeatureHashing/commit/a1c5cb2289022f62138498fdfb068290a4a4b89a?
Yes, of course - happy to close this one.
Thanks :)
@lewis-c , recently I implemented n-gram tokenization as part of tmlite corpus construction pipeline. And it is very fast, because written in C and tightly integrated into text vectorization process. For details, see ?create_dict_corpus
or ?create_hash_corpus
documentation.
Hashing trick is also implemented, many thanks to @wush978 for collaboration. But note, tmlite
is far less general than FeatureHashing:
formula
interfaceAlso I can expose ngram
generator function, if someone want to use it in formula
interface. But I suppose it will be less efficient to use it in such way.
Thanks @dselivanov - I will take a closer look at tmlite when I get a chance.
Overall, with tasks similar to the Kaggle Bag-of-words competition I think I'm happy to use my own manual bigram function, then FeatureHashing, then the linear version of Xgboost - that all seemed to work very well and perhaps I should add the manual construction of bigrams to the vignette here. But I can see there are times when that pipeline would struggle with larger datasets so I'll be curious to see how far you can push things with tmlite.
Dear @lewis-c,
This is the issue for tracking https://github.com/wush978/FeatureHashing/issues/60#issuecomment-81660067. Thanks for contributing this.
Please let me know if you need any help.
Best, Wush