wush978 / FeatureHashing

Implement feature hashing with R
GNU General Public License v3.0
97 stars 38 forks source link

Vignette of sentiment analysis #66

Closed wush978 closed 9 years ago

wush978 commented 9 years ago

Dear @lewis-c,

This is the issue for tracking https://github.com/wush978/FeatureHashing/issues/60#issuecomment-81660067. Thanks for contributing this.

Please let me know if you need any help.

Best, Wush

wush978 commented 9 years ago

Another related issue: https://github.com/wush978/FeatureHashing/issues/64#issuecomment-85996108

pommedeterresautee commented 9 years ago

@wush978 regarding the format of the Vignette, if you want to build a HTML file, why not using Markdown instead of Latex? IMO Latex is good for PDF but may be an overkill for HTML.

Kind regards, Michael

wush978 commented 9 years ago

@pommedeterresautee , thanks for reminding me. I'll switch the build tools to HTML based vignette.

formwork commented 9 years ago

Hi. This is my early draft of the sentiment analysis vignette: https://github.com/lewis-c/lewis-c.github.io/blob/master/sentiment%20analysis%20with%20FeatureHashing.Rmd

I'm still planning to review the text to clarify things and add hyperlinks so it's a little early for comments on the detailed drafting. But it would be useful to check -

formwork commented 9 years ago

Ah, I see there is a vignette folder already in 0.9.1 - I will amend the format to match.

wush978 commented 9 years ago

Yes, the vignette folder in branch dev/0.9.1 is the format for vignette. Please feel free to give any suggestion about it.

If there is no issue about v0.9 from CRAN, I'll check the draft this weekend and we can start to discuss the upcoming features in v0.9.1.

wush978 commented 9 years ago

Dear @lewis-c ,

After reading your draft, maybe a new functionality is to directly create a xgb.DMatrix object from hashed.model.matrix, as suggested by @pommedeterresautee before ( sorry that I forget where @pommedeterresautee suggested this feature).

This feature requires the C-linking between the FeatureHashing and xgboost. A good news is that I did a C-linking between FeatureHashing and digest before in #26. However, the xgboost need to properly expose its declaration of xgb.DMatrix in C/C++ and register the C function to R first. I contributed this kind of feature to digest, so it is ok for me to help xgboost to implement this feature.

The point is that this feature requires xgboost to implement some API first and the implementation need to be confirmed by the maintainers of the xgboost.

formwork commented 9 years ago

Sorry, I think my lazy language may have created some confusion: in my earlier question I was asking whether you wanted any existing FeatureHashing functions to be illustrated in the sentiment analysis vignette. But maybe it's best to keep that to the essential steps. I've created a pull request for my latest draft.

As for the separate question of development ideas for FeatureHashing, I would say:

And thanks again for all your work on the package. I'm still learning about how to handle larger datasets in R but it's good to know I have the option of FeatureHashing for text data or features with many categorical values.

wush978 commented 9 years ago

Let's focus on the second point.

IMO, the construction of bi-grams should be implemented in another package whose purpose is to do semantic analysis. I search the existed CRAN packages and notice that qlcMatrix and biogram might be able to construct bi-grams. Is it possible to feed the FeatureHashing with the result of these packages?

formwork commented 9 years ago

Thanks Wush. I hadn't seen those packages but I was really thinking about building the bigrams via FeatureHashing rather than a data pre-processing stage. I guess I was thinking of Vowpal Wabbit and how easy it is there to just specify the ngram in the main command.

I only asked because when I looked at other text processing packages in R I was surprised that even the most popular packages such 'tm' rely on RWeka to build bigrams which seems a bit clumsy and slow.

But I completely understand that you need to decide on your priorites for FeatureHashing and maybe bigrams isn't a high enough priority or, as you suggest, maybe you want a 'special function' that is more flexible.

wush978 commented 9 years ago

Dear @lewis-c ,

Could I close this issue because the vignette is done in https://github.com/wush978/FeatureHashing/commit/a1c5cb2289022f62138498fdfb068290a4a4b89a?

formwork commented 9 years ago

Yes, of course - happy to close this one.

wush978 commented 9 years ago

Thanks :)

dselivanov commented 9 years ago

@lewis-c , recently I implemented n-gram tokenization as part of tmlite corpus construction pipeline. And it is very fast, because written in C and tightly integrated into text vectorization process. For details, see ?create_dict_corpus or ?create_hash_corpus documentation. Hashing trick is also implemented, many thanks to @wush978 for collaboration. But note, tmlite is far less general than FeatureHashing:

Also I can expose ngram generator function, if someone want to use it in formula interface. But I suppose it will be less efficient to use it in such way.

formwork commented 9 years ago

Thanks @dselivanov - I will take a closer look at tmlite when I get a chance.

Overall, with tasks similar to the Kaggle Bag-of-words competition I think I'm happy to use my own manual bigram function, then FeatureHashing, then the linear version of Xgboost - that all seemed to work very well and perhaps I should add the manual construction of bigrams to the vignette here. But I can see there are times when that pipeline would struggle with larger datasets so I'll be curious to see how far you can push things with tmlite.