Example Data - Githubissues

wush978 commented 9 years ago

The iPinYou dataset is public domain and we can use part of it as an example in the documentations.

Inspired from comment of @pommedeterresautee https://github.com/wush978/FeatureHashing/issues/41#issuecomment-77178420

pommedeterresautee commented 9 years ago

The dataset seems interesting but very big.

I was thinking to something we would embed with the package or one of the dataset directly available on Cran.

The issue with large dataset is that it can't be embedded, you need space and time to download, execute the examples and most important, you can't have all the data in mind to guess the results of a command.

But may be you are thinking to a small sample of this dataset? By small I think < 50 observations.

The issue I see with a sample is that you can't have interesting results when you apply a ml algorithm without carefully construct the sample.

For xgboost vignettes I have chosen one of the dataset embedded in vcd. The one for the examples of xgboost is good too but already prepared (one hot encoded...). Both are small.

Kind regards, Michaël

pommedeterresautee commented 9 years ago

In a perfect world, the feature importances would be similar in the sample and in the full data. And easy to interpret.

wush978 commented 9 years ago

Thanks for your comment. However, we were caught in a dilemma. Feature hashing is useful for processing a large amount of data, but the data size in a package is limited by the CRAN's policy.

I prefer to use only a little part of data from iPinYou to demonstrate how to use feature hashing. Moreover, I can report the benchmark of the performance (in AUC and Log Loss) if the same code is applied to full dataset. I have done this in my own research before and as far as I know the result is close to the benchmark reported in Real-Time Bidding Benchmarking with iPinYou Dataset.

pommedeterresautee commented 9 years ago

A solution where you embed a small sample and you provide a way to download and execute the code on full dataset would be perfect for both rapid try and those who wants to go deeper.

wush978 / FeatureHashing

Example Data #47