wush978 / FeatureHashing

Implement feature hashing with R
GNU General Public License v3.0
97 stars 38 forks source link

Empty features #124

Closed topepo closed 7 years ago

topepo commented 7 years ago

I'm trying to understand why there are some hash features that have no keys mapped to them.

Here's an example where the number of features is significantly smaller than the number of original values but there are a few thousand features that are zero across all rows.

> library(FeatureHashing)
> library(stringi)
> library(Matrix)
> 
> n <- 10^6
> 
> set.seed(42276)
> tmp_dat <- data.frame(x = stri_rand_strings(n, 20),
+                       stringsAsFactors = FALSE)
> ## make sure there are no dups
> tmp_dat <- tmp_dat[!duplicated(tmp_dat),, drop = FALSE]
> 
> 
> hash_mat <- hashed.model.matrix(~x, data = tmp_dat)
> keys_per_hash <- Matrix::colSums(hash_mat)
> table(keys_per_hash)
keys_per_hash
      0       1       2       3       4       5       6       7 
   5784   22025   42139   53316   51041   39003   24666   13386 
      8       9      10      11      12      13      14      15 
   6475    2748    1052     350     106      39       9       4 
1000005 
      1 

Thanks

─ Session info ─────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.3.3 (2017-03-06)
 os       macOS Sierra 10.12.6        
 system   x86_64, darwin13.4.0        
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            
 date     2017-08-20                  

─ Packages ─────────────────────────────────────────────────────────────────────────────
 package        * version date       source                                 
 clisymbols       1.2.0   2017-05-21 CRAN (R 3.3.2)                         
 digest           0.6.12  2017-01-27 CRAN (R 3.3.2)                         
 FeatureHashing * 0.9.1.1 2017-08-02 Github (wush978/FeatureHashing@f97b03f)
 lattice          0.20-35 2017-03-25 CRAN (R 3.3.3)                         
 magrittr         1.5     2014-11-22 CRAN (R 3.3.0)                         
 Matrix         * 1.2-8   2017-01-20 CRAN (R 3.3.3)                         
 Rcpp             0.12.12 2017-07-15 cran (@0.12.12)                        
 sessioninfo    * 1.0.0   2017-06-21 CRAN (R 3.3.2)                         
 stringi        * 1.1.5   2017-04-07 CRAN (R 3.3.2)                         
 withr            2.0.0   2017-07-28 CRAN (R 3.3.2)  
wush978 commented 7 years ago

Hi @topepo ,

Let's take https://wush978.github.io/FeatureHashing/#36 as an example:

As you can see, if the data have only one instance (one row), then the hashed feature will become 0x64, 0x9b and 0x36. Therefore, there are 2^4 - 3 columns will be zero.

That is to say, the feature hashing might not map real features to all hashed feature. It depends on the data, hash space and the hashing algorithm.

topepo commented 7 years ago

Thanks!