zero-one-group / geni

A Clojure dataframe library that runs on Spark
Apache License 2.0
283 stars 28 forks source link

Conversion of Spark SparseVector to Clojure looses important information #294

Closed behrica closed 3 years ago

behrica commented 3 years ago

a SparseVector has three pieces of information:

Current code here: https://github.com/zero-one-group/geni/blob/5b570b5db4413ea8152232b66e9aba7905cf4273/src/clojure/zero_one/geni/interop.clj#L142

only collects values, but we need all three:

(seq (.indices sparse-vector))
(seq (.values sparse-vector))
(.size sparse-vector)

Maybe a map with all three should be returned instead

behrica commented 3 years ago

I have a fix and could do a PR.,

Now a some tests fail, as they assert the "wrong" behaviour. Like: https://github.com/zero-one-group/geni/blob/3ea740d3c863530439b520a67c5dddea39108a6f/test/zero_one/geni/ml_test.clj#L721

I needed to fix them as well.

anthony-khong commented 3 years ago

Hi @behrica, yes, I remember implementing this and I did cut some corners. A PR to fix it would be great! Thank you!!

behrica commented 3 years ago

I have a problem with one test, which fails: In this test https://github.com/zero-one-group/geni/blob/d0bc4e734d88cda0d288eb25df92683c6313d8de/test/zero_one/geni/data_sources_test.clj#L235

The expected and found sparse vectors differ in the "size" (780 vs 692) The indices and values seem to be identical, so technical the sparse vector are the same, correct ?

But maybe there is a bug hidden somewhere.


Diffs: in [:features :size] expected 692, was 780
behrica commented 3 years ago

So this definitely passes:

(map  #(get-in % [:features :indices]) (g/collect (libsvm-df))) => (map  #(get-in % [:features :indices]) (g/collect read-df))
 (map  #(get-in % [:features :values]) (g/collect (libsvm-df))) => (map  #(get-in % [:features :values]) (g/collect read-df))

But why are the sizes different, I don't know.

behrica commented 3 years ago

Can be closed.