zeno-ml / zeno

AI Data Management & Evaluation Platform
https://zenoml.com
MIT License
212 stars 9 forks source link

calculate projection in preprocessing #582

Open cabreraalex opened 1 year ago

cabreraalex commented 1 year ago

If a user provides embeddings, we should compute the projections as a preprocessing step and cache the result. Will make interaction from then on much, much faster. Can create an option to not compute projections as well if we want.

cabreraalex commented 1 year ago

@xnought any thoughts on this? Any downside? One I can think of is you have to store the projection coordinates, using up disk space, but should be minimal?

xnought commented 1 year ago

Depending on the data format yeah disk space would not be too bad.

Sidenote: it could be better to use parquet when caching columns for that extra compression.

xnought commented 1 year ago

I do like your idea. I think I'll give that a shot next.

xnought commented 1 year ago

There is also something else to think about: should users be able to mess with tsne parameters (like perplexity)?

Should the user be able to recompute tsne? Given how different the results are with the tsne parameters, maybe?

xnought commented 1 year ago

Also if there dataset is too large and tsne ends up taking the eternities, what then?

That would favor our current method where they can just load one tsne instead or preloading all of them.

cabreraalex commented 1 year ago

We could add an option to the TOML that are parameters for the TSNE?

For your last point, if it's too large the current method would be worse because if you leave the screen it would stop processing and lose your progress.