willtownes / nsf-paper

Nonnegative spatial factorization for multivariate count data
GNU Lesser General Public License v3.0
51 stars 11 forks source link

Implementation questions #8

Closed tmchartrand closed 1 year ago

tmchartrand commented 1 year ago

I enjoyed the paper and am exploring testing the technique on some MERSCOPE data!

A couple questions if you have a moment to answer: I see you cited work relating to the SVGP support in GPFlow, so I'm curious why you chose not to or weren't able to build on that framework to build this tool (or GPyTorch as well, if that was considered)? I got curious about this in part while wondering if the IP locations could also be adjusted during variational inference, which I saw is supported in GPFlow - I'm a bit of a TensorFlow newbie, but would that perhaps be an easy tweak in your code as well?

willtownes commented 1 year ago

Thanks for your interest! Those are good questions.

GPflow actually did try to use GPflow at first but I found the documentation too confusing and it didn't give me enough flexibility to experiment around, which is why I switched to the lower level tensorflow interface. Even tensorflow probability has some GP stuff that I could have used but again I felt I could understand what's going on better if I worked with more simple building blocks with a more thoroughly documented interface.

Pytorch if I were to implement something like this again I would definitely use pytorch instead of tensorflow. Pytorch is much simpler and easier to understand, and well suited to experimentation needed for research projects. I have been told tensorflow is preferable for industry applications but even that may be waning now. I'm not as familiar with Gpytorch, although the fact that the VNNGP method is available there is very appealing. I feel that VNNGP is the way to scale this type of model to more than a few tens of thousands of cells, where IP methods alone (like my implementation here) start to hit a wall.

Learning the inducing point locations It's true that this can be done in principle but I am skeptical it is worth the hassle (it will greatly increase the number of parameters, thus slowing down the optimization and possibly also introducing numerical instability). If you look through the supplement of the MEFISTO paper they basically agree with this perspective. Basically as long as you disperse the IP locations relatively evenly throughout the data domain you can get a pretty good result. I did this by using k-means clustering (of the spatial coordinates, NOT the gene expression values). It's not a good idea to do a regular grid because you want more IPs in regions with denser cell concentrations. Others have suggested using more complex strategies like determinantal point processes but I didn't get a chance to look at that in detail.

tmchartrand commented 1 year ago

meant to reply, thanks for the insights @willtownes! feel free to leave open or closed