Closed lzj1769 closed 11 months ago
not sure of the exact solution but some possible suggestions:
Z = misc.kmeans_inducing_pts(Xtr, 1000)
maxtry
parameter in the tro.train_model
part. Let me know if any of those are helpful.
Hi,
Thanks for your reply.
I found the problem might be caused by running TF on GPU which can be numerically unstable.
Previously, I was running NSF with A100, now I moved the computation to CPU with os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
, it worked smoothly.
Glad you figured it out! I forgot to mention, I only tested the code on CPU with 32-bit precision. I would not recommend using the 16 bit precision.
Hi, I am experiencing exactly the same problem when trying nsf to a visium 10x publicly availble dataset. I tried many different solutions, including the ones you recommend above but nothing worked. Here is the code (tried also matern32 and both Poisson and NB likelihoods) :
I would appreciate your help!
@sokratiag Sorry for your difficulties, are you running this on GPU or CPU? If you could please provide the code either as text in the comment or as an attached file, rather than screenshots, that would be helpful for debugging. It looks like the issue is with the initialization. Are you doing any feature selection before trying to fit NSF(H)? What are the number of cells and number of features?
Hi @willtownes,
I am running it locally on CPU - I attach the code.
Many thanks, Sokratia nsf_prostate_visium.md
@sokratiag could you try running this with NSFH instead of NSF. Sometimes including the nonspatial factors can make things more numerically stable.
Hi @willtownes,
I tried it but unfortunately I got the same error: L = 10 fit = sfh.SpatialFactorizationHybrid(Ntr, J, L, Z, lik="poi", nonneg=True, psd_kernel=ker) fit.elbo_avg(Dtr["X"],Dtr["Y"],Dtr["idx"]) fit.init_loadings(Dtr["Y"],X=Dtr["X"]) pp = fit.generate_pickle_path("scanpy",base=mpth) tro = training.ModelTrainer(fit,pickle_path=pp)
output:
Temporary checkpoint directory: /var/folders/00/f4hclsjx6fv56zx24vbt8d_r0000gp/T/tmp9v2i6l4r
WARNING:tensorflow:From /Users/user/opt/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow_probability/python/distributions/distribution.py:342: calling MultivariateNormalDiag.init (from tensorflow_probability.python.distributions.mvn_diag) with scale_identity_multiplier is deprecated and will be removed after 2020-01-01.
Instructions for updating:
scale_identity_multiplier
is deprecated; please combine it into scale_diag
directly instead.
0010 train: 1.431e+04, val: 1.460e+04
0020 train: 1.324e+04, val: 1.370e+04
0030 train: 1.273e+04, val: 1.329e+04
0040 numerical instability (try 1)
0000 learning rate: 5.00e-03
hmm, that seems like a different error, it's not failing right away, maybe try initiating it with a lower learning rate?
I changed the length scale from 0.1 to 1 and it seems more stable now (it converged). Thank you for your help.
hooray! I'm glad you figured it out.
Interestingly, I'm having the opposite issue - numerical instability when running using CPU, but no issues when running on GPU (all parameters the same)! Any ideas?
an update to the above - decreasing lr by half (to 0.005) for the CPU training seems to stabilize it, and the results appear essentially identical.
thank you @tmchartrand !
Hi,
I was trying to run nsf but got the following error:
Basically, I followed the code to precess the data as follows:
Then selecte 2000 genes:
Then run nsf as
Can you let me know how to solve the error?
Thanks and best, Zhijian