NumericalDivergenceError

lzj1769 commented 1 year ago

Hi,

I was trying to run nsf but got the following error:

NumericalDivergenceError                  
Traceback (most recent call last)
File <timed eval>:1

File nsf-paper/utils/training.py:325, in ModelTrainer.train_model(self, lr_reduce, maxtry, verbose, ckpt_freq, *args, **kwargs)
    323 except (tf.errors.InvalidArgumentError,NumericalDivergenceError) as err: #cholesky failure
    324   tries+=1
--> 325   if tries==maxtry: raise err
    326   #else: #not yet reached the maximum number of tries
    327   if verbose:

File nsf-paper/utils/training.py:319, in ModelTrainer.train_model(self, lr_reduce, maxtry, verbose, ckpt_freq, *args, **kwargs)
    317 while tries < maxtry:
    318   try:
--> 319     self._train_model_fixed_lr(mgr, *args, ptic=ptic, wtic=wtic,
    320                                verbose=verbose, ckpt_freq=ckpt_freq,
    321                                **kwargs)
    322     if self.epoch>=len(self.loss["train"])-1: break #finished training
    323   except (tf.errors.InvalidArgumentError,NumericalDivergenceError) as err: #cholesky failure

File nsf-paper/utils/training.py:246, in ModelTrainer._train_model_fixed_lr(self, ckpt_mgr, Dtrain, Ntr, Dval, S, verbose, num_epochs, ptic, wtic, ckpt_freq, kernel_hp_update_freq, status_freq, span, tol, pickle_freq)
    243 self.loss["train"][i] = trl
    245 if not np.isfinite(trl) or trl>self.loss["train"][1]:
--> 246   raise NumericalDivergenceError
    248 if i%status_freq==0 or i==num_epochs:
    249   if Dval:

NumericalDivergenceError:

Basically, I followed the code to precess the data as follows:

adata = sc.read_h5ad("../../../data/10x_visium_DLPFC/raw/151508.h5ad")
adata.var_names_make_unique()
adata.var["mt"] = adata.var_names.str.startswith("mt-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)
adata = adata[adata.obs.pct_counts_mt < 20]
sc.pp.filter_genes(adata, min_cells=100)
sc.pp.filter_cells(adata, min_counts=100)

adata.layers['counts'] = adata.X.copy()
sc.pp.normalize_total(adata, inplace=True, layers=None, key_added="sizefactor")
sc.pp.log1p(adata)

Then selecte 2000 genes:

# %% normalization, feature selection and train/test split
adata.var['deviance_poisson'] = preprocess.deviancePoisson(adata.layers["counts"])
o = np.argsort(-adata.var['deviance_poisson'])
idx = list(range(adata.shape[0]))
random.shuffle(idx)
adata = adata[idx,o]
adata = adata[:,:2000]

Then run nsf as

#raw count data for NSF
Dtr,Dval = preprocess.anndata_to_train_val(adata, layer="counts", sz="scanpy")
Dtr_n, Dval_n = preprocess.anndata_to_train_val(adata)
fmeans,Dtr_c,Dval_c = preprocess.center_data(Dtr_n,Dval_n) #centered features

Xtr = Dtr["X"] #note this should be identical to Dtr_n["X"]
Ntr = Xtr.shape[0]
Dtf = preprocess.prepare_datasets_tf(Dtr, Dval=Dval,shuffle=False)
Dtf_n = preprocess.prepare_datasets_tf(Dtr_n,Dval=Dval_n,shuffle=False)
Dtf_c = preprocess.prepare_datasets_tf(Dtr_c,Dval=Dval_c,shuffle=False)

Z = misc.kmeans_inducing_pts(Xtr, 500)
M = Z.shape[0]
ker = tfk.MaternThreeHalves
S = 3 #samples for elbo approximation

J = 2000

fit = sf.SpatialFactorization(J, 12, Z, psd_kernel=ker, nonneg=True,lik="poi")
fit.elbo_avg(Xtr, Dtr["Y"],sz=Dtr["sz"])
fit.init_loadings(Dtr["Y"], X=Xtr, sz=Dtr["sz"])
fit.elbo_avg(Xtr,Dtr["Y"], sz=Dtr["sz"])

tro = training.ModelTrainer(fit)

%time tro.train_model(*Dtf)

Can you let me know how to solve the error?

Thanks and best, Zhijian

willtownes commented 1 year ago

not sure of the exact solution but some possible suggestions:

use a larger number of inducing points eg Z = misc.kmeans_inducing_pts(Xtr, 1000)
decrease the starting learning rate, in the ModelTrainer constructor. The default is lr=0.01.
increase the number of repetitions before it gives up, the maxtry parameter in the tro.train_model part.
try using negative binomial likelihood instead of Poisson
try fitting a nonspatial factor model and see if it also has numerical problems
check for extreme values in the data (genes that are all zero, cells that are all zero, very high counts, negative numbers)
try fitting with a smaller number of factors
try using a Matern 1/2 kernel
try different kernel hyperparameter initializations
use more samples in the ELBO approximation

Let me know if any of those are helpful.

lzj1769 commented 1 year ago

Hi,

Thanks for your reply.

I found the problem might be caused by running TF on GPU which can be numerically unstable.

Previously, I was running NSF with A100, now I moved the computation to CPU with os.environ['CUDA_VISIBLE_DEVICES'] = '-1', it worked smoothly.

willtownes commented 1 year ago

Glad you figured it out! I forgot to mention, I only tested the code on CPU with 32-bit precision. I would not recommend using the 16 bit precision.

sokratiag commented 1 year ago

Hi, I am experiencing exactly the same problem when trying nsf to a visium 10x publicly availble dataset. I tried many different solutions, including the ones you recommend above but nothing worked. Here is the code (tried also matern32 and both Poisson and NB likelihoods) :

Screenshot 2023-07-21 at 14 24 09

Screenshot 2023-07-21 at 14 24 30

Screenshot 2023-07-21 at 14 25 26

I would appreciate your help!

willtownes commented 12 months ago

@sokratiag Sorry for your difficulties, are you running this on GPU or CPU? If you could please provide the code either as text in the comment or as an attached file, rather than screenshots, that would be helpful for debugging. It looks like the issue is with the initialization. Are you doing any feature selection before trying to fit NSF(H)? What are the number of cells and number of features?

sokratiag commented 11 months ago

Hi @willtownes,

I am running it locally on CPU - I attach the code.

Many thanks, Sokratia nsf_prostate_visium.md

willtownes commented 11 months ago

@sokratiag could you try running this with NSFH instead of NSF. Sometimes including the nonspatial factors can make things more numerically stable.

sokratiag commented 11 months ago

Hi @willtownes,

I tried it but unfortunately I got the same error: L = 10 fit = sfh.SpatialFactorizationHybrid(Ntr, J, L, Z, lik="poi", nonneg=True, psd_kernel=ker) fit.elbo_avg(Dtr["X"],Dtr["Y"],Dtr["idx"]) fit.init_loadings(Dtr["Y"],X=Dtr["X"]) pp = fit.generate_pickle_path("scanpy",base=mpth) tro = training.ModelTrainer(fit,pickle_path=pp)

output:

Temporary checkpoint directory: /var/folders/00/f4hclsjx6fv56zx24vbt8d_r0000gp/T/tmp9v2i6l4r WARNING:tensorflow:From /Users/user/opt/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow_probability/python/distributions/distribution.py:342: calling MultivariateNormalDiag.init (from tensorflow_probability.python.distributions.mvn_diag) with scale_identity_multiplier is deprecated and will be removed after 2020-01-01. Instructions for updating: scale_identity_multiplier is deprecated; please combine it into scale_diag directly instead. 0010 train: 1.431e+04, val: 1.460e+04 0020 train: 1.324e+04, val: 1.370e+04 0030 train: 1.273e+04, val: 1.329e+04 0040 numerical instability (try 1) 0000 learning rate: 5.00e-03

willtownes commented 11 months ago

hmm, that seems like a different error, it's not failing right away, maybe try initiating it with a lower learning rate?

sokratiag commented 11 months ago

I changed the length scale from 0.1 to 1 and it seems more stable now (it converged). Thank you for your help.

willtownes commented 11 months ago

hooray! I'm glad you figured it out.

tmchartrand commented 6 months ago

Interestingly, I'm having the opposite issue - numerical instability when running using CPU, but no issues when running on GPU (all parameters the same)! Any ideas?

tmchartrand commented 6 months ago

an update to the above - decreasing lr by half (to 0.005) for the CPU training seems to stabilize it, and the results appear essentially identical.

willtownes commented 6 months ago

thank you @tmchartrand !

willtownes / nsf-paper

NumericalDivergenceError #6