yoavram / curveball

Predicting microbial growth in a mixed culture from growth curve data
http://curveball.yoavram.com
Other
14 stars 5 forks source link

Better bounds in sample_params #20

Closed yoavram closed 8 years ago

yoavram commented 9 years ago

Currently the bounds on the sampled params are enforced by simply ommitting out of bounds param values. This decreases number of samples and the deviation. Should sample from a bounded/truncated multivariate normal distribution, instead.

yoavram commented 9 years ago

I can't find a good example for a test.

yoavram commented 9 years ago

So, a good example is ZD564wt:

> df = pd.read_csv("gltA growth curve data.csv")
> df = pd.melt(df, id_vars=['Time','Replicate'], value_name='OD', var_name='Strain')
> df = df[df.Time <= 10]
> df.OD -= df.OD.min()
> m_ZDB564_wt,fig,ax = curveball.models.fit_model(df[df.Strain=='ZDB564 (Cit+ gltAwt)'])
> curveball.models.sample_params(m_ZDB564_wt[0], 100)
...\curveball\models.py:226: RuntimeWarning: covariance is not positive-semidefinite.
  param_samples = pd.DataFrame(param_samples, columns=names)
d:\workspace\curveball_project\code\curveball\models.py:235: UserWarning: Warning: truncated 93 parameter samples; please report at https://github.com/yoavram/curveball/issues, including the data and use case.

The covariance matrix I get from the fitting procedure has large elements which causes many samples to be outside of the legal range:

[[  1.16157014e-09  -1.44749963e-07  -7.29273845e-12   1.00000000e+00]
 [ -1.44749963e-07   1.88978499e-05   1.14607840e-09  -5.13777521e+03]
 [ -7.29273845e-12   1.14607840e-09   4.79221797e-08  -1.40152197e+00]
 [  1.00000000e+00  -5.13777521e+03  -1.40152197e+00   1.00000000e+00]]

I translated the code for a univariate truncated normal distribution from R, so one possibility is to use it and just ignore the correlation between the parameters. Another options is to define some heuristic to when to use this instead of the multivariate distribution.

yoavram commented 9 years ago

I've tried using lmfit's parameters confidence intervals instead of the covar matrix, but the function got stuck. Then I figured out that the params are changing inside the function. But it didn't happen when I tried to reproduce the problem outside of curveball to generate a bug report. So then I figured this is related to the "weird-bug" things, and did what was previously suggested - use all the data points in the fit, instead of the averages and the weights. Then I could get the confidence intervals,

Per @uriobolski suggestion, I can use the lmfit.conf_interval2d function to estimate the pairwise mutual density of parameters, and from this to estimate their covariance. I should probably do this when the variance/covariance I get from the least squares is > 10*estimated param or something similar.

Other benefits:

yoavram commented 8 years ago

This is no longer relevant as param distribution can be estimated using bootstrap with bootstrap_params (#111).