wilsonrljr / sysidentpy

A Python Package For System Identification Using NARMAX Models
https://sysidentpy.org
BSD 3-Clause "New" or "Revised" License
380 stars 77 forks source link

Model performance changing a lot. #111

Closed himanshupant24 closed 2 months ago

himanshupant24 commented 1 year ago

I tried the model for an example with Y_lag=26 and x_lag=10. I ran the model 2-3 times. I am using NSE and RMSE as performance measurement. I found my NSE changing from 0.45 to -0.3. My question is , is the performance of the model expected to change this much with same data and configuration?

ericglem commented 1 year ago

I think the best way to assure the variability is large is by running a significant benchmark (i.e, more than 2-3 times) and taking some measurements (like mean, std, median, whatever). Have you tried that?

wilsonrljr commented 1 year ago

I agree with @ericglem. Besides, as I sent you by mail, if you could share a small sample of your data and code you are using it'd be great. Without some details its difficult to check whats is happening.

For example: since you are using a neural network with stochastic optimizer, the performance can change each time you run it. But if the performance is changing too much we have to look at the data to check if it is a data related problem or is something else behind the scenes in the package implementation.

Moreover, you said in the issue #110 you are having some troubles to defined the lags for multiple input cases, so maybe the problem can be related to some mistake in that part too.

But as I said, I have to look your code and data (just a sample) to have a proper answer.

himanshupant24 commented 1 year ago

Hi @wilsonrljr, I have sent you the data and other information over an email on 26th may. My email id is "himanshupant2411@gmail.com". Can you please have a look and give me your insight?

wilsonrljr commented 1 year ago

Hey @himanshupant24 , I'll make some tests with the sample data you sent me.

Regarding the model performance changing, it can absolutely happen in your case (given the details you sent me by mail). In your case, you are changing the lags and, respectively, the model form, so the performance can vary a lot depending on the data.

I'll keep you updated.

himanshupant24 commented 1 year ago

Hi @wilsonrljr,

Sorry to bother you again. Did you get a chance to have a look on the dataset?

Thanks, Himanshu

wilsonrljr commented 1 year ago

Hey @himanshupant24 ! I was looking at your case:

First, you said you got the best results when you used water_level as input and output. This is actually wrong because you will have kind of a "causal" relation between the input and output of your system. Maybe I'm not getting the idea right, but it looks like that to me. Let me know if I'm wrong.

So you tried to use the rain data as an input and got worse results. Compared with using the same data as input and output, this is expected. We should work on improve the model that use rain as input but probably we can't reach the save accuracy level of the "causal" method.

In this respect, I want to ask another questions:

As soon as I have your answers we can try to follow some new ideas.

himanshupant24 commented 1 year ago

Hi @wilsonrljr , Thanks so much for your response. PFB the response for your questions:

“First, you said you got the best results when you used water_level as input and output”: The logic behind it was to use lag value of the column for forecasting. In this case, I am trying to forecast the value of water_level using its lag. That’s why it is both input and output. “So you tried to use the rain data as an input and got worse results”: Sorry I forgot to mention one thing, I used rain data along with water_level as input. I expected better performance because I have extra information to the model in the form of rain data. Note: As displayed below, the cross correlation of rain data is high with water_level till 20-25 legs.

image

"Did you remove outliers from your data? I checked the data and there are some outliers on it, so we can try to improve the model a little by processing the outliers.": Actually these are not outliers because those points represents high water level due to rain/blockage.

"Did you try to decimate your data in anyways or you are using all samples in the training process?": Yes, the data is split into train, test and valid in the ratio of 65/35/35.

"Have you tried other models than neural networks or neural network is your goal in this case?": Yes, started with ARIMA and Prophet,

image

If you want data from different sensor, I can provide one. Let me know if you need any clarification from my side.

wilsonrljr commented 2 months ago

First, I'm sorry if I didn't answer everything you need (I dont remember if we have discussed this also in discord or mail). I'm closing this for because you might have found a solution at this point). However, if you still need any help, please contact me. Because more people started using the package, now I have a specif schedule in my agenda every week just to assist people with questions about the package.

Again, sorry if your question was "ghosted".