question - Githubissues

mayang113 commented 2 months ago

Hello author， I am once again having problems😭😭😭. I see that when I use the trained model to generate the acceleration timescales, the program is set to generate 100 images because it is set to generate 100 images, but there is a big difference between these 100 images, is this normal? Is it ok to choose the best one from the 100 images as the final generated result? (best fit to the observed data)

yzshi5 commented 2 months ago

That's normal to have difference between the generated samples, that is how the model capture the variability. GANO is a generative model, which is stochastic. Basically, you cannot make one-to-one comparison, but compare the calculated statistics from the generated samples.

mayang113 commented 2 months ago

Basically, you cannot make one-to-one comparison, but compare the calculated statistics from the generated samples.

Does this sentence mean that you can go and count some of the result forms of the generated results to compare them to determine how good the results are, such as FAS, RotD50, etc.?
But the model is generating 100 images for a specific label, how do I select from these 100 images? Or do I need to count them all and take the average?

yzshi5 commented 2 months ago

Let me explain my point into detail:

Different from deterministic numerical simulation methods, GANO is stochastic and take the sample from a Gaussian process as input. You cannot make one-to-one comparison for stochastic method. That's why we group events for comparing the statistics : (Figure 9 in our paper). One example of one-to-one comparison in the supplementary materials (Figure S3) is used to give you a general sense about the performance of our model, but not rigorous in statics. Quantitatively comparison should involve engineering metrics like FAS, Rotd50, residual analysis etc for the median and variability derived from the generated samples.
You can freely change the number of generated samples for one condition, 100 is not a fixed number. For calculating the statics, you should also calculating the uncertainty ( standard deviation in logscale). Uncertainty quantification is very important in earthquake engineering (that's why residual analysis is widely used).

I also recommend reading the previous work developed by our group, which may help you have a better understanding of the validation pipeline :https://pubs.geoscienceworld.org/ssa/bssa/article/112/4/1979/613199/Data-Driven-Synthesis-of-Broadband-Earthquake

yzshi5 commented 2 months ago

I would suggest reading the references in the introduction part of our paper, which can be useful for understanding how people validate different waveforms generation methods.

mayang113 commented 2 months ago

Hello author In your paper，why are there four acceleration time histories for a specific scenario? Shouldn't there be only one?

屏幕截图 2024-05-06 134805 My understanding ：The right panel shows the results of 4 acceleration time courses randomly selected from the 100 time course data generated by GANO through the conditions M 4.5, 50km, 300m/s, and shallow crustal. The left panel is the result of randomly selecting within the neighborhood of the condition M 4.5, 50km, 300m/s, and shallow crustal 屏幕截图 2024-05-06 144343

yzshi5 commented 1 month ago

Yes, your understanding is correct.

The dataset you collected is always discrete w.r.t the conditional variables, the perfect match for any combination of the conditional variables is almost impossible. Thus, for a specific scenario, you actually define narrow bands for each condition. For example, "Observation M4.5, 50 km, 300m/s" represents M4.4-M.6, 40km-60km, 250-350m/s". If you have a large dataset, you can choose smaller bandwidth for the conditions, which will make the comparison more accurate.

mayang113 commented 1 month ago

Hello author

why is the graph in m/s? is it based on the time course of velocity?
Was the graph created by calculating the FAS for each of the 76 graphs and then averaging them? Does the number of synthesized data also need to be 76?

yzshi5 commented 1 month ago

It's based on the acceleration, and FAS is normalized. (FAS_normalized = FAS * dt)
Yes, in the shown scenario, we have 76 observations in the bin. Here, for each observation, we provide the associated and identical meta information (mag, vs30, rrup, f_type) to the GANO and ask GANO only generate 1 realization. In this way, the number of synthesized data matches the number of observations

mayang113 commented 1 month ago

Hello，author

When I train the model using the acceleration instead of the velocity, is this the only change I need to make?
I'm not using kiknet data, do I need to change it here? If it needs to be changed, is it to fill in the PGA of my data (trained using acceleration timescales)

yzshi5 commented 1 month ago

Yes, that's the postprocessing code. If you train the model with acceleration data, the GANO model will generate acceleration time histories
You can modify the code, we added that part because we are not allowed to share our dataset. That function is used to convert the normalized log10_PGA to the actually PGA, which is a built-in function in SeisData class.

mayang113 commented 1 month ago

Thanks to the author for being able to answer my question because GANO is so advanced that no one around me would have this knowledge

When I enter the acceleration timescale in both directions, do I need to change the 3 here to 2 as well?
Here are my conditional variables, I would like you to help me see if they are reasonable (since I have only three magnitudes and only one Vs30, is the distance too precise) A1.csv 3.This may not be a good question to answer, I feel that the generated data is a multiple of the recorded data (as I see a slight similarity in the trend of the curves), can you tell if it's a matter of a multiple of the difference or if the data I generated is just not right?

yzshi5 commented 1 month ago

Yes, change the 3 to 2
For conditional variables, I would suggest just keeping magnitude and rupture distance as the conditional variables, since Vs30 is a constant not a variable in your case.
I didn't understand your question well. What do you mean the generated data is a multiple of the recorded data? Besides, we usually use log-log plots for Fourier Amplitude Spectrum

mayang113 commented 1 month ago

Can you see how well the GANO model generates data on this graph?
Do I need to follow up by preparing another drawing of the average acceleration envelope？

yzshi5 commented 1 month ago

From figure you showed, there exists a constant error between the ground truth mean and predicted mean. You may need to check the postprocessing method.

yzshi5 / GM-GANO

question #9