yingtaoluo / Spatial-Temporal-Attention-Network-for-POI-Recommendation

Codes for a WWW'21 Paper. POI recommender system for location/trajectory prediction.
https://doi.org/10.1145/3442381.3449998
164 stars 39 forks source link

Some questions about the model training and the code #13

Closed ShelveyJ closed 2 years ago

ShelveyJ commented 2 years ago

Hello, I'm very interested in your work. Thanks for sharing the dateset and the resources. After reading your paper and realted source code. I have some questions. Firstly, you set the part=100, for NYC dateset of about 1000 users, we need to run train.py 10 times. Does it mean we need to train STAN from zero every time we use a part of the dateset and averge the performance? or use the parameters trained from previous proportions to train the new proportion? If it's the first case, wouldn't using data from only a hundred users at a time lead to overfitting? Secondly, what is the difference between only using 100 users' data at a time and using the whole dataset e.g. NYC?What is the difference in their running results in recall rate? Thirdly, the code in line 103 in load.py is lens.append(user_len-2), why do we need to subtract 2 from user_len? As far as i can see, the code "if mask_len <= person_traj_len[0] - 2" in line 119 in train.py actually has achieved the same effect. If we subtract 2 from user_len, the max length of a user's training sequence would be 96 instead of 98. Am I misinterpreting this?

yingtaoluo commented 2 years ago

First question: since data is divided by users, 100 users' training data are still tested on the same 100 users. Yes, it is possible to consider all data instead of 100 each time average if you have a larger model capacity. We consider it as an option to mimic the evaluation more easily since many people do not have access to that many computing resources. Second question: If all users are used, we find it hard to reproduce the results of some papers. Thus, to align with prior works' results (appeared on respective papers) on these datasets, we had to have a way that is both fair and possible for reproduction of similar results. It is a compromise, which is definitely not a mandatory setting. If more users lead to lower accuracy, it could imply a challenge for personalization. We guess that the model capacity we used for reproduction is not large enough to accommodate so many users' personalization (different user has different embedding). Please feel free to reproduce your results in your preferred setting, as long as you are using the proposed spatiotemporal embedding and architectures correctly (the main difference from TiSAS is the modifications (embedding, bi-layer) stated in the introduction. Third question: the initial sequence length for preprocessing is 100. Training data does not have access to the validation (99th) and test (100th) in the sequence, so we minus 2.