wouterkool / attention-learn-to-route

Attention based model for learning to solve different routing problems
MIT License
1.04k stars 337 forks source link

Question - how long does training the TSP model take? #40

Open LuciaTajuelo opened 3 years ago

LuciaTajuelo commented 3 years ago

Hi!

I'm executing the following command to train a TSP model:

python run.py --graph_size 20 --baseline rollout --run_name 'tsp20_rollout' i've run the code for 1 epocs and it took 10 hours. Is this expected?

I'm running the code under windows on a msi laptop using GPU and cuda enable.

Thanks in advance!

wouterkool commented 3 years ago

Hi and thanks for trying the code! From the top of my head, training a single epoch with default settings should take around 5 minutes on a 1080Ti GPU. This does not sound as if you are actually using the GPU. Please verify GPU usage through the `nvidia-smi' command (not sure whether this exists under windows) and check that tensors are actually moved to the GPU (e.g. add some printing statements printing {tensor}.device).

LuciaTajuelo commented 3 years ago

Hi!

Thanks for your quick replay. I've checked the logs on the nvidia-smi' but i dont really see usage from python when running the code. I've added some prints to the code. In the run.py line 58 i've added aprint(opts.device)` and i've add some prints in nets/graph_endoder.py line 44

 print(self.W_query)
print(self.W_key)
print(self.W_val)

I see that opts.device is cuda:0 but the tensors are cpu. However, i'm not sure if this is the proper way to test it.

To sum up, i'd say that i'm not actually using the GPU when running the code. How can i use it?

Thanks a lot!!

XxwlW commented 2 years ago

Hi!I have the same problem as you. Have you solved it?

Hessen525 commented 2 years ago

I have the same probnelm, too....

Sinyo-Liu commented 2 years ago

Perhaps you can decrease the magnitude of '--epoch_size', adjust from 1280000 to 128000/12800/1280, the time consuming of each epoch will be greatly improved.

zbh888 commented 1 year ago

Same here.

wouterkool commented 1 year ago

Sorry this caused trouble, for people running into this please have a look at #11, I'll copy it here for reference:

Hi, I had the same issue, and here is how I solved it: Turns out this is related to the num_workers=1. If you change it to num_workers=0 the code will run properly. This is in line 78 of 'train.py': training_dataloader = DataLoader(training_dataset, batch_size=opts.batch_size, num_workers=1) At first, I thought this is because of 'enumerate', but it is not. I hope this solves it for you. Cheers