During the later epochs, the training speed decreases by 2-3 times（3.0.1）

When training Conformer U2++ with version 3.0.1, I noticed that the time taken for each batch to train in later epochs increased from 1-2 minutes in the early stages to 4-5 minutes. I have checked the following issues and they are all normal. Could you give me some other suggestions? THX!

Change

1，In this version, I have additionally introduced an online data augmentation method (after speed perturbation and before spectral enhancement) to dynamically blend some noise and reverberation during training (not enabled for the validation set).

Observation

1，I have noticed that only the training speed has slowed down, not the speed during validation evaluation. 2，The overall loss is still decreasing.

Investigate

1，I am using shards mode to read data via HTTP, so I checked the network and IO of the machine storing the samples and found them to be normal. 2，I checked the network, CPU, IO, and disk of the training machine and did not find any performance bottlenecks. 3，I checked the GPU on the training machine and found that the memory usage, power, and temperature are all normal, with the GPU utilization still at 100%.

wenet-e2e / wenet