Closed TmacTmac1992 closed 5 months ago
The three-stage training is better than only stage-3 joint training with our computation resources (8xA5000, 24GB memory each). A large batch size benefits the convergence at early training of the vector module, so the stage-2 warmup is important (the BEV module is pre-trained and frozen, making it possible for a large batch size). I'm not sure if the conclusion would change when training stage 3 only on GPUs with larger memory (like A100 80G).
Hi, thanks for open-souring this awesome work! Have you compare stage 1-3 training method verse only stage 3 joint training?