Question about the implementation of bootstrap reward

Hi, Thanks for your interest in our work. Indeed, in statistical theory, bootstrapping refers to resampling data. In RL, however, bootstrapping is a way to encourage exploration. See https://rail.eecs.berkeley.edu/deeprlcourse/deeprlcourse/static/slides/lec-13.pdf

Regarding implementing bootstrapping in RL, there are several usual considerations.

Training N big neural networks is expensive. To save computation in training and inference, a useful practice is constructing a shared backbone with multiple heads. So that most of the architecture is shared across bootstrapped models (see Page 35 of the shared slides). In our implementation, we found that 4 predictor heads will work.
Practically, researchers found bootstrapping is very effective for RL exploration. It is found training two different neural networks and taking min/max of them is already nice.
Regarding inference, there are several ways. The most principled way is closely related to Tompson sampling (see Page 36 of the slides.). However, picking prior distribution is highly manual. To some extent, taking an average of all outputs is related to this. In the practice of RL, the following are all effective yet simple approaches: taking min/max or softmin/softmax of the outputs.
For our work, it is possible to use (1) softmax or (2) max or (3) average. Experimentally, we found that (1) involved choosing a hard temperature parameter, which can be avoided if using (2). Our implementation is based on this.

zhaoyl18 / SEIKO