Closed roussel-ryan closed 1 month ago
A general comment is that I'd suggest we do a more complete implementation with more randomization as a followup PR. There is also a question of how to compute feasibility - just drop infeasible points like is done now, or sample each candidate to determine things probabilistically, potentially getting more accurate borderline candidates.
More specifically:
Botorch does a more complicated process with posterior sampling + pareto downselect in the pruning function for qNEHVI family - see here. Note how if any of the samples at a baseline point are infeasible, the point is set equal to reference and excluded from pareto front. This encodes noise knowledge of the model into surviving candidates and should in general be better than removing the pareto point based on observed data, at the cost of performance.
In fact, since prune_inferior_points_multi_objective
will frequently get called in the acquisition function, it might be useful to cache baseline as part of pf initialization and feed results as new X_baseline
+ prune_baseline=false
to save on repeating pareto front computation.
For choosing which points to use when len(initial_points) > num_restarts
, it might be good to use stochastic behavior. First, use 'around best' logic (see here) to generate raw_pf_samples
points, with raw_pf_samples = num_restarts*factor
. Then, use the same stochastic logic as in Botorch raw_samples
parameter. Namely, compute acquisition function at all points and pick exactly num_restarts
probabilistically with bias for higher acquisition values (see here). One can make an argument that acquisition functions values will be pretty similar around each point unless perturbations are large, and thus this complicated procedure will not be particularly useful. A simpler solution is to only generate at most 'num_restarts' candidates without downselect, for example by picking num_restarts
pareto points using above weighted procedure and then generating 1 nudged candidate per point. Need to benchmark to see if that is worth it. The overall goal here is to make initialization not use completely random parts of pareto front, but be softly biased towards more promising areas.
It would be interesting to plot how many points are on pareto front vs dimensionality. Have a feeling num_restarts
might need to be scaled a lot for larger problems if the fully random scheme is kept.
@nikitakuklev for the record here, I'll reiterate that we are happy to incorporate your suggested improvements to this process in a future PR
use_pf_as_initial_points
flag which uses points on the pareto frontier to initialize optimization of the EHVI acquisition function which results in substantial speed-up of convergence to the pareto front in high-dimensional input spaces