Deep dive on random search and loss landscape

The big idea is get your model to the lowest point in the loss landscape.
Let's say, your team leader wants to optimize for accuracy for the sprint, but the BA gave you a hot tip it's gonna be recall, but finance insists ROI depends on precision; while the DL library just care about CE loss or *insert buzzword fn* loss. The darnest thing is, each combination of hyperparameters will point to a different point on the map in each map. Scale that up to thousands, millions of different features, and you have yourself a proper mess.

Map 1: Cross-Entropy Loss. The model's default map for loss. The lowest point is where the training algorithm naturally wants to go via its implementation of backprop and criterion as CE loss.

Map 2: Accuracy. The team lead's map. Completely different coordinates for the 'lowest point.'' Optimizing for CE loss doesn't guarantee the best accuracy, though operationally debatable.

Map 3: Recall. The BA's map. Sweet spot for best recall and grabbing as many true positives as possible are somewhere else entirely. You want to minimize opportunity cost or no?

Map 4: Precision. Finance's map. To minimize false positives, we must navigate to yet another optimal coordinate. You want to give out fewer soon-to-default credit lines or no?
The challenge is clear: each set of hyperparameters takes you to a single spot, but its "performance" is judged differently based on the flavor of the eval_perf of the day. Finding the "best" model is thus heavily about tradeoffs - but let's assume that's been resolved. Let's say the current challenge is whether we could even converge towards the global minimum of a single map. Sure you can brute force it, but that's easily exponential O(). So you do randomized search instead to at least PAC converge (which is still better than intractability.)
Wait, what's PAC?
It'd take a literal forever to flip every rock on the map to find the best spot. This is where Probably Approximately Correct (PAC) learning comes in. Instead of demanding a guaranteed, perfect solution, it's better to just go with PAC where you're happy with a solution that is "good enough" (approximately correct) with a very high likelihood (probably). Randomized search is how you bring this philosophy to life. By sampling random spots from the enormous problem space, we aren't trying to find the single best needle in an infinite haystack; instead, we are travelling across different spots on the map and research there - making it highly probable that we will eventually stumble upon a high-performing region. Thus, we'd be PAC and the gains from brute-force-until-best becomes marginal or negative.
It's ok to just PAC and RO
This is where Randomized Optimization (RO) strategies come in.
Real quick - why RO? As you may have noticed, loss maps are often rugged and non-convex. If your starting point was on some random hill and you descend to the neighborhood well, then you declare that well to be the deepest spot in the world while asserting that the Mariana trench is fake news, that'd be suboptimal. That is, the model may hit a local minima and unknowingly get trapped in a basin of attraction.
By randomly restarting, we can "escape" local basins and hopefully explore a better spot (and eventually stumble upon the basin of the global minimum.) We can now frame this process as exploration vs exploitation. Whenver we start over, we "exploit" the local slopes and descend towards the lowest point possible in the neighborhood. Once there, we will do some quick mental math to decide whether this will be a good stopping point or to "explore" again. For a quick recap, lets consider 4 different RO strategies.

Randomized Hill Climbing: Takes the next best step as you randomly respawn - can easily get stuck on the first "peak" you finds (or "basin", depending if we're multiplying by -1 or not.) Repeat until quota.

Simulated Annealing: Explore or exploit depending on temperature metric that "cools" down over the loop. Inspired by hot metal with flexible particle lattice that "shifts" toward a more robust arrangement with each hammering.

Genetic Algorithm: Evolves a population of solutions (aka hyperparameter permutations,) combining the best to create supposedly better offspring.

MIMIC: Builds a statistical probability map of where the best solutions are, then samples from it. Depicted is StableDiffusion's understanding of MIMIC - surprisingly close to that of the average ML student
RO Hyperparameter | Algorithm | Effect | Neural Net Equivalence |
---|---|---|---|
Population Size | GA, MIMIC | Search breadth / Greediness | Batch Size |
Mutation Rate | GA | Increases diversity/exploration | Regularization / Dropout Rate |
Initial Temperature | SA | Initial exploration rate | Initial Learning Rate |
Cooling Schedule | SA, MIMIC | Shift from exploration to exploitation | Learning Rate Decay |
Conclusion
Since canvassing the entire map to guarantee the best solve is intractable, we make do with PAC. The question now becomes how do we structure our RO on exploit vs explore, how to systematically "avoid" local basins and as such sample enough of the big wide world.
There is no single "best" algorithm. RHC and SA offer speed, but at the risk of getting stuck. GA offers a more robust and powerful search, but at a higher computational cost. Not to mention if the eval_perf metric changes, you have to redo everything. Of course, if data shift or concept shift happens, you have to redo everything. The key is to understand the nature of the problem space and navigate tradeoffs as you set a direction for your search space, and delegate to the oncall.



Tldr: there is no free lunch.