Deep dive on random search and loss landscape

The big idea is get your model to the lowest point in the loss landscape.


Let's say, your team leader wants to optimize for accuracy for the sprint, but the BA gave you a hot tip it's gonna be recall, but finance insists ROI depends on precision; while the DL library just care about CE loss or *insert buzzword fn* loss. The darnest thing is, each combination of hyperparameters will point to a different point on the map in each map. Scale that up to thousands, millions of different features, and you have yourself a proper mess.

Loss landscape for cross-entropy loss

Map 1: Cross-Entropy Loss. The model's default map for loss. The lowest point is where the training algorithm naturally wants to go via its implementation of backprop and criterion as CE loss.

Loss landscape for accuracy

Map 2: Accuracy. The team lead's map. Completely different coordinates for the 'lowest point.'' Optimizing for CE loss doesn't guarantee the best accuracy, though operationally debatable.

Loss landscape for recall

Map 3: Recall. The BA's map. Sweet spot for best recall and grabbing as many true positives as possible are somewhere else entirely. You want to minimize opportunity cost or no?

Loss landscape for precision

Map 4: Precision. Finance's map. To minimize false positives, we must navigate to yet another optimal coordinate. You want to give out fewer soon-to-default credit lines or no?

The challenge is clear: each set of hyperparameters takes you to a single spot, but its "performance" is judged differently based on the flavor of the eval_perf of the day. Finding the "best" model is thus heavily about tradeoffs - but let's assume that's been resolved. Let's say the current challenge is whether we could even converge towards the global minimum of a single map. Sure you can brute force it, but that's easily exponential O(). So you do randomized search instead to at least PAC converge (which is still better than intractability.)


Wait, what's PAC?

It'd take a literal forever to flip every rock on the map to find the best spot. This is where Probably Approximately Correct (PAC) learning comes in. Instead of demanding a guaranteed, perfect solution, it's better to just go with PAC where you're happy with a solution that is "good enough" (approximately correct) with a very high likelihood (probably). Randomized search is how you bring this philosophy to life. By sampling random spots from the enormous problem space, we aren't trying to find the single best needle in an infinite haystack; instead, we are travelling across different spots on the map and research there - making it highly probable that we will eventually stumble upon a high-performing region. Thus, we'd be PAC and the gains from brute-force-until-best becomes marginal or negative.


It's ok to just PAC and RO

This is where Randomized Optimization (RO) strategies come in.

Real quick - why RO? As you may have noticed, loss maps are often rugged and non-convex. If your starting point was on some random hill and you descend to the neighborhood well, then you declare that well to be the deepest spot in the world while asserting that the Mariana trench is fake news, that'd be suboptimal. That is, the model may hit a local minima and unknowingly get trapped in a basin of attraction.

By randomly restarting, we can "escape" local basins and hopefully explore a better spot (and eventually stumble upon the basin of the global minimum.) We can now frame this process as exploration vs exploitation. Whenver we start over, we "exploit" the local slopes and descend towards the lowest point possible in the neighborhood. Once there, we will do some quick mental math to decide whether this will be a good stopping point or to "explore" again. For a quick recap, lets consider 4 different RO strategies.

Conceptual visualization of Randomized Hill Climbing

Randomized Hill Climbing: Takes the next best step as you randomly respawn - can easily get stuck on the first "peak" you finds (or "basin", depending if we're multiplying by -1 or not.) Repeat until quota.

Conceptual visualization of Simulated Annealing

Simulated Annealing: Explore or exploit depending on temperature metric that "cools" down over the loop. Inspired by hot metal with flexible particle lattice that "shifts" toward a more robust arrangement with each hammering.

Conceptual visualization of a Genetic Algorithm

Genetic Algorithm: Evolves a population of solutions (aka hyperparameter permutations,) combining the best to create supposedly better offspring.

Conceptual visualization of MIMIC

MIMIC: Builds a statistical probability map of where the best solutions are, then samples from it. Depicted is StableDiffusion's understanding of MIMIC - surprisingly close to that of the average ML student

RO HyperparameterAlgorithmEffectNeural Net Equivalence
Population SizeGA, MIMICSearch breadth / GreedinessBatch Size
Mutation RateGAIncreases diversity/explorationRegularization / Dropout Rate
Initial TemperatureSAInitial exploration rateInitial Learning Rate
Cooling ScheduleSA, MIMICShift from exploration to exploitationLearning Rate Decay

Conclusion

Since canvassing the entire map to guarantee the best solve is intractable, we make do with PAC. The question now becomes how do we structure our RO on exploit vs explore, how to systematically "avoid" local basins and as such sample enough of the big wide world.

There is no single "best" algorithm. RHC and SA offer speed, but at the risk of getting stuck. GA offers a more robust and powerful search, but at a higher computational cost. Not to mention if the eval_perf metric changes, you have to redo everything. Of course, if data shift or concept shift happens, you have to redo everything. The key is to understand the nature of the problem space and navigate tradeoffs as you set a direction for your search space, and delegate to the oncall.

Chart showing Knapsack problem fitness score by algorithm as problem size increases.
Figure 1: Knapsack Performance vs. Problem Size. This chart illustrates how each algorithm's fitness score scales as the complexity of the Knapsack problem (the number of items) grows.
Box plot chart showing the distribution of fitness scores for the TSP after many Monte Carlo runs.
Figure 2: TSP Fitness Distribution (Monte Carlo Runs). By running the simulation many times, we can see the consistency of each algorithm. Note the wide variance for some, indicating a high sensitivity to the random starting conditions.
Chart validating Simulated Annealing performance on the TSP with different cooling schedules.
Figure 3: Impact of Cooling Schedule on Simulated Annealing. This validation curve shows how a single hyperparameter—the cooling schedule in SA—can drastically affect the final solution's quality and consistency.

Tldr: there is no free lunch.


Ben Truong
Ben Truong
ML Engineer

I build ML/GenAI apps for the cloud and private clusters.

Related