If anyone know of popular proven techniques for dynamically adjusting the step size in gradient descent, or how to handle approaching a minimum (switch to binary search?), or even escaping local minima, let me know!
@runevision look at ADAM for step sizes. You could also try L-BFGS (not certain how it will do on 106 params). Can try random restarts to avoid local minima.
@avi I'm looking into ADAM but all descriptions I could find assume familiarity with stochastic gradient descent. And when I'm looking into that, I can't figure out how to map the terminology to what I'm doing, or if it even applies. In the image here (from Wikipedia) I don't know what the summand Qi functions are supposed to correspond to. I have many parameters/dimensions (104) but I only have one evaluation function. I also don't know if I have anything corresponding to "i-th observation".
@runevision I think you can just treat yours as a degenerate case where you have exactly 1 observation, at which point SGD = GD. The momentum and adaptive learning rates from Adam should still apply.
@runevision alternatively, and maybe more conventionally, you could consider each pixel distance to be a separate observation.