Model Fitting – In depth
Once you’ve tuned hyperparameters (or if you already in advance have hyperparameters which you know will work well from previous experiments), you’re ready to fit. Fitting is faster, easier and more foolproof than tuning, and the default options usually work well, but if you want more fine-grained control, there’s some additional knobs you can turn.
Preconditioning
Preconditioning is the heart and core of how iterative (non-exact) model
fitting is done in xGPR. If fitting in mode=exact, of course, a
preconditioner is not required; preconditioners are used for mode=cg.
mode=exact is very fast for smaller numbers of RFFs (e.g. 4096), but
for a larger number of RFFs, CG is the way to go, and this is where
preconditioners are useful.
Building a preconditioner
requires either one or two passes over the dataset. When you call
fit, xGPR will ordinarily automatically choose a preconditioner
size for you. If you’d like to control that process yourself
instead, or even have a little more control over the automated
process that xGPR is using, here’s how to do it.
To build your own preconditioner, use:
method = 'srht' uses a fast Hadamard transform based construction
with a single pass over the dataset. method = 'srht_2', by contrast,
runs two passes over the dataset and does involve some matrix
multiplications, so it is more expensive, but the resulting preconditioner
is always better. A preconditioner built with ‘srht_2’ will usually reduce
the number of iterations required to fit by about 20-25% compared with one
built with ‘srht’.
The ratio is a pretty good predictor of how many iterations it will
take xGPR to fit. See the following chart, where ratio is referred
to as beta / lambda^2. (The ratios here are all for “srht”. “srht_2”
will generally increase the ratio slightly but decrease the actual number
of iterations needed to fit as illustrated).
Smaller ratios in general mean fewer iterations are required to fit. Each iteration
is a single pass over the dataset (similar to an epoch when training a deep learning
model but generally much faster). The ratio in turn is a dataset-dependent
function of max_rank. Notice that the relationship between ratio and
number of iterations is different for srht and srht_2; srht_2 is
better but as noted requires two passes over the data to build.
If based on the ratio we obtain, we think that fitting
will take more iterations than we would like, we can reconstruct the preconditioner
using a larger max_rank and get a better ratio (doubling max_rank is
usually a good step to take). Bear in mind however that large values of
max_rank are also expensive, because preconditioner construction requires
a matrix decomposition, which is cheap for say 512 x 512 or 1024 x 1024, but
not so much for very large max_rank. Therefore the goal is to push
max_rank high enough to get an acceptable likely number of iterations,
but not so high that it is unnecessarily expensive to construct. In our
experiments, we have rarely needed to take max_rank higher than 2000,
but it is of course possible to do so if necessary.
Note that the chart above was constructed using fits with tol=1e-6. The number
of iterations will be greater if you select a tighter tol and less with a
looser tol. Also note that for very small values of the lambda hyperparameter
(shared by all kernels, and the first hyperparameter that is returned
from model.get_hyperparams()) is very small, the ratio may
dramatically overestimate how many iterations are required to fit. See the
molecule example on the main page for a case where the ratio is > 1000 but the model
fits in << 100 iterations! Consequently, the graph above should
only be used as a rough guide. When in doubt, a quick experiment on a subset of the
data may prove helpful in deciding what ratio is “good enough” before
building a preconditioner on the full dataset and fitting.
Fitting
If you build your own preconditioner, you can pass it to the fit() function. Alternatively, you can still use xGPR’s automated procedure for building a preconditioner while taking more fine-grained control of that automated procedure. This avoids the hassle of building the preconditioner yourself while giving you some additional control to ensure the fastest fit possible. Here’s an overview of what you can do:
Note that the alternative mode=exact is preferable if the
number of RFFs is small, where it will generally get better speed (it has
poor scalability to large numbers of RFFs). For exact fitting, no
preconditioner is required.
As a side note: SGD is popular in the literature because it works well for deep learning. Most papers that recommend some flavor of SGD haven’t tried to use it for high-dimensional linear problems that may sometimes be ill-conditioned and where a tight fit is desired – SGD does NOT work well compared to preconditioned CG under this particular set of circumstances, at least Adam, AMSGrad, SVRG, and the other usual suspects. The amount of learning rate and learning rate schedule tuning required to get SGD to work well is simply not acceptable for an out of the box fitting routine.