xGPR quickstart

Build your training set

Start by building a Dataset, which is similar to a DataLoader in PyTorch. If your data is organized in one of a couple fairly common ways, you can use a built-in xGPR function to build this dataset. If your data is in some other form(e.g. a fasta file, an SQLite db or an HDF5 file) and you don’t want to make a copy of it, you can instead subclass xGPR’s DatasetBaseclass and build a custom Dataset object. We’ll look at the more common situations first. You can use either build_regression_dataset or build_classification_dataset, which take the same arguments, as shown below::

from xGPR import build_regression_dataset, build_classification_dataset

reg_train_set = build_regression_dataset(x, y, sequence_lengths = None,
                      chunk_size = 2000)

x and y can be either numpy arrays OR a list of filepaths to .npy files saved on disk. If the latter, no single one of the files can contain more datapoints than the chunk_size setting. chunk_size controls how many datapoints xGPR processes at a time while training. Unlike deep learning, this doesn’t affect the fitting trajectory in any way; it just affects memory consumption. If your datapoints are really large, you can use a smaller chunk size to reduce memory consumption.

For regression, y should be all real values. For classification, it should be class labels starting from 0 (e.g. for a three-class problem the labels should be 0, 1, or 2).

The x array(s) can be either 2d for tabular data or 3d for sequences and graphs. If they are 2d, they should have shape (N, M) for N datapoints and M features. If they are 3d, they should have shape (N, D, M) for N datapoints, D sequence elements / timepoints / nodes and M features. If they are arrays saved on disk, you can save the data as float32 or even uint8, and xGPR will convert it to float32 when loading (this can often save considerable disk space and also make model fitting faster).

If your data is 3d (i.e. sequences or graphs), you have to also supply sequence_lengths. If x is an array with shape[0]=N, sequence_lengths should also be an array with shape (N,) that indicates the length of each sequence (or number of nodes in each graph) excluding zero-padding (unless you want to include zero-padding in the kernel calculation for whatever reason, in which case, just set all sequence_lengths to be x.shape[1]). xGPR will then “mask” the zero- padding when doing the kernel calculations.

If your data is 3d but x is a list of .npy files on disk, sequence_lengths should be a list of .npy files on disk of the same length. For each x file with shape (N,D,M), the corresponding sequence_length file should be an npy array of shape (N,) indicating the length of each corresponding sequence in that x file again (typically) excluding zero- padding.

If using 3d data, you’ll also have to pass the sequence lengths of your test datapoints to model.predict when making predictions – see below for examples.

See the Examples section for some illustrations of working with sequence and graph data. For fixed-vector 2d data, of course, sequence lengths are not required.

When you create the dataset, xGPR will do some checks to make sure that what you fed it makes sense. If the dataset is very large, these may take a second.

Finally, let’s say your data is a fasta file, a csv file, an HDF5 file or some other format. You can create your own Dataset that loads the data in minibatches during training and does any preprocessing you want to do on each minibatch. To see how to do this, check out Example: Using a custom Dataset object.

Fit your model and make predictions

Let’s create two models using RBF kernels and fit them::

# xGPRegression is for regression, xGPDiscriminant is for classification
from xGPR import xGPRegression, xGPDiscriminant

#Notice that we set the device to be "cuda". If there are multiple
#cuda devices, xGPR will use whichever one is currently active. You can control
#this by setting the environment variable CUDA_VISIBLE_DEVICES,
#e.g. "export CUDA_VISIBLE_DEVICES". Also note that for regression,
#we specify a quantity, "variance_rffs", which must be < num_rffs.
#This controls the accuracy of the variance approximation when
#calculating uncertainty on predictions. We've found 512 - 1024 is
#usually fine.

reg_model = xGPRegression(num_rffs = 2048,
                      variance_rffs = 512,
                      random_seed = 123.
                      kernel_choice = "RBF",
                      device = "cuda", kernel_settings = {})

#Nearly all xGPR kernels have either one or two hyperparameters. We'll
#talk more about how to optimize hyperparameters shortly.
my_hyperparams = np.array([-3.1, -1.25])
reg_model.set_hyperparams(my_hyperparams, reg_train_set)

#We can change the number of RFFs at any time up until we fit the model.
reg_model.num_rffs = 2000



#We can fit using "exact" mode or "cg" mode. "cg" mode is much faster
#if num_rffs is very large. "exact" is much faster if "num_rffs" is small,
#e.g. < 3000. We'll use both here just for illustrative purposes.
#tol controls how tight the fit is; 1e-6 (the default) is usually fine;
#1e-7 (tighter fit) will improve performance slightly, especially if
#data is close to noise-free, but is not usually necessary; 1e-8 is
#usually overkill.

reg_model.fit(reg_train_set, mode="cg", tol=1e-6)

#To make predictions, just feed in a numpy array.
#If we want to switch over to CPU for inference now that the model
#is fitted, we can do that too, e.g.:

model.device = "cpu"

#For regression, predictions are just
#a numpy array of predicted y-values. chunk_size controls
#how many datapoints xGPR processes at a time in the input array; larger
#slightly increases speed but increases memory usage. If you're worried
#about memory usage, set a small chunk_size, otherwise default of 2000 is fine.

reg_preds = reg_model.predict(xtest_reg, sequence_lengths = None, chunk_size = 2000)

#For regression, we can also get the variance on the predictions, which
#is useful as a measure of uncertainty.

reg_preds, reg_var = reg_model.predict(xtest_reg, sequence_lengths = None, get_var = True)



# Classification is similar but with a few slight differences. For one thing,
# we don't need to specify the number of variance rffs since variance is not
# calculated separately.
class_train_set = build_classification_dataset(xclass, yclass,
                      sequence_lengths = None, chunk_size = 2000)

classifier = xGPDiscriminant(num_rffs = 2048,
                      random_seed = 123.
                      kernel_choice = "RBF",
                      device = "cuda", kernel_settings = {})

my_hyperparams = np.array([-3.1, -1.25])
classifier.set_hyperparams(my_hyperparams, class_train_set)

#We can change the number of RFFs at any time up until we fit the model.
classifier.num_rffs = 2000

#We can fit using "exact" mode or "cg" mode. "cg" mode is much faster
#if num_rffs is very large. "exact" is much faster if "num_rffs" is small,
#e.g. < 3000. We'll use both here just for illustrative purposes.
#tol controls how tight the fit is; 1e-6 (the default) is usually fine;
#1e-7 (tighter fit) will improve performance slightly, especially if
#data is close to noise-free, but is not usually necessary; 1e-8 is
#usually overkill.

classifier.fit(class_train_set, mode="cg", tol=1e-6)

#To make predictions, just feed in a numpy array.
#If we want to switch over to CPU for inference now that the model
#is fitted, we can do that too, e.g.:

model.device = "cpu"

#For classification, predictions are a shape (N,M) array for N
# datapoints and M classes. These are the class probabilities.
# The actuall class assignments are np.argmax(preds, axis=1). chunk_size controls
#how many datapoints xGPR processes at a time in the input array; larger
#slightly increases speed but increases memory usage. If you're worried
#about memory usage, set a small chunk_size, otherwise default of 2000 is fine.

class_preds = classifier.predict(xclass)

And that’s basically it!

We used “RBF” kernels here, but there are plenty of other options; see the Kernels section on the main page. Some are designed for sequences, others for tabular data. Some of them have options that you can supply under the kernel_settings dict (e.g. the convolution width if using a sequence kernel).

Notice also that we had to specify num_rffs when setting up the model (but can change it subsequently as well, at least right up until we fit). num_rffs controls how accurately the kernel is approximated. The error in the kernel approximation falls off exponentially with larger num_rffs values, so increasing num_rffs generally makes the model more accurate, but with diminishing returns. It also increases computational expense (fitting using num_rffs=1024 will be much faster than fitting with num_rffs=32,768).

Finally, notice that when calling model.predict just as when building a dataset, sequence_lengths is None if you’re using a fixed-length kernel; if you’re inputting a 3d array and using a convolution kernel, you have to supply a numpy array of sequence lengths so that xGPR can mask zero-padding (if you’re using zero-padding).

There’s one big missing piece we haven’t discussed so far of course, which is…

How to find good hyperparameter values?

Most kernels in xGPR have either two hyperparameters (“lambda”, “sigma”) or one (“lambda”). There are some exceptions to this. (See the Kernels section on the main page for details about each kernel.) The Lambda hyperparameter is like the ridge penalty in ridge regression: it provides regularization and is roughly related to how “noisy” the data is expected to be. Larger (more positive) values = stronger regularization. xGPR squares the Lambda hyperparameter when fitting.

The “sigma” hyperparameter, for kernels that have it, is an inverse lengthscale that (to oversimplify a little) determines how close datapoints must be in order to be considered similar. Smaller (more negative values) cause points that are farther apart to be considered “similar”.

xGPR always uses the natural log of the hyperparameters as input, and internally converts those to actual values. So if you have::

my_model.set_hyperparams(np.array([-1., 0.]), my_train_dataset)

the Lambda value that xGPR will use when fitting is (1 / e)^2, and the sigma value will be 1. (This is really just for internal convenience). For numerical stability, “lambda” probably should not go below -7 for regression or -10 for classification.

One simple way to find good hyperparameter values is to fit the model using different hyperparameter settings and look at performance on a validation set. So in this scheme, for each set of hyperparameters you’re considering, you would::

def my_hparam_evalation_function(my_new_hyperparams, my_validation_set_array,
                 my_validation_set_sequence_lengths = None):
    my_model.set_hyperparams(my_new_hyperparams, my_train_dataset)
    my_model.fit(my_train_dataset, mode="cg")
    preds = my_model.predict(my_validation_set_array, my_validation_set_sequence_lengths)
    ##Add some score evaluation, R^2, MAE, accuracy, etc. here...
    return score

where my_new_hyperparams is a numpy array. You can easily plug this into Optuna or some other hyperparameter tuning package, do Bayesian optimization or grid search or any other procedure you like. The xGPDiscriminant classifer comes with some built-in functions that will do this kind of optimization for you just for convenience (for details on these see the “Classifier Tuning” section below).

For regression, there’s also a much nicer way to evaluate hyperparameters, which uses negative log marginal likelihood (what xGPR calls NMLL). In Bayesian inference, the marginal likelihood is the probability of the training data averaged over all possible parameter values. A lower NMLL means a better model (and a higher NMLL means a worse model). The NMLL on the training set in general correlates very well with performance on held-out data. So, for regression we don’t really even need a validation set to tune hyperparameters; we can just calculate the NMLL for different hyperparameter settings and see which one gives us the best result.

Here’s an example::

def my_regression_hparam_evalation_function(my_new_hyperparams):
    #If num_rffs is small, use this function
    nmll = my_model.exact_nmll(my_new_hyperparams, my_training_dataset)
    #If num_rffs is large, use this function
    nmll = my_model.approximate_nmll(my_new_hyperparams, my_training_dataset)
    return nmll

Now, we just minimize the value returned by this function – again, we can use Optuna, grid search, Bayesian optimization, what have you.

Notice one thing in the function above. exact_nmll is much faster if the number of RFFs is small. On GPU, it can be reasonably fast up to about 8,000 RFFs or so. It has cubic scaling, however, so for large numbers of RFFs it can get very slow very quickly. approximate_nmll has much better scaling and so is your friend if you want to tune using a large num_rffs. It does involve an additional approximation (above and beyond the random feature approximation used throughout xGPR). This additional approximation is very good in general but its quality and speed can be fine-tuned if desired by using some additional knobs; see the Advanced Tutorials for more.

Finally, for regression, xGPR offers two build-in functions that can do hyperparameter tuning for you by minimizing the NMLL. These are:

my_model.tune_hyperparams_crude(my_training_dataset, bounds = None, max_bayes_iter = 30,
                                       subsample = 1)
my_model.tune_hyperparams(my_training_dataset, bounds = None, max_iter = 50, tuning_method = "Powell",
                          starting_hyperparams = None, n_restarts = 1,
                          nmll_method = "exact")

#Get the final hyperparameters optimized by tuning as a numpy array
my_final_hyperparams = my_model.get_hyperparams()

(There are some other knobs we can turn on tune_hyperparams; see Advanced Tutorials for more.)

The first function, tune_hyperparams_crude, is a remarkably efficient way to rapidly search the whole hyperparameter space for 1 and 2 hyperparameter kernels. It lets you specify a “subsample” argument; if this is less than 1(e.g. 0.5), it will use the specified fraction of the training data when tuning. Both functions let you specify search boundaries or just pass None (the default) for bounds; if None, xGPR uses some default search boundaries.

tune_hyperparams_crude uses an SVD, which means it doesn’t scale well – it can get pretty slow for num_rffs = 3,000 or above. Fortunately, we’ve generally found that the hyperparameters which give good NMLL with a small number of RFFs (a sketchy kernel approximation) are usually not too terribly far away from those which give good NMLL with a larger number of RFFs (a better kernel approximation). (This is a rule of thumb, and like all rules of thumb should be used with caution.) So, one way to use these two functions together is to use tune_hyperparams_crude for a fast initial search, then (if desired) further fine-tune the hyperparameters using tune_hyperparams. For example::

my_model.device = "cuda"
my_model.num_rffs = 1024
my_model.tune_hyperparams_crude(my_train_dataset)
rough_hparams = my_model.get_hyperparams()

#Use rough_hparams as a starting point for fine-tuning.
#We could also set a bounding box around rough_hparams,
#pass that as bounds, set n_restarts to say 3 and
#thoroughly explore the space around rough_hparams. Or
#even just do a gridsearch across the space around rough
#hparams. See the examples section for some illustrations.
my_model.num_rffs = 4096
my_model.tune_hyperparams(my_train_dataset, max_iter = 50,
                      tuning_method = "L-BFGS-B",
                      starting_hyperparams = rough_hparams,
                      n_restarts = 1,
                      nmll_method = "exact")

tune_hyperparams can use one of three different algorithms or tuning_method: Powell, L-BFGS-B and Nelder-Mead. L-BFGS-B uses the fewest iterations, but has to calculate the gradient on each, so it’s slow if num_rffs is large. If num_rffs is large, instead, consider Powell and Nelder-Mead. Nelder-Mead is usually better than Powell at finding the absolute best possible value, but it can take a lot of iterations to converge, so it’s only good if you’re not in a hurry. We generally prefer Powell to Nelder-Mead.

Remember that when calculating NMLL, we could use exact_nmll or approximate_nmll. The function tune_hyperparams offers you the same choice: you can set nmll_method to either nmll_method=exact or nmll_method=approximate, and the considerations are the same. Again, exact is faster if num_rffs is small, while approximate_nmll has better scaling.

Finally, one important thing to keep in mind. Most of these methods run at reasonable speed on GPU. On CPU, however, tuning with a large num_rffs can be slow because of all of the matrix multiplications that are required. We recommend doing hyperparameter tuning and fitting on GPU when possible. Making predictions, by contrast, is reasonably fast on CPU. So fitting on GPU and doing inference on CPU is a perfectly viable way to go if desired.

Classifier tuning

As discussed above, the xGPDiscriminant classifier does not currently calculate marginal likelihood, so currently the supported way to tune hyperparameters is to evaluate performance on a validation set or using cross-validation (for larger datasets, validation set performance is much faster and is preferred). For convenience, xGPR incorporates several functions that can automate validation set tuning using either Optuna or a Scipy-based optimizer; these are described below:

That’s really all you absolutely need to know! For lots of useful TMI, see Advanced Tutorials.