Example: Fitting tabular data

This straightforward example makes use of a small, fairly random UCI repository dataset with about 45,000 datapoints. We’ll download this data, do some light preprocessing, and fit an RBF kernel.

These experiments used xGPR v0.4.8. Note that if setting device to cuda, xGPR always uses the currently active cuda device. To control which device this is, you can set the environment variable “CUDA_VISIBLE_DEVICES”.

[1]:
import os
import math
import time

import wget
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split

from xGPR import xGPRegression as xGPReg
from xGPR import build_regression_dataset
/ssd1/Documents/gp_proteins/venv_testing/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[2]:
fname = wget.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv")
raw_data = pd.read_csv(fname)
os.remove(fname)
-1 / unknown
[3]:
raw_data
[3]:
RMSD F1 F2 F3 F4 F5 F6 F7 F8 F9
0 17.284 13558.30 4305.35 0.31754 162.1730 1.872791e+06 215.3590 4287.87 102 27.0302
1 6.021 6191.96 1623.16 0.26213 53.3894 8.034467e+05 87.2024 3328.91 39 38.5468
2 9.275 7725.98 1726.28 0.22343 67.2887 1.075648e+06 81.7913 2981.04 29 38.8119
3 15.851 8424.58 2368.25 0.28111 67.8325 1.210472e+06 109.4390 3248.22 70 39.0651
4 7.962 7460.84 1736.94 0.23280 52.4123 1.021020e+06 94.5234 2814.42 41 39.9147
... ... ... ... ... ... ... ... ... ... ...
45725 3.762 8037.12 2777.68 0.34560 64.3390 1.105797e+06 112.7460 3384.21 84 36.8036
45726 6.521 7978.76 2508.57 0.31440 75.8654 1.116725e+06 102.2770 3974.52 54 36.0470
45727 10.356 7726.65 2489.58 0.32220 70.9903 1.076560e+06 103.6780 3290.46 46 37.4718
45728 9.791 8878.93 3055.78 0.34416 94.0314 1.242266e+06 115.1950 3421.79 41 35.6045
45729 18.827 12732.40 4444.36 0.34905 157.6300 1.788897e+06 229.4590 4626.85 141 29.8118

45730 rows × 10 columns

[4]:
train_data, test_data = train_test_split(raw_data, test_size = 0.2, random_state=123)

train_y, test_y = train_data["RMSD"].values, test_data["RMSD"].values
train_x, test_x = train_data.iloc[:,1:].values, test_data.iloc[:,1:].values

#Standardizing the features

train_mean, train_std = train_x.mean(axis=0), train_x.std(axis=0)
train_x = (train_x - train_mean[None,:]) / train_std[None,:]

test_x = (test_x - train_mean[None,:]) / train_std[None,:]

Next, we’ll set the data up for use as a training set by xGPR. If the data is too large to fit in memory, we can save it in “chunks” to disk, each chunk as a .npy file with the corresponding y-values as another .npy file, then build an OfflineDataset. In this case, we’ll build an OnlineDataset as well to illustrate.

The chunk_size parameter indicates how much data the Dataset will feed to xGPR at any one given time during training. It’s a little like a minibatch for deep learning. If you’re using a large number of random features to ensure a highly accurate model, or if your data has a large number of features per datapoint, set chunk_size small to avoid excessive memory consumption. This does not affect the accuracy of the model or training in any way, merely memory and to some extent speed (larger chunk sizes are slightly faster).

[5]:
online_train_data = build_regression_dataset(train_x, train_y, chunk_size = 2000)

For the OfflineDataset, we’ll save the data to .npy files; each file can only contain up to chunk_size datapoints. The files don’t all have to be the same size.

[6]:
chunk_size = 2000
xfiles, yfiles = [], []

for i in range(0, math.ceil(train_x.shape[0] / chunk_size)):
    xfiles.append(f"{i}_xblock.npy")
    yfiles.append(f"{i}_yblock.npy")
    start = i * chunk_size
    end = min((i + 1) * chunk_size, train_x.shape[0])
    np.save(xfiles[-1], train_x[start:end,:])
    np.save(yfiles[-1], train_y[start:end])


offline_train_data = build_regression_dataset(xfiles, yfiles, chunk_size = 2000)

We’ll do an initial quick and dirty hyperparameter tuning run using the tune_hyperparams_crude method, using a small number of random features. Later on we’ll use a larger number of random features (for a more accurate kernel approximation) and fine-tune the hyperparameters to get a better model. Crude tuning with a smaller number of random features is useful as an initial experiment when you’re deciding whether to use xGPR for your problem and if so, what features and kernel to use. It also gives us a good idea what region of hyperparameter space to search when fine-tuning, because the “best” hyperparameters from fine-tuning are usually not too far from those identified in an initial crude experiment.

[7]:
#Variance_rffs controls the accuracy of the uncertainty / variance approximation.
#512 - 1024 is usually fine.
uci_model = xGPReg(num_rffs = 1024, variance_rffs = 512,
                  kernel_choice = "RBF", verbose = True, device = "cuda",
                  random_seed = 123)

start_time = time.time()
uci_model.tune_hyperparams_crude(online_train_data)
end_time = time.time()

print(f"Wallclock: {end_time - start_time}")
Grid point 0 acquired.
Grid point 1 acquired.
Grid point 2 acquired.
Grid point 3 acquired.
Grid point 4 acquired.
Grid point 5 acquired.
Grid point 6 acquired.
Grid point 7 acquired.
Grid point 8 acquired.
Grid point 9 acquired.
New hparams: [-0.2041134]
Additional acquisition 10.
New hparams: [0.1916695]
Additional acquisition 11.
New hparams: [0.2469573]
Best score achieved: 40022.306
Best hyperparams: [-0.5406061  0.2469573]
Wallclock: 7.319413185119629

Just for fun, let’s repeat this using the offline dataset…this requires loading data from disk in batches on each iteration.

[8]:
start_time = time.time()
uci_model.tune_hyperparams_crude(offline_train_data)
end_time = time.time()

print(f"Wallclock: {end_time - start_time}")
Grid point 0 acquired.
Grid point 1 acquired.
Grid point 2 acquired.
Grid point 3 acquired.
Grid point 4 acquired.
Grid point 5 acquired.
Grid point 6 acquired.
Grid point 7 acquired.
Grid point 8 acquired.
Grid point 9 acquired.
New hparams: [-0.2041134]
Additional acquisition 10.
New hparams: [0.1916695]
Additional acquisition 11.
New hparams: [0.2469573]
Best score achieved: 40022.306
Best hyperparams: [-0.5406061  0.2469573]
Wallclock: 6.411782503128052

We can retrieve the resulting hyperparameters and save them somewhere for future use if needed. This function always returns the log of the hyperparameters, and if you’re passing hyperparameters to the fitting function, you should use the log of the hyperparameters as well.

[9]:
uci_model.get_hyperparams()
[9]:
array([-0.5406061,  0.2469573])

Now we’ll increase the number of RFFs for a more accurate kernel approximation and fit the model. When fitting, mode=exact is faster if the number of RFFs is small – say < 6,000 or so – and mode=cg is faster for larger numbers of RFFs.

[10]:
uci_model.num_rffs = 8192
start_time = time.time()
uci_model.fit(offline_train_data, mode = "cg", tol = 1e-6)
end_time = time.time()
print(f"Wallclock: {end_time - start_time}")
starting fitting
Chunk 0 complete.
Chunk 10 complete.
Using rank: 512
Chunk 0 complete.
Chunk 10 complete.
0 iterations complete.
5 iterations complete.
10 iterations complete.
15 iterations complete.
20 iterations complete.
25 iterations complete.
30 iterations complete.
CG iterations: 35
Now performing variance calculations...
Fitting complete.
Wallclock: 3.184067964553833

We can get the uncertainty on predictions by setting get_var = True. In this case, we don’t need it, so we’ll skip it. chunk_size ensures we only process up to chunk_size datapoints at one time to limit memory consumption.

[11]:
test_predictions, test_var = uci_model.predict(test_x, get_var = True, chunk_size = 1000)
[12]:
mae = np.mean( np.abs(test_predictions - test_y))
print(f"MAE: {mae}")
MAE: 3.029597679749727

Suppose we are unhappy with this result. We could of course consider a different kernel or modeling approach; alternatively, we can increase the number of random features for either tuning or fitting, which will almost invariably improve performance.

Generally increasing the number of random features used for fitting gives a bigger performance boost than increasing the number for tuning. For fitting, it can be beneficial to use as many as 32,768 random features, while for tuning, we seldom see large performance gains for more than 10,000. Either way, however, increasing the number of random features yields diminishing returns. Going from 1024 to 2048 gives a more substantial improvement than going from 2048 to 4096, and so on. If you ever find yourself needing to go to very high numbers, the model & kernel may not be a good fit for that particular problem.

First, let’s increase the number used to fit and see what happens…

[13]:
uci_model.num_rffs = 32768

start_time = time.time()
uci_model.fit(offline_train_data, mode = "cg", tol = 1e-6)
end_time = time.time()
print(f"Wallclock: {end_time - start_time}")
starting fitting
Chunk 0 complete.
Chunk 10 complete.
Using rank: 512
Chunk 0 complete.
Chunk 10 complete.
0 iterations complete.
5 iterations complete.
10 iterations complete.
15 iterations complete.
20 iterations complete.
25 iterations complete.
30 iterations complete.
CG iterations: 35
Now performing variance calculations...
Fitting complete.
Wallclock: 9.839282274246216
[14]:
test_predictions = uci_model.predict(test_x, get_var = False, chunk_size = 1000)
mae = np.mean( np.abs(test_predictions - test_y))
print(f"MAE: {mae}")
MAE: 2.9649900823083315

As discussed above, we could also retune hyperparameters using a larger number of random features. In xGPR, if the number of random features is < 8000 or so, we can use model.exact_nmll to calculate the NMLL (a Bayesian measure of model quality that is strongly correlated with validation / test set performance). model.approximate_nmll is slower for small numbers of random features but scales better to larger numbers of random features (it uses a highly accurate approximation for log determinants). We could use Optuna or set up a grid search and call model.exact_nmll or model.approximate_nmll for each hyperparameter combo we want to evaluate. We’ll illustrate all of these alternatives under the other examples. Optuna is one of our favorite methods and is often better than simple gridsearch.

Or we can use xGPR’s built-in model.tune_hyperparams(), which lets us use either exact or approximate nmll using either L-BFGS-B, Nelder-Mead or Powell optimzation methods, using the hyperparameters from the crude initial tuning run as a starting point. We’ll do that here. Nelder-Mead tends to be (very) thorough but take a long number of iterations to converge; it’s not our preferred approach for this reason. Powell tends to be a little less optimal but converges quickly. L-BFGS-B doesn’t scale well to more than 5,000 rffs or so because it has to calculate the gradient but converges very quickly in most cases.

Increasing the number of RFFs yields diminishing returns. We could get an even better result by tuning using 16,384 RFFs for example instead of 8,192, but it would be slower and the benefit would be quite small typically.

[15]:
uci_model.num_rffs = 8192

start_time = time.time()

uci_model.tune_hyperparams(offline_train_data, tuning_method = "Powell",
                            nmll_method = "exact", max_iter = 50)
end_time = time.time()

print(f"Wallclock: {end_time - start_time}")
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Evaluated NMLL.
Best score: 36552.343775204354
Wallclock: 353.921005487442
[16]:
print(uci_model.get_hyperparams())
[-0.6049985   0.75809609]

Now let’s refit using our new hyperparameters, using 8192 fitting rffs so we can compare to what we used at first. We’ll see that we get a slight improvement over our initial tuning run, but nothing to write home about. (This isn’t always true – sometimes fine-tuning the hyperparameters can make a big difference – see the other examples.) Of course, by fitting using this new hyperparameter set with 32768 RFFs instead of 8192 we could get some small additional improvement too.

[17]:
uci_model.num_rffs = 8192
uci_model.fit(online_train_data, mode = "cg", tol = 1e-6)
test_predictions = uci_model.predict(test_x, get_var = False, chunk_size = 1000)
mae = np.mean( np.abs(test_predictions - test_y))
print(mae)
starting fitting
Chunk 0 complete.
Chunk 10 complete.
Using rank: 512
Chunk 0 complete.
Chunk 10 complete.
0 iterations complete.
5 iterations complete.
10 iterations complete.
15 iterations complete.
20 iterations complete.
25 iterations complete.
30 iterations complete.
35 iterations complete.
40 iterations complete.
45 iterations complete.
50 iterations complete.
55 iterations complete.
CG iterations: 57
Now performing variance calculations...
Fitting complete.
2.874895428012307
[18]:
#We can switch the model over to CPU if we want to do inference on CPU (training is best
#done on GPU if possible.)
uci_model.device = "cpu"
[19]:
#Finally, we'll delete the .npy files we created earlier.
for xfile, yfile in zip(xfiles, yfiles):
    os.remove(xfile)
    os.remove(yfile)

Now let’s look at some more interesting examples.

[ ]: