Example: Using a custom Dataset object

You can use the build_regression_dataset and build_classification_dataset calls in xGPR to build a Dataset object that wraps your training data, then pass this to the fitting and tuning routines. Both functions work with numpy arrays either in memory or saved on disk. However, there may be situations where your data is not in the form of a numpy array or list of .npy files – if your data is stored in an HDF5 file or SQLite db, for example – and while you could take your data and save it to disk as .npy files, it can sometimes be more convenient to keep your data in its original form without making a copy of it unnecessarily, especially if the input dataset is large. In these situations it’s easy to create a custom Dataset object by subclassing the DatasetBaseclass object in xGPR (a little like a custom Dataloader in PyTorch).

In this example, we’ll illustrate how to build a custom Dataset that we can pass to all of the training and tuning functions. You can also use this in situations where there’s some special prep you want to run on each datapoint before it’s passed to xGPR.

[1]:
import os
import math
import time

import wget
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split

from xGPR import xGPRegression as xGPReg
from xGPR import DatasetBaseclass
/ssd1/Documents/gp_proteins/venv_testing/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[2]:
fname = wget.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv")
raw_data = pd.read_csv(fname)
os.remove(fname)
-1 / unknown
[3]:
raw_data
[3]:
RMSD F1 F2 F3 F4 F5 F6 F7 F8 F9
0 17.284 13558.30 4305.35 0.31754 162.1730 1.872791e+06 215.3590 4287.87 102 27.0302
1 6.021 6191.96 1623.16 0.26213 53.3894 8.034467e+05 87.2024 3328.91 39 38.5468
2 9.275 7725.98 1726.28 0.22343 67.2887 1.075648e+06 81.7913 2981.04 29 38.8119
3 15.851 8424.58 2368.25 0.28111 67.8325 1.210472e+06 109.4390 3248.22 70 39.0651
4 7.962 7460.84 1736.94 0.23280 52.4123 1.021020e+06 94.5234 2814.42 41 39.9147
... ... ... ... ... ... ... ... ... ... ...
45725 3.762 8037.12 2777.68 0.34560 64.3390 1.105797e+06 112.7460 3384.21 84 36.8036
45726 6.521 7978.76 2508.57 0.31440 75.8654 1.116725e+06 102.2770 3974.52 54 36.0470
45727 10.356 7726.65 2489.58 0.32220 70.9903 1.076560e+06 103.6780 3290.46 46 37.4718
45728 9.791 8878.93 3055.78 0.34416 94.0314 1.242266e+06 115.1950 3421.79 41 35.6045
45729 18.827 12732.40 4444.36 0.34905 157.6300 1.788897e+06 229.4590 4626.85 141 29.8118

45730 rows × 10 columns

[4]:
train_data, test_data = train_test_split(raw_data, test_size = 0.2, random_state=123)

# We will store these values so we can standardize the data. This isn't necessary
# but is often beneficial. For a large dataset we could calculate these values
# by loading the raw data in chunks, but for simplicity since this data frame is
# small enough to hold in memory, we'll just work with the full data frame here.
trainy_mean, trainy_std = train_data["RMSD"].values.mean(), train_data["RMSD"].values.std()
train_mean, train_std = train_data.values[:,1:].mean(axis=0), train_data.values[:,1:].std(axis=0)

# Again, not necessary but just for fun here we'll save the training data in a csv so
# we can see how we can load it in chunks in our CustomDataset.
train_data.to_csv("training_data.csv", index=False)

In this case, of course, the data is a relatively small numpy array that we can easily store in memory, so there’s not much benefit to a custom Dataset here. However we’ll use this as an example since it’s straightforward. We’ll look at how the custom Dataset is set up then illustrate with an example.

When you build a custom Dataset, your Dataset should inherit from DatasetBaseclass. This means your Dataset will always look like:

class CustomDataset(DatasetBaseclass):

    def __init__('''arguments here'''):
        super().__init__(xdim, chunk_size, trainy_mean,
                trainy_std, max_class)

    def get_chunked_data(self):
        '''implement logic for getting the next minibatch here --
            loading it from a database for example. length_array is
            either None if your dataset only generates 2d arrays,
            or an array of shape (minibatch size) with the length
            of the sequence in each corresponding datapoint if
            you generate 3d arrays -- this is so that zero-padding
            can be "masked". If you don't care about masking
            zero-padding, you can just set all elements of length_array
            to equal xarray.shape[1].

            All three arrays should be c-contiguous numpy arrays,
            otherwise an exception may be generated. xarray can be
            any type, but yarray must be np.float64, and length_array
            (if not None) must be np.int32. Returning
            arrays that contain np.inf or np.nan may cause training to
            fail, do that at your own risk.'''
        return xarray, yarray, length_array


    def get_chunked_x_data(self):
        '''implement logic for getting the next minibatch here in situations
            where the yvalues are not needed. This is the same in all other respects
            as get_chunked_data.'''
        return xarray, length_array

There are five arguments you have to pass when initializing the parent through super(). xdim is either a two-tuple (if your input arrays will all be 2d) or a three-tuple (if they will all be 3d). The only element of the tuple that matters is the last one, which indicates the dimensionality of your input; the other tuple elements can be set to 1. So for example if all your input arrays will be 3d arrays with variable dim0 and dim1 where dim2 is always 21, you should set xdim=(1,1,21). If all your input arrays are 2d arrays with variable dim0 but dim1 is always 200, you should set xdim=(1,200), and so on.

chunk_size is the maximum number of datapoints you will give to xGPR in any given minibatch. The size of minibatches doesn’t affect the training outcome in any way, it only affects speed and memory consumption (larger is slightly faster but takes up more memory).

trainy_mean is the mean of your y-values. You can set this to zero if you don’t want to standardize your y-values (it’s usually a good idea but not required). In your get_chunked_data function, if trainy_mean is not zero, you should subtract trainy_mean from the yvalues that the function returns, otherwise this may cause a dramatic crash in test set performance (xGPR adds the trainy_mean value it gets from your Dataset to predictions).

trainy_std is the standard deviation of your y-values. You can set this to 1 if you don’t want to standardize your y-values. In your get_chunked_data function, if trainy_std is not 1, you should divide your y-values by trainy_std after subtracting trainy_mean, otherwise this may cause a dramatic crash in test set performance (xGPR multiplies predictions by the trainy_std value it gets from your Dataset).

max_class is only used for classification problems so can be set to any value for regression (it doesn’t matter). For classification, you should set a max_class value to be the maximum class expected in the dataset.

Let’s see what this looks like using this data as an example.

[5]:
class CustomDataset(DatasetBaseclass):

    def __init__(self, input_file, xsize, ymean, ystd, xmean, xstd):
        # xdim should be a two-tuple for 2d array input and a three-tuple
        # for 3d array input. All elements except the last are ignored
        # so can just be set to 1.
        xdim = (1, xsize)

        # Since this is not classification max_class can be set arbitrarily.
        super().__init__(xdim = xdim, chunk_size = 2000, trainy_mean = ymean,
                trainy_std = ystd, max_class = 1)

        # Save some values we'll use for standardizing data as we load it.
        self.xmean = xmean
        self.xstd = xstd
        self.input_file = input_file


    def get_chunked_data(self):
        # Note that self.get_chunk_size() is a build-in function that
        # returns whatever chunk_size we passed when initializing the
        # parent through super().
        with pd.read_csv(self.input_file,
                    chunksize=self.get_chunk_size()) as reader:
            for chunk in reader:
                #Standardize the x and y values as we load them.
                # self.get_ymean() and self.get_ystd() are build-in
                # functions that return whatever ymean and ystd we
                # passed when initializing the parent class through super().
                xvalues = (chunk.values[:,1:] - self.xmean[None,:]) / self.xstd[None,:]
                yvalues = (chunk.values[:,0] - self.get_ymean()) / self.get_ystd()
                # Here since this is a 2d array we return None for the last return
                # value. If these were 3d arrays, we would return a numpy array of
                # type np.int32 for the third return value where each element indicates
                # the corresponding sequence length for that datapoint (this is so that
                # zero-padding can be masked if desired). Note that yvalues should
                # always be type np.float64.
                yield xvalues, yvalues.astype(np.float64), None


    def get_chunked_x_data(self):
        # Note that self.get_chunk_size() is a build-in function that
        # returns whatever chunk_size we passed when initializing the
        # parent through super().
        with pd.read_csv(self.input_file,
                    chunksize=self.get_chunk_size()) as reader:
            for chunk in reader:
                #Standardize the x and y values as we load them.
                # self.get_ymean() and self.get_ystd() are build-in
                # functions that return whatever ymean and ystd we
                # passed when initializing the parent class through super().
                xvalues = (chunk.values[:,1:] - self.xmean[None,:]) / self.xstd[None,:]
                # Here since this is a 2d array we return None for the last return
                # value. If these were 3d arrays, we would return a numpy array of
                # type np.int32 for the last return value where each element indicates
                # the corresponding sequence length for that datapoint (this is so that
                # zero-padding can be masked if desired).
                yield xvalues, None
[6]:
my_dataset = CustomDataset("training_data.csv", train_data.shape[1] - 1,
                           trainy_mean, trainy_std, train_mean, train_std)
[7]:
uci_model = xGPReg(num_rffs = 1024, variance_rffs = 512,
                  kernel_choice = "RBF", verbose = True, device = "cuda",
                  random_seed = 123)

start_time = time.time()
uci_model.tune_hyperparams_crude(my_dataset)
end_time = time.time()

print(f"Wallclock: {end_time - start_time}")
Grid point 0 acquired.
Grid point 1 acquired.
Grid point 2 acquired.
Grid point 3 acquired.
Grid point 4 acquired.
Grid point 5 acquired.
Grid point 6 acquired.
Grid point 7 acquired.
Grid point 8 acquired.
Grid point 9 acquired.
New hparams: [-0.2041134]
Additional acquisition 10.
New hparams: [0.1916695]
Additional acquisition 11.
New hparams: [0.2469573]
Best score achieved: 40022.306
Best hyperparams: [-0.5406061  0.2469573]
Wallclock: 7.114676237106323

Compare this to the tabular data tutorial and you’ll notice that the hyperparameters and score we achieved are the same – our custom Dataset works just fine. Now let’s use it to fit the tuned model.

[8]:
uci_model.num_rffs = 8192
start_time = time.time()
uci_model.fit(my_dataset, mode = "cg", tol = 1e-6)
end_time = time.time()
print(f"Wallclock: {end_time - start_time}")
starting fitting
Chunk 0 complete.
Chunk 10 complete.
Using rank: 512
Chunk 0 complete.
Chunk 10 complete.
0 iterations complete.
5 iterations complete.
10 iterations complete.
15 iterations complete.
20 iterations complete.
25 iterations complete.
30 iterations complete.
CG iterations: 35
Now performing variance calculations...
Fitting complete.
Wallclock: 4.269773960113525

Notice one catch: because we standardized our training x data, we should do the same to our testing x data, otherwise we’ll get really strange results. xGPR will already take into account the y mean and y standard deviation that we passed to DatasetBaseclass when initializing the Dataset.

[9]:
test_x = (test_data.values[:,1:] - train_mean) / train_std
test_y = test_data.values[:,0]
[10]:
test_predictions, test_var = uci_model.predict(test_x, get_var = True, chunk_size = 1000)
[11]:
mae = np.mean( np.abs(test_predictions - test_y))
print(f"MAE: {mae}")
MAE: 3.029597679749587
[12]:
os.remove("training_data.csv")

And there you have it. You can use the approach outlined above to set up a custom Dataset that wraps a fasta file, SQLite db or some other object instead of the csv we used here if desired.

[ ]: