{ "cells": [ { "cell_type": "markdown", "id": "b1382a54", "metadata": {}, "source": [ "## Example: Using a custom Dataset object\n", "\n", "You can use the `build_regression_dataset` and `build_classification_dataset` calls in xGPR to build a Dataset object that wraps your training data, then pass this to the fitting and tuning routines. Both functions work with numpy arrays\n", "either in memory or saved on disk. However, there may be situations where your data is not in the form of a\n", "numpy array or list of `.npy` files -- if your data is stored in an HDF5 file or SQLite db, for example --\n", "and while you could take your data and save it to disk as .npy files, it can sometimes be more convenient to\n", "keep your data in its original form without making a copy of it unnecessarily, especially if the input\n", "dataset is large. In these situations it's easy to create a custom Dataset object by subclassing the\n", "`DatasetBaseclass` object in xGPR (a little like a custom Dataloader in PyTorch).\n", "\n", "In this example, we'll illustrate how to build a custom Dataset that we can pass to all of the training\n", "and tuning functions. You can also use this in situations where there's some special prep you want to\n", "run on each datapoint before it's passed to xGPR." ] }, { "cell_type": "code", "execution_count": 1, "id": "c1d0a6db", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/ssd1/Documents/gp_proteins/venv_testing/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "import os\n", "import math\n", "import time\n", "\n", "import wget\n", "import pandas as pd\n", "import numpy as np\n", "import sklearn\n", "from sklearn.model_selection import train_test_split\n", "\n", "from xGPR import xGPRegression as xGPReg\n", "from xGPR import DatasetBaseclass" ] }, { "cell_type": "code", "execution_count": 2, "id": "39d7217e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-1 / unknown" ] } ], "source": [ "fname = wget.download(\"https://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv\")\n", "raw_data = pd.read_csv(fname)\n", "os.remove(fname)" ] }, { "cell_type": "code", "execution_count": 3, "id": "ec1e70f2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RMSDF1F2F3F4F5F6F7F8F9
017.28413558.304305.350.31754162.17301.872791e+06215.35904287.8710227.0302
16.0216191.961623.160.2621353.38948.034467e+0587.20243328.913938.5468
29.2757725.981726.280.2234367.28871.075648e+0681.79132981.042938.8119
315.8518424.582368.250.2811167.83251.210472e+06109.43903248.227039.0651
47.9627460.841736.940.2328052.41231.021020e+0694.52342814.424139.9147
.................................
457253.7628037.122777.680.3456064.33901.105797e+06112.74603384.218436.8036
457266.5217978.762508.570.3144075.86541.116725e+06102.27703974.525436.0470
4572710.3567726.652489.580.3222070.99031.076560e+06103.67803290.464637.4718
457289.7918878.933055.780.3441694.03141.242266e+06115.19503421.794135.6045
4572918.82712732.404444.360.34905157.63001.788897e+06229.45904626.8514129.8118
\n", "

45730 rows × 10 columns

\n", "
" ], "text/plain": [ " RMSD F1 F2 F3 F4 F5 F6 \\\n", "0 17.284 13558.30 4305.35 0.31754 162.1730 1.872791e+06 215.3590 \n", "1 6.021 6191.96 1623.16 0.26213 53.3894 8.034467e+05 87.2024 \n", "2 9.275 7725.98 1726.28 0.22343 67.2887 1.075648e+06 81.7913 \n", "3 15.851 8424.58 2368.25 0.28111 67.8325 1.210472e+06 109.4390 \n", "4 7.962 7460.84 1736.94 0.23280 52.4123 1.021020e+06 94.5234 \n", "... ... ... ... ... ... ... ... \n", "45725 3.762 8037.12 2777.68 0.34560 64.3390 1.105797e+06 112.7460 \n", "45726 6.521 7978.76 2508.57 0.31440 75.8654 1.116725e+06 102.2770 \n", "45727 10.356 7726.65 2489.58 0.32220 70.9903 1.076560e+06 103.6780 \n", "45728 9.791 8878.93 3055.78 0.34416 94.0314 1.242266e+06 115.1950 \n", "45729 18.827 12732.40 4444.36 0.34905 157.6300 1.788897e+06 229.4590 \n", "\n", " F7 F8 F9 \n", "0 4287.87 102 27.0302 \n", "1 3328.91 39 38.5468 \n", "2 2981.04 29 38.8119 \n", "3 3248.22 70 39.0651 \n", "4 2814.42 41 39.9147 \n", "... ... ... ... \n", "45725 3384.21 84 36.8036 \n", "45726 3974.52 54 36.0470 \n", "45727 3290.46 46 37.4718 \n", "45728 3421.79 41 35.6045 \n", "45729 4626.85 141 29.8118 \n", "\n", "[45730 rows x 10 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_data" ] }, { "cell_type": "code", "execution_count": 4, "id": "649cb19e", "metadata": {}, "outputs": [], "source": [ "train_data, test_data = train_test_split(raw_data, test_size = 0.2, random_state=123)\n", "\n", "# We will store these values so we can standardize the data. This isn't necessary\n", "# but is often beneficial. For a large dataset we could calculate these values\n", "# by loading the raw data in chunks, but for simplicity since this data frame is\n", "# small enough to hold in memory, we'll just work with the full data frame here.\n", "trainy_mean, trainy_std = train_data[\"RMSD\"].values.mean(), train_data[\"RMSD\"].values.std()\n", "train_mean, train_std = train_data.values[:,1:].mean(axis=0), train_data.values[:,1:].std(axis=0)\n", "\n", "# Again, not necessary but just for fun here we'll save the training data in a csv so\n", "# we can see how we can load it in chunks in our CustomDataset.\n", "train_data.to_csv(\"training_data.csv\", index=False)" ] }, { "cell_type": "markdown", "id": "c590e1cc", "metadata": {}, "source": [ "In this case, of course, the data is a relatively small numpy array that we can easily store in memory,\n", "so there's not much benefit to a custom Dataset here. However we'll use this as an example since\n", "it's straightforward. We'll look at how the custom Dataset is set up then illustrate with an example.\n", "\n", "When you build a custom Dataset, your Dataset should inherit from DatasetBaseclass. This means your\n", "Dataset will always look like:\n", "```\n", "class CustomDataset(DatasetBaseclass):\n", "\n", " def __init__('''arguments here'''):\n", " super().__init__(xdim, chunk_size, trainy_mean,\n", " trainy_std, max_class)\n", "\n", " def get_chunked_data(self):\n", " '''implement logic for getting the next minibatch here --\n", " loading it from a database for example. length_array is\n", " either None if your dataset only generates 2d arrays,\n", " or an array of shape (minibatch size) with the length\n", " of the sequence in each corresponding datapoint if\n", " you generate 3d arrays -- this is so that zero-padding\n", " can be \"masked\". If you don't care about masking\n", " zero-padding, you can just set all elements of length_array\n", " to equal xarray.shape[1].\n", "\n", " All three arrays should be c-contiguous numpy arrays,\n", " otherwise an exception may be generated. xarray can be\n", " any type, but yarray must be np.float64, and length_array\n", " (if not None) must be np.int32. Returning\n", " arrays that contain np.inf or np.nan may cause training to\n", " fail, do that at your own risk.'''\n", " return xarray, yarray, length_array\n", "\n", "\n", " def get_chunked_x_data(self):\n", " '''implement logic for getting the next minibatch here in situations\n", " where the yvalues are not needed. This is the same in all other respects\n", " as get_chunked_data.'''\n", " return xarray, length_array\n", "```\n", "There are five arguments you have to pass when initializing the parent through `super()`. `xdim` is either a two-tuple (if your\n", "input arrays will all be 2d) or a three-tuple (if they will all be 3d). The only element of the tuple that matters is the last\n", "one, which indicates the dimensionality of your input; the other tuple elements can be set to 1. So for example if all your\n", "input arrays will be 3d arrays with variable dim0 and dim1 where dim2 is always 21, you should set `xdim=(1,1,21)`. If all\n", "your input arrays are 2d arrays with variable dim0 but dim1 is always 200, you should set `xdim=(1,200)`, and so on.\n", "\n", "`chunk_size` is the maximum number of datapoints you will give to xGPR in any given minibatch. The size of minibatches doesn't\n", "affect the training outcome in any way, it only affects speed and memory consumption (larger is slightly faster but takes up\n", "more memory).\n", "\n", "`trainy_mean` is the mean of your y-values. You can set this to zero if you don't want to standardize your y-values\n", "(it's usually a good idea but not required). In your `get_chunked_data` function, if `trainy_mean` is not zero,\n", "you should subtract `trainy_mean` from the yvalues that the function returns, otherwise this may cause\n", "a dramatic crash in test set performance (xGPR adds the `trainy_mean` value it gets from\n", "your Dataset to predictions).\n", "\n", "`trainy_std` is the standard deviation of your y-values. You can set this to 1 if you don't want to standardize your\n", "y-values. In your `get_chunked_data` function, if `trainy_std` is not 1,\n", "you should divide your y-values by `trainy_std` after subtracting `trainy_mean`, otherwise this may cause\n", "a dramatic crash in test set performance (xGPR multiplies predictions by the `trainy_std` value it gets from\n", "your Dataset).\n", "\n", "`max_class` is only used for classification problems so can be set to any value for regression (it doesn't matter).\n", "For classification, you should set a max_class value to be the maximum class expected in the dataset.\n", "\n", "Let's see what this looks like using this data as an example." ] }, { "cell_type": "code", "execution_count": 5, "id": "865cfcaa-fc36-452b-8480-a599ca4e0aca", "metadata": {}, "outputs": [], "source": [ "class CustomDataset(DatasetBaseclass):\n", "\n", " def __init__(self, input_file, xsize, ymean, ystd, xmean, xstd):\n", " # xdim should be a two-tuple for 2d array input and a three-tuple\n", " # for 3d array input. All elements except the last are ignored\n", " # so can just be set to 1.\n", " xdim = (1, xsize)\n", "\n", " # Since this is not classification max_class can be set arbitrarily.\n", " super().__init__(xdim = xdim, chunk_size = 2000, trainy_mean = ymean,\n", " trainy_std = ystd, max_class = 1)\n", "\n", " # Save some values we'll use for standardizing data as we load it.\n", " self.xmean = xmean\n", " self.xstd = xstd\n", " self.input_file = input_file\n", "\n", "\n", " def get_chunked_data(self):\n", " # Note that self.get_chunk_size() is a build-in function that\n", " # returns whatever chunk_size we passed when initializing the\n", " # parent through super().\n", " with pd.read_csv(self.input_file,\n", " chunksize=self.get_chunk_size()) as reader:\n", " for chunk in reader:\n", " #Standardize the x and y values as we load them.\n", " # self.get_ymean() and self.get_ystd() are build-in\n", " # functions that return whatever ymean and ystd we\n", " # passed when initializing the parent class through super().\n", " xvalues = (chunk.values[:,1:] - self.xmean[None,:]) / self.xstd[None,:]\n", " yvalues = (chunk.values[:,0] - self.get_ymean()) / self.get_ystd()\n", " # Here since this is a 2d array we return None for the last return\n", " # value. If these were 3d arrays, we would return a numpy array of\n", " # type np.int32 for the third return value where each element indicates\n", " # the corresponding sequence length for that datapoint (this is so that\n", " # zero-padding can be masked if desired). Note that yvalues should\n", " # always be type np.float64.\n", " yield xvalues, yvalues.astype(np.float64), None\n", "\n", "\n", " def get_chunked_x_data(self):\n", " # Note that self.get_chunk_size() is a build-in function that\n", " # returns whatever chunk_size we passed when initializing the\n", " # parent through super().\n", " with pd.read_csv(self.input_file,\n", " chunksize=self.get_chunk_size()) as reader:\n", " for chunk in reader:\n", " #Standardize the x and y values as we load them.\n", " # self.get_ymean() and self.get_ystd() are build-in\n", " # functions that return whatever ymean and ystd we\n", " # passed when initializing the parent class through super().\n", " xvalues = (chunk.values[:,1:] - self.xmean[None,:]) / self.xstd[None,:]\n", " # Here since this is a 2d array we return None for the last return\n", " # value. If these were 3d arrays, we would return a numpy array of\n", " # type np.int32 for the last return value where each element indicates\n", " # the corresponding sequence length for that datapoint (this is so that\n", " # zero-padding can be masked if desired).\n", " yield xvalues, None" ] }, { "cell_type": "code", "execution_count": 6, "id": "0ebea394-5b15-4fe0-82df-7390add7602b", "metadata": {}, "outputs": [], "source": [ "my_dataset = CustomDataset(\"training_data.csv\", train_data.shape[1] - 1,\n", " trainy_mean, trainy_std, train_mean, train_std)" ] }, { "cell_type": "code", "execution_count": 7, "id": "6c8a2cc5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Grid point 0 acquired.\n", "Grid point 1 acquired.\n", "Grid point 2 acquired.\n", "Grid point 3 acquired.\n", "Grid point 4 acquired.\n", "Grid point 5 acquired.\n", "Grid point 6 acquired.\n", "Grid point 7 acquired.\n", "Grid point 8 acquired.\n", "Grid point 9 acquired.\n", "New hparams: [-0.2041134]\n", "Additional acquisition 10.\n", "New hparams: [0.1916695]\n", "Additional acquisition 11.\n", "New hparams: [0.2469573]\n", "Best score achieved: 40022.306\n", "Best hyperparams: [-0.5406061 0.2469573]\n", "Wallclock: 7.114676237106323\n" ] } ], "source": [ "uci_model = xGPReg(num_rffs = 1024, variance_rffs = 512,\n", " kernel_choice = \"RBF\", verbose = True, device = \"cuda\",\n", " random_seed = 123)\n", "\n", "start_time = time.time()\n", "uci_model.tune_hyperparams_crude(my_dataset)\n", "end_time = time.time()\n", "\n", "print(f\"Wallclock: {end_time - start_time}\")" ] }, { "cell_type": "markdown", "id": "c9707c70-afe4-458f-b211-fa435b0a8788", "metadata": {}, "source": [ "Compare this to the tabular data tutorial and you'll notice that the hyperparameters and score we achieved\n", "are the same -- our custom Dataset works just fine. Now let's use it to fit the tuned model." ] }, { "cell_type": "code", "execution_count": 8, "id": "5e12d8c9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "starting fitting\n", "Chunk 0 complete.\n", "Chunk 10 complete.\n", "Using rank: 512\n", "Chunk 0 complete.\n", "Chunk 10 complete.\n", "0 iterations complete.\n", "5 iterations complete.\n", "10 iterations complete.\n", "15 iterations complete.\n", "20 iterations complete.\n", "25 iterations complete.\n", "30 iterations complete.\n", "CG iterations: 35\n", "Now performing variance calculations...\n", "Fitting complete.\n", "Wallclock: 4.269773960113525\n" ] } ], "source": [ "uci_model.num_rffs = 8192\n", "start_time = time.time()\n", "uci_model.fit(my_dataset, mode = \"cg\", tol = 1e-6)\n", "end_time = time.time()\n", "print(f\"Wallclock: {end_time - start_time}\")" ] }, { "cell_type": "markdown", "id": "d0583d66-dd97-4bb2-8087-62fcc209424a", "metadata": {}, "source": [ "Notice one catch: because we standardized our training x data, we should do the same to our testing\n", "x data, otherwise we'll get really strange results. xGPR will already take into account the\n", "y mean and y standard deviation that we passed to DatasetBaseclass when initializing the Dataset." ] }, { "cell_type": "code", "execution_count": 9, "id": "a23dcb07-8461-4768-ae1f-bc4d2494de9a", "metadata": {}, "outputs": [], "source": [ "test_x = (test_data.values[:,1:] - train_mean) / train_std\n", "test_y = test_data.values[:,0]" ] }, { "cell_type": "code", "execution_count": 10, "id": "66136407", "metadata": {}, "outputs": [], "source": [ "test_predictions, test_var = uci_model.predict(test_x, get_var = True, chunk_size = 1000)" ] }, { "cell_type": "code", "execution_count": 11, "id": "e6e1aef6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAE: 3.029597679749587\n" ] } ], "source": [ "mae = np.mean( np.abs(test_predictions - test_y))\n", "print(f\"MAE: {mae}\")" ] }, { "cell_type": "code", "execution_count": 12, "id": "758ef555-f392-4d49-ad1a-ef1a33172e7e", "metadata": {}, "outputs": [], "source": [ "os.remove(\"training_data.csv\")" ] }, { "cell_type": "markdown", "id": "3aa66e48", "metadata": {}, "source": [ "And there you have it. You can use the approach outlined above to set up a custom Dataset that wraps a fasta file, SQLite db\n", "or some other object instead of the csv we used here if desired." ] }, { "cell_type": "code", "execution_count": null, "id": "141e83df", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }