{ "cells": [ { "cell_type": "markdown", "id": "b1382a54", "metadata": {}, "source": [ "## Example: Fitting tabular data\n", "\n", "This straightforward example makes use of a small,\n", "fairly random UCI repository dataset with about 45,000 datapoints. We'll\n", "download this data, do some light preprocessing, and fit an RBF kernel.\n", "\n", "These experiments used xGPR v0.4.8. Note that if setting device to cuda,\n", "xGPR always uses the currently active cuda device. To control which\n", "device this is, you can set the environment variable \"CUDA_VISIBLE_DEVICES\"." ] }, { "cell_type": "code", "execution_count": 1, "id": "c1d0a6db", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/ssd1/Documents/gp_proteins/venv_testing/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "import os\n", "import math\n", "import time\n", "\n", "import wget\n", "import pandas as pd\n", "import numpy as np\n", "import sklearn\n", "from sklearn.model_selection import train_test_split\n", "\n", "from xGPR import xGPRegression as xGPReg\n", "from xGPR import build_regression_dataset" ] }, { "cell_type": "code", "execution_count": 2, "id": "39d7217e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-1 / unknown" ] } ], "source": [ "fname = wget.download(\"https://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv\")\n", "raw_data = pd.read_csv(fname)\n", "os.remove(fname)" ] }, { "cell_type": "code", "execution_count": 3, "id": "ec1e70f2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RMSDF1F2F3F4F5F6F7F8F9
017.28413558.304305.350.31754162.17301.872791e+06215.35904287.8710227.0302
16.0216191.961623.160.2621353.38948.034467e+0587.20243328.913938.5468
29.2757725.981726.280.2234367.28871.075648e+0681.79132981.042938.8119
315.8518424.582368.250.2811167.83251.210472e+06109.43903248.227039.0651
47.9627460.841736.940.2328052.41231.021020e+0694.52342814.424139.9147
.................................
457253.7628037.122777.680.3456064.33901.105797e+06112.74603384.218436.8036
457266.5217978.762508.570.3144075.86541.116725e+06102.27703974.525436.0470
4572710.3567726.652489.580.3222070.99031.076560e+06103.67803290.464637.4718
457289.7918878.933055.780.3441694.03141.242266e+06115.19503421.794135.6045
4572918.82712732.404444.360.34905157.63001.788897e+06229.45904626.8514129.8118
\n", "

45730 rows × 10 columns

\n", "
" ], "text/plain": [ " RMSD F1 F2 F3 F4 F5 F6 \\\n", "0 17.284 13558.30 4305.35 0.31754 162.1730 1.872791e+06 215.3590 \n", "1 6.021 6191.96 1623.16 0.26213 53.3894 8.034467e+05 87.2024 \n", "2 9.275 7725.98 1726.28 0.22343 67.2887 1.075648e+06 81.7913 \n", "3 15.851 8424.58 2368.25 0.28111 67.8325 1.210472e+06 109.4390 \n", "4 7.962 7460.84 1736.94 0.23280 52.4123 1.021020e+06 94.5234 \n", "... ... ... ... ... ... ... ... \n", "45725 3.762 8037.12 2777.68 0.34560 64.3390 1.105797e+06 112.7460 \n", "45726 6.521 7978.76 2508.57 0.31440 75.8654 1.116725e+06 102.2770 \n", "45727 10.356 7726.65 2489.58 0.32220 70.9903 1.076560e+06 103.6780 \n", "45728 9.791 8878.93 3055.78 0.34416 94.0314 1.242266e+06 115.1950 \n", "45729 18.827 12732.40 4444.36 0.34905 157.6300 1.788897e+06 229.4590 \n", "\n", " F7 F8 F9 \n", "0 4287.87 102 27.0302 \n", "1 3328.91 39 38.5468 \n", "2 2981.04 29 38.8119 \n", "3 3248.22 70 39.0651 \n", "4 2814.42 41 39.9147 \n", "... ... ... ... \n", "45725 3384.21 84 36.8036 \n", "45726 3974.52 54 36.0470 \n", "45727 3290.46 46 37.4718 \n", "45728 3421.79 41 35.6045 \n", "45729 4626.85 141 29.8118 \n", "\n", "[45730 rows x 10 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_data" ] }, { "cell_type": "code", "execution_count": 4, "id": "649cb19e", "metadata": {}, "outputs": [], "source": [ "train_data, test_data = train_test_split(raw_data, test_size = 0.2, random_state=123)\n", "\n", "train_y, test_y = train_data[\"RMSD\"].values, test_data[\"RMSD\"].values\n", "train_x, test_x = train_data.iloc[:,1:].values, test_data.iloc[:,1:].values\n", "\n", "#Standardizing the features\n", "\n", "train_mean, train_std = train_x.mean(axis=0), train_x.std(axis=0)\n", "train_x = (train_x - train_mean[None,:]) / train_std[None,:]\n", "\n", "test_x = (test_x - train_mean[None,:]) / train_std[None,:]" ] }, { "cell_type": "markdown", "id": "c590e1cc", "metadata": {}, "source": [ "Next, we'll set the data up for use as a training set by xGPR. If \n", "the data is too large to fit in memory, we can save it in \"chunks\"\n", "to disk, each chunk as a .npy file with the corresponding y-values\n", "as another .npy file, then build an OfflineDataset.\n", "In this case, we'll build an OnlineDataset as well to illustrate.\n", "\n", "The chunk_size parameter indicates how much data the Dataset\n", "will feed to xGPR at any one given time during training. It's a \n", "little like a minibatch for deep learning. If you're using a\n", "large number of random features to ensure a highly accurate model,\n", "or if your data has a large number of features per datapoint,\n", "set chunk_size small to avoid excessive memory consumption. This\n", "does not affect the accuracy of the model or training in any way,\n", "merely memory and to some extent speed (larger chunk sizes are\n", "slightly faster)." ] }, { "cell_type": "code", "execution_count": 5, "id": "9498c163", "metadata": {}, "outputs": [], "source": [ "online_train_data = build_regression_dataset(train_x, train_y, chunk_size = 2000)" ] }, { "cell_type": "markdown", "id": "51e576ee", "metadata": {}, "source": [ "For the OfflineDataset, we'll save the data to .npy files; each file\n", "can only contain up to chunk_size datapoints. The files don't all have to be the same size." ] }, { "cell_type": "code", "execution_count": 6, "id": "4d3cd484", "metadata": {}, "outputs": [], "source": [ "chunk_size = 2000\n", "xfiles, yfiles = [], []\n", "\n", "for i in range(0, math.ceil(train_x.shape[0] / chunk_size)):\n", " xfiles.append(f\"{i}_xblock.npy\")\n", " yfiles.append(f\"{i}_yblock.npy\")\n", " start = i * chunk_size\n", " end = min((i + 1) * chunk_size, train_x.shape[0])\n", " np.save(xfiles[-1], train_x[start:end,:])\n", " np.save(yfiles[-1], train_y[start:end])\n", "\n", "\n", "offline_train_data = build_regression_dataset(xfiles, yfiles, chunk_size = 2000)" ] }, { "cell_type": "markdown", "id": "8bad2770", "metadata": {}, "source": [ "We'll do an initial quick and dirty hyperparameter tuning run using\n", "the ``tune_hyperparams_crude`` method, using a small number of random\n", "features. Later on we'll use\n", "a larger number of random features (for a more accurate kernel\n", "approximation) and fine-tune the hyperparameters to get a better\n", "model. Crude tuning with a smaller number of random features is\n", "useful as an initial experiment when you're deciding whether to\n", "use xGPR for your problem and if so, what features and kernel\n", "to use. It also gives us a good idea what region of hyperparameter\n", "space to search when fine-tuning, because the \"best\" hyperparameters\n", "from fine-tuning are *usually* not too far from those identified\n", "in an initial crude experiment." ] }, { "cell_type": "code", "execution_count": 7, "id": "6c8a2cc5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Grid point 0 acquired.\n", "Grid point 1 acquired.\n", "Grid point 2 acquired.\n", "Grid point 3 acquired.\n", "Grid point 4 acquired.\n", "Grid point 5 acquired.\n", "Grid point 6 acquired.\n", "Grid point 7 acquired.\n", "Grid point 8 acquired.\n", "Grid point 9 acquired.\n", "New hparams: [-0.2041134]\n", "Additional acquisition 10.\n", "New hparams: [0.1916695]\n", "Additional acquisition 11.\n", "New hparams: [0.2469573]\n", "Best score achieved: 40022.306\n", "Best hyperparams: [-0.5406061 0.2469573]\n", "Wallclock: 7.319413185119629\n" ] } ], "source": [ "#Variance_rffs controls the accuracy of the uncertainty / variance approximation.\n", "#512 - 1024 is usually fine.\n", "uci_model = xGPReg(num_rffs = 1024, variance_rffs = 512,\n", " kernel_choice = \"RBF\", verbose = True, device = \"cuda\",\n", " random_seed = 123)\n", "\n", "start_time = time.time()\n", "uci_model.tune_hyperparams_crude(online_train_data)\n", "end_time = time.time()\n", "\n", "print(f\"Wallclock: {end_time - start_time}\")" ] }, { "cell_type": "markdown", "id": "205aeb4e", "metadata": {}, "source": [ "Just for fun, let's repeat this using the offline dataset...this requires\n", "loading data from disk in batches on each iteration." ] }, { "cell_type": "code", "execution_count": 8, "id": "4defbf34", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Grid point 0 acquired.\n", "Grid point 1 acquired.\n", "Grid point 2 acquired.\n", "Grid point 3 acquired.\n", "Grid point 4 acquired.\n", "Grid point 5 acquired.\n", "Grid point 6 acquired.\n", "Grid point 7 acquired.\n", "Grid point 8 acquired.\n", "Grid point 9 acquired.\n", "New hparams: [-0.2041134]\n", "Additional acquisition 10.\n", "New hparams: [0.1916695]\n", "Additional acquisition 11.\n", "New hparams: [0.2469573]\n", "Best score achieved: 40022.306\n", "Best hyperparams: [-0.5406061 0.2469573]\n", "Wallclock: 6.411782503128052\n" ] } ], "source": [ "start_time = time.time()\n", "uci_model.tune_hyperparams_crude(offline_train_data)\n", "end_time = time.time()\n", "\n", "print(f\"Wallclock: {end_time - start_time}\")" ] }, { "cell_type": "markdown", "id": "3832e2bb", "metadata": {}, "source": [ "We can retrieve the resulting hyperparameters and save them somewhere\n", "for future use if needed. This function always returns the\n", "log of the hyperparameters, and if you're passing\n", "hyperparameters to the fitting function, you should use\n", "the log of the hyperparameters as well." ] }, { "cell_type": "code", "execution_count": 9, "id": "c2058512", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-0.5406061, 0.2469573])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "uci_model.get_hyperparams()" ] }, { "cell_type": "markdown", "id": "6a52c002", "metadata": {}, "source": [ "Now we'll increase the number of RFFs for a more accurate kernel approximation and fit\n", "the model. When fitting, ``mode=exact`` is faster if the number of RFFs is small -- say < 6,000\n", "or so -- and ``mode=cg`` is faster for larger numbers of RFFs." ] }, { "cell_type": "code", "execution_count": 10, "id": "5e12d8c9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "starting fitting\n", "Chunk 0 complete.\n", "Chunk 10 complete.\n", "Using rank: 512\n", "Chunk 0 complete.\n", "Chunk 10 complete.\n", "0 iterations complete.\n", "5 iterations complete.\n", "10 iterations complete.\n", "15 iterations complete.\n", "20 iterations complete.\n", "25 iterations complete.\n", "30 iterations complete.\n", "CG iterations: 35\n", "Now performing variance calculations...\n", "Fitting complete.\n", "Wallclock: 3.184067964553833\n" ] } ], "source": [ "uci_model.num_rffs = 8192\n", "start_time = time.time()\n", "uci_model.fit(offline_train_data, mode = \"cg\", tol = 1e-6)\n", "end_time = time.time()\n", "print(f\"Wallclock: {end_time - start_time}\")" ] }, { "cell_type": "markdown", "id": "c80bcc80", "metadata": {}, "source": [ "We can get the uncertainty on predictions by setting get_var = True.\n", "In this case, we don't need it, so we'll skip it. chunk_size ensures\n", "we only process up to chunk_size datapoints at one time to limit\n", "memory consumption." ] }, { "cell_type": "code", "execution_count": 11, "id": "66136407", "metadata": {}, "outputs": [], "source": [ "test_predictions, test_var = uci_model.predict(test_x, get_var = True, chunk_size = 1000)" ] }, { "cell_type": "code", "execution_count": 12, "id": "e6e1aef6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAE: 3.029597679749727\n" ] } ], "source": [ "mae = np.mean( np.abs(test_predictions - test_y))\n", "print(f\"MAE: {mae}\")" ] }, { "cell_type": "markdown", "id": "6655429c", "metadata": {}, "source": [ "Suppose we are unhappy with this result. We could of course consider\n", "a different kernel or modeling approach; alternatively, we can\n", "increase the number of random features for either tuning or\n", "fitting, which will almost invariably improve performance.\n", "\n", "Generally increasing the number of random features used for fitting gives a\n", "bigger performance boost than increasing the number for tuning. For fitting,\n", "it can be beneficial to use as many as 32,768 random features, while\n", "for tuning, we seldom see large performance gains for more than 10,000.\n", "Either way, however, increasing the number of random features\n", "yields diminishing returns. Going from 1024 to 2048 gives\n", "a more substantial improvement than going from 2048 to\n", "4096, and so on. If you ever find yourself needing to\n", "go to very high numbers, the model & kernel may not be\n", "a good fit for that particular problem.\n", "\n", "First, let's increase the number used to fit and see what happens..." ] }, { "cell_type": "code", "execution_count": 13, "id": "65ceb406", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "starting fitting\n", "Chunk 0 complete.\n", "Chunk 10 complete.\n", "Using rank: 512\n", "Chunk 0 complete.\n", "Chunk 10 complete.\n", "0 iterations complete.\n", "5 iterations complete.\n", "10 iterations complete.\n", "15 iterations complete.\n", "20 iterations complete.\n", "25 iterations complete.\n", "30 iterations complete.\n", "CG iterations: 35\n", "Now performing variance calculations...\n", "Fitting complete.\n", "Wallclock: 9.839282274246216\n" ] } ], "source": [ "uci_model.num_rffs = 32768\n", "\n", "start_time = time.time()\n", "uci_model.fit(offline_train_data, mode = \"cg\", tol = 1e-6)\n", "end_time = time.time()\n", "print(f\"Wallclock: {end_time - start_time}\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "548d75f8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAE: 2.9649900823083315\n" ] } ], "source": [ "test_predictions = uci_model.predict(test_x, get_var = False, chunk_size = 1000)\n", "mae = np.mean( np.abs(test_predictions - test_y))\n", "print(f\"MAE: {mae}\")" ] }, { "cell_type": "markdown", "id": "eff215bc", "metadata": {}, "source": [ "As discussed above, we could also retune hyperparameters using a larger number of random features.\n", "In xGPR, if the number of random features is < 8000 or so, we can use ``model.exact_nmll`` to calculate\n", "the NMLL (a Bayesian measure of model quality that is strongly correlated with validation / test set\n", "performance). ``model.approximate_nmll`` is slower for small numbers of\n", "random features but scales better to larger numbers of random features (it uses a highly\n", "accurate approximation for log determinants). We could use Optuna or set\n", "up a grid search and call ``model.exact_nmll`` or ``model.approximate_nmll`` for each hyperparameter\n", "combo we want to evaluate. We'll illustrate all of these alternatives under the other examples.\n", "Optuna is one of our favorite methods and is often better than simple gridsearch.\n", "\n", "Or we can use xGPR's built-in ``model.tune_hyperparams()``, which lets us use either exact or approximate\n", "nmll using either L-BFGS-B, Nelder-Mead or Powell optimzation methods, using the hyperparameters from\n", "the crude initial tuning run as a starting point. We'll do that here. Nelder-Mead tends to be (very)\n", "thorough but take a long number of iterations to converge; it's not our preferred approach for this reason. Powell tends to be a little less optimal but converges quickly. L-BFGS-B doesn't scale well to more than 5,000 rffs or so because it\n", "has to calculate the gradient but converges very quickly in most cases.\n", "\n", "Increasing the number of RFFs yields diminishing returns. We could get\n", "an even better result by tuning using 16,384 RFFs for example instead of 8,192, but it would\n", "be slower and the benefit would be quite small typically." ] }, { "cell_type": "code", "execution_count": 15, "id": "b7208786", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Evaluated NMLL.\n", "Best score: 36552.343775204354\n", "Wallclock: 353.921005487442\n" ] } ], "source": [ "uci_model.num_rffs = 8192\n", "\n", "start_time = time.time()\n", "\n", "uci_model.tune_hyperparams(offline_train_data, tuning_method = \"Powell\",\n", " nmll_method = \"exact\", max_iter = 50)\n", "end_time = time.time()\n", "\n", "print(f\"Wallclock: {end_time - start_time}\")" ] }, { "cell_type": "code", "execution_count": 16, "id": "afb6c7c0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-0.6049985 0.75809609]\n" ] } ], "source": [ "print(uci_model.get_hyperparams())" ] }, { "cell_type": "markdown", "id": "eaf5f026", "metadata": {}, "source": [ "Now let's refit using our new hyperparameters, using 8192 fitting rffs so we \n", "can compare to what we used at first. We'll see that we get a slight improvement\n", "over our initial tuning run, but nothing to write home about. (This isn't always\n", "true -- sometimes fine-tuning the hyperparameters can make a big difference -- see\n", "the other examples.) Of course,\n", "by fitting using this new hyperparameter set with 32768 RFFs instead of 8192\n", "we could get some small additional improvement too." ] }, { "cell_type": "code", "execution_count": 17, "id": "90fbdc7c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "starting fitting\n", "Chunk 0 complete.\n", "Chunk 10 complete.\n", "Using rank: 512\n", "Chunk 0 complete.\n", "Chunk 10 complete.\n", "0 iterations complete.\n", "5 iterations complete.\n", "10 iterations complete.\n", "15 iterations complete.\n", "20 iterations complete.\n", "25 iterations complete.\n", "30 iterations complete.\n", "35 iterations complete.\n", "40 iterations complete.\n", "45 iterations complete.\n", "50 iterations complete.\n", "55 iterations complete.\n", "CG iterations: 57\n", "Now performing variance calculations...\n", "Fitting complete.\n", "2.874895428012307\n" ] } ], "source": [ "uci_model.num_rffs = 8192\n", "uci_model.fit(online_train_data, mode = \"cg\", tol = 1e-6)\n", "test_predictions = uci_model.predict(test_x, get_var = False, chunk_size = 1000)\n", "mae = np.mean( np.abs(test_predictions - test_y))\n", "print(mae)" ] }, { "cell_type": "code", "execution_count": 18, "id": "e3d3fe41", "metadata": {}, "outputs": [], "source": [ "#We can switch the model over to CPU if we want to do inference on CPU (training is best\n", "#done on GPU if possible.)\n", "uci_model.device = \"cpu\"" ] }, { "cell_type": "code", "execution_count": 19, "id": "11e757eb", "metadata": {}, "outputs": [], "source": [ "#Finally, we'll delete the .npy files we created earlier.\n", "for xfile, yfile in zip(xfiles, yfiles):\n", " os.remove(xfile)\n", " os.remove(yfile)" ] }, { "cell_type": "markdown", "id": "3aa66e48", "metadata": {}, "source": [ "Now let's look at some more interesting examples." ] }, { "cell_type": "code", "execution_count": null, "id": "141e83df", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }