xGPR quickstart
===============================================

Build your training set
-------------------------

Start by building a Dataset, which is similar to a DataLoader in PyTorch. If
your data is organized in one of a couple fairly common ways, you can
use a built-in xGPR function to build this dataset. If your data is in
some other form(e.g. a fasta file, an SQLite db or an HDF5 file) and you
don't want to make a copy of it, you can instead
subclass xGPR's ``DatasetBaseclass`` and build a
custom ``Dataset`` object. We'll look at the more common
situations first. You can use either ``build_regression_dataset``
or ``build_classification_dataset``, which take the same arguments,
as shown below:::

  from xGPR import build_regression_dataset, build_classification_dataset

  reg_train_set = build_regression_dataset(x, y, sequence_lengths = None,
                        chunk_size = 2000)

``x`` and ``y`` can be *either* numpy arrays OR a list of filepaths
to ``.npy`` files saved on disk. If the latter, no single one of the
files can contain more datapoints than the ``chunk_size`` setting.
``chunk_size`` controls how many datapoints xGPR processes at a time
while training. Unlike deep learning, this doesn't affect the fitting
trajectory in any way; it just affects memory consumption. If your
datapoints are really large, you can use a smaller chunk size to
reduce memory consumption.

For regression, ``y`` should be all real values. For classification,
it should be class labels starting from 0 (e.g. for a three-class problem
the labels should be 0, 1, or 2).

The ``x`` array(s) can be either 2d for tabular data or 3d for sequences
and graphs. If they are 2d, they should have shape (N, M) for N datapoints
and M features. If they are 3d, they should have shape (N, D, M) for N
datapoints, D sequence elements / timepoints / nodes and M features.
If they are arrays saved on disk, you can save the data as float32 or
even uint8, and xGPR will convert it to float32 when loading (this can
often save considerable disk space and also make model fitting faster).

If your data is 3d (i.e. sequences or graphs), you have to also supply
``sequence_lengths``. If ``x`` is an array with ``shape[0]=N``,
``sequence_lengths`` should also be an array with shape ``(N,)`` that
indicates the length of each sequence (or number of nodes in each graph)
excluding zero-padding (unless you want to include zero-padding in the 
kernel calculation for whatever reason, in which case, just set all
sequence_lengths to be ``x.shape[1]``). xGPR will then "mask" the zero-
padding when doing the kernel calculations.

If your data is 3d but ``x`` is a list of ``.npy`` files on disk,
``sequence_lengths`` should be a list of ``.npy`` files on disk
of the same length. For each ``x`` file with shape ``(N,D,M)``,
the corresponding sequence_length file should be an npy array
of shape ``(N,)`` indicating the length of each corresponding
sequence in that ``x`` file again (typically) excluding zero-
padding.

If using 3d data, you'll also have to pass the sequence lengths of your
test datapoints to ``model.predict`` when making predictions -- see below
for examples.

See the Examples section for some illustrations of working with sequence
and graph data. For fixed-vector 2d data, of course, sequence lengths are not required.

When you create the dataset, xGPR will do some checks to make sure that
what you fed it makes sense. If the dataset is very large, these may take a
second.

Finally, let's say your data is a fasta file, a csv file, an HDF5 file or some
other format. You can create your own Dataset that loads the data in minibatches
during training and does any preprocessing you want to do on each minibatch.
To see how to do this, check out :doc:`notebooks/custom_dataset_example`.

Fit your model and make predictions
-------------------------------------

Let's create two models using RBF kernels and fit them:::

  # xGPRegression is for regression, xGPDiscriminant is for classification
  from xGPR import xGPRegression, xGPDiscriminant

  #Notice that we set the device to be "cuda". If there are multiple
  #cuda devices, xGPR will use whichever one is currently active. You can control
  #this by setting the environment variable CUDA_VISIBLE_DEVICES,
  #e.g. "export CUDA_VISIBLE_DEVICES". Also note that for regression,
  #we specify a quantity, "variance_rffs", which must be < num_rffs.
  #This controls the accuracy of the variance approximation when
  #calculating uncertainty on predictions. We've found 512 - 1024 is
  #usually fine.

  reg_model = xGPRegression(num_rffs = 2048,
                        variance_rffs = 512, 
                        random_seed = 123.
                        kernel_choice = "RBF",
                        device = "cuda", kernel_settings = {})

  #Nearly all xGPR kernels have either one or two hyperparameters. We'll
  #talk more about how to optimize hyperparameters shortly.
  my_hyperparams = np.array([-3.1, -1.25])
  reg_model.set_hyperparams(my_hyperparams, reg_train_set)

  #We can change the number of RFFs at any time up until we fit the model.
  reg_model.num_rffs = 2000


  #We can fit using "exact" mode or "cg" mode. "cg" mode is much faster
  #if num_rffs is very large. "exact" is much faster if "num_rffs" is small,
  #e.g. < 3000. We'll use both here just for illustrative purposes.
  #tol controls how tight the fit is; 1e-6 (the default) is usually fine;
  #1e-7 (tighter fit) will improve performance slightly, especially if
  #data is close to noise-free, but is not usually necessary; 1e-8 is
  #usually overkill.

  reg_model.fit(reg_train_set, mode="cg", tol=1e-6)

  #To make predictions, just feed in a numpy array.
  #If we want to switch over to CPU for inference now that the model
  #is fitted, we can do that too, e.g.:

  model.device = "cpu"

  #For regression, predictions are just
  #a numpy array of predicted y-values. chunk_size controls
  #how many datapoints xGPR processes at a time in the input array; larger
  #slightly increases speed but increases memory usage. If you're worried
  #about memory usage, set a small chunk_size, otherwise default of 2000 is fine.

  reg_preds = reg_model.predict(xtest_reg, sequence_lengths = None, chunk_size = 2000)

  #For regression, we can also get the variance on the predictions, which
  #is useful as a measure of uncertainty.

  reg_preds, reg_var = reg_model.predict(xtest_reg, sequence_lengths = None, get_var = True)


  # Classification is similar but with a few slight differences. For one thing,
  # we don't need to specify the number of variance rffs since variance is not
  # calculated separately.
  class_train_set = build_classification_dataset(xclass, yclass,
                        sequence_lengths = None, chunk_size = 2000)
  
  classifier = xGPDiscriminant(num_rffs = 2048,
                        random_seed = 123.
                        kernel_choice = "RBF",
                        device = "cuda", kernel_settings = {})

  my_hyperparams = np.array([-3.1, -1.25])
  classifier.set_hyperparams(my_hyperparams, class_train_set)

  #We can change the number of RFFs at any time up until we fit the model.
  classifier.num_rffs = 2000

  #We can fit using "exact" mode or "cg" mode. "cg" mode is much faster
  #if num_rffs is very large. "exact" is much faster if "num_rffs" is small,
  #e.g. < 3000. We'll use both here just for illustrative purposes.
  #tol controls how tight the fit is; 1e-6 (the default) is usually fine;
  #1e-7 (tighter fit) will improve performance slightly, especially if
  #data is close to noise-free, but is not usually necessary; 1e-8 is
  #usually overkill.

  classifier.fit(class_train_set, mode="cg", tol=1e-6)

  #To make predictions, just feed in a numpy array.
  #If we want to switch over to CPU for inference now that the model
  #is fitted, we can do that too, e.g.:

  model.device = "cpu"

  #For classification, predictions are a shape (N,M) array for N
  # datapoints and M classes. These are the class probabilities.
  # The actuall class assignments are np.argmax(preds, axis=1). chunk_size controls
  #how many datapoints xGPR processes at a time in the input array; larger
  #slightly increases speed but increases memory usage. If you're worried
  #about memory usage, set a small chunk_size, otherwise default of 2000 is fine.

  class_preds = classifier.predict(xclass)


And that's basically it!

We used "RBF" kernels here, but there are plenty of other options; see the Kernels
section on the main page. Some are designed for sequences, others for tabular data.
Some of them have options that you can supply under the
``kernel_settings`` dict (e.g. the convolution width if using a sequence kernel).

Notice also that we had to specify ``num_rffs`` when setting up the model (but can
change it subsequently as well, at least right up until we fit). ``num_rffs`` controls
how accurately the kernel is approximated. The error in the kernel approximation falls
off exponentially with larger ``num_rffs`` values, so increasing ``num_rffs`` generally
makes the model more accurate, but with diminishing returns. It also increases
computational expense (fitting using ``num_rffs=1024`` will be much faster than fitting
with ``num_rffs=32,768``).

Finally, notice that when calling ``model.predict`` just as when building a dataset,
``sequence_lengths`` is None if you're using a fixed-length kernel; if you're inputting
a 3d array and using a convolution kernel, you have to supply a numpy array of sequence
lengths so that xGPR can mask zero-padding (if you're using zero-padding).

There's one big missing piece we haven't discussed so far of course, which is...


How to find good hyperparameter values?
----------------------------------------

Most kernels in xGPR have either two hyperparameters ("lambda", "sigma") or one ("lambda").
There are some exceptions to this.
(See the Kernels section on the main page for details about each kernel.) The Lambda hyperparameter is like the ridge
penalty in ridge regression: it provides regularization and is roughly related to how "noisy"
the data is expected to be. Larger (more positive) values = stronger regularization.
xGPR squares the Lambda hyperparameter when fitting.

The "sigma" hyperparameter, for kernels that have it, is an inverse lengthscale that (to oversimplify
a little) determines how close datapoints must be in order to be considered similar. Smaller (more
negative values) cause points that are farther apart to be considered "similar".

xGPR *always* uses *the natural log of the hyperparameters* as input, and internally converts those
to actual values. So if you have:::

  my_model.set_hyperparams(np.array([-1., 0.]), my_train_dataset)

the Lambda value that xGPR will use when fitting is ``(1 / e)^2``, and the sigma value will be ``1``.
(This is really just for internal convenience). For numerical stability, "lambda" probably
should not go below -7 for regression or -10 for classification.

One simple way to find good hyperparameter values is to fit the model using different 
hyperparameter settings and look at performance on a validation set. So in this scheme, for each set of
hyperparameters you're considering, you would:::

   def my_hparam_evalation_function(my_new_hyperparams, my_validation_set_array,
                    my_validation_set_sequence_lengths = None):
       my_model.set_hyperparams(my_new_hyperparams, my_train_dataset)
       my_model.fit(my_train_dataset, mode="cg")
       preds = my_model.predict(my_validation_set_array, my_validation_set_sequence_lengths)
       ##Add some score evaluation, R^2, MAE, accuracy, etc. here...
       return score

where ``my_new_hyperparams`` is a numpy array. You can easily plug this into Optuna or
some other hyperparameter tuning package, do Bayesian optimization or grid search or
any other procedure you like. The ``xGPDiscriminant`` classifer comes with some
built-in functions that will do this kind of optimization for you just for
convenience (for details on these see the "Classifier Tuning" section below).

For regression, there's also a much nicer way to evaluate hyperparameters,
which uses negative log marginal likelihood (what xGPR calls NMLL). In Bayesian
inference, the marginal likelihood is the probability of the training data averaged
over *all possible parameter values*.  A lower NMLL means a better model (and a higher
NMLL means a worse model). The NMLL on the training set in general correlates *very*
well with performance on held-out data. So, for regression we don't really even need a
validation set to tune hyperparameters; we can just calculate the NMLL for different
hyperparameter settings and see which one gives us the best result.

Here's an example:::

   def my_regression_hparam_evalation_function(my_new_hyperparams):
       #If num_rffs is small, use this function
       nmll = my_model.exact_nmll(my_new_hyperparams, my_training_dataset)
       #If num_rffs is large, use this function
       nmll = my_model.approximate_nmll(my_new_hyperparams, my_training_dataset)
       return nmll

Now, we just minimize the value returned by this function -- again, we can use Optuna,
grid search, Bayesian optimization, what have you.

Notice one thing in the function above. ``exact_nmll`` is much faster if the
number of RFFs is small. On GPU, it can be reasonably fast up to about 8,000 RFFs or so.
It has cubic scaling, however, so for large numbers of RFFs it can get very
slow very quickly. ``approximate_nmll`` has much better scaling and so is your
friend if you want to tune using a large ``num_rffs``. It does involve an additional
approximation (above and beyond the random feature approximation used throughout xGPR).
This additional approximation is very good in general but its quality and speed can
be fine-tuned if desired by using some additional knobs; see the Advanced Tutorials for
more.

Finally, for regression, xGPR offers two build-in functions that can do hyperparameter
tuning for you by minimizing the NMLL. These are::

  my_model.tune_hyperparams_crude(my_training_dataset, bounds = None, max_bayes_iter = 30,
                                         subsample = 1)
  my_model.tune_hyperparams(my_training_dataset, bounds = None, max_iter = 50, tuning_method = "Powell",
                            starting_hyperparams = None, n_restarts = 1,
                            nmll_method = "exact")

  #Get the final hyperparameters optimized by tuning as a numpy array
  my_final_hyperparams = my_model.get_hyperparams()


(There are some other knobs we can turn on ``tune_hyperparams``;
see Advanced Tutorials for more.)

The first function, ``tune_hyperparams_crude``, is a remarkably efficient way to rapidly
search the whole hyperparameter space for 1 and 2 hyperparameter kernels. It lets you
specify a "subsample" argument; if this is less than 1(e.g. 0.5), it will use the specified fraction
of the training data when tuning. Both functions let you specify search boundaries or just pass
None (the default) for ``bounds``; if None, xGPR uses some default search boundaries.

``tune_hyperparams_crude`` uses an SVD, which means it doesn't scale well
-- it can get pretty slow for  ``num_rffs = 3,000`` or above. Fortunately, we've generally
found that the hyperparameters which give good NMLL with a small number of RFFs
(a sketchy kernel approximation) are usually not too terribly far away from those which give
good NMLL with a larger number of RFFs (a better kernel approximation).
(This is a rule of thumb, and like all rules of thumb should be used with caution.)
So, one way to use these two functions together is to use ``tune_hyperparams_crude`` for a
fast initial search, then (if desired) further fine-tune the hyperparameters using
``tune_hyperparams``. For example:::

  my_model.device = "cuda"
  my_model.num_rffs = 1024
  my_model.tune_hyperparams_crude(my_train_dataset)
  rough_hparams = my_model.get_hyperparams()

  #Use rough_hparams as a starting point for fine-tuning.
  #We could also set a bounding box around rough_hparams,
  #pass that as bounds, set n_restarts to say 3 and
  #thoroughly explore the space around rough_hparams. Or
  #even just do a gridsearch across the space around rough
  #hparams. See the examples section for some illustrations.
  my_model.num_rffs = 4096
  my_model.tune_hyperparams(my_train_dataset, max_iter = 50,
                        tuning_method = "L-BFGS-B",
                        starting_hyperparams = rough_hparams,
                        n_restarts = 1,
                        nmll_method = "exact")

``tune_hyperparams`` can use one of three different algorithms or ``tuning_method``:
``Powell``, ``L-BFGS-B`` and ``Nelder-Mead``. ``L-BFGS-B`` uses the fewest iterations,
but has to calculate the gradient on each, so it's slow if ``num_rffs`` is large.
If ``num_rffs`` is large, instead, consider ``Powell`` and ``Nelder-Mead``. ``Nelder-Mead``
is usually better than ``Powell`` at finding the absolute best possible value, but
it can take a *lot* of iterations to converge, so it's only good if you're not in a
hurry. We generally prefer Powell to Nelder-Mead.

Remember that when calculating NMLL, we could use ``exact_nmll`` or
``approximate_nmll``. The function ``tune_hyperparams`` offers you the same choice:
you can set ``nmll_method`` to either ``nmll_method=exact`` or ``nmll_method=approximate``,
and the considerations are the same. Again, ``exact`` is faster if ``num_rffs`` is small,
while ``approximate_nmll`` has better scaling.

Finally, one important thing to keep in mind. Most of these methods run at reasonable
speed on GPU. On CPU, however, tuning with a large ``num_rffs`` can be slow because of
all of the matrix multiplications that are required.
We recommend doing hyperparameter tuning and fitting on GPU when possible. Making predictions,
by contrast, is reasonably fast on CPU. So fitting on GPU and
doing inference on CPU is a perfectly viable way to go if desired.


Classifier tuning
----------------------------------------

As discussed above, the xGPDiscriminant classifier does not currently calculate marginal
likelihood, so currently the supported way to tune hyperparameters is to evaluate
performance on a validation set or using cross-validation (for larger datasets,
validation set performance is much faster and is preferred). For convenience,
xGPR incorporates several functions that can automate validation set tuning using
either Optuna or a Scipy-based optimizer; these are described below:

.. autofunction:: xGPR.tune_classifier_optuna

   
.. autofunction:: xGPR.cv_tune_classifier_optuna

   
.. autofunction:: xGPR.tune_classifier_powell


That's really all you absolutely need to know! For lots of useful TMI, see Advanced Tutorials.