Kernels for graphs
------------------------------------------------------

These are convolution kernels for graphs. To use one of these, when initializing the
model, set ``kernel_choice = 'kernel name'``, e.g.
``kernel_choice = "GraphRBF"``.


.. list-table:: Graph Kernels
   :align: center
   :header-rows: 1

   * - Kernel Name
     - Description
     - kernel_settings
   * - GraphRBF
     - | Compares graphs by averaging over
       | an RBF kernel applied pairwise to
       | all node representations
       | in the two graphs.
     - | "averaging": str
       | One of 'none', 'sqrt', 'full'. See
       | below.
       | "intercept":bool
   * - GraphMatern
     - | Compares graphs by averaging over
       | a Matern kernel applied pairwise to
       | all node representations
       | in the two graphs.
     - | "averaging": str
       | One of 'none', 'sqrt', 'full'. See
       | below.
       | "matern_nu":float
       | "intercept":bool
   * - GraphCauchy
     - | Compares graphs by averaging over
       | a Cauchy kernel applied pairwise to
       | all node representations
       | in the two graphs.
     - | "averaging": str
       | One of 'none', 'sqrt', 'full'. See
       | below.
       | "intercept":bool

Consider a graph where each node has an associated 
set of features. GraphRBF compares two graphs A and B by
taking each node in graph A and evaluating an RBF kernel across
that node vs each node in graph B. GraphMatern and GraphCauchy
do the same thing, except using Cauchy or Matern kernels.
A naive implementation would have quadratic scaling
in the size of the graph; in xGPR, we
implement these kernels with a "trick" that results in linear
scaling with graph size and number of datapoints.

These convolution kernels can be slower than
fixed-vector input kernels, *especially* for large graphs,
because to limit memory use, the convolutions are performed
in batches (rather than all at once).

When using any of these kernels, you are required to supply ``sequence_lengths``
when building a dataset or doing inference. This is the number of nodes
in each graph. xGPR uses this information to mask out any zero-padding
you may have applied to make all the graphs the same size.

Note that all three offer averaging as an option. What this means
is as follows. The GraphRBF kernel computes the similarity of two
graphs with :math:`L_1` nodes in graph 1, :math:`L_2` nodes in graph 2 as:

.. math::

  k(x_1, x_2) = \sum_i^{L_1} \sum_j^{L_2} e^{\sigma ||x_1[i] - x_2[j]||^2}

Cauchy and Matern are the same except with Cauchy and Matern kernels substituted.

Notice this is actually performing :math:`L_1 * L_2` node comparisons
between the two, so the result will be larger when the graphs are larger. We can compensate
for this by dividing by :math:`L_1 * L_2`, which is ``full`` averaging, or dividing by :math:`\sqrt{L_1 * L_2}`, which is
``sqrt`` averaging. Averaging is helpful if the property you are trying to predict does not
depend on graph size. It is counterproductive if graph size actually *is* important.

Usually the validation set performance difference
between ``GraphMatern``, ``GraphCauchy`` and ``GraphRBF`` is 
small; if this is your primary concern, we recommend defaulting
to ``GraphRBF`` and experimenting with the others if desired to
see if some small further performance achievement can be obtained.