Kernels for sequence and time series data (non-static)
------------------------------------------------------

These kernels handle sequence and time series data,
similar to a 1d CNN with global average pooling.
To use one of these, when initializing the
model, set ``kernel_choice = 'kernel name'``, e.g.
``kernel_choice = "Conv1dRBF"``.

*IMPORTANT NOTE*: In addition to these choices, you can use the
FastConv1d kernel for sequences, which is described under Feature
Extractors since it is actually a feature extractor rather than
a typical kernel. FastConv1d is equivalent to the ``Conv1dTwoLayer``
kernel described below but sometimes using the feature extractor
in preference to the kernel shown here can be useful.

.. list-table:: Sequence Kernels
   :align: center
   :header-rows: 1

   * - Kernel Name
     - Description
     - kernel_settings
   * - Conv1dRBF
     - | Compares sequences by averaging over
       | an RBF kernel applied pairwise to
       | all subsequences of length "conv_width"
       | in the two sequences.
     - | "conv_width":int
       | "averaging": str One of 'none', 'sqrt',
       | 'full'. See below.
       | "intercept":bool
   * - Conv1dMatern
     - | Compares sequences by averaging over
       | a Matern kernel applied pairwise to
       | all subsequences of length "conv_width"
       | in the two sequences.
     - | "conv_width":int
       | "averaging": str One of 'none', 'sqrt',
       | 'full'. See below.
       | "intercept":bool
       | "matern_nu":float
   * - Conv1dCauchy
     - | Compares sequences by averaging over
       | a Cauchy kernel applied pairwise to
       | all subsequences of length "conv_width"
       | in the two sequences.
     - | "conv_width":int
       | "averaging": str One of 'none', 'sqrt',
       | 'full'. See below.
       | "intercept":bool
   * - Conv1dTwoLayer
     - | Compares sequences by performing random-weight
       | convolutions over the input, applying ReLU
       | activation with global maxpooling, then
       | supplying the resulting features as input
       | to an RBF kernel layer.
     - | "init_rffs": int The number of random
       | filter convolutions to perform.
       | "intercept": bool
       | "conv_width": The width of the random filters.


The ``Conv1dTwoLayer`` kernel is analogous to a three-layer convolutional
neural network; it applies a set of random filters to the input, applies
ReLU activation and global maxpooling, then
uses the resulting features as input to an RBF kernel. You can control
the number of random filters using the "init_rffs" option. A larger
value for "init_rffs" will make the model slower but improve accuracy
(albeit with diminishing returns).

If we have a sequence (or time series) of length N and k = conv_width,
to measure the similarity of two sequences A and B, the ``Conv1dMatern``,
``Conv1dRBF`` and ``Conv1dCauchy`` take all the
length k subsequences of A and for each length k subsequence in A,
evaluate an RBF or Cauchy or Matern kernel on it against all length d subsequences in B. The
net similarity is the sum across all of these. If implemented as
described, of course, this kernel would be extremely inefficient. In xGPR,
however, we implement this kernel in such a way we can achieve *linear
scaling* in both number of datapoints and sequence length.

Be aware that these convolution kernels are slower than
fixed-vector input kernels, *especially* for long sequences,
because to avoid using excessive
memory, the convolutions are performed in batches (rather
than all at once). As a compensating factor, they frequently
need fewer random features to achieve good performance.

When using any of these kernels, you are required to supply ``sequence_lengths``
when building a dataset or doing inference. This is the number of elements
in each sequence. xGPR uses this information to mask out any zero-padding
you may have applied to make all the sequences the same length.

Note that all except the ``Conv1dTwoLayer`` kernel offer averaging as an
option. What this means is as follows. The Conv1dRBF kernel computes the similarity of two
graphs for convolution width *k* with :math:`L_1` elements in sequence 1,
:math:`L_2` elements in sequence 2 as:

.. math::

  k(x_1, x_2) = \sum_i^{L_1 - k + 1} \sum_j^{L_2 - k + 1} e^{\sigma ||x_1[i:i+k] - x_2[j:j+k]||^2}

Cauchy and Matern are the same except with Cauchy and Matern kernels substituted.

Notice that if :math:`K_1 = L_1 - k + 1`, this is actually performing :math:`K_1 * K_2` k-mer comparisons
between the two, so the result will be larger when the sequences are longer. We can compensate
for this by dividing by :math:`K_1 * K_2`, which is ``full`` averaging, or dividing by :math:`\sqrt{K_1 * K_2}`, which is
``sqrt`` averaging. Averaging is helpful if the property you are trying to predict does not
depend on sequence length. It is counterproductive if sequence length actually *is* important.

Usually the validation set performance difference
between ``Conv1dMatern``, ``Conv1dCauchy`` and ``Conv1dRBF`` is 
small; if this is your primary concern, we recommend defaulting
to ``Conv1dRBF`` and experimenting with the others if desired to
see if some small further performance achievement can be obtained.
``Conv1dTwoLayer`` by contrast can sometimes perform significantly
better (or worse) than these other options.