Kernels for sequence and time series data (non-static)

These kernels handle sequence and time series data, similar to a 1d CNN with global average pooling. To use one of these, when initializing the model, set kernel_choice = 'kernel name', e.g. kernel_choice = "Conv1dRBF".

IMPORTANT NOTE: In addition to these choices, you can use the FastConv1d kernel for sequences, which is described under Feature Extractors since it is actually a feature extractor rather than a typical kernel. FastConv1d is equivalent to the Conv1dTwoLayer kernel described below but sometimes using the feature extractor in preference to the kernel shown here can be useful.

Sequence Kernels

Kernel Name

Description

kernel_settings

Conv1dRBF

Compares sequences by averaging over
an RBF kernel applied pairwise to
all subsequences of length “conv_width”
in the two sequences.
“conv_width”:int
“averaging”: str One of ‘none’, ‘sqrt’,
‘full’. See below.
“intercept”:bool

Conv1dMatern

Compares sequences by averaging over
a Matern kernel applied pairwise to
all subsequences of length “conv_width”
in the two sequences.
“conv_width”:int
“averaging”: str One of ‘none’, ‘sqrt’,
‘full’. See below.
“intercept”:bool
“matern_nu”:float

Conv1dCauchy

Compares sequences by averaging over
a Cauchy kernel applied pairwise to
all subsequences of length “conv_width”
in the two sequences.
“conv_width”:int
“averaging”: str One of ‘none’, ‘sqrt’,
‘full’. See below.
“intercept”:bool

Conv1dTwoLayer

Compares sequences by performing random-weight
convolutions over the input, applying ReLU
activation with global maxpooling, then
supplying the resulting features as input
to an RBF kernel layer.
“init_rffs”: int The number of random
filter convolutions to perform.
“intercept”: bool
“conv_width”: The width of the random filters.

The Conv1dTwoLayer kernel is analogous to a three-layer convolutional neural network; it applies a set of random filters to the input, applies ReLU activation and global maxpooling, then uses the resulting features as input to an RBF kernel. You can control the number of random filters using the “init_rffs” option. A larger value for “init_rffs” will make the model slower but improve accuracy (albeit with diminishing returns).

If we have a sequence (or time series) of length N and k = conv_width, to measure the similarity of two sequences A and B, the Conv1dMatern, Conv1dRBF and Conv1dCauchy take all the length k subsequences of A and for each length k subsequence in A, evaluate an RBF or Cauchy or Matern kernel on it against all length d subsequences in B. The net similarity is the sum across all of these. If implemented as described, of course, this kernel would be extremely inefficient. In xGPR, however, we implement this kernel in such a way we can achieve linear scaling in both number of datapoints and sequence length.

Be aware that these convolution kernels are slower than fixed-vector input kernels, especially for long sequences, because to avoid using excessive memory, the convolutions are performed in batches (rather than all at once). As a compensating factor, they frequently need fewer random features to achieve good performance.

When using any of these kernels, you are required to supply sequence_lengths when building a dataset or doing inference. This is the number of elements in each sequence. xGPR uses this information to mask out any zero-padding you may have applied to make all the sequences the same length.

Note that all except the Conv1dTwoLayer kernel offer averaging as an option. What this means is as follows. The Conv1dRBF kernel computes the similarity of two graphs for convolution width k with \(L_1\) elements in sequence 1, \(L_2\) elements in sequence 2 as:

\[k(x_1, x_2) = \sum_i^{L_1 - k + 1} \sum_j^{L_2 - k + 1} e^{\sigma ||x_1[i:i+k] - x_2[j:j+k]||^2}\]

Cauchy and Matern are the same except with Cauchy and Matern kernels substituted.

Notice that if \(K_1 = L_1 - k + 1\), this is actually performing \(K_1 * K_2\) k-mer comparisons between the two, so the result will be larger when the sequences are longer. We can compensate for this by dividing by \(K_1 * K_2\), which is full averaging, or dividing by \(\sqrt{K_1 * K_2}\), which is sqrt averaging. Averaging is helpful if the property you are trying to predict does not depend on sequence length. It is counterproductive if sequence length actually is important.

Usually the validation set performance difference between Conv1dMatern, Conv1dCauchy and Conv1dRBF is small; if this is your primary concern, we recommend defaulting to Conv1dRBF and experimenting with the others if desired to see if some small further performance achievement can be obtained. Conv1dTwoLayer by contrast can sometimes perform significantly better (or worse) than these other options.