Hyena Model (deep learning)
The Hyena[1] model is a neural network architecture that was developed to address the scalability issues associated with traditional self‐attention[2] mechanisms. It is designed to efficiently handle very long sequences by replacing the quadratic-complexity self‐attention with a sub-quadratic operator that interleaves implicit long convolutions with data-controlled gating. ArchitectureAt the core of the Hyena model is the concept of implicit long convolutions. Traditional convolutions use fixed kernels that are explicitly defined and stored, resulting in a parameter count that scales linearly with the kernel size. In contrast, Hyena generates convolutional filters implicitly using a parameterized function—typically implemented as a small feed-forward network. This allows the model to synthesize long filters on the fly, effectively decoupling the filter length from the number of parameters. In addition to implicit convolutions, the Hyena operator incorporates data-controlled multiplicative gating. In this mechanism, each token is modulated by gating signals that are derived from learned linear projections of the input. The gating operation is performed element-wise and serves to dynamically adjust the influence of the convolutional output, effectively tailoring the operator to the specific input context. The overall Hyena operator is defined as a recurrence that alternates between implicit long convolutions and element-wise gating. For an order-N Hyena operator, the recurrence is expressed as follows:
, where
Mathematical Formulation The implicit convolution filters in Hyena are typically parameterized as functions of time. For each filter , the response at time is given by:
, where is the composition operator, meaning that the positional encoding is first applied to and then processed by the FFN. Here, the window function serves to modulate the filter (for example, by imposing an exponential decay), and the feed-forward network (FFN) together with positional encodings generate the filter values. This implicit parameterization is a key design choice that allows Hyena to capture long-range dependencies without a proportional increase in parameter count. Efficiency and scalabilityBy replacing the quadratic self-attention[2] mechanism with a sequence of FFT-based convolutions and element-wise multiplications, the Hyena operator achieves an overall time complexity of , where is the number of recurrence steps. This subquadratic scaling is particularly advantageous for long sequences, allowing the model to process inputs that are orders of magnitude longer than those feasible with conventional attention. The operations in the Hyena model—both the implicit convolutions and the gating functions—are highly parallelizable and amenable to optimization on modern hardware accelerators. Techniques such as fast Fourier transforms (FFT) further enhance the efficiency, making the model well-suited for large-scale applications where both speed and memory efficiency are critical. References
|
Portal di Ensiklopedia Dunia