TABLE OF CONTENTS

Introduction Background: Vector Quantized Attention Online Vector Quantized Attention Results Conclusions and Future Directions

Background: Vector Quantized Attention Online Vector Quantized Attention Results Conclusions and Future Directions

Introduction Background: Vector Quantized Attention

Introduction Background: Vector Quantized Attention Online Vector Quantized Attention Results Conclusions and Future Directions

Background: Vector Quantized Attention Online Vector Quantized Attention Results Conclusions and Future Directions

Introduction Background: Vector Quantized Attention Online Vector Quantized Attention Results Conclusions and Future Directions

Introduction Background: Vector Quantized Attention Online Vector Quantized Attention Results

Introduction Background: Vector Quantized Attention Online Vector Quantized Attention Results Conclusions and Future Directions

Introduction

Standard sequence mixing layers used in language models struggle to balance efficiency and effectiveness. Self-attention performs well on long-context tasks but has expensive quadratic compute and linear memory costs. Conversely, linear attention and SSMs use only linear compute and constant memory, but struggle with long-context processing. Hybrid models that combine linear attention/SSM layers with self-attention alleviate, but do not remove, the memory and compute complexity of self-attention.

Like linear attention and SSMs, our OVQ-attention layer requires linear compute and constant memory complexity. However, unlike linear attention and SSMs, OVQ-attention uses a sparse state update that allows it to greatly increase the size of its memory state and, consequently, memory capacity, while maintaining efficient training and inference characteristics. In our experiments, we find OVQ-attention significantly outperforms baseline linear-attention and SSM models on long context tasks, while matching or only slightly deviating from strong self-attention baselines at 16k+ context lengths, despite using a fraction of the memory state size of self-attention. OVQ-attention, thus, marks an important alternative direction for developing sequence mixing layers capable of long-context processing.

Background: Vector Quantized Attention

Causal self-attention compares a query vector, \(q_{T}\), at position \(T\) to a tensor containing all previous key vectors, \(K\), via a dot-product. These dot-product similarity values are normalized with softmax then used to compute a weighted average over all previous values, \(V\):

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

Our novel OVQ layer builds on a little-explored sequence mixing layer called vector quantized (VQ) attention. VQ-attention makes one alteration to causal self-attention: it applies vector-quantization to the keys. VQ-attention uses a pre-trained dictionary, \(D_{k}\), that contains a constant number of centroid vectors, which model the means of dense regions of key space. After keys, \(K\), are computed they are vector quantized, i.e., each key is replaced with its nearest neighbor centroid yielding, \(\hat{K}\). Also, the number of keys assigned to each centroid are stored in a count vector, \(c\). It can be shown that when keys are vector quantized in this way, there is an alternative but equivalent form of self-attention with linear compute and constant memory costs:

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

where \(D_{v}\) is a dictionary of centroids fitted to the values, computed on-the-fly during the forward pass. We can see that since the dictionaries are constant size, the compute and memory cost of this operation does not grow with sequence, yielding linear compute and constant memory cost for pre-fill.

Online Vector Quantized Attention

We found that quantizing the keys using a pretrained dictionary \(D_{k}\), as the original VQ attention does, leads to significant performance deficits at long context lengths compared to self-attention. For example, in a basic in-context retrieval (ICR) task, a model that interleaves sliding window attention with VQ-attention (sw-vq) begins to fail prior to 4k context length, even when using a SOTA method for pretraining the key dictionary. A baseline that interleaves sliding window (with RoPE) and standard self attention with NoPE position encodings (sw-nope) solves the task easily and maintains perfect performance beyond training length to 64k.

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Results

The only difference between sw-vq and the baseline sw-nope is the quantized keys (see equation above). Therefore, the cause of the performance deficit must be the quantization error, i.e., the deviation between the original key and its nearest centroid. This error adds a bias to the self-attention operation, since quantization error is added to the keys prior to self-attention. Further, the quantization error adds a bias to the gradients, since a straight-through estimator must be used to propagate gradients through the non-differentiable quantization process.

Increasing the number of centroids provides diminishing improvements, so this issue cannot be solved by using ever larger pre-trained dictionaries. We therefore propose an alternative approach. We design a layer that learns \(D_{k}\) on-the-fly during the forward pass instead of during pre-training. We call this online vector quantized attention (OVQ-attention) since now all components \((D_{k}, D_{v}, c)\) are computed on-the-fly during the forward pass. Our hypothesis is that directly fitting \(D_{k}\) to the inputted keys during the forward pass should reduce the quantization error, since \(D_{k}\) is fitted to the current sequence rather than to the average of sequences observed in pretraining. Further, by dynamically learning \(D_{k}\) on-the-fly, we can backpropagate directly through the dictionary state updates and the attention operation removing the need for straight through estimators.

However, there are several questions that need to be addressed in order to compute both \(D_{k}\) and \(D_{v}\) in the forward pass. First is the question of how to initialize the dictionaries, since clustering models are notoriously sensitive to initialization. Second, is the question of how to implement the dictionary updates efficiently. Our approach uses a chunk-wise parallel method that loops through chunks of key-values, performing mini-batch updates on the dictionaries. Our method initializes centroids by setting new centroids equal to a subset of the key-values in each chunk, according to their distance from the existing centroids. We use a method that quickly adds new centroids early in the sequence then slowly adds them later. As the state grows to very long lengths, the state reaches some maximum number of centroids, N, which is a hyper-parameter. Thus, although state size grows it has a hard upper-bound yielding constant memory complexity. Details can be found in the paper, but the intuition can be gained from the figure below.

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Results

Conclusions and Future Directions

The figure above provides some intuition for how OVQ-attention can have better long-context processing abilities than standard linear attention and SSM models. OVQ-attention uses a kind of online clustering algorithm to update its state. The state update is tiny since each key-value only updates a single column of the state, \(S = [D_{k},D_{v}]\). Further, unlike linear attention and SSM models, these state updates have fixed size relative to the number of centroids, allowing us to add more centroids without requiring more memory to store the state updates. This allows us to greatly expand the size of our memory state, and thus its memory capacity. Finally, since these updates are sparse, updating only a single column of the memory state matrix, there is very little interference between unrelated key-values, helping prevent in-context, catastrophic forgetting. This is opposed to the standard SSM and linear attention updates, which perform dense updates over the entire state.

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Results

Conclusions and Future Directions

We set up our models and baselines so the only difference is the sequence mixing operations. We interleave sliding window layers ( with RoPE) with either self-attention with NoPE (sw-nope) , VQ attention with NoPE (sw-vq), or OVQ-attention with NoPE (sw-ovq). We test sw-vq with a large dictionary size (N = 2k). As noted above, OVQ models have dictionaries that grow over time according to a plateauing growth function toward some maximum, N. We train sw-ovq models with small N (1k or 2k), but test at larger dictionary sizes, and find sw-ovq models can dynamically increase dictionary size at test time and receive consistent performance improvements by doing so, successfully generalizing to larger dictionary sizes.

We first test the models on synthetic key-value in-context retrieval (ICR) tasks. Basic ICR is a simple key-value retrieval task that requires retrieving a value for a given key from a pool of unique key-value pairs shown in context. Positional ICR, a harder task, requires retrieving an ordered set of values assigned to a particular key in the context. Models are trained with up to 4k context length, then tested up to 64k context.

sw-ovq performs significantly better than sw-vq, nearly matching the self-attention baseline at 64k on basic ICR for N=16k-20k, while using less than a quarter of the state/cache size (see right side of figure above). Positional ICR is a harder task for sw-ovq, but we still see OVQ-attention performing well, showing the ability to generalize well to 16k using much smaller KV-cache/dictionary sizes.

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Results

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Introduction

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Background: Vector Quantized Attention

Online Vector Quantized Attention

Results

Conclusions and Future Directions

Models are also tested on an in-context learning (ICL) task that requires the model to learn linear regression functions from input-output examples listed in the context. Our version of the task requires long context ICL abilities since examples from multiple functions are listed, thus spreading relevant examples for a particular function over large distances throughout the context. sw-vq fails on this task, even in an easy scenario with a small number of functions to learn. Conversely, sw-ovq is able to solve and match standard attention baseline, sw-nope, even on the hardest ICL task with examples from 128 different functions shown in context. This task requires performing ICL over 16k+ tokens. sw-ovq solves it while using a small state size with <4k centroids.

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Results

Conclusions and Future Directions

Similar trends are observed in long-context language modeling on the PG19 dataset. sw-ovq models perform only slightly worse than sw-nope models at long context lengths, and significantly better than VQ-baseline. Further, adding OVQ-layers to Gated Delta Net (gdn) models dramatically improves performance, while maintaining linear compute and constant memory complexity.

We also compare sw-ovq to several linear attention and SSM baselines on the synthetic tasks including Gated Delta Net, Mesa Net, and Mamba2. We find the OVQ-attention model has far superior long-context capabilities on these tasks. Further results can be found in our arXiv paper.

Creating LLMs and multi-modal agents that can learn continually over extended deployments is one of the final frontiers facing the field of AI. Storing and processing a linearly increasing KV-cache, as self-attention does, is infeasible at the extremely long context lengths faced during such deployments. Sequence compression, via principled learning mechanisms, is needed. However, the current layers that perform such compression, such as SSMs and linear attention, lack the long-term recall and long-context processing capabilities needed for truly long-term coherent agency. Thus, an alternative approach is needed.

OVQ-attention points toward an alternative path: store a dynamically growable, but strictly bounded, memory state that uses efficient sparse updates. Our empirical results suggest this is a promising alternative path forward. Future work will look to further improve the OVQ-attention layer’s performance on long context tasks and develop hardware-efficient implementations that allow for use at large scale.

Link to Cookbook (GitHub)

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Results

Conclusions and Future Directions

What is Annealing?

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Results

Conclusions and Future Directions

Introduction

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Background: Vector Quantized Attention

Online Vector Quantized Attention

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Results

Conclusions and Future Directions

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Table 1: Evaluation scores for Zyda-2 vs alternative datasets broken down more granularly by specific evaluation metric

Results

Conclusions and Future Directions

Analysis of Global Duplicates

We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.

Figure 7: Distribution of cluster sizes of duplicates in global dataset (log-log scale).

Figure 8: Distribution of cluster sizes of duplicates in DCLM (log-log scale).

Figure 9: Distribution of cluster sizes of duplicates in FineWeb-Edu2 (log-log scale).

Figure 10: Distribution of cluster sizes of duplicates in Zyda-1 (log-log scale).

Figure 11: Distribution of cluster sizes of duplicates in Dolma-CC (log-log scale).

Largest cluster in DCLM

Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
‍
‍Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
‍‍
‍The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
‍‍
‍Positive votes:
Negative votes:
Vote Up Vote Down review
‍‍
‍Have you had bad experience with Warn us, please!

Examples of varying quality score in a cluster of duplicates in DCLM

Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.

Document ID: <urn:uuid:941f22c0-760e-4596-84fa-0b21eb92b8c4>

Quality score of: 0.19616

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, Rovi

Document ID: <urn:uuid:0df10da5-58b8-44d8-afcb-66aa73d1518b>

Quality score of: 0.091928

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
Sean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:4986ef09-3ee3-4e13-9084-7898aaf72aaf>

Quality score of: 0.072259

recent on-air advertisers

Now Playing

You Control the ...

Artist Snapshot:

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, RoviSean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:1e0496a9-0116-418a-9aec-e65b1d20e709>

Quality score of: 0.0424

18 June 2015

ROME self titled 1996

by request

Artist Biography by

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
‍
1 Leaving Perdition 8:10
2 Intermodal 3:39
3 Lunar White 3:25
4 She's A Black Belt 3:14
5 Rohm 1:09
6 Radiolucence (Version) 5:31
7 Deepest Laws 14:14

No comments:

Introduction

Reported scores underlined.

Pass@1 scores with greedy sampling.

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Pass@1 scores with greedy sampling. Livebench 2024-11-25.
Bold: Best score at 1.5B scale w/ greedy sampling
*reported scores

Evals (reported underlined). All numbers pass@1 estimated using n=16

Background: Vector Quantized Attention

Online Vector Quantized Attention

Results

Footnote: Training on the Eurus-2-RL dataset did not match the DeepScaleR math evaluation numbers, possibly due to lower quality synthetic math questions in NuminaMath-CoT providing a mixed training signal, or the solvability filtering process with QwQ-preview reducing the difficulty of the dataset. Additionally, the relatively small percentage of code (5%) likely led to math dominating training at the expense of code performance. Training on domain specific datasets and merging resulting models seems to be a potential way to counteract this problem, as demonstrated with SFT in Light-R1.

Conclusions and Future Directions

Introduction

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Background: Vector Quantized Attention

Online Vector Quantized Attention

Results

Conclusions and Future Directions

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Results

Conclusions and Future Directions

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Prompt #1

I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #2

The emperor's complexion did not change, remaining as still as a sculpture, and a touch of touching warmth flashed in his eyes. He deeply glanced at the loyal minister, and finally spoke: "Well, I will consider it again." His voice was low and firm, leaving a faint hint of helplessness and tenderness in the air.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #3

You don't even think to call me "Godfather." You come into my house on the day my daughter is to be married and you ask me to do murder - for money.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #4

Brave bakers boldly baked big batches of brownies in beautiful bakeries.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #5

Active artists always appreciate artistic achievements and applaud awesome artworks.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #6

I was, like, talking to my friend, and she’s all, um, excited about her, uh, trip to Europe, and I’m just, like, so jealous, right?

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #7

F one F two F four F eight H sixteen H thirty two H sixty four

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #8

Its chlorover. Like totally chlorover. Totally. Completely. Chlorover.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #9

Crafting a symphony of flavors the skilled chef orchestrated a culinary masterpiece that left an indelible mark mark mark mark mark on the palates of the discerning diners.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Results

Conclusions and Future Directions

Introduction

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Background: Vector Quantized Attention

Online Vector Quantized Attention

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Results

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.

\[ o_{T} = softmax(q_{T}{\hat{K}}^T)V \]

\[ = softmax(q_{T}{D_k}^T + \log(c))D_{v'} \]

Online Vector Quantized Attention

Results

Conclusions and Future Directions

Introduction

Background: Vector Quantized Attention

\[ o_{T} = softmax(q_{T}K^T)V \]

The greater the sequence position, \(T\), the more compute and memory self-attention requires, yielding, quadratic compute and linear memory complexity for pre-fill.