TABLE OF CONTENTS

Introduction How CCA and CCGQA Work Computational Advantages Modeling Advantages

How CCA and CCGQA Work Computational Advantages Modeling Advantages

Introduction How CCA and CCGQA Work Computational Advantages Modeling Advantages

How CCA and CCGQA Work Computational Advantages Modeling Advantages

Introduction How CCA and CCGQA Work Computational Advantages Modeling Advantages

Introduction

Self-attention is the powerful sequence mixer at the core of modern transformer architectures, but its quadratic compute complexity and linearly-growing KV-cache create fundamental bottlenecks for training and serving large language models—especially at long context lengths. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) have made important strides in reducing the KV-cache size and reducing decode latency. However, these methods leave FLOPs—which determines prefill and training speed—either unchanged or even increased.

We introduce Compressed Convolutional Attention (CCA), a novel attention method which down-projects queries, keys, and values to lower-dimensional latent space and then, crucially, performs the entire attention operation inside the shared latent space, unlike other attention mechanisms such as GQA and MLA. Our design cuts parameters, KV-cache, and FLOPs all at once by the desired compression factor. Since CCA is orthogonal to head-sharing approaches, we can combine the two to form Compressed Convolutional Grouped Query Attention (CCGQA), which enables the model designer to tune compression toward either FLOP or memory limits without sacrificing model quality. CCGQA is used to power the ZAYA suite of language models.

How CCA and CCGQA Work

CCA builds on the insight that significant redundancy exists in traditional attention's parameter and activation spaces. Rather than up-projecting compressed representations back to full dimension before attention (as in MLA), CCA performs the full attention computation entirely within the compressed latent space.

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Convolutional Mixing: We apply sequential convolutions across both sequence and channel dimensions on the compressed query and key latents. These convolutions provide additional expressivity and enable better information transfer through attention—analogous to how causal convolutions improve sequence mixing in state-space models.

QK-Mean Adaptation: We add the mean of pre- and post-convolution query and key values, which shares information between Q and K while providing a skip connection that allows the model to interpolate convolution strength.

Value-Shift: Each attention head receives two distinct value types—one from the current embedding and one from the previous position in the sequence. This inductive bias, similar to token-shift approaches in RWKV, proves beneficial for sequence modeling.

Computational Advantages

CCA allows RoPE (or any position embedding) to be applied directly within the latent space. This is unlike competing latent methods like MLA, which requires a shared key rope head and cache due to its up-projections.

Furthermore, our work is the first to note and theoretically clarify that parameter-sharing methods (like GQA) and parameter-compression methods (like MLA and CCA) are orthogonal and can be effectively combined. CCGQA applies GQA-style K and V head sharing within the already compressed latent space. This enables an additional 2× KV-cache reduction without performance penalty, and allows decoupling the compression rates of queries and keys.

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Tensor Parallelism: Sharding CCA's latent representation incurs only the same cost as GQA, as long as TP rank matches the number of kv heads. The TP communication for QK mean can be overlapped with the compute of the convolutions.

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Computational Advantages

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Modeling Advantages

Because the S² terms in QKᵀ and Attn·V shrink by 1/C (where C is the compression factor), the speedups grow with sequence length. On H100 GPUs, our fused CCA kernel reduces prefill latency by approximately 1.7× at sequence length 16k relative to MHA, and accelerates backward passes by approximately 1.3×.

CCA with 16× compression enables √16 = 4× longer sequences to be processed for the same FLOP budget. For a full theoretical FLOP analysis of CCA against its competitors, see Table II and Figure 2 (below) of the technical report.

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Computational Advantages

Modeling Advantages

To realize these theoretical gains in practice, we designed and implemented H100 GPU kernels for the forward and backward that fuses the convolution operations with an online softmax in the style of Flash Attention. The kernel executes the entire attention operation in the compressed latent space of width E/C, which significantly reduces both the arithmetic intensity and data-movement requirements.

Our FLOP complexity model closely aligns with implementation results. The matrix multiplications of CCA reduce by a factor of 1/C. CCA’s lead grows with larger sequence lengths as the S² terms and projections dominate and the small overheads from convolutions, reductions, and kernel launch amortize. A key difference from methods that only rebalance bandwidth during decode (like GQA and MLA) is that CCA reduces both prefill and decode compute while simultaneously shrinking the KV-cache. CCGQA inherits GQA's KV reuse benefits in the latent while preserving CCA's 1/C scaling of the matrix multiplications. This enables CCA to provide significant performance benefits during training, prefill, and decode.

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Computational Advantages

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Modeling Advantages

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Dense Models: CCGQA outperforms all other attentions despite matching parameters—with substantial KV-cache compression compared to MHA. CCA beats MLA in the parameter-matched setting while using 4 times fewer FLOPs.

MoE Models: CCA achieves lower loss than GQA and MLA at equivalent parameter counts with less compute cost. CCGQA achieves 8× KV-cache compression, half the KV-cache of other parameter efficient attentions tested, with no drop in performance compared to standard MHA and other parameter efficient attentions such as MLA.

The full technical report, including ablation studies, kernel implementation details, and code examples, is available on arXiv. For an example of using CCA/CCGQA in a production model training run, see our ZAYA1 technical report.

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Computational Advantages

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Modeling Advantages

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Computational Advantages

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Introduction

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

How CCA and CCGQA Work

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Computational Advantages

Modeling Advantages

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Computational Advantages

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Modeling Advantages

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Link to Cookbook (GitHub)

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Computational Advantages

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Modeling Advantages

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

What is Annealing?

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Computational Advantages

Modeling Advantages

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Introduction

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

How CCA and CCGQA Work

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Computational Advantages

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Computational Advantages

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Modeling Advantages

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Computational Advantages

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Computational Advantages

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Table 1: Evaluation scores for Zyda-2 vs alternative datasets broken down more granularly by specific evaluation metric

Modeling Advantages

Analysis of Global Duplicates

We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.

Figure 7: Distribution of cluster sizes of duplicates in global dataset (log-log scale).

Figure 8: Distribution of cluster sizes of duplicates in DCLM (log-log scale).

Figure 9: Distribution of cluster sizes of duplicates in FineWeb-Edu2 (log-log scale).

Figure 10: Distribution of cluster sizes of duplicates in Zyda-1 (log-log scale).

Figure 11: Distribution of cluster sizes of duplicates in Dolma-CC (log-log scale).

Largest cluster in DCLM

Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
‍
‍Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
‍‍
‍The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
‍‍
‍Positive votes:
Negative votes:
Vote Up Vote Down review
‍‍
‍Have you had bad experience with Warn us, please!

Examples of varying quality score in a cluster of duplicates in DCLM

Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.

Document ID: <urn:uuid:941f22c0-760e-4596-84fa-0b21eb92b8c4>

Quality score of: 0.19616

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, Rovi

Document ID: <urn:uuid:0df10da5-58b8-44d8-afcb-66aa73d1518b>

Quality score of: 0.091928

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
Sean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:4986ef09-3ee3-4e13-9084-7898aaf72aaf>

Quality score of: 0.072259

recent on-air advertisers

Now Playing

You Control the ...

Artist Snapshot:

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, RoviSean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:1e0496a9-0116-418a-9aec-e65b1d20e709>

Quality score of: 0.0424

18 June 2015

ROME self titled 1996

by request

Artist Biography by

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
‍
1 Leaving Perdition 8:10
2 Intermodal 3:39
3 Lunar White 3:25
4 She's A Black Belt 3:14
5 Rohm 1:09
6 Radiolucence (Version) 5:31
7 Deepest Laws 14:14

No comments:

Introduction

Reported scores underlined.

Pass@1 scores with greedy sampling.

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Pass@1 scores with greedy sampling. Livebench 2024-11-25.
Bold: Best score at 1.5B scale w/ greedy sampling
*reported scores

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Evals (reported underlined). All numbers pass@1 estimated using n=16

How CCA and CCGQA Work

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Computational Advantages

Modeling Advantages

Footnote: Training on the Eurus-2-RL dataset did not match the DeepScaleR math evaluation numbers, possibly due to lower quality synthetic math questions in NuminaMath-CoT providing a mixed training signal, or the solvability filtering process with QwQ-preview reducing the difficulty of the dataset. Additionally, the relatively small percentage of code (5%) likely led to math dominating training at the expense of code performance. Training on domain specific datasets and merging resulting models seems to be a potential way to counteract this problem, as demonstrated with SFT in Light-R1.

Introduction

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

How CCA and CCGQA Work

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Computational Advantages

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Modeling Advantages

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Computational Advantages

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Modeling Advantages

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Computational Advantages

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Prompt #1

I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #2

The emperor's complexion did not change, remaining as still as a sculpture, and a touch of touching warmth flashed in his eyes. He deeply glanced at the loyal minister, and finally spoke: "Well, I will consider it again." His voice was low and firm, leaving a faint hint of helplessness and tenderness in the air.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #3

You don't even think to call me "Godfather." You come into my house on the day my daughter is to be married and you ask me to do murder - for money.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #4

Brave bakers boldly baked big batches of brownies in beautiful bakeries.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #5

Active artists always appreciate artistic achievements and applaud awesome artworks.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #6

I was, like, talking to my friend, and she’s all, um, excited about her, uh, trip to Europe, and I’m just, like, so jealous, right?

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #7

F one F two F four F eight H sixteen H thirty two H sixty four

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #8

Its chlorover. Like totally chlorover. Totally. Completely. Chlorover.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #9

Crafting a symphony of flavors the skilled chef orchestrated a culinary masterpiece that left an indelible mark mark mark mark mark on the palates of the discerning diners.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Modeling Advantages

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Introduction

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

How CCA and CCGQA Work

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Computational Advantages

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Computational Advantages

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Modeling Advantages

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

Computational Advantages

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Modeling Advantages

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.

Introduction

How CCA and CCGQA Work

To maintain and enhance performance at high compression rates, CCA introduces three key innovations:

In addition, CCA and CCGQA are also amenable to existing parallelism schemes:

Context Parallelism: One can communicate the smaller latent width E/C instead of full width E within ring or tree attention schemes.

Computational Advantages

Modeling Advantages

Our experimental ablations demonstrate that CCA and CCGQA consistently outperform both GQA and MLA at equal KV-cache compression on both dense and MoE models.