TABLE OF CONTENTS

Highlights Efficiency Evaluations Inference Performance Architecture

Highlights Efficiency Evaluations Inference Performance

Highlights Efficiency Evaluations Inference Performance Architecture

Highlights Efficiency Evaluations Inference Performance

Highlights Efficiency Evaluations Inference Performance Architecture

Zamba2-mini (1.2B) Highlights Zamba2-mini (1.2B) Efficiency Zamba2-mini (1.2B) Evaluations Zamba2-mini (1.2B) Inference Performance Zamba2-mini (1.2B) Architecture

Highlights Zamba2-mini (1.2B) Efficiency Zamba2-mini (1.2B) Evaluations Zamba2-mini (1.2B) Inference Performance Zamba2-mini (1.2B) Architecture

Highlights Efficiency Evaluations Inference Performance Architecture

Zamba2-mini (1.2B) Highlights Zamba2-mini (1.2B) Efficiency Efficiency

Zamba2-mini (1.2B) Highlights Zamba2-mini (1.2B) Efficiency Zamba2-mini (1.2B) Evaluations Zamba2-mini (1.2B) Inference Performance Zamba2-mini (1.2B) Architecture

Zamba2-mini (1.2B) Highlights Zamba2-mini (1.2B) Efficiency Zamba2-mini (1.2B) Evaluations Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Highlights Zamba2-mini (1.2B) Efficiency Zamba2-mini (1.2B) Evaluations Zamba2-mini (1.2B) Inference Performance Zamba2-mini (1.2B) Architecture

Zamba2-mini (1.2B) Highlights Zamba2-mini (1.2B) Efficiency Zamba2-mini (1.2B) Evaluations

Zamba2-mini (1.2B) Highlights Zamba2-mini (1.2B) Efficiency Zamba2-mini (1.2B) Evaluations Zamba2-mini (1.2B) Inference Performance Zamba2-mini (1.2B) Architecture

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Zamba2-mini achieves the quality of a 2-3B dense transformer while only requiring the inference compute and memory of a <1B dense transformer. Much of our focus on designing hybrid models is to maintain the best of both worlds (the efficiency of SSM/RNN architectures, and the quality of the transformer architecture). Some of the main contributing factors of our model’s benefits over comparable dense transformers are:

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared attention and MLP blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-1.2B was pretrained for approximately 3T tokens on a dataset composed of Zyda and other open-access pre-training datasets (all aggressively filtered and deduplicated to ensure quality), then annealed on 100B of the highest-quality tokens.

Zamba2-1.2B will be released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models. A Huggingface integration is available here, and a pure-pytorch implementation is available here.

Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Efficiency

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture

Link to Cookbook (GitHub)

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture

What is Annealing?

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Efficiency

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Table 1: Evaluation scores for Zyda-2 vs alternative datasets broken down more granularly by specific evaluation metric

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture

Analysis of Global Duplicates

We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.

Figure 7: Distribution of cluster sizes of duplicates in global dataset (log-log scale).

Figure 8: Distribution of cluster sizes of duplicates in DCLM (log-log scale).

Figure 9: Distribution of cluster sizes of duplicates in FineWeb-Edu2 (log-log scale).

Figure 10: Distribution of cluster sizes of duplicates in Zyda-1 (log-log scale).

Figure 11: Distribution of cluster sizes of duplicates in Dolma-CC (log-log scale).

Largest cluster in DCLM

Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
‍
‍Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
‍‍
‍The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
‍‍
‍Positive votes:
Negative votes:
Vote Up Vote Down review
‍‍
‍Have you had bad experience with Warn us, please!

Examples of varying quality score in a cluster of duplicates in DCLM

Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.

Document ID: <urn:uuid:941f22c0-760e-4596-84fa-0b21eb92b8c4>

Quality score of: 0.19616

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, Rovi

Document ID: <urn:uuid:0df10da5-58b8-44d8-afcb-66aa73d1518b>

Quality score of: 0.091928

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
Sean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:4986ef09-3ee3-4e13-9084-7898aaf72aaf>

Quality score of: 0.072259

recent on-air advertisers

Now Playing

You Control the ...

Artist Snapshot:

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, RoviSean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:1e0496a9-0116-418a-9aec-e65b1d20e709>

Quality score of: 0.0424

18 June 2015

ROME self titled 1996

by request

Artist Biography by

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
‍
1 Leaving Perdition 8:10
2 Intermodal 3:39
3 Lunar White 3:25
4 She's A Black Belt 3:14
5 Rohm 1:09
6 Radiolucence (Version) 5:31
7 Deepest Laws 14:14

No comments:

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Reported scores underlined.

Pass@1 scores with greedy sampling.

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Pass@1 scores with greedy sampling. Livebench 2024-11-25.
Bold: Best score at 1.5B scale w/ greedy sampling
*reported scores

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Evals (reported underlined). All numbers pass@1 estimated using n=16

Zamba2-mini (1.2B) Efficiency

Zamba2-mini (1.2B) Evaluations

Zamba2-mini (1.2B) Inference Performance

Footnote: Training on the Eurus-2-RL dataset did not match the DeepScaleR math evaluation numbers, possibly due to lower quality synthetic math questions in NuminaMath-CoT providing a mixed training signal, or the solvability filtering process with QwQ-preview reducing the difficulty of the dataset. Additionally, the relatively small percentage of code (5%) likely led to math dominating training at the expense of code performance. Training on domain specific datasets and merging resulting models seems to be a potential way to counteract this problem, as demonstrated with SFT in Light-R1.

Zamba2-mini (1.2B) Architecture

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Efficiency

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Prompt #1

I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #2

The emperor's complexion did not change, remaining as still as a sculpture, and a touch of touching warmth flashed in his eyes. He deeply glanced at the loyal minister, and finally spoke: "Well, I will consider it again." His voice was low and firm, leaving a faint hint of helplessness and tenderness in the air.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #3

You don't even think to call me "Godfather." You come into my house on the day my daughter is to be married and you ask me to do murder - for money.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #4

Brave bakers boldly baked big batches of brownies in beautiful bakeries.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #5

Active artists always appreciate artistic achievements and applaud awesome artworks.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #6

I was, like, talking to my friend, and she’s all, um, excited about her, uh, trip to Europe, and I’m just, like, so jealous, right?

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #7

F one F two F four F eight H sixteen H thirty two H sixty four

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #8

Its chlorover. Like totally chlorover. Totally. Completely. Chlorover.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #9

Crafting a symphony of flavors the skilled chef orchestrated a culinary masterpiece that left an indelible mark mark mark mark mark on the palates of the discerning diners.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Efficiency

Zamba2-mini (1.2B) Evaluations

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Highlights

Zamba2-mini achieves SOTA evaluation benchmark performance and superior inference efficiency compared to models of a similar scale and larger such as Gemma-2B (Google), SmolLM-1.7B (Huggingface), OpenELM-1.1B (Apple), StableLM-1.6B (StabilityAI) and Phi-1.5 (Microsoft)
Zamba2-mini is extremely inference-efficient, achieving 1.67x faster time-to-first-token and a 23.3% reduction in memory overhead compared to Phi1.5-1.3B
We release the model weights open-source (Apache 2.0)

Zamba2-mini (1.2B) Efficiency

Model Quality:

The shared transformer block allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
Our 3 trillion token pre-training dataset, which is composed of a combination of Zyda and other openly-available datasets that are extensively filtered and deduplicated.
We have a separate "annealing" pre-training phase, which decays the learning rate over 100B very high-quality tokens.

Inference Efficiency:

Mamba2 blocks are extremely efficient, and have roughly 4 times the throughput of an equal-parameter transformer block.
Mamba blocks only have small hidden states to store and don't require a KV-cache, so we only need to store KV states for the invocations of the shared attention block.
We choose model sizings that are very amenable to parallelization on modern hardware (i.e. multiple streaming multiprocessors on GPUs, multiple cores on CPUs).

Due to these results, we believe Zamba2-mini offers a significant improvement over comparable small language models.

Zamba2-mini makes some architectural improvements over Zamba1-7B:

Mamba1 blocks have been replaced with Mamba2 blocks
We apply a LoRA projector to both shared attention and MLP block, which allows the network to specialize the shared layers at each invocation of the shared layer across depth
We added Rotary Position embeddings to the shared attention layers which we found slightly improved performance.

Our architecture also differs from Zamba2-2.7B by utilizing LoRAs on the shared attention layers and using only a single shared layer instead of the alternating scheme employed in Zamba2-2.7B.

Zamba2-mini (1.2B) Evaluations

Zamba2-mini (1.2B) Inference Performance

Zamba2-mini (1.2B) Architecture