We empirically study a simple layer-pruning strategy for popular families of openweight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to “heal” the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.
For full information please see our paper.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
The figure below illustrates the similarity measure across blocks of layers in different models:
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
The figure below illustrates the similarity measure across blocks of layers in different models:
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
The figure below illustrates the similarity measure across blocks of layers in different models:
The figure below illustrates the similarity measure across blocks of layers in different models:
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
Pruning is an effective technique for reducing both the memory footprint and latency in large language models (LLMs). Notably, in our approach, latency reduction scales linearly with the number of layers pruned. From an interpretability standpoint, the robustness of models to the removal of deeper layers in downstream tasks suggests that the shallower layers might play a crucial role in retaining core knowledge. Moreover, post-training optimization methods such as pruning, quantization, and speculative decoding operate independently of each other, indicating that combining these techniques could yield significant benefits in further reducing overhead.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
Pruning is an effective technique for reducing both the memory footprint and latency in large language models (LLMs). Notably, in our approach, latency reduction scales linearly with the number of layers pruned. From an interpretability standpoint, the robustness of models to the removal of deeper layers in downstream tasks suggests that the shallower layers might play a crucial role in retaining core knowledge. Moreover, post-training optimization methods such as pruning, quantization, and speculative decoding operate independently of each other, indicating that combining these techniques could yield significant benefits in further reducing overhead.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
Pruning is an effective technique for reducing both the memory footprint and latency in large language models (LLMs). Notably, in our approach, latency reduction scales linearly with the number of layers pruned. From an interpretability standpoint, the robustness of models to the removal of deeper layers in downstream tasks suggests that the shallower layers might play a crucial role in retaining core knowledge. Moreover, post-training optimization methods such as pruning, quantization, and speculative decoding operate independently of each other, indicating that combining these techniques could yield significant benefits in further reducing overhead.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
Pruning is an effective technique for reducing both the memory footprint and latency in large language models (LLMs). Notably, in our approach, latency reduction scales linearly with the number of layers pruned. From an interpretability standpoint, the robustness of models to the removal of deeper layers in downstream tasks suggests that the shallower layers might play a crucial role in retaining core knowledge. Moreover, post-training optimization methods such as pruning, quantization, and speculative decoding operate independently of each other, indicating that combining these techniques could yield significant benefits in further reducing overhead.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
Pruning is an effective technique for reducing both the memory footprint and latency in large language models (LLMs). Notably, in our approach, latency reduction scales linearly with the number of layers pruned. From an interpretability standpoint, the robustness of models to the removal of deeper layers in downstream tasks suggests that the shallower layers might play a crucial role in retaining core knowledge. Moreover, post-training optimization methods such as pruning, quantization, and speculative decoding operate independently of each other, indicating that combining these techniques could yield significant benefits in further reducing overhead.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
Pruning is an effective technique for reducing both the memory footprint and latency in large language models (LLMs). Notably, in our approach, latency reduction scales linearly with the number of layers pruned. From an interpretability standpoint, the robustness of models to the removal of deeper layers in downstream tasks suggests that the shallower layers might play a crucial role in retaining core knowledge. Moreover, post-training optimization methods such as pruning, quantization, and speculative decoding operate independently of each other, indicating that combining these techniques could yield significant benefits in further reducing overhead.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
Pruning is an effective technique for reducing both the memory footprint and latency in large language models (LLMs). Notably, in our approach, latency reduction scales linearly with the number of layers pruned. From an interpretability standpoint, the robustness of models to the removal of deeper layers in downstream tasks suggests that the shallower layers might play a crucial role in retaining core knowledge. Moreover, post-training optimization methods such as pruning, quantization, and speculative decoding operate independently of each other, indicating that combining these techniques could yield significant benefits in further reducing overhead.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
Pruning is an effective technique for reducing both the memory footprint and latency in large language models (LLMs). Notably, in our approach, latency reduction scales linearly with the number of layers pruned. From an interpretability standpoint, the robustness of models to the removal of deeper layers in downstream tasks suggests that the shallower layers might play a crucial role in retaining core knowledge. Moreover, post-training optimization methods such as pruning, quantization, and speculative decoding operate independently of each other, indicating that combining these techniques could yield significant benefits in further reducing overhead.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
Pruning is an effective technique for reducing both the memory footprint and latency in large language models (LLMs). Notably, in our approach, latency reduction scales linearly with the number of layers pruned. From an interpretability standpoint, the robustness of models to the removal of deeper layers in downstream tasks suggests that the shallower layers might play a crucial role in retaining core knowledge. Moreover, post-training optimization methods such as pruning, quantization, and speculative decoding operate independently of each other, indicating that combining these techniques could yield significant benefits in further reducing overhead.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
Pruning is an effective technique for reducing both the memory footprint and latency in large language models (LLMs). Notably, in our approach, latency reduction scales linearly with the number of layers pruned. From an interpretability standpoint, the robustness of models to the removal of deeper layers in downstream tasks suggests that the shallower layers might play a crucial role in retaining core knowledge. Moreover, post-training optimization methods such as pruning, quantization, and speculative decoding operate independently of each other, indicating that combining these techniques could yield significant benefits in further reducing overhead.
We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.
Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
Positive votes:
Negative votes:
Vote Up Vote Down review
Have you had bad experience with Warn us, please!
Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
Pruning is an effective technique for reducing both the memory footprint and latency in large language models (LLMs). Notably, in our approach, latency reduction scales linearly with the number of layers pruned. From an interpretability standpoint, the robustness of models to the removal of deeper layers in downstream tasks suggests that the shallower layers might play a crucial role in retaining core knowledge. Moreover, post-training optimization methods such as pruning, quantization, and speculative decoding operate independently of each other, indicating that combining these techniques could yield significant benefits in further reducing overhead.
Large language models (LLMs) come with significant memory and computational demands, posing challenges in terms of efficiency and scalability which is especially important for on-device and local inference. Yet, there appears to be a substantial degree of redundancy in how the weights and internal representations of these models are utilized during inference. This redundancy indicates that it may be possible to optimize these models, reducing their memory and compute footprint without compromising performance. There are a number of techniques aimed at reducing these overheads for pre-trained language models. For instance, quantization can significantly reduce memory footprint of parameters, and pruning, as well as distillation, can reduce both memory usage and computational demands.
In this work, we introduce a straightforward pruning strategy which we apply to open-weight pre-trained LLMs. Specifically, we devise a method that identifies the most effective layers to prune by analyzing the similarity between representations at different layers. For a given pruning fraction, we remove the layers with the highest similarities and then "heal" the pruning-induced mismatch through a small amount of fine-tuning using QLoRA. Our key finding is that we can prune a significant portion of the deepest layers in the models while maintaining minimal degradation in downstream performance on multiple-choice benchmarks. For instance, in Llama-2-70B, we can prune nearly half of the layers before observing a significant drop in performance.
The figure below illustrates the similarity measure across blocks of layers in different models:
Pruning is an effective technique for reducing both the memory footprint and latency in large language models (LLMs). Notably, in our approach, latency reduction scales linearly with the number of layers pruned. From an interpretability standpoint, the robustness of models to the removal of deeper layers in downstream tasks suggests that the shallower layers might play a crucial role in retaining core knowledge. Moreover, post-training optimization methods such as pruning, quantization, and speculative decoding operate independently of each other, indicating that combining these techniques could yield significant benefits in further reducing overhead.