LLaMA-2 and LLaMA-3 Models with SparseGPT-Based Model  Compression and LoRA/QLoRA Integration: Efficient Large Language  Model Optimization for Resource-Constrained Environments

Erke Arıbaş; Melisa Güler

Authors

Erke Arıbaş Istanbul Technical University
Melisa Güler Istanbul Technical University

Keywords:

SparseGPT, LoRA, QLoRA, Model Compression, Large Language Models

Abstract

Data exists in different sizes according to the needs and usage area. The need for these storage
areas is increasing day by day. However, the increasing size of these models significantly increases storage
requirements, memory usage, and inference latency, limiting applicability, especially on resource
constrained devices (e.g., mobile, embedded systems). By 2025, global data volume is expected to reach
175 zettabytes, underscoring the growing need not only for efficient storage solutions but also for effective
model optimization techniques [1]. Increasing model sizes and evolving computational requirements reveal
the importance of model compression. The choice of compression algorithm plays a decisive role in
balancing efficiency, speed, and accuracy. If the correct and efficient compression technique is chosen,
both the compression ratio can be increased and the processing time can be reduced, while choosing an
algorithm that is not suitable for the purpose can distract us from the target. When llama-2 and llama-3,
which are among the large language models, are examined, the fact that they have millions of parameters
can cause serious problems, especially in devices with limited storage and performance. The appropriate
compaction technique can be selected by taking into account the characteristics as well as the overall
performance metrics and size. With the SparseGPT used in this study, weights that can be considered
unnecessary are pruned and the model is reduced. Using the SparseGPT-based pruning method, up to 50
60% sparsity was applied on the LLaMA-2 and LLaMA-3 models, and significant memory savings were
achieved without the need for retraining by setting the insignificant weights in the parameter space of the
model to zero. This approach reduces the model size while eliminating the costly requirement of full
retraining. Then the model is customized with lora/qlora. Instead of completely retraining large models,
additional adjustments are made through adapters.

Downloads

Download data is not yet available.

Author Biographies

Erke Arıbaş, Istanbul Technical University

Türkiye

Melisa Güler, Istanbul Technical University

Türkiye

References

IDC, Worldwide Global DataSphere Forecast, 2021–2025. IDC, 2021.

E. Frantar and D. Alistarh, “SparseGPT: Massive language models can be accurately pruned in one-shot,” arXiv preprint arXiv:2301.00774, 2023. [Online]. Available: https://arxiv.org/abs/2301.00774

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” arXiv preprint arXiv:2305.14314, 2023. [Online]. Available: https://arxiv.org/abs/2305.14314

Meta AI, “LLaMA: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023. [Online]. Available: https://arxiv.org/abs/2302.13971

S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” in Proc. Int. Conf. Learn. Representations (ICLR), 2016.

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in Proc. Int. Conf. Learn. Representations (ICLR), 2021.

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, et al., “LLaMA 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023. [Online]. Available: https://arxiv.org/abs/2307.09288

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020. [Online]. Available: https://arxiv.org/abs/2005.14165

E. Frantar, S. P. Singh, and D. Alistarh, “Optimal brain compression: A framework for accurate post-training quantization and pruning,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2022.

LLaMA-2 and LLaMA-3 Models with SparseGPT-Based Model Compression and LoRA/QLoRA Integration: Efficient Large Language Model Optimization for Resource-Constrained Environments

Authors

Keywords:

Abstract

Downloads

Author Biographies

Erke Arıbaş, Istanbul Technical University

Melisa Güler, Istanbul Technical University

References

Downloads

Published

How to Cite

Issue

Section

Keywords

Information

Current Issue