LLaMA3-Quantization is the official implementation of our paper How Good Are Low-bit Quantized LLAMA3 Models? An Empirical Study PDF. Created by researchers from The University of Hong Kong, Beihang University and ETH Zürich.
Meta's LLaMa family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMa3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMa3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMa3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMa3 on 1-8 bits and diverse datasets to comprehensively reveal LLaMa3's low-bit quantization performance. Our experiment results indicate that LLaMa3 still suffers non-negligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the significant performance gap under low bit-width that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, pushing the LLMs to lower bit-width with higher accuracy for being practical. Our project is released on https://github.com/Macaronlin/LLaMA3-Quantization and quantized LLaMa3 models are released in https://huggingface.co/LLMQ.
We provide full script to evaluate various quantization methods in ./scripts/
. We use LLaMa-3-8B in IR-QLoRA method as an example here:
python main.py \
--model meta-llama/Meta-Llama-3-8B \
--peft LLMQ/LLaMA-3-8B-IR-QLoRA \
--tau_range 0.1 --tau_n 100--blocksize 256 \
--epochs 0 \
--output_dir ./log/llama-3-8b-irqlora \
--wbits 4 \
--tasks piqa,arc_easy,arc_challenge,hellaswag,winogrande
-
Evaluation results of post-training quantization on LLAMA3-8B model
-
Evaluation results of post-training quantization on LLAMA3-70B model
GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
RPTQ: Reorder-Based Post-Training Quantization for Large Language Models
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
PB-LLM: Partially Binarized Large Language Models
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
QLoRA: Efficient Finetuning of Quantized LLMs
IR-QLoRA: Accurate LoRA-Finetuning Quantization of LLMs via Information Retention