QLoRA Efficient Finetuning of Quantized LLMs

QLoRA: Efficient Finetuning of Quantized LLMs

The key innovation behind QLoRA lies in its ability to backpropagate gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA). The resulting model family, aptly named Guanaco, surpasses all previously released models on the Vicuna benchmark, achieving an impressive 99.3% of the performance level of ChatGPT. Notably, this feat is accomplished within a mere 24 hours of fine-tuning on a single GPU.

Read More

LoRA: Low-Rank Adaptation of Large Language Models

The core idea behind LoRA is to freeze the pre-trained model weights and introduce trainable rank decomposition matrices into each layer of the Transformer architecture. This innovative approach significantly reduces the number of trainable parameters for downstream tasks, offering a more efficient and cost-effective adaptation method. For instance, when compared to fine-tuning GPT-3 175B with Adam, LoRA demonstrates an astonishing reduction of trainable parameters by a factor of 10,000 and a 3x decrease in GPU memory requirements.

Read More