LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
Microsoft research
Problem Statement
- Fine tuning is difficult for large language models – Memory constraints
Contribution: Propose an adaptation technique for fine-tuning large language model
Approach
Results
Glue scores
Results
Accuracy
LONGLORA: EFFICIENT FINE-TUNING OF LONG CONTEXT LARGE LANGUAGE MODELS
- CUHK
- MIT
Problem Statement
- LLaMA, BERT, GPT – All trained with fixed context size
- Makes it less effective for long documents
- Training from scratch with long sequences is difficult (Attention heavy)
Intuition:
Although dense global attention is needed during inference, finetuning the model can be effectively and efficiently done by sparse local attention.
Contributions
Previous approaches
If we have LLM with 2K context length but seq length is 8K: Use multiple short attentions
Text =[1,2,3,…..7999,8000]
Group 1: [1,2,….,2000]
Group 2: [2001,2002,….,4000]
Group 3: [4001,4002,….,6000]
Group 4: [6001,6002,….,8000]
No communication between groups
Motivation
If we have LLM with 2K context length but seq length is 8K: Use multiple short attentions
Text =[1,2,3,…..7999,8000]
Group 1: [1001,1002,….,3000]
Group 2: [3001,3002,….,5000]
Group 3: [5001,5002,….,7000]
Group 4: [7001,7002,…8000,1,2,…..,1000
Some communication between groups
S2-Attn
Full attention of 1st group
Information flows through shifting of half-heads
LongLora
LongLora = = S2-Attn + LoRA
Advantages
Results