I AM YOUR LOYAL SERVANT, A PROUD PLUMBER! PRAISE THE OMNISSIAH!
Welcome to TDM’s Vault.
Trying my best (to keep up with AI literature), here’s the proof that’ll reside in basilisk’s hive mind.
This is honestly just a stupid log. Don’t expect any structure out of it. Meant mostly to shame myself into studying more to avoid a drop in social credit.
Publishing this doc btw is me shamelessly stealing yacine’s idea.
You can also ping me on Twitter anytime if want to build something fun or just chat about which
AI startup has most GPUs - https://x.com/cto_junior or mail me at cto.junioor@gmail.com
HuggingGPT (could’ve chosen a better name, linkedin tier)
NOOO, THEY AUTOMATED L3s (KINDA!)
Can The foundation be just an LLM? If only Hari Seldon read this paper
DECKARD - RL Agent that dreams
Training LLMs using AI generated dialogues
Automating Data Analysts [By Microsoft(™)]
(FLARE) Active Retrieval Augmented Generation
Hack to make inference faster (by HuggingFace)
Yes your models can memorize exact stuff
Voyager [Diamond ranked AI Minecraft player]
Activation-aware Weight Quantisation (AWQ)
SpQR (Sparse Quantised Representation)
SOTA document bender for your company QA
Insane alpha drop from kaiokendev
Skinny dip into GGML code base
How to check fine tuning datasets’ quality?
DPO (Direct Preference Optimization)
Symbol Rank ( for coding LLMs)
Scaling S3 is not easy [Not related to ML but also related to AI cause all data is in S3]
MoE (by Deepmind) (It’s soft not sparse)
Estimate LLM Flops and Memory requirement
How to reduce KV cache mem usage?
Ok, I am going to become Vector DB expert this week
Mixture of Experts: PEFT edition by Cohere
Generative Recommendors - Cool paper by Google
Fusing Modalities - Chimera by Meta
IMPORTANT INTERPRETABILITY PAPER BY ANTHROPIC
It’s not AGI (it’s just your data)
Insane ML Notes on Twitter with Q&A
Stable Diffusion Turbo (or How to distill a diffusion model 101)
I can’t hear the MUSIC* !!!!!!! NEEEED TO GET BETTTTTTTER!!!
Mamba - faster architecture (Reading cause Tri Dao is author)
Use smol models to train large models faster
LLM Paper from Apple?? : That’s a rare sight
Multimodal paper from Apple???
Amazing paper to Learn about Dingboard
TDM edge Multimodal arc (I blame Vik)
Embarrassing myself publicly arc (PHOTOMAKER)
ILYA’s READING LIST (For getting up to speed on today’s architectures)
Stream Diffusion - Brrrrr ImageGen at 100FPS
MLLM-Guided Image Editing (MGIE)
Generalising Length of Transformers
Fashion Diffusion (Make your waifu dress in Zara)
Another Apple LLM (this time it’s multimodal)
Quiet-Star (Is it really the fabled openai algo, nope)
Transformers for time series (truly retarded)
MyVLM (Shitty Name only Snapchat can think of)
How to create a FAQ dataset from your company docs?
REMOVE BACKGROUND OSS MODEL LFGGGG!!!
https://huggingface.co/schirrmacher/ormbg
Flux architecture (stolen from reddit)
Loopy - Make images speak and sing
ColPali - Multimodal RAG made simpler and faster
Janus - Deepseek goes multimodal
TDM’s reasoning arc - must increase AGI IQ by 20 basis points
Which LLM layers are important?
Allow LLMs to explore-exploit better
Karpathy’s Presentation in MS Build 2023
GEMINI PRO VS GPT-4V comparison
Answers to Stupid questions (mostly for me)
Need to play around with AutoGPT and more agents. Mfw it deletes all the pepe memes in my mac.
https://www.reddit.com/r/ChatGPT/comments/12diapw/gpt4_week_3_chatbots_are_yesterdays_news_ai/
(Moving too fast, need to catch up)
Still haven’t tried Toolformer and ReAct, they are now available in LangChain
Should be pretty easy (goddamn tho I really don’t want to write YAML configs)
https://arxiv.org/pdf/2303.17580.pdf
Seems to be the end of NLPcels. Ideally, could have been done using NLP by tracking the question tokens I guess?
I still don’t fully understand how LLMs predict the tool/model to be used accurately, how do we control the hallucination there, is setting temperature = 0 the solution for everything?
https://arxiv.org/abs/2303.16634
Some interesting stuff in that paper, mostly related to how they used logprobs to do scoring rather than absolute scores
Also, for GPT-4, calculated logprobs manually
Works quite well in mac
Only issue is I had to download 2 separate things (1 weights and 1 executable) from two different places
Terminal interface quite good, can pipe input/output as well
Result though not really great (Damn these days billion is not enough, we need to enter the trillion era fr).
Should I finetune this over some OSS docs??? The train.py script seems easy peasy, but setting it up on colab might give me nightmares.
LLM as Terminal commands like jq (Brain explode jpg)
Actually, I am just going to lora the fuck out of GPT4All
Not going well honestly, too many errors
Finally atleast it ran when I gave up on accelerate and just ran it using normal python3 train.py
Still failed while trying to slice 0-length tensor, seems to be related to tokenizer returning empty
Btw, this transformers submodule looks interesting, need to check it out later hehe
https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All_Technical_Report.pdf
So Basically it takes 8 hours on A100 to finetune LORA (Hmm, and also 800$), not worth it for me honestly (a noob collab user)
Tried running the finetuning in colab with the following configs - https://github.com/nomic-ai/gpt4all/issues/108
Dataset taken from - https://huggingface.co/datasets/nomic-ai/gpt4all_prompt_generations_with_p3/blob/main/data.jsonl
Disabled wandb logging (poor guy, no account)
Running into some tokenization errors with zero length tensors.
OK, LFG!!!, It is working, I am able to finetune. Seems like you need source column as well along with prompt and response
Seems like it is not utilizing GPU at all?? Memory usage seems 0. Only running on CPU
WTF. Ths is when running with python3 train.py
Running accelerate launch doesn’t work at all in colab. Saw too many errors on github as well with no clear solution.
Using the P3 removed dataset now (had to fucking convert parquet to jsonl)
Let’s see if it works
So turns out I am dumb (obviously)
You can just use the parquet dataset with P3 removed from the hugging face (https://huggingface.co/datasets/nomic-ai/gpt4all_prompt_generations/tree/main/data). The code already supports that. Let’s see if it works this time
Naah Still fails.
I want to figure out this generative agent shit for sure. Like Agents are literally the next step.
LangChain is also moving quite fast on this.
I clearly know in principle how Vector DBs work.
But I still don’t understand how they solve the performance issue of searching through all embeddings for nearest neighbours (that’s cause I am bad at maths)
Pretty sure there’s some heuristic from which you can select a smaller set of segments/files containing the embeddings that can fit into the memory.
OK looked at the chromDB code base and these mfs are just using wrappers on duckDB and hnswlib.
Duckdb stores the data and hnswlib creates the index for similarity search.
Mf hnsw codebase is in c++. Time to paste it into chatGPT to understand wtf is going on.
So for searching atleast it is simply doing BFS on a graph and adding result to a priority queue
Open questions :
Atleast now I understand why do they ask the similarity measure at the time of creating index in pinecone.
Also benchmarks for most nearest neighbour algos - https://github.com/erikbern/ann-benchmarks
OK, so to create a scalable vector DB, I guess there’s no solution other than to use GPU (since vector search is compute heavy)
This is def outdated as it refers to P100 which is quite old (Apparently a lot of it is still same till Ampere, but changed significantly in Hopper)
So they follow SIMT instead of SIMD. each of the 32 lanes/threads in a warp execute the same instruction but on different data. It’s kinda like a turbocharged SIMD.
Much better diagram from - https://web.ecs.syr.edu/~ffiorett/files/papers/padl14.pdf
I should learn GPU programming fr
Some alpha on GPU in this similarity search paper (https://arxiv.org/pdf/1702.08734.pdf)
This guy is sooooooo correct. That’s literally what I am facing.
I need to push a million docs in context.
Map-reduce summarisation is losing sooo much info.
Plus making too many GPT calls is just sloooow (mf, gpt chat APIs support batching soon)
This is genius. Mfw you can call vectorDB in the end of the chain rather the start.
We literally took an hallucinated answer and converted it to actual text.
Issue is though, what if GPT can’t even hallucinate, like it literally outputs one line?
One more drawback is that you have to compute embeddings of large answers every time instead of a shorter query.
So two performance bottlenecks.
I guess I should give shoggoth lang (the weird compression by GPT) a try. Its lossy but might help here cause we are retrieving from the database.
Paper - https://arxiv.org/pdf/2212.10496.pdf
We actually simulating society now via LLMs.
Again just plugging in memory + some prompt engineering to reason works wonders.
https://arxiv.org/pdf/2304.03442.pdf
This agent might just be better than me cause I can’t remember who the fuck I ran into at the park and clearly not what they were working on (unless they are working on some cool shit).
Btw this runs on GPT-3.5-TURBO and not GPT-4.
Just switching the model will make this soooooo much more impressive.
So for simulating memory, they are simply storing whatever the agent observed in some DB. At the point of interaction, these observations are fetched and then rated on recency, importance (by directly calling the LLM to assign a score) and relevance to the current situation. Take weighted sum, select as many memories as can fit in context window of LLM and hit it.
I feel that this type of scoring via LLM doesn’t work really well though. Same was said in that using GPT as labeller paper.
On the scale of 1 to 10, where 1 is purely mundane (e.g., brushing teeth, making bed) and 10 is extremely poignant (e.g., a break up, college acceptance), rate the likely poignancy of the following piece of memory.
Memory: buying groceries at The Willows Market and Pharmacy
Rating: <fill in>
This prompt returns an integer value of 2 for “cleaning up the room” and 8 for “asking your crush out on a date.” The importance score is generated at the time the memory object is created.
^ Brain explode JPG (Honestly though IRL, it’ll be too slow to be practical in large systems due to sheer number of LLM calls but that should be resolved in like 6 months)
So for Reflection they are using memory as well. Just gather recent observations, ask LLMs what questions should I ask based on these observations and then ask LLM again to answer those questions. It is only triggered btw when importance score of observations goes beyond a certain threshold. Generally that happens twice-thrice a day (man, this is like a human lol)
So after solving for Observation and Reflection, final step in Planning. Mainly how to avoid duplicate actions like eating food twice in same hour as well as incoherent actions like playing basketball in a swimming pool.
Again, they are using memory for it. Just store the current plan along with Observation and reflection. First ask LLM to make a broad plan for the day, then using LLM start to break that plan into chunks of 1 hour. Then break it down further into chunks of 15 minutes. This solves for non-duplicate actions.
Secondly, you can also make changes to the plan based on new observations. Again this can be done by asking LLM questions like
Final complete architecture.
I feel like the number of things that are possible if LLM inference is like in single digit millis. We might actually be able to simulate brain functions. FAAASTEEEER COMPUTTTE! QUANTISATION!!!
Ok so this paper had some maths equations
But turns out those were just API calls represented as a(i) -> r, where a is api, i is input and r is output
So they are first taking a clean dataset, then selecting the places where we can insert some API call
This is being done by asking an LLM
They then filter out the API calls where the output of the API is actually helpful. They do this by calculating CE loss by providing the result of API, and not providing result of API. If that’s above a configured threshold tao, only then the API call is added to the input. This is only during training.
Now they finetune an LLM with this context + LLM call
And voila
You now have an LLM that can predict API calls
During inference, you simply detect → token
As soon as you get it, means the preceding text was an API call and now you actually need to do the work and fill the result before predicting rest of the tokens.
Pretty simple paper honestly, but finetuning is pain in the ass. Can LoRA be applied here during training?
One of the most popular papers on getting LLM to be more humanlike
Ngl, too many words for what seems like prompt engineering
Should skip directly to appendix
Yeah, I didn’t find a lot of value here. The results also don’t seem that big of an improvement.
Gonna stick to a few-shot prompting for most tasks.
Link - https://arxiv.org/pdf/2304.04487.pdf
Basically why are you generating tokens one by one anon when you can simply copy a lot of it from the context in the prompt itself
Like when you do context based or history based Question answering or something else, the LLM will (as expected) generate a lot of stuff as it is mentioned in the context.
So you can simply change the LLM inference design so that when n number of generated tokens match some text in the input, we can simply take k more tokens from the input and simply append it to the output.
Then we can ask LLM to directly generate next K tokens and then remove the ones that are not the same.
So your inference is reduced from K steps to just 2.
[BTW, most of these papers have hidden alpha in appendix where they tell their prompt templates]
Honestly the pace of LLM research is so fast that a lot of these techniques can be simply made better
By plugging in the techniques from the other papers published in last month or so.
E.g. you can improve the generative agent paper by using a better scoring technique from GPTEval paper (take weighted sum of logprobs instead of absolute rank)
Also, I might need to learn Linear algebra fr now to understand some of these papers better (but let’s see how much I can avoid doing that first). Luckily most of them have just basic error loss or probabilities equations for now.
Some more papers I need to read (related to making inference faster) -
4-bit quantisation - https://arxiv.org/abs/2212.09720 (not gonna have to pretend that I know about it while discussing llama.cpp)
SparseGPT - https://arxiv.org/abs/2301.00774 (mostly related to pruning I guess)
Compression - https://arxiv.org/abs/2002.02925
Distillation (of attention) - https://arxiv.org/abs/2002.02925
It might very well be already better than me. Only issue with such systems again is the inference speed.
But then again, I play Apex or Overwatch while debugging an issue so we are both inefficient. (Codex = code-davinci-002)
Hmm, I remember reading about this high-level semantic idea in the Generative Agent paper as well (specifically in the planning phase).
Basically first generate a good enough summary of the code, break it down into chunks and then recursively generate a finer summary of each chunk.
This way the program doesn’t hallucinate much or creates duplicate summaries.
Not sure about their few-shot approach cause the solution is mentioned in the prompts itself.
Idea is interesting though but isn’t that exactly what toolformer/plugin is doing though?
Handing over the complex tasks to plugins and then using their output.
I guess toolformer can’t operator in pipelined fashion is being done here. Not really sure though.
Following is much better way IMO VVVV
Reminded me why I hate being in this limbo of US and UK english. Like should I use sation or zation
FUCK OFFF!
Anyways,
Quantisation is basically mapping a value to a finite set F
So e.g. You can map 32 bit integer values to a set of 16 floating point or (integer) values
For the most basic quantisation the technique is quite simple
Implementation in llama.cpp
Seems simple but there are still a few gaps in my understanding.
Also, why the fuck is markdown code block not working here.
```
QK = 32
size_t ggml_quantize_q4_0(const float * src, void * dst, int n, int k, int64_t * hist) {
assert(k % QK == 0);
const int nb = k / QK;
for (int j = 0; j < n; j += k) {
block_q4_0 * restrict y = (block_q4_0 *)dst + j/QK;
quantize_row_q4_0_reference(src + j, y, k);
for (int i = 0; i < nb; i++) {
for (int l = 0; l < QK; l += 2) {
const uint8_t vi0 = y[i].qs[l/2] & 0xF;
const uint8_t vi1 = y[i].qs[l/2] >> 4;
hist[vi0]++;
hist[vi1]++;
}
}
}
return (n/QK*sizeof(block_q4_0));
}
static void quantize_row_q4_0_reference(const float * restrict x, block_q4_0 * restrict y, int k) {
assert(k % QK == 0);
const int nb = k / QK;
uint8_t pp[QK/2];
for (int i = 0; i < nb; i++) {
float amax = 0.0f; // absolute max
for (int l = 0; l < QK; l++) {
const float v = x[i*QK + l];
amax = MAX(amax, fabsf(v));
}
const float d = amax / ((1 << 3) - 1);
const float id = d ? 1.0f/d : 0.0f;
y[i].d = d;
for (int l = 0; l < QK; l += 2) {
const float v0 = x[i*QK + l + 0]*id;
const float v1 = x[i*QK + l + 1]*id;
const uint8_t vi0 = (int8_t)roundf(v0) + 8;
const uint8_t vi1 = (int8_t)roundf(v1) + 8;
assert(vi0 < 16);
assert(vi1 < 16);
pp[l/2] = vi0 | (vi1 << 4);
}
memcpy(y[i].qs, pp, sizeof(pp));
}
}
```
Also, I know it is just 1 line call in pytorch afaik but its fun to look into how some stuff is actually implemented internally. Might be useful during the apocalypse when github arctic vault is taken over by neo-atlantians.
It seems like just looking at prompts.txt is enough for such tools (praise the LLMs)
The tooling is mostly trivial
Only need to ensure the output of LLM is in JSON (for which I saw A LOOOOOT of hacks in their codebase)
Anyways, so it is mostly the prompts in the following fashion
Thoughts -> reasoning -> Plan -> Criticism -> Speak
Thoughts dictate the motivation
Reasoning justifies it
Plan lists down the steps in bullet points
Criticism makes sure to not do anything illegal mentioned in the original instructions
Speak is optional, if you want to communicate what you’re doing to the user in summarised form
Check out the RLHF module in Deepspeed chat
Btw, 1000X programmer ggerganov cooked another appetizer. This time its whisper doing inference via large model in just 1 second on macs.
WUT!
https://github.com/stochasticai/xturing
We can finetune it now on our laptops
Praise the omnissiah!
I feel though the time taken will be HUUUUUUUUGEEE
Ok so in the stats it still takes quite long (6-7 hours) to finetune on 3070.
I can assume somewhere near 5 hours on 4090.
Practical buttt… who’s gonna monitor it for so long for a side project.
Also, their notebook is a bit weird, it uses docker command inside the notebook (first time seeing this)
Seems like everyone hates vectorDBs now.
I still think they have a proper usecase once you go beyond 10 million vectors.
On that note, I should checkout Instructor-xl - https://arxiv.org/pdf/2212.09741.pdf
VISION + LLM = AI Manga with dialogues (https://github.com/Vision-CAIR/MiniGPT-4)
https://huggingface.co/blog/stackllama
MultiGPTs(https://github.com/rumpfmax/Multi-GPT/) , More LLMs with Vision (https://arxiv.org/abs/2304.08485) , Prompt inversion (: harxiv.org/abs/2304.08460 - man this list keeps on growing
(https://arxiv.org/pdf/2304.09842.pdf)
Better than toolformer (according to them), in the sense it doesn’t require any finetuning for new tools
You simply use LLM to generate a plan based on an inventory of existing modules. Since inventory can be updated easily, integrating new tools is quite simple. Only limitation is the context window length of the LLM.
Paper - https://arxiv.org/pdf/2304.09433.pdf
Basically they are using LLMs to convert unstructured corpus of text to structured data.
The new thing here is mainly not simply using directl approach via prompt.
It’s to generate a schema and a code to do the conversion.
Reasoning being the text is subject to change and you can keep on adding more details. Running via direct approach is expensive which needs to revaluate everything for the answer while with new approach you can simply execute the python code.
Another thing is determining which code is correct. They generate multiple code blocks and then basically
On reading more I feel this will be extremely good for finbros. They are the ones dealing with most unstructured data honestly (shocking but true, not all data is excel). Can help them extract stats easily from multiple company’s data which are in totally different formats.
The True magic is in the function synthesis part which involves weak supervision.
https://arxiv.org/pdf/2304.11062.pdf
On more analysis, it is not that practical to use. The model can only retain fix amount of memory.
Scaling Transformer to 1M tokens and beyond with RMT (Paper Explained)
Not worried about the end of normie SDE day job from AI cause I am craving for more work honestly.
Guess what, we added another loop in the GPT
Nothing major honestly, just keep on correcting CoT (think step by step) with either
Doing this for 5-6 iterations results in correct results.
Now these correct demonstrations can be used in Few-shot prompts the next time (god I wish if context length was bigger)
JUst using prompts to expand the dataset to include complex tasks.
Exploring both breadth and depth.
Claims its better then chatGPT for complex tasks.
Not sure. Doubt.
—------------—------------—------------—------------—------------—------------—------------—------------—------------
Btw, I just realised I don’t understand anything in ML after reading about DinoV2. I need to know about self supervised learning as well as distillation. Guess its time for chatGPT to teach me about this.
Paper - https://arxiv.org/pdf/2301.12050.pdf
Basic funda is instead of starting a RL agent from zero knowledge, you use LLM to create an Abstract World Model (AWM) and then use the model as starting point.
The model is created during the dream phase.
The model is verified by the RL agent using rewards in the wake phase.
Was tested in Minecraft where it creates recipes for crafting stuff in dreams and then RL agents learns to actually mine/craft those recipes. If something is not valid, then it is discarded from the graph and new verified nodes are added.
LLM is pretty good at generating recipes. Mostly fucks up in quantities but not the ingredients.
They are mostly using code-davinci-002 for generation.
Paper - [2304.14318] q2d: Turning Questions into Dialogs to Teach Models How to Search
One of the few papers that uses PaLM-540B instead of GPT (although they provide code that works with the latter)
Premise is you can generate human annotator level chat conversations with LLMs.
Steps
Paper: [2305.01598] From Words to Code: Harnessing Data for Program Synthesis from Natural Language
I guess this might be the way they generated code for copilot as well. Filled with lots of small small nuggets. They are quite focused on good UX and not just research which is great.
Using code-davinci-002 for LLM calls
Problem statement is to write K programs to process data D and then present them to the users.
What makes it different is the way those K programs are being ranked. The aim is to offer programs to the user that are correct but have enough variability. There’s no point in presenting 100 programs that look exactly the same.
Steps
Ooga Booga chat UI + Stable diffusion Character pfp + pygmalion-6b character model + custom persona (inspired by some submissions from https://botprompts.net/)
On more experiments, the superCot LoRA also does pretty well with character simulation (even with characters that use profanity)
However, the dialogues are super short for some reason.
Reminds of the tweet by Noelle on how pygmalion dialogues feel much more natural/human cause it is trained on actual chat data instead of artificial datasets.
Basic problem they are trying to solve is generating long text for a question based on a retrieval.
Retrieving a lot of chunks based on a small question doesn’t work well.
What’s needed is as you are generating text, you keep on fetching chunks to fill the information.
We need to solve for three problems though with this method
The solve for these problems using the following methods
A tear rolled down my face (so beautiful, I will never stop loving computers)
Calculating attention is really slow. Limited by memory bandwidth since it scales quadratically w.r.t
Matrix Dimensions.
Basic funda is to make the algo IO-Aware (at first I thought they will be doing some syscalls to get io stats but its not that)
You split the original matrices of size N * d into multiple blocks of size B.
Run two loops
The outer one iterates over each block of Key K and Value V matrices and loads them from high bandwidth memory (HBM) to SRAM
The inner loop iterates over blocks of Query Q and calculates the softmax and other matrices for output.
During backprop, the algo is changed a bit so that intermediate matrices (Softmax and dot product) need not be loaded from HBM again. The algo can easily calculate those using the output itself which is already available in SRAM.
Hence, you basically use more compute to use less memory access.
Luckily GPUs have a lot of compute so it’s not a problem.
Here’s GPT4s simpler explanation in case you lose a few brain cells after binging anime with tsunderes
Now, let's go through the algorithm step by step: 1. **Set block sizes**: The algorithm sets the size of the blocks it will divide the data into, based on the available on-chip memory (SRAM) and the dimension of the data (d). 2. **Initialize**: The algorithm initializes an output matrix O, and two vectors l and m with zeros and negative infinity, respectively. These are used to store the results of the computations. 3. **Divide Q, K, V**: The matrices Q, K, and V are divided into smaller blocks. 4. **Divide O, l, m**: Similarly, the output matrix O, and the vectors l and m are divided into smaller blocks. 5-15. **Compute attention**: For each block of K and V (steps 5 and 6), and for each block of Q, O, l, and m (steps 7 and 8), the algorithm performs the following computations: - Compute the attention scores S (step 9).
- Compute the max of each row of S (step 10). This is used for numerical stability when computing the exponentials in the next step.
- Compute the exponentials of the attention scores, normalized by subtracting the max computed in the previous step. This produces the attention probabilities P (step 10).
- Compute the sum of each row of P (step 10). This is used to normalize the attention probabilities.
- Update the vectors l and m using the values computed in the previous steps (steps 11 and 12). These are used to keep track of the maximum attention scores and the sum of the attention probabilities.
- Write the updated values of O, l, and m back to the memory (steps 12 and 13). 16. **Return**: The algorithm returns the final output matrix O, which contains the weighted sum of the Values (V), weighted by the computed attention probabilities. This algorithm optimizes the attention computation by processing the data in blocks that fit into the on-chip memory, which is much faster than main memory (HBM). This is a technique known as "block-based computation", and it is widely used in computer science to optimize computations for memory-hierarchy-based systems. |
In the context of the Flash Attention algorithm, l and m are vectors used in the calculation of the softmax function, which is a crucial part of the attention mechanism in transformer-based models. The softmax function is used to convert raw scores (in this case, the result of QK^T, denoted as S_ij in the algorithm) into probabilities that sum up to 1. This function has an exponential operation, which can lead to numerical instability issues (like overflow or underflow) when the raw scores are very large or very small. To mitigate this issue, a common trick used in practice is to subtract the maximum value in the set of scores from all scores before applying the exponential function. This is where m comes in -- it's used to store the maximum values. Specifically, in Step 10 of the algorithm: m~ = rowmax(S_ij) calculates the maximum value of each row in the score matrix, and P~ = exp(S_ij - m~) subtracts these max values from the scores before applying the exponential function. In Step 11: m_i = max(m_i, m~_ij) updates m with the new maximum values. The l vector, on the other hand, is used to store the sum of the softmax outputs (i.e., the attention probabilities). In Step 10, l~ = rowsum(P~_ij) calculates the sum of each row in the attention probability matrix. Then, in Step 11, l_i = exp(-m_i) * l_i + exp(-m_new) * l~_ij updates l with the new sums, where m_new is the updated maximum value. These sums are used in Step 12 to normalize the attention probabilities. Overall, l and m are used to perform stable softmax calculations and to store intermediate results that are used for later normalization. |
Code - https://github.com/ofirpress/attention_with_linear_biases
This algo (along with FlashAttention) is currently being used to extend the context length of the models. The most popular use case being mpt-7b-storywriter.
This algo btw is just a hack that works (not making this up, they accept it his in their README btw)
Basically, they remove positional embeddings from the attention calculation.
And then mask the attention scores with m.X where X is proportional to the distance i.e. farther the word, the more heavily it is penalised.
That’s it. Somehow it works.
m is also constant btw determined before training (½ ^ 0.5)
Assisted Generation: a new direction toward low-latency text generation
gist: Use Smaller LM to generate stuff faster while using a LLM to fix output in case it deviates.
Decision taken on the basis on token mismatch plus output logprobs
Pros - low latency
Cons - More compute (cause you are running both models)
Paper: [2305.01625] Unlimiformer: Long-Range Transformers with Unlimited Length Input
So how to increase the context length of the transformers?
Flash attention? Ok done
ALiBi? Done
Congrats, you got to 60K in length.
Only problem is I need a million token context to stuff my divorce court case documents
Worry not.
What if you simply plugged in vectorDB into transformer architecture?
Well that would make inference insanely slow
Hmm, but what if we kept it small enough that it always fits in GPU or CPU RAM
Yep, it will work in that case.
This is basically what unlimiformer is. The hidden states are stored in a vector db like FAISS and when attention is computed, instead of multiplying every key and query we only fetch top-N keys and multiply it with query. This call is done separately for each head.
Paper: [2305.10601] Tree of Thoughts: Deliberate Problem Solving with Large Language Models
I think we are speedrunning data structures with LLMs at this point. Soon we’ll be getting graph prompting, some esoteric mf balanced tree prompting and so on.
Anyways, this one is an extension of the chain of thoughts (CoT) prompting.
Ask LLM to generate N possible paths to the solution (e.g. generate 1 word to fill the crossword)
Then you can use BFS or DFS to explore all the possible solutions from each branch.
At each step you can also run a validation (typically using LLM) to see if it's even worth pursuing the branch. If not, you simply discard it.
Another thing you can do is take a vote on which path to follow instead of running validation. Vote is also taken using multiple LLM calls
Issue is though I can’t apply such techniques in prod code since LLM calls are too slow rn.
They will be viable when latency comes in hundreds of millis.
Good for problem solving use-cases though.
Honestly, I just had an idea. I should maybe add a tab or something to Ooba which allows me to easily leverage all of this clever prompting.
Just struck me how trust is important in society and model interpretation is solving that problem rather than for research.
If we can interpret the black boxes better, we can convince regulators in the fields such as Medicine, lawyers, civil engineers, piloting EVAs, etc. to use these boxes in high-risk environments.
Two recent experiments, both use LLMs
https://arxiv.org/abs/2305.09863 (Microsoft)
Language models can explain neurons in language models (OpenAI)
The initial step is the same in both i.e. to figure out the tokens for which the neuron activates the most. This can be done quite easily by just analyzing the output log probs of some inputs.
In the next step, what MS folks do is they ask GPT-3.5 to generate 5 explanations based on the selected tokens. E.g. if selected tokens are wife, sister, father, mother then the explanation can be ‘family and relationships’
Microsoft approach
Next, they ask GPT-3.5 to generate some paragraphs that contain words related to the explanation. 10 paragraphs are generated that contain the related tokens and 10 paragraphs are generated that have none of the related tokens.
Then we make each of these para pass through the neuron and check the output logprobs again. The higher the difference b/w positive and negative samples, the better the explanation is for the neuron. One with the largest difference is the winner.
OpenAI Approach
Here, they use GPT-4 to simulate the neuron itself. The original text is given and passed to this simulated neuron (just a prompt asking LLM to output scores from 1 to 10 based on the explanation).
The output of the simulated neurons is then compared to the actual activations. The closer the outputs are, the better the explanation is.
This requires considerably fewer steps then MS approach but I feel it will fuck up on generating scores.
I need to experiment with ImageBind.
So wrote a local memes organizer.
Works pretty well
Blog - https://huggingface.co/blog/4bit-transformers-bitsandbytes
Paper has some math so I didn’t read it honestly till now
But basic funda is quite simple
Using it is quite simple. Just add a few params mentioned in the blog to any existing PEFT based training code e.g. alpaca-lora. The qlora.py code in the official repo seems to be broken.
For the memory usage mentioned in the paper, you need to use batch size of 1 and gradient accumulation steps of 4.
Since it is a bit slow, I used a batch size of 4 and a lora rank of 64.
With that I was able to train a 13B vicuna model on my smol dataset in an hour on a 4080 card.
Fucking awesome!
Compute Metrics for llama-supercot-13B
The output is not great though since I only trained it for an hour. Need to train longer.
https://twitter.com/main_horse/status/1662478420738187266?s=20
This is also a lie btw, just a random narrative
I got tested in the public arena so I am testing you as well
Paper -https://arxiv.org/pdf/2305.16291.pdf
Ask GPT-4 to write a program based on the current environment context
Tell Gpt-3.5 to generate description for that program (I should start doing this as well, meta-commentary by GPT on text blocks)
Store it in a vector database with key has the description and program as the value
For programming they use three feedbacks -
This is yet another great point. Reasoning capabilities of GPT-4 (we should call it proto-AGI at this moment) enable us to do a lot of this stuff. Can’t believe how much better GPT-4 is compared to 3.5
Link to all prompts in the codebase - https://github.com/MineDojo/Voyager/tree/main/voyager/prompts
Haven’t updated this doc in a week or so, busy with day job stuff
Also spending time actually trying out a few techniques from this doc. I realized I don’t understand some stuff especially when I saw weird tokenization code in qlora.py code.
So just testing out in my local
Some things sound simple when reading but you realize so many hidden details when you actually implement stuff.
This imo makes a lot of difference in your understanding as well as the speed of iteration when it comes to shipping stuff. Like you can figure out with a quick look at the error what’s the actual root case when it might take someone to spend 2 days googling.
Everyone has had this feeling of just knowing something but not being able to explain how.
It’s because they spent time earlier on in their life playing with so many tools and techniques that it's just built into their subconsciousness.
Also, I need a way to update this doc directly via terminal. Opening google doc in a browser and scrolling is too slow.
Paper - https://arxiv.org/pdf/2306.00978.pdf
Code - https://github.com/mit-han-lab/llm-awq
Claims to be 1.5X faster than GPTQ as well as more accurate
Most quantisation techniques currently rely on re-ordering weights post quantisation to get better accuracy. This operation however is not natively supported by GPUs and hence slow.
Here they do not use re-ordering.
What they do is simply perform a normal quantisation using min max approach.
Then they check using a sample of inputs which lead to more activation.
Then they keep 0.1-1% of such weights in f16 format only while converting rested to INT3/4
This helps solve for accuracy.
Mixed precision (f16 and INT4) is not however GPU friendly
So they finally figure out an appropriate scaling factor that minimizes the difference in output for a layer.
Then the f16 weights are scaled by this and converted to INT4.
This is the python file where most of the activation-aware logic occurs.
This is where we cache input feature samples for some data from pre training dataset to determine activation
https://github.com/mit-han-lab/llm-awq/blob/3a6dfc39ed20d793f7c26624c4b9f9599960dd3b/awq/quantize/pre_quant.py
Paper - https://arxiv.org/abs/2306.03078
Literally within a week of the previous technique we have a better one.
Core idea is slightly similar - Some weights are more important than others so focus on preserving them correctly rather than every parameter.
Here what they do is make a sparse representation that consists of important weights and then try to minimize the error for these weights (still need to read how they are calculating error and minimising, most likely seems to be some sample dataset as in AWQ)
PR link - https://github.com/ggerganov/llama.cpp/pull/1684
Wait wut? Is it even worth it? Like how bad would this level of quant be
Well
It is not bad at all.
The 2 bit quantised version of large model has better perplexity then f16 version of smaller model. The gap for 13B f16 and 30B 2-bit is quite high.
This means I should now start running 30B on my 4080 instead of 13B (Yea it fits in 16GB VRAM)
Not made up without any thought
Employs HyDE (Hypothetical document embeddings) + specialist model technique presented in lots of papers
The primary objective is to fetch correct embeddings when questions are extremely short plus ordering them correctly on more than similarity
You would assume simply combining embeddings should work but its not like that
Reasoning being embeddings of different type of objects (i.e. text, audio, video, image) have different Signal to noise ratio.
This is the reason why you can train a great image or audio model with just 1-3B parameters (MusicGen, SD) but text requires much much more params
https://kaiokendev.github.io/til#extending-context-to-8k
You need to extend your model context by 4X
Worry not
Just divide the positional embeddings by 4 (lol, lmao even)
Don’t believe me?
See this tweet from mr. ggml
https://twitter.com/ggerganov/status/1671915699025977351?s=20
One possible reason it works is that large models tend to overfit on positional embeddings
So if they see an embedding like 4096 which was never encountered in the training, they start outputting gibberish.
However, if you make the model believe that 4096 is infact 2048 (dividing all embeddings by 2), the model suddenly starts giving correct output.
This however doesn’t explain though what happens with decimal embeddings like 1024.75 since they were also not encountered in the training.
GGML folks are doing god’s work. Giving the horny 4chan bois in college hostels on their cheap Asus phones a way to run LLMs locally is not a small task.
It is even harder to make them run on smol pi machines but it works.
Two weeks back I was trying to add capability to swap loras in the llama.cpp.
The code is already there to apply the Lora. To remove one, you can simply subtract the BA adapter matrix from the weights instead of adding it.
However, it didn’t work as expected when I was doing print debugging and for that I had to do a smol give into the ggml codebase.
https://github.com/ggerganov/llama.cpp/blob/b8c8dda75fdf5fdea49c80af36818e7c30fe0ddf/llama.cpp#L2896
All of the matrices in ggml codebase are represented using ggml_tensor
Each tensor struct generally contains the source tensors and the operator using those two source matrices can be combined to get this one.
The floating point data is just present in void* data array. It is void* so that you can store data in any format - f32, f16, quantized ints.
Take an example of the first op
ggml_tensor * BA = ggml_mul_mat(lora_ctx, loraA, loraB);
Here it is simply performing matrix multiplication on two matrices loraA and loraB. The lora_ctx is used for temporary memory buffers and is cleaned up after an operation is complete for a layer.
There is a catch tho. This op actually doesn’t do anything! It just creates a new tensor with sources as loraA and loraB and operator as GGML_OP_MUL_MAT. https://github.com/ggerganov/llama.cpp/blob/b8c8dda75fdf5fdea49c80af36818e7c30fe0ddf/ggml.c#L5849
So what’s the point of calling this?
Well, for ggml all the computations for a layer are performed in a single go (I guess for better GPU utilisation as well as lazy execution? Not sure)
Once all ops are listed down, a DFS traversal is done from the last tensor all the way to the root tensors to form a computation graph. This is done in line
struct ggml_cgraph gf = ggml_build_forward(r);
The final result is just a 1-d array containing the tensors in the order in which they should be computer i.e. leaves first and root last.
Once you have the array, the computation is actually triggered using ggml_graph_compute(lora_ctx, &gf);
As you can see above, the multiplication is handled separately based on the data type of the tensor. The general theme is most calculations are done in f32 mode and all other datatypes are converted to F32 and back to quantised form for the calculations. This can be different in ggml cuda code but I haven’t taken a deep look at that.
https://github.com/ggerganov/llama.cpp/blob/b8c8dda75fdf5fdea49c80af36818e7c30fe0ddf/ggml.c#L11057
Paper -https://arxiv.org/pdf/2307.02628.pdf
So far from what I got after reading this is that I need to understand KV caching better.
Update:
So basically the KV Cache in itself isn’t the problem.
It’s the use of KV Cache with early termination, that’s the issue.
So basically in early termination, you’ve a classifier or some other algo at a layer that can use the log probs generated after the layer N to decide if the tokens should even go to the next layer or we should simply declare a winner here.
Now the thing is with this technique, if you terminated the previous token at layer N but for the new token you need to terminate at layer N + 2, you are left with 2 layers for which you have no KV cache data. So now you need to recompute the KV cache for the previous token and the last 2 layers before proceeding to do calculations for the current token.
This is computationally heavy and what this paper is trying to solve.
How?
Well instead of letting everything terminate at random layers, what if we could make a deterministic algo to predict the last layer of the token.
Well, that’s really simple if you just plot at what layers do transformers have good enough confidence for Nth token. You will find that for the token later in the sequence, you can predict them just by passing only 1-2 layers cause you have a large amount of context available in the input.
For the earlier tokens however you need to go through all the layers to predict correctly.
So you can simply use a function that’s like monotonically decreasing and use it to predict no. of layers for Nth token.
But what about the KV cache?
So see, since now you are basically ensuring that if the Nth token goes through M layers, then it’s guaranteed that N+1th token will go through <=M layer (cause your func is monotonically decreasing). Thus, you will always have vectors in the KV cache for all M layers and for all the N tokens.
Cool. But why is this called skip decode then? There’s no skipping so far.
Well, your previous tokens have gone through M layers and your next tokens are going through <=M layers. You are now stuck with exact opposite problem that the new tokens don’t benefit from the extra computation done by previous tokens. To solve for this, instead of using the first M layers, the authors propose to use the last M layers. Hence, the skipping (cause you are skipping first few layers).
This is what the final output looks like.
Final question tho, is this all actually useful at all?
Yep, absolutely, 100 percent. Leads to 2-5X speed up in the inference (which would be amazing for large unquantized models running on my 4080 PC).
Paper - https://arxiv.org/abs/2304.13835
This is much more relatable now after using with the talk repo (https://github.com/yacineMTB/talk)
The primary problem is how to allow an LLM to
This is because other way is to clear the whole context and start LLM again with another persona. Can’t be done on each turn as it is too expensive.
Another way is to simply switch loras but training loras for each persona is a compute costs which VCs won’t sponsor. You can switch one easily though in less than 200ms in llama.cpp
So we are left with training an LLM in such a way that
Most of the magic of this paper lies in the dataset rather than the techniques.
Paper - Paper page - Lost in the Middle: How Language Models Use Long Contexts
Just researchers trying to figure out if long context is even helpful or not in sota LLMs.
Good thing is they tried both OSS and closed source LLMs
Not so good - unless your doc is at the starting or the end of the context, it won’t influence the LLMs answer. This means that ranking really really matters. Which is why I use cohere rerank after fetching docs from pinecone, have seen insane but correct shifts in ranks.
Another thing - the more docs you stuff into the context the less accurate your results become.
Almost all models behave the same way.
So from a practical LLM Q&A perspective
Why does this happen tho?
Still not sure but so many legit folks think it’s true
So I must do the hard work now and read about all the techniques mentioned in that article
Multi query attention - https://arxiv.org/pdf/1911.02150.pdf
MoE (Mixture of experts) -
[2101.03961] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
[2112.06905] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
[2202.08906] ST-MoE: Designing Stable and Transferable Sparse Expert Models
Sparse Expert Models (Switch Transformers, GLAM, and more... w/ the Authors)
Multimodal vision - https://arxiv.org/abs/2204.14198
Speculative decoding - https://arxiv.org/pdf/2302.01318.pdf
(Still haven’t read the gpt-4 papers, too much work in day job)
Had a discussion with telt, realized judging the quality of fine tuning datasets is a hard problem
For my custom one, I simply did multiple rounds of cleanup after my loss curve was not converging.
That is not the right way though, cause even if the curve goes down, you can get highly inaccurate but sorta correct sounding results.
Some methods which I have seen in paper are related to determining the variety of data via clustering but might have missed stuff that grades the datasets other than using Humans.
Please don’t suggest GPT-4 to grade it (which is also what I have seen in some papers). It doesn’t work well ime.
Only valid resources I’ve found is this - Paper page - Instruction Mining: High-Quality Instruction Data Selection for Large Language Models
Let me know if you make it here.
Update:
Seems like one other idea is to train a classifier to select high-quality vs low-quality. Seems to be a hassle though but applied in lot of goog papers as well as GPT-3
Although I am skeptical of approaches like above which rely on GPT to grade the answers, the papers do get good results using it.
Why I am skeptical is the fact it was shown that GPT favors its own generated answers plus the absolute rating scale can screw up a lot and relying on logprob based method is better.
Faster way then PPO (Proximal Policy Optimization) to do RLHF.
Doesn’t require a separate model
Interestingly too much math in the paper but thanks to https://twitter.com/akbirthko autism I realized the math is quite simple. Check the code here - https://github.com/okarthikb/DPO/blob/main/train.py
In essence you are just training the policy model with a loss function that minimizes the prob difference on the ratio of accepted/rejected probs w.r.t. Ref model
What I’m realizing is this is the bleeding edge of transformer research.
Dense models are still the talk of the town. Compute prices are falling to the floor and hence it makes sense to keep on scaling dense models.
However, there’s still not enough compute in the world to run inference on 2T param models for 100M users in a few seconds.
That’s where the MoE models shine.
The basic idea is you still have one large model but only a part of it (called the expert) is triggered in the inference.
Now there are multiple attempts going on to make this better and better. Excluding the initial approach crafted for RNNs by Noam shazeer, I am finding the GShard and Switch transformers paper to be more relatable.
Instead of a normal feed forward layer, we use a routing based feed forward layer with N experts.
Router is just a simple linear function which predicts which expert has the highest probability of generating the next token. Now we can simply take softmax and choose K <= N experts to route our data. In this case, K = 1 always.
Best part is you can simply distill these large sparse models to smol dense models while retaining a good amount of accuracy.
For the deployment part, you can easily shard these models using a combination of multiple paradigms.
If you’re a noob like me and don’t really understand what’s going on in this diagram. Simply look at this pseudocode -
https://gist.github.com/cto-junior/88477d52818597bf725e02bfb0559b43
The paper also has pseudocode specific to their implementation but it uses mesh tensorflow (baahh!!)
Here’s a good one that uses pytorch (translated using chatGPT):
https://gist.github.com/cto-junior/5018d526f2056546f6607986b08b423d
One thing I still don’t understand though is how these models are trained, like especially the gating function. Do you just follow the normal training regime where inputs are passed to all the experts and finally settle upon some gating weights by backprop? Or do you explicitly choose which expert to run forward and backprop on for a particular cluster of dataset?
Extremely important points in case you choose to train a trillion param model in the basement. Don’t laugh, it should be possible in 5 years.
It just uses two instead of one expert and also adds another layer without an expert on top of the expert one.
The authors of original papers weren’t impressed (and I was not as well)
Some legit guy told me the GPT-4 is actually based on this architecture.
[2202.08906] ST-MoE: Designing Stable and Transferable Sparse Expert Models
This focuses mostly on stability of the expert transformers especially during finetuning. Have to read it but it’s too loooooong.
Below is a much more exhaustive reading list thanks to main_horse on twitter
https://arxiv.org/pdf/1911.02150.pdf
This is actually quite simple. You simply calculate attention for the same key and value using multiple queries. It’s just a variant of multi head attention with key and values shared. Primary motivation is to reduce the memory footprint during inference and training while capturing as much performance as possible.
https://gist.github.com/cto-junior/0adbaa7c5a8b2ce115939c7092af783b
https://twitter.com/ocolegro/status/1676602607106760705?s=20
GitHub - emrgnt-cmplxty/automata: Automata: The Future is Self-Written
[2307.05695] Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
Ok so the idea is really simple.
You add a lora adapter, you train it
Once the training is finished, you simply merge those lora weights with the layer.
Then you reset the A & B matrices and resume the training again.
Doing it multiple times will give you good results.
The catch is simple reset doesn’t work due to some past gradient thingy which will mess up the future updates so they reset a lot of optimizer states to 0 as well.
Overall code is very easy - https://github.com/Guitaricet/peft_pretraining
Just grep for the can_reset if block in the torchrun_main.py
And grep for merge_and_reinit in peft_pretraining/relora.py
Thanks for reading this, I should mention though more likely it is not usable for your llama model.
I am not making this up but the authors themselves haven’t tested it properly beyond 1B models and there too the results were dicey.
Ok so this is easy as well. They are just trying to reduce the comm overhead in distributed training.
So they do three things -
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Ok, this mostly seems like just optimisation on the kernel front rather than like an algo-rewrite of v1.
LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA
[2305.11206] LIMA: Less Is More for Alignment
Just use less but better quality example for supervised finetuning, rather the shitty dumps from your support portal
I have the mandate of the heaven
Tweeted this into the void and woke up to a presentation from HF bros in the morning. Entropy works in mischievous ways.
ICML '23 Tutorial on Reinforcement Learning from Human Feedback
https://arxiv.org/abs/2307.13269
We already know that GPT4 leaks said the MoE architecture is the way to go.
So now anons are trying to replicate that for local models.
Issue is a proper MoE architecture works in the following way
So at inference time you only have a part of the model that’s actually doing the computation while the rest is inactive.
For local models, this approach is not the right way forward. Reason being it severely inflates the size of the model and we want smol.
So what to do?
We already have smol experts (kinda) called LoRAs. But the issue with loras is you can combine them but not select one dynamically during the inference time. Secondly no one knows if it’ll even actually benefit the base model.
Well that’s what this paper shows. The approach works and your model is better at more tasks. They however don’t select one lora, they simply multiply each lora adapter matrix A & B by a different set of weights, sum all A’s together, then sum all B’s together and finally multiply both to get a merged Lora.
The weights take care of how much each lora should contribute to the final output.
For the weight training, they say using gradient based approach will be super slow, what they recommend is a nograd approach (that’s present in their code as well), where they just select 20 loras, take loss with the output then use this nograd optim to adjust weights. Only a few iters are done.
Since karpathy implemented and validated it, I must read it.
[2305.07759] TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Ok, here the claim is very simple. You don’t really need a large model for coherent English output.
What you need is [Ba Dum Tss!] a good dataset.
So to create that dataset they select around a 1000 words that are typically known by 4-5 year old kids. Then used GPT to create small stories around those words to increase the diversity of the dataset.
Then simply train multiple small models all < 100M params.
What they find is amazing! Not only does your model produce coherent output, it also gains a bit of reasoning ability like LLMs e.g. it can remember facts, figure out correct grammar etc.
I don’t like their final eval method though. They simply used GPT-4 to grade the answers generated by small models. The reason I don’t like this is even though if GPT-4’s scoring is consistent, I remember reading somewhere it prefers the output that GPT-4 itself will generate rather than a better one. Might not be a big deal here since if you match GPT-4 levels you are already good.
Next - experimenting with llama.c
GitHub - karpathy/llama2.c: Inference Llama 2 in one file of pure C
Have you thought about replacing attention with FFT anon? It works and is faster. Validated by kaioken dev
[2105.03824] FNet: Mixing Tokens with Fourier Transforms
Building and operating a pretty big storage system called S3 | All Things Distributed
Was reading about how S3 is scaled for a billion users. Realized I know nothing about computers. I am so dumb. Need to get smarter.
The gist is HDDs are cheap (everyone knows that) and they are getting larger storage wise (everyone knows that yet again) but the issue is reading/writing in HDD is done by a mechanical head. And there’s a limit to which you can make mechanical heads faster.
So more often that note you’re choked by IO. The way to solve for this is to shard your data and distribute it across as many HDDs as possible.
That way you can do multiple fetches in parallel rather than waiting for a second for head to seek 2TB of data in a single node.
I thought it was a meme but now it’s gaining a lot of steam so I must absorb it’s data stream
From Sparse to Soft Mixtures of Experts
Instead of routing a single token to an expert, what they’re proposing is to take a weighted average of the tokens and then route it to an expert. This allows for better training stability and ability to scale the experts.
The primary disadvantage is this doesn’t work for autoregressive decoders.
It looks like they are using slots per expert as the tuning knob to make the model faster or slower. Each slot has its own set of parameters (known as dispatch). We simply multiply the input tokens using a slots params and feed it to the expert. Then we do the reverse using ‘combine params’ and get the combined prob of a single output token.
If you’re noob in maths like me, use this GPT-4 explainer for understanding - https://chat.openai.com/share/6bc517e0-2acf-48ba-8436-f1a2d8702b65
[2307.14430] Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models
Train LLMs like you would teach a kid
Don’t teach them algebra before you have taught them basic arithmetics
This is an interesting approach to clustering.
This paper shouldn’t be complicated to read at all but they fell into the rabbit hole of using too many math symbols to denote function calls.
Skill Graph creation - quite easy honestly if you ignore the maths
You first create a train and eval set for various skills using clustering
Then you take a base model and train K versions of it for H steps, one for each skill. Finally store the validation loss difference for that skill b/w original base model and finetuned model.
Now you simply start considering all pairs of skills, and train the base model on union of both skill dataset.
Now you observe the loss of this finetuned model vs the base model.
If the delta loss here is greater than the delta loss when trained just on one of the skills in previous step, we simply add an edge in the graph.
Using the Skill graph matrix A to actually select samples during training
Now that you have the skill graph, it’s time to use it during the training. This is the main part of the paper.
Naively select equal number of samples from all relevant skill is not good enough, since one skill can disproportionately affect the loss.
A better way is to take into account which skills are leading to bigger validation loss and then change the sample distribution accordingly. We will be using pi to denote the fraction of samples that should be from skill i.
It is not that difficult except for some assumptions. You initialize the proportion of each sample related to the weight of the adjacency weight matrix edge. Then at each iteration, you observe the loss of the model w.r.t. All the unique skill sets samples in the validation set. Then you train the model with an existing mixture of sample. Finally you adjust the proportion of the samples based on the new loss.
WHY ARE WE STILL DOING MANUAL STATUS UPDATES IN COMPANIES?
Time to plugin LLMs to the Slack feed and let it automatically track everything for you.
https://arxiv.org/abs/2002.12327
I have skipped over a lot of basics, time to catch up
I am not good at maths so it took me some time but after reading https://finbarr.ca/how-is-llama-cpp-possible/ I finally understand
Most basic formulae
Number of params approx (P) - 12 * [(Model dimension)^2] * [number of layers]
Mem usage = bytes required by each params (4 generally) * P
Flops usage = 2 * P (assuming it requires one mul and one add per param)
All of this is extremely approx and doesn’t account for vocab to embeddings as well as normalisation layers
So that finally even codellama paper has demonstrated that using RoPE scaling you can get to almost 100K context size, it’s time for me to stop pretending that I know what it is and actually learn.
Paper - [2104.09864] RoFormer: Enhanced Transformer with Rotary Position Embedding
If you don’t know what positional embedding is, it’s basically capturing the position information of the sequence in the input. This is needed so that you can differ b/w B follows A from A follows B. Also needed to determine how far is A w.r.t B and should it even affect B’s prob
The trad way of doing it is to simply learn embeddings of the same dimensions as the model but for each position. Then add these embeddings to the model input.
But this suffers from a major drawback - You can’t adapt your model to go beyond the positional embeddings it has learned since they don’t exist in the matrix. Secondly it doesn’t capture the relative positions well, only the absolutes.
RoPE aims to solve it. The fundamental idea is first to use only the relative positional information, secondly use some function (not lookup) to convert that information to embeddings. The function that they use is sinusoidal angular transform which basically rotates your token embeddings by some angle based on the position (hence the name). Sounds quite intuitive.
Code - https://gist.github.com/cto-junior/3493fe428a069a3500f85f8558ef5df9
Paper: [2302.01318] Accelerating Large Language Model Decoding with Speculative Sampling
I had heard about it and I heard about it again today (both time from Seminalysis)
THe idea is pretty simple - you use a smol model to generate tokens quickly (speculation) and then correct the output if it diverges from the slower but accurate large model.
The smaller model will be faster in generation but if you have to check its output 1 by 1 with larger model then you’re still bottlenecked right?
Well, the solution is to simply generate K tokens at a time from a smaller model and then feed all K tokens to the main model and get the correct log probs for the next tokens in a single pass. Then you can simply compare and discard the incorrect tokens.
Almost all major LLMs are using this in prod (as per the leaks). Leads to 2 - 5X improvement in latencies
[2004.06093] Topology of deep neural networks
TLDR in this thread - https://twitter.com/suchenzang/status/1696924361373151337?s=20
I’d be lying if I said I understand all the maths in this. But I do understand the idea. They are basically trying to show how classifier neural nets change the datasets after each layer so that each class is easily distinguishable in the space.
It also shows how ReLU is more effective than sinusoidal functions at doing this.
Size of the KV Cache for one token in an LLM assuming f16 weights is -
(From - Transformer Inference Arithmetic | kipply's blog)
You can also multiply this by batch size B for production systems
By convention, most models have nhead * dhead = dmodel
So e.g. for LLama-7B, for a batch size of 1, this would be equal to 4 * 32 * 4096 = 524288 bytes = 0.5MB
Now this is per token, so if you want to run inference on 2048 tokens, the total kv cache required would be = 2048 * 0.5 = 1GB.
And this is just for this small 7B model. For 65B it would come to around 5GB (4 * 80 * 8192 * 2048).
And this is for batch size = 1 btw which means low flops utilization and insanely slow inference.
So there's a need to find a way to compress the KV cache. That’s what this paper aims to solve.
The basic hypothesis is simple. A) Only a few tokens are important and others are not. B) The importance of future tokens is more or less correlated with the ones in the past.
The core algo then is not that tough. You make a window of size w and only keep the values in the kv cache which always have values greater than the threshold in that window w. You also always keep all the values of the most recent tokens denoted by r.
Overall I feel like it’s a good hypothesis but there are too many approximations built into the inference side. They should publish the results for llama or something rather than OPT.
https://hazyresearch.stanford.edu/blog/2023-03-07-hyena
You must read this. I am telling you. It is mid but you must.
Update:
I finally read this blog. They are proposing a new architecture which helps reduce the inference time especially cause the attention formula is quadratic in nature (we compute attention for every token in sequence with every other token in the sequence)
They simplify the attention formula simply as A(x) . x where x is the input embeddings. Most network architectures don’t have A(x) but just a single W weight matrix learned during training. This makes the attention based models quite unique as they can adjust to inputs very well and can demonstrate capabilities like in-context and few shot learning.
But in attention based models, except for this layer all other layers simply are of W.x nature.
Their argument is instead of using quadratic attention here, what if we made other layers a bit different so that they are of A(x).x nature as well but A(x) being a sub-quadratic function.
The function they propose is a long convolution i.e. long sliding window matrix that attends to some tokens in the sequence and generates the output.
The hyena_orders is just some number which they keep as 3 for some reason.
The results presented are pretty good but I hope they present it to work same or better w.r.t. llama-like alternatives.
Also, if you’re like me with rusty knowledge of CNNs, here’s a simple explanation by GPT-4
I need to get so much better aaahhh
Also, I think I’d be dabbling a bit into tinygrad, see what abstractions I can recreate from scratch. Should be a fun exercise.
It has been mandated by the heaven
Let’s look into it from a practical perspective.
I found this good starter resource for anyone who wants to understand the basics - The definitive guide to using Vector Search to solve your semantic search production workload needs.
I know about 40% of it already so directly jumping on how actual indexing works.
Most popular algo is HNSW but scann is faster. There are a lot of other alternatives as well.
Now most people are only concerned about speed of the algo but that is sorta necessary but not sufficient.
To create a full fledged vector db, you need more
And many more
Considering all this I have started reading about Lucene’s HNSW implementation (since Lucene is production-grade and almost used everywhere. I am afraid though it might be slow)
All alpha from - https://issues.apache.org/jira/browse/LUCENE-10054
And https://issues.apache.org/jira/browse/LUCENE-9004
More orgs should start doing this honestly - https://people.apache.org/~mikemccand/lucenebench/VectorSearch.html
Publish nightly perf numbers on OSS libs.
The first issue was a gold mine btw. I learnt a lot.
The primary doubt I had was how to create a single graph and so far it looks like that’s not the case.
They are still using small graphs (10M record max)
So that means you yourself will have to combine the results (maybe they have some utility for it)
Another thing is the memory usage optimisations they have done. Loading everything in memory is not required, you can just keep the entry points for each hierarchy and the neighbors. The values are always docIDs so it only requires 4 bytes to store. Once you get a docID you can look up the actual vector value from the segment.
How to do distributed HNSW? Like the one where I don’t replicate the data but where I shard the data and then query multiple subgraphs to get final answer.
Some hint - https://github.com/nmslib/hnswlib/issues/377
Also Pyramid Paper - https://arxiv.org/pdf/1906.10602.pdf
Ok, I found the answer for this. Right now it’s pretty dumb but works. If you need top 10 matches, you query for top 100 records from each shard and then sort again. It’s costly and inaccurate but that’s what’s being done in Elasticsearch atleast
https://github.com/facebookresearch/faiss/wiki/
FAISS wiki has a lot of alpha
I was reading chroma and weavite docs but to me they look like half solutions. Primarily cause I don’t see they support efficient sharding of vectors. It is still mostly defined by some primary key kind of approach in weavite.
Need to look into this but afaik this and ScaNN are immutable once built. This would mean carefully choosing a large enough shard which you can’t modify later on. Hmm doesn’t seem to be worth pursuing.
Paper - https://arxiv.org/abs/2309.05444
After LoraHub, another paper showing the viability of MoE using only Lora or (IA)3
The architecture is almost the same which involves a gating function which decides the experts.
Notable changes though -
If you don’t know what IA3 is, it’s pretty simple. Just 3 vectors per layer that are multiplied to Key, Value and FF layer weights respectively during inference. IA3 in general performs worse than lora but here when used in MoE fashion they perform better.
Another issue with IA3 is that your vector size is fixed w.r.t. To model dimensions. In LoRA, you can easily change the rank to change the size and play around with evals.
Paper - [2309.03409] Large Language Models as Optimizers
Not sure how many papers I gotta show folks before they accept that LLMs can reason pretty well. They also try the best models from each company and all of em perform pretty well. GPT-4 obviously being the best.
This is by google so please believe me now.
The problem statement is simple. First - given an optimisation problem like travelling salesman or simple gradient descent and the path take till now in the form of coordinates, can the LLMs converge to the final solution?
The answer is yes.
The second problem statement is - Given an eval, along with previous prompts used for the eval and the final scores of each prompt, can you generate a new prompt that’s better? It’s the variation of the first problem statement.
The answer is again, yes.
Just ask you LLM to take a deep breath 😀
I will become this guy, watch me
Paper - [2305.05065] Recommender Systems with Generative Retrieval
If you don’t know already, the most common implementation of recommendation systems in most companies is the following
This poses a problem though. What to do when new items keep on being added to the catalogue. Retraining the embedding model is expensive with millions of items plus slow.
What if we simply trained a model to output the most related item id based on the past selection by the user?
What if it’s a generative model rather than a classifier or something? Would that even work and not fucking hallucinate while generating the id.
Turns out it does.
Their approach is as follows
Turns out this approach is amazing and gives SoTA results on most popular recommendation evals and real life scenarios. They also show that the model doesn’t generate invalid ids a lot ( < 1% in most cases)
ʼ
Man, not getting a lot of time to read paper these days. Occupied by Day job stuff.
But I must push harder.
Paper -�� Flamingo: a Visual Language Model for Few-Shot Learning
This seems to be a secret sauce behind a lot of ongoing GPT-4(V) like projects.
The premise is using a pre-trained visual encoder as well as an LLM to attend to multimodal inputs.
They combine these two models using three things
All of this comes together to make the final Flamingo model. It is now trained while keeping original encoder weights and LLM weight frozen.
Paper - https://arxiv.org/abs/2309.15564
Multimodal transformers are the rage of the town. Especially now that people are seeing the power of GPT4-V.
The current way to train them is to change the architecture of an LLM to infuse cross-attention and then training it from scratch to use both Image and Text embeddings.
This is very expensive.
Meta proposes a simpler way. What if you could simply combine two pretrained models to process both image and texts.
Ngl, this was also how the Flamingo architecture worked (which is most likely used in GPT-4V). But even then it introduced 3 extra type of layers to make it work. Plus it required training from scratch (although kept the visual encoder and LLM weights frozen)
This paper proposes something much simpler. You simply add a cross attention block at the output of each LLM or text to image block. This x-attn block processes input from both modalities. The output of multiple x-attn block is then combined using a linear transformer.
Now the question is - how to train the x-attn block though? That’s the neat part. They simply use supervised finetuning over a small dataset to get the desired outputs.
Paper - https://arxiv.org/pdf/2309.16797.pdf
I’ll be honest here, I saw this paper in a tweet and simply proceeded to ignore it cause not interested in another making prompts have sex to create synthetic dataset approach. Especially cause evol instructor and airoboros already exist and are good.
Then I saw on r/localLlama that this paper is actually by Goog deepmind. Hence I read it.
The approach is mutate a prompt using an LLM
How?
Just append a mutate instruction (e.g. make it more creative), before the prompt. You can also append a reasoning style (e.g. think step by step).
They have multiple ways to select the best prompt as well as mutate the mutate prompt itself using LLM
Hence the name prompt breeder.
More important is to filter out similar prompts by using BERT embeddings and cosine similarity
Next, you also provide the good quality prompts in the context of LLM so that it generates an unique one with high quality rather than repeating the same
The primary alpha of this paper honestly is in Appendix. Just go through all the prompts and strategies and select the best ones.
Paper - https://browse.arxiv.org/pdf/2304.08485.pdf
Terrible name, decent model. Multimodal, hence interesting.
You take the image, then use some encoder like CLIP and generate embeddings.
Now you do a transform so that these embeddings match the dimension of the LLM’s input layer. LLava folks use a single weight matrix to do this which makes it quite easy to train.
And voila, you have a multimodal LLM.
The training part is also not something out of this world. First you keep both CLIP and LLM weights frozen and only adjust the projection weights.
Next you do a full finetune for both LLM and projection layer. You still keep the CLIP frozen.
Another Interesting thing is they use CLIP encodings from the penultimate layer and not the last layer like you would ideally do.
Overall, this is actually quite simpler than Flamingo and possibly perform worse but it is faster to finetune and train cause of no complex cross attention mechanisms and fewer additional layers.
Paper - https://browse.arxiv.org/pdf/2310.03744.pdf
Honestly, it’s same as LLAVA 1 except they two things
Everything else is the same.
https://transformer-circuits.pub/2023/monosemantic-features/index.html
I am finally reading this masterpiece but let me be honest. Understanding it requires a lot of existing knowledge about NNs plus support from chatGPT
You have been taught from childhood in every deep learning course that NNs are basically black box.
These days although we have gained significant ability to understand small NNs as well we still lack in ability to figure out how each individual neuron behaves in a network trained on large amounts of data.
There is a lot of research going on in this domain as it is extremely useful to affect the NNs outcomes and control/guide them easily.
There was a blog/paper by OpenAI where they used GPT-4 to understand GPT-2 neurons. See Model Interpretation in this doc.
This paper is another one which uses another NN to figure out first one.
The hypothesis is this: We are not able to understand each individual neuron cause it activates on seemingly random inputs which make no sense to naked eye. However, the possible reason it does that is because it is compressing so much data into the small number of weights. Thus, one neuron ends up representing multiple inputs.
If this is true, what if we trained a much wider neural net on the inputs to this neuron? Since this NN is wide, each neuron should end up representing a considerably fewer number of features. And since they are fewer in number, they should be easily understandable as well.
Well that’s what they try to do in this paper. They use a sparse autoencoder who’s width varies from 1X to 256X of the neural net. The input to this autoencoder is the activation of the neural net layer. The output is also the activations (thus the goal is to reconstruct them). They use MSE loss for it plus an L1 regularization penalty to force sparseness. Without sparseness you would have simply too many neurons activating in this autoencoder making feature distinction difficult.
They haven’t done a half-assed job as well. They do a lot of analysis and publish it to make sure that the features that are detected are in fact representative of the text in the context plus they are simply not neurons weights slapped here.
Validating Specificity
To validate the features, they use log likelihood. E.g. take an Arabic character.. You can find out the probability of that character occurring when the feature corresponding to it is activated vs prob. of that character occurring in the overall dataset. If the ratio is high, that means your feature does specifically point to that character. Now using a single character is tricky so for approximation they use a proxy here. E.g. just a word containing Arabic characters.
Validating sensitivity
Here they just measure the correlation b/w the times our feature got activates vs the time we detected the proxy (i.e. any arabic character) in the context. High correlation means the feature is indeed sensitive to the proxy.
Downstream Effects
If your feature does point to an arabic character, it should also make the future predictions gravitate more towards arabic characters (cause it makes natural sense). They try to measure this for each feature and find out that this is indeed true. They also do a study by disabling the feature and see that the future outputs skews towards normal english or something else.
Paper- [2304.02643] Segment Anything
Thought about reading it after yacine’s dingboard success. The model basically outputs the correct segment masks after downloading the data.
The model architecture is not that complicated.
You first have an Image Encoder (ViTMAE) to generate image embeddings.
Then you have CLIP (obviously since it’s used everywhere) to generate embeddings for the text prompts.
For bounding box or pixel prompts they use learned embeddings
For Mask based prompts, they use convolution and then sum it up with image embeddings.
Once you have the final embeddings you feed it to a transformer based decoder that outputs a mask. They use two way cross attention (prompt with image and vice-versa) in this decoder.
When you get the mask, your next job is to map it to the image. For that they use a simple MLP that computes the mask probability at each image location. This mask decoder part is inspired by the Maskformer paper.
I did spend some time understand it’s code (and by spending time I mean going on an evening walk talking to chatGPT to explain the piece of code I copy pasted into it before going on the walk). So basically for each pixel you predict the probabilities of the mask class it belongs to. And then you can simply group up all the pixels with same mask class to form a segment.
This is llava but for Qwen model by Alibaba. Some have told me it is better than llava.
I checked its architecture and that might be possible simply cause it uses cross-attention instead of a simple projection layer / MLP to connect CLIP to Qwen.
However weird part is they don’t do cross attention of image embeddings and text embeddings like flamingo.
They use some learned embeddings and do cross attention with that.
I think the major reason for such architectural choices is that no one wants to modify the base LLM.
Paper - [2303.15343] Sigmoid Loss for Language Image Pre-Training
Lucas has been sharing a lot of SigLIP hype on my TL so it made sense to actually see what it is.
As yall already know, CLIP's image encoder is used in almost every Multimodal LLM out there.So how is CLIP trained?
Well you take embeddings from an image encoder and a text encoder and project them to the same dimensions using weight matrices. Once you have the same dimension embeddings, you simply take a dot product along with a temperature param and get a 2d matrix signifying how close each image and text pair really are.
To train CLIP, you apply Cross entropy loss on the following 2d matrix
SigLip simplifies this by changing the loss function. They use sigmoid instead of softmax and change the function so that it’s just dependent on each image-text pair rather than combination of pairs.
This doesn’t seem like a big changes but what it allows you to do is remove the need for all-gather communication during the training run
https://arxiv.org/pdf/2305.11172.pdf
Mostly interested cause it’s the only multimodal one that doesn’t use CLIP lmao
https://polymathic-ai.org/blog/xval/
Paper - [2311.00430] Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
Distillation is a widely popular set of techniques to reduce the size of the model.
It involves training a smaller model to predict output probs same as the larger model (not the same as ground truth). We use KL Divergence loss to ensure this.
In Distil Whisper, they are using the same approach with some modifications
https://x.com/xariusrke/status/1727254622442791400?s=20
Just realized we have cracked good Video gen after watching Pika announcements.
So it makes sense that I learn how does it work especially the temporal gen and interpolation (instead of doing my day job)
One fascinating thing I found is the amount of effort spent in data curation for training. This still remains the most important part of all the models out there. They filter out the videos with small motions and text.
They also use video BLIP, CoCo and other models to generate descriptions of the videos. Then use the same information to filter out the data
Still need to read the rest
You can simply read this paper for model architecture - https://arxiv.org/pdf/2304.08818.pdf
One more thing - why do people use this word ‘latent’ a lot. Please, stop. Use something which plebs can understand.
So here’s what the basic model is
Now doing this for all the frames of the videos will be pretty expensive especially if input has high FPS.
So what you do is extract out only key frames in the videos which represent a high semantic shift (can be done with cosine distance or like text based descriptions). Then you only use those key frames to generate more keyframes.
Once you have the results, you use another model to interpolate b/w those frames.The interpolation model is also similar to the above one but just has different layers (what exactly tho?) instead of temporal ones b/w spatial layers
Finally they finetune the model to produce sequence of frames rather than just the next frame
Paper - Adversarial Diffusion Distillation
At first I thought it’ll be the usual Student Teacher model with a different loss function. But turns out it has one more component - a discriminator model
Another interesting aspect is the input to the teacher model is not the original image but the diffused latents from the student model
The Discriminator model just tries to ascertain how close the provided image is to the original image.
* Music here implies diffusion operations in latent space
https://arxiv.org/pdf/2312.00785.pdf
[2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Have just started reading it but I already know this one is going to be a banger. Mostly talking about taking an existing State space model architecture and how to make the parameters time variant which according to them is the major blocker for these models not being good enough in real world.
Why hasn’t someone done this already? Well, because it increases the computational overhead and makes the models slower and if you have slow models then why not just use transformers with quadratic attention.
To solve this they have developed a hardware aware algo (mostly that decides what data should be in HBM vs SRAM) and who’s better to do this than Tri dao who wrote Flash Attention.
I forgive this paper for using too many mathnerdsnipes simply cause it is tbh required to understand the motivation. Thanks to chatGPT, here's a simple explainer for most of the terms you’ll encounter in this paper.
Most important part of the paper is this. The takeaway for me is that memory bandwidth is so low that recomputation is faster than storing and fetching.
They actually released it. And it’s good?
Sort of
But tbh, benchmarks are not what interest me. You can go ahead and view them. Most evals are trash anyways and suffer from data leakage in subtle forms.
I am more interested in tips and tricks scattered across the fine prints in the paper. Tbh, they didn’t reveal much of those.
Importance of dataset quality. Also, they finally realise that users leave the app if you deny them the response. Just mention something helpful instead.
I am truly excited about Nano and what usecases will be possible on the device after its release. I consulted the huggingface dashboard and the stats (atleast for MMLU) look great for its size.
It is also multimodal which was a surprise to me for such a small model. But then again that’s the advantage of Fuyu like architectures instead of using a dedicated image encoder which makes the overall size bloated.
This is one of the most difficult parts about training, handling hardware failures. TBH I should read all the 3 linked papers and understand what can happen (I do have some idea from OPT-175B logbook)
Another interesting thing is that they do not checkpoint the weights to distributed store since it’ll be simply too slow (rugged by Network IO). Instead what they do is either keep some robust copy of weights in memory or maybe transfer the weights from one of the other nodes in the cluster (due to Model + Data parallelism)r.
Also, please shield your clusters from cosmic rays. Yet another win for basement dungeon AGI enthusiasts.
Karpathy tweeted about it
TBH I am aware about the RAG Based approach to ground the output in facts.
Not sure about Reflection, Verification chains, Decoding uncertainty
If I had to guess what they means
Reflection - Simply ask LLM if this is the correct answer or not and then ask it to modify it
Verification Chain - Maybe just a fact checker and similar tools post the LLM output before sending it to user
Decoding uncertainty - Basically if the logprobs are lower (or you can consult from some hidden layer), the output might not be true
[2310.04378] Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
SD Turbo is the talk of the town but before that we had LCMs.
OK, there is way too much maths in this paper. Time to upload it into chatGPT
Update: Math too hard, went over notebook shared by the kind (https://twitter.com/felix_red_panda), now math is easy
Also one thing that this paper finally is peaking my interest to dive into various aspects of diffusion models
E.g. Schedulers - The ML developers guide to Schedulers in Stable Diffusion
The basic idea is really simple - you want to train a network to predict an image from a noisy latent without going through the whole iterative sampling process.
Why use EMA (exponential moving average) decay rate?
Tbh you should just read this amazing blog instead - https://naklecha.notion.site/explained-latent-consistency-models-13a9290c0fd3427d8d1a1e0bed97bde2
Although I read this paper and I get what they are trying to do, I am still not getting a sense that I actually know this. Maybe it’s because I haven’t read the linked papers about online learning. Should do that first.
OK, after reading the next paper and gaining some intuition, I remembered I am just baka.
It’s actually quite simple. What they are trying to do is almost the same as doremi in that they use 2 smol models to select the data to be used to train large models. Their main contribution is how smol can we make these two models so that overall we save FLOPs rather than spending less in training but overall more in total.
They simply keep on running this in an actor model (which is actually quite popular in distributed computing if you have worked with spark etc.). Your workers keep on running in parallel and compute scores for the samples from datasets. The score is nothing but the cross-entropy loss of the scorer model and reference model.
You keep on updating the scores in a memory bank (can be any DB).
You then use these scores to sample the data, thus prioritizing examples that would actually help large model to learn something new instead of simply repeating it.
In the end you update the weights of both reference model and large model.
Why both? Cause what makes all this work is that the loss trend of smol reference models serves as good enough proxy for the loss trend of large models.
Finally they keep on reducing the scorer and reference model size until the training regime is compute positive (i.e. takes fewer FLOPs w.r.t. Learner). As you can notice the learn obviously takes longer to train as reference models get smaller but the hit is not large
[2305.10429] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
The idea is simple. You want optimal data distribution in pretraining. By optimal I mean the samples that lead to lowest loss as early as possible.
How to achieve that?
Well the intuition (a good one) in most of these papers is that smol models, although not great at producing output, are still good enough proxy to dictate how large models are gonna behave. So e.g. if a small model trained on dataset X l has high loss on some dataset Y, then it might be the same for a large model as well trained on the same dataset X.
Here we use 2 small models - one is a reference model that is petrained using normal sampling weighted by token count. Second is the proxy model. The whole trick lies in how to train this proxy.
We do that by first starting with uniform sample distribution over domains, doing a normal forward pass for a batch, taking difference of loss b/w this and reference model for each domain. Now we select the worst domain loss among these (max) and try to adjust the alpha weights (for sampling) to minimize this.
After T steps, we simply take the average of all the alpha weights per domain and use that as sampling from the large model.
Does it work?
Yes, as you can see at just 80K steps the model is performing better than baseline at 160K
I KEEP ON FORGETTING THAT I ACTUALLY NEED TO LEARN ABOUT DPO AND PPO AND OTHER SHIT (although I did make an attempt earlier in this doc for DPO)
Why did I suddenly have this thought? Cause they released DPO fine tuned models for SD which are actually good at prompt following!!! Ya khuda, no more ((masterpiece)), ((high-res)), 5 fingers, bullshit.
[2311.12908] Diffusion Model Alignment Using Direct Preference Optimization
[2312.11514] LLM in a flash: Efficient Large Language Model Inference with Limited Memory
So apple is def serious about running them in device. What they are trying to achieve is how to run models larger than the available memory. TBH, we already run all programs with memory usage larger than RAM (using virtual memory and paging)
Why can’t we have a similar thing for LLMs? Cause it’ll make inference slower.
But smol devices have a big advantage, they simply use flash storage which is quite fast if used in the right manner. The focus of this paper is on this part.
They are relying on two important properties here:
1. The FFN (not attention) weights of most LLMs are sparse
2. The time to first byte from disk > time to read data
Now to take leverage of these two props
To read only non-zero weights, they are using a low rank predictor that simply tells beforehand which neurons will give positive activations.
To read more data, they load both up_proj and down_proj matrices and keep them in a single row.
When I read the benchmark setup of this paper tho, I get an ick.
The memory management def seems neat - they ensure continuous allocation in a pre-allocated memory region.
[2310.07704] Ferret: Refer and Ground Anything Anywhere at Any Granularity
Initially I thought this paper was mid cause it was using Vicuna which hasn’t been sota for like a year now. But now that I read it, the point is not about the base model, it’s about the technique being used to ground the model.
If you’ve read GPT 4-V, Gemini or Llava whitepapers, you will know the achilles heel of all these models is the ability to not parse as well as create bounding box correctly. Tbf, GPT-4V still can parse but creating is not it’s strong suite.
Here, they are trying to tackle only the parsing the bounding box part. Most of the magic is in how they represent it in the first place.
They are using a special MLP based sampler which instead of representing the box as a set of coordinates, represent it as using a set of modified features.
Paper page - Generative AI Beyond LLMs: System Implications of Multi-Modal Generation
It presents common architectures powering the Text to image space and what are the common bottlenecks. It’s by FAIR meta so actually useful content rather than it being a blog post in PDF form.
Deepseek, YAYI-30B, WaveCoder
Seems like I should invest some time to learn about dataset creation and filter pipelines. Lots of alpha in case I actually choose to earn money via ML consulting
LISTEN TO ME RN!!!
EDGE AI
THAT’S IT
VTUBER WAIFU IN YOUR GLASSES
JUST BET ON MOBILE COMPUTE GETTING BETTER AND BETTER!!!
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Paper page - TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Loved this cause they actually tried to optimise the model to be able to run on snapdragon devices.
Apologies for the phone screenshot I was reading this in the mall while waiting for someone
Only reason you should read dataset papers is to just find out what they use to clean up the set. More often than not I just skip to that section.
For contamination detection:
We employed line-level exact match detection for both our corpus and test sets, as the questions in these benchmarks are generally brief and often contained within a single line. Specifically, we split documents into lines, hashed each line using MD5, and took the first 64 bits along with the corresponding line to form a set. This procedure was also applied to the constructed reference test set collection. If a line from the test set, along with its corresponding hash code, is found in the training set’s corresponding set, and the length of the line is over 50 characters,13 we classify it as a leaked sample with an exact match
Paper - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
This is quite close to what Gemini is doing for multimodal gen
Uses llama tokenizer for text
Uses the concatenated output from 2nd and 2nd last layer of ViT for images
For audio, it is a bit more complex but I am pretty sure it would be like 1-2 ffmpeg commands. The main funda is you need to just need to 1. Encode and 2. Project into 3. 1-d space
Paper - Paper page - DocLLM: A layout-aware generative language model for multimodal document understanding
Just like the top comment on this page I was expecting this paper to be a snoozefest (banks, pffttt)
But they really tried something other than prompt engineering a finetuning llama (tbh it is llama2-7B but not vanilla arch)
That being said, some details are def missing from this paper cause well, it’s a bank so needs to think everything they do is a secret
Basically they first use some OCR code (plenty of libs in OSS), to get text and their corresponding bounding boxes from the docs.
Next - they need some way to encode these bounding boxes (not mentioned clearly in the paper), once encoded they change the attention block to use different Q and K matrices for bounding boxes. You can think of this part like a simplified cross attention.
Seems like a follow up paper to this - Paper page - DocGraphLM: Documental Graph Language Model for Information Extraction
[2401.00368] Improving Text Embeddings with Large Language Models
Dataset Best Papers
Hallucination minimisation and Refusal on not knowing the answer
Transformers from a Maths perspective (not including finbarr, eleuther ai maths)
Vikp’s work with dataset prep and related stuff
Factual Grounding methods
DreamTuner - ipadapter but different
https://arxiv.org/pdf/2312.13789.pdf - how i beat the big wigs
[2312.09608] Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models - faster stable diffusion by skipping unecessary bits
Official implementations for paper: Anydoor: zero-shot object-level image customization - instruct edit + https://old.reddit.com/r/StableDiffusion/comments/18kd0na/code_for_anydoor_zeroshot_objectlevel_image/
dont forget - diffusion slider demo https://github.com/Kevin-thu/DiffMorpher?tab=readme-ov-file
https://arxiv.org/pdf/2312.01943.pdf - i should use this for anime
TextDiffuser 2 - a Hugging Face Space by JingyeChen22 - people have been asking for text.. right?
GitHub - cumulo-autumn/StreamDiffusion: StreamDiffusion: A Pipeline-Level Solution for Real-Time Interactive Generation - turbo go fast
wowowowow
Paper page - Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache - Mostly engineering hence I love it
[2401.01325] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning - cause surya implemented it and it worked, high signal
[2401.02954] DeepSeek LLM: Scaling Open-Source Language Models with Longtermism - Mostly to see scaling laws and hyperparams choice
[2401.00588] Fairness in Serving Large Language Models - Scheduler by lmsys
Paper page - Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws - hmmm
[2401.01335] Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models - someone was talking about in a GC, high signal
gonna spend more time writing LLM code now, this is a fundamental blocker now in my path forward to actually becoming a good assistant to ML bros
I was playing around with Tencent Photomaker for a few days. I am blown away by how good it is with faces so I naturally went ahead and read their paper
[2312.04461] PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
What I realised after reading is it, I really do not understand how it works.
What they are doing is pretty simple, they have trigger words and class words
The trigger words can be something like img and class words can be man, woman, boy etc.
The class word should always be followed by the trigger word.
Whenever a trigger word is encountered in the prompt, they remove it, take the features of the class word preceding it and then do the tensor jujitsu
What technique? They take the embeddings of the face images you uploaded, project them into the same dimensions as text embeddings and then fuse them with the embeddings of the class word.
This is the part that kept on worrying me, like how does this even work, why is face information simply not lost, I was honestly expecting a control net like thing.
Then I proceeded to read their code and that made me question my skills further. Reason being I was not familiar with a few torch methods they were using and also the code is shitty in general.
Then in the paper they mentioned that it actually works simply cause SD already has Cross-attention which takes care of mixing face id info with image.
So rn I am going through the whole SDXL architecture in colab (it’s embarrassing that I haven’t done that at all) and trying to understand this flow. The fact that I don’t know this already is so baaaaaaaad.
Today I was trying to verify if they are lying to us in SDXL paper. Turns out I am just dumb and forgot they concatenate the embeddings from two CLIP models
Paper - https://arxiv.org/pdf/2401.12945.pdf
Video gen model by goog. They are not relying on the SVD (Stable Video Diffusion and not Singular Value Decomposition) way to generate only keyframes and interpolate b/w them. They instead generate all frames in a single pass.
I am worried tho that this will have huge mem requirements though. Will read the paper to understand more.
Ok, so the way they are avoiding huge compute requirements is basically via doing temporal convolutions on very small latents.
Once you have video at course resolution it is upscaled using MultiDiffusion (whatever that is, I need to read)
Lots of alpha in this paper, asian bros I kneel
Thanks for being inclusive for us Java programmers
You would think it’s just topo sort right? But then us programmers are so shit we do cyclic dependencies such as passing factory instance to implementation so it can create new instances for some recursion hell.
Worry not, the asian bros know about this practice. Interesting thing that they kept the file path info in comments as well, not sure why that is important.
They then dedup at repo level (i.e. concatening all the files of the repo and applying some near-dedup algo). It should be similar to cosine similarity but I need to see the exact implementation for this. Expecting something similar to this - Large-scale Near-deduplication Behind BigCode
Using smol models to estimate loss curve trends / accuracy of larger models should be widely known. Don’t waste compute unnecessarily.
I will never not be amazed that CoT actually works in the LLMs
Paper - [2308.06721] IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Been using this model and impressed by its performance in local setup. It does a really simple thing tbh, where it allows you to prompt a diffusion model with images (not to be confused with i2i as it simply influences the style rather than the exact outlines)
Their approach is quite simple. Diffusion models already use Cross Attention to condition the latents using text.
You can just add another cross attention layer that instead of using text embeddings as key and value, uses image embeddings as key and value
Once done, you can add the text and image cross attention output and get the final result
Paper - [2401.14953] Learning Universal Predictors
This paper has too much math but it is expected because of what they are trying to prove.
I am bad at it so taking help of GPT to guide me through the explanations (embarrassing myself (complementary)) - https://chat.openai.com/share/fdaaa434-011b-4da6-a3a2-d69ddc8c3180
The primary idea that they are trying to put forward is LLMs in itself can do meta learning (which is fancy way of saying if you give them context, they can figure out a solution from that concept even if they haven’t seen it before). And the way to make meta learning efficient according to them is to have enough diverse set of problems already part of pretraining.
To prove this, what they do is that using Brainfuck (yeah, really, brainfuck) as the primary language for its simplicity (as in it just reads from one place and writes to another with a small working memory) as opposed to langs like java which can have read from multiple channels and do random access.
For problem diversity, the ensure multiple tasks from different complexity are part of the training dataset
I still feel overwhelmed by the mathematical terms in this paper, gonna switch to reading code instead https://github.com/google-deepmind/neural_networks_solomonoff_induction/tree/main
So the code is really simple tbh, they are just sampling tasks with Brainfuck lang code. The samples are ensured to be diverse enough via randomisation.
This is done per batch and then used to train the network
Then the nets are evaluated against other types of tasks (specifically chomsky hierarchy ones and CTW ones). We observe the accuracy does increase for each model as their sizes are grown even if they are not trained on this dataset (hence, AGI)
As a followup I would also need to read [1905.03030] Meta-learning of Sequential Strategies so i can understand the whole meta-learning thing a bit better
https://arc.net/folder/D0472A20-9C20-4D3F-B145-D2865C0A9FEE
Paper - [2312.12491] StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation
Allows you to either use SD + LCM lora OR SD-Turbo to do insanely fast generation
Doesn’t support SDXL yet
How come it is so fast tho?
Turns out it's just simple old optimisations you would do for you Java app as well
- multiple threads to offload lightweight computations (such as encoding/decoding images)
- pipelining with batches to allow process multiple images in a single pass
- cache to store precomputed text prompt embddings as well as KV for those embeddings in cross-attention layer
- a bit of magic (tensorrt)
Paper from Apple (along with the code, goddamn). Imagine those cool demos at google io where you can magically remove things from photos. Well this is a superior version of that.
Not only you can add/remove/move anything you want, but you can also do it via simple natural language instructions instead of using your fingers.
The basic architecture is quite simple - they are just using a VLM (or as they say MLLM) to get description of the image with the edits requests. E.g. What would the image look like if I add a pizza on the table.
Now the answers of VLMs can be quite long, so they have this magical summariser that shortens the answers to make them more precise.
Now when you get this answer, they also add some [IMG] tokens. These tokens are used as input to anothersequence model that transforms them into embeddings.
Now you can use I2I mode of diffusion models. But we condition this image not on the prompt but on the embeddings we got in the previous step (using inbuilt cross-attention of most diffusion models)
Now they train this whole pipeline to minimise two losses -
Paper - [2205.13147] Matryoshka Representation Learning
Hearing this term too frequently in the contexts that are most definitely not related to dolls (if they are then it’s concerning)
Basically in the existing embedding models, you get floating-point vectors of fixed lengths (1536 for openai, 768 for clip, etc.). However sometimes you wish you had smaller vectors (to tradeoff accuracy for latency). Changing models just for embedding length sounds tedious
Enters Matryoshka
The concept is pretty simple, generally you train embedding models by training classifiers and then taking hidden layer representation.
Now what if instead of just using the full output of the last layer to compute logits and calculate loss, you instead did it for multiple length vectors e.g. 8,16,32,64…2048
Then you simply added up the loss for each of these and tried to gradient descent on this cumulative loss.
Well, yes, it’s as simple as that and it works perfectly. Using this loss formulation, the network tries to compress any coarse info in the smaller length vectors and then proceed to finer ones later on.
Another interesting usecase for this mentioned in the paper is adaptive retrieval. So basically you use only the first N floats to perform the similarity search while you use the M bits to perform re-ranking on the retrieved results where M >> N. This allows you to significantly make your queries faster while not sacrificing accuracy.
Paper - [2402.09371] Transformers Can Achieve Length Generalization But Not Robustly
Just me (an outsider) trying to understand what might have went into gemini 1.5 pre training so that it can generalize to 10M context length even with limited training.
Tbh this paper is not the sauce as it is too simple plus tries to verify results only on a small problem
3rd is just doing 321 + 654 instead of 123 + 456
4th is just doing 3c2b1a + 6c5b4a instead of 321 + 654
2nd is instead of using encoding position 1 as vector of 1, 2 as vector of 2 and so on
What if you took a random set of positions from length L (which will be much much more than maximum context tokens we will feed to this model)
Then you sort those random positions in ascending order and assign it to each actual position.
E.g. sampled position are 4, 7 so 1st and 2nd tokens get assigned the embeddings of 4th and 7th
We can then training this network only for length N but then using the same technique during prediction for length M >> N.
1st is FIRE which iiuc uses an MLP to learn embeddings for each position instead of using some fixed function like linear or sinusoidal
Paper - [2402.08268] World Model on Million-Length Video And Language With RingAttention
Paper - [2212.09748] Scalable Diffusion Models with Transformers
Since Sora has dropped and hints at using similar architecture (with a lot of magic), good idea to go through this
Architecture wise there’s not a lot going on here. Other than standard diffusion models, there are two changes
For patching, simply use MLP to convert it into embeddings. Once you get the embeddings, you also apply positional encoding using sinusoidal frequency based version
Then you proceed to predicting the tokens using the transform block. They use N transform blocks (where N varies according to model size i.e. S, B, L etc.) They also try different types of DiT blocks primarily to introduce text based conditioning.
Ultimately after ablation they used the one the Adaptive layer norm with zero init for gammas (Note for me, I should read more on this)
What is the Adaptive layer norm?
Paper - [2403.03206] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Interesting part is using a third text encoder T5 and then using separate weights for it instead of simply piping it in cross-attention
Synthetic datasets are now being part of Visual models as well. Might be time to create some simple repo to make this seamless.
Increasing channels improves the performance of the models
Paper - [2403.05525] DeepSeek-VL: Towards Real-World Vision-Language Understanding
Strangely Teorotaxes shilling deepseek on my TL almost every hour has worked and I have developed an affection for their smol models that perform great in benchmarks as well as are useful practically
I am also reading this paper so I can steal moondream’s alpha without selling my house to hire vikhyatk
For datasets, they have a pretty diverse mix that covers web UI screenshots, OCR and chart parsing. This is good as simple image text pairs is not good enough for practical utility of these models
Interesting choice for the architecture to use two image encoders and not one
First they use SigLIP for semantic information. However siglip uses lower dimensional latents so the details get lost
For preserving details they use VitDet based on SAM-B that can accept higher dimensional latents.
The output of both of these encoders is concatenated before being passed via MLP
Paper - https://arxiv.org/abs/2403.07750
Everyone is now aware about the usefulness of synthetic datasets for training LLMs. Even the latest and greatest Claude-3 has some synthetic datasets in its training set
The next natural step for this is synthetic datasets for multimodal LMs.
And the easiest modality to start with is images i.e VLMs
A simpler version of this would be simply generating captions for existing images using some other VLM
A harder version would be generating both image - text pairs since generating images is costly as well as might not be great in quality and adhere to prompt really well
This is how they are doing it
Here they do pre training of their image gen so that to eliminate the effect of human-annotated VLM dataset for experiments. This step is not necessary for practical purposes imo. They are using Muse as the primary image gen here since it’s transformer based architecture.
Paper - [2403.01779] OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on
The outfit fitting problem seems simple at first glance - you just have to inpaint right?
But the issue becomes more apparent as you dive deeper
E.g. How do you make sure the clothes and brands are represented exactly the same? How do you account for weird body poses?
There are multiple approaches that exist currently to solve this problem. The authors propose a new one that’s far more accurate
So basically they train one more u-net in parallel to SD. This unet only gets the garment as well as some text label as input. We however, do not use it to generate anything. What we do is simply take the inputs to the spatial attention layer in the unet and then concatenate them with the inputs to the spatial attention layer of Stable diffusion along the width. This is what they call as output fusion
This would be confusing so you can check the following lines in the official code
Getting query from attention layers of outfitting unt - https://github.com/levihsu/OOTDiffusion/blob/344112ad1c03c2af1cf7a1f07d689b18af4c175a/ootd/pipelines_ootd/attention_garm.py#L234
Concatenating query with denoising unet -
Parent that calls the former and then passed it to the latter -
https://github.com/levihsu/OOTDiffusion/blob/344112ad1c03c2af1cf7a1f07d689b18af4c175a/ootd/pipelines_ootd/pipeline_ootd.py#L373
Paper - [2403.09611] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
I have no hope from apple’s ML teams to make a sota model. Reading this paper only cause they do ablations
For data they found that keeping almost an equal mixture of interleaved records along with image-text pairs gives best capability. You also need some text only data to preserve language capability of llms.
Second approach is interesting to support higher resolution while not sacrificing speed
Paper - [2403.09629] Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Afraid to say but they use too many buzzwords for me to understand EXACTLY what they are doing and they don’t even have code as well.
That being said, this is what I gather
Teacher forcing - Nothing but using data from training set as input for next token as opposed to output from previous steps https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/
Non-myopic loss - Calculate loss based on all future tokens instead of just next token
Star - simply ask model to generate a rational (using prompt) and then an answer instead of directly generating an answer https://arxiv.org/abs/2203.14465
Ok, no worries, simply had to upload paper PDF in opus to crack it
Paper - https://arxiv.org/abs/2403.07815
Why would amazon do this when you can use ARIMA or other lightweight shit?
No idea, but it’s fun
Tbf, they are not doing something insanely genius.
What they are doing is following
They also show improvements with using synthetic data by augmenting time series with noise instead of just training data
Synthetic data gen for time series
Paper - [2403.03507] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
So their idea is the following, you already have LORA that uses low-rank matrices A&B to update the weights.
Here they propose that instead of using separate A&B matrices, you can simply keep on reducing the rank of gradients and still get the same level of accuracy while using even lower memory.
They prove this using mathematical voodoo that indeed the gradient matrices tend to have lower rank as training progresses and so you don’t need to store the whole and secondly, if you use low-rank matrices, the loss still converges to a minimum value.
Another interesting thing is that they keep on recomputing the low-rank projections but not at every timestep. Only when T steps have passed, the projections are updated.
The code is actually quite simple. Generated it using claude (so that no one has to bother with weird maths symbol in the paper)
class GaLoreAdam:
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, rank=1, scale_factor=0.25, freq=200):
self.params = list(params)
self.lr = lr
self.betas = betas
self.eps = eps
self.rank = rank
self.scale_factor = scale_factor
self.freq = freq
self.step_count = 0
self.m = [torch.zeros_like(p, memory_format=torch.preserve_format) for p in self.params]
self.v = [torch.zeros_like(p, memory_format=torch.preserve_format) for p in self.params]
self.P = [None] * len(self.params)
def step(self):
with torch.no_grad():
for i, p in enumerate(self.params):
grad = p.grad
if grad is None:
continue
if self.step_count % self.freq == 0:
U, S, V = torch.svd(grad)
self.P[i] = U[:, :self.rank]
m, v = self.m[i], self.v[i]
P_t = self.P[i]
R_t = P_t.T @ grad
m.mul_(self.betas[0]).add_(R_t, alpha=1 - self.betas[0])
v.mul_(self.betas[1]).addcmul_(R_t, R_t, value=1 - self.betas[1])
m_hat = m / (1 - self.betas[0] ** (self.step_count + 1))
v_hat = v / (1 - self.betas[1] ** (self.step_count + 1))
N_t = m_hat / (torch.sqrt(v_hat) + self.eps)
G_t = self.scale_factor * P_t @ N_t
p.add_(G_t, alpha=-self.lr)
self.step_count += 1
Paper - [2403.07691] ORPO: Monolithic Preference Optimization without Reference Model
So what they are proposing here is a way to RLHF the model without using any reward model as well as the need for a separate preference phase. They tell us you can simply do it along with SFT
The algorithm proposed is really simple. They use the log prob of the chosen and rejected samples from the model and then use it to calculate odd ratio
Once you have odd ratio, you can formulate a second loss term based on that, scale it down by a factor lambda and then add it to SFT loss
Now you simply train the model to using gradient from this combined loss
They show that the model is actually larning to adhere to preference and keeps on lowering reject samples log prob.
They also show that the trained model produces responses that are quite same (i.e. closer to reward) given the same prompt (even tho temperature is kept as 1.0)
They also show that trained models do change their responses significantly if you change the prompt meaning they are not over-optimised towards a single reward and do adhere to instructions
Implementation from claude:
import torch
import torch.nn as nn
from torch.nn import functional as F
class ORPOLoss(nn.Module):
def __init__(self, lambda_or=1.0):
super(ORPOLoss, self).__init__()
self.lambda_or = lambda_or
def forward(self, logits_chosen, logits_rejected):
# Compute SFT loss (negative log-likelihood)
sft_loss = F.cross_entropy(logits_chosen.view(-1, logits_chosen.size(-1)), labels_chosen.view(-1))
# Compute L_OR loss
probs_chosen = F.softmax(logits_chosen, dim=-1)
probs_rejected = F.softmax(logits_rejected, dim=-1)
odds_chosen = probs_chosen / (1 - probs_chosen)
odds_rejected = probs_rejected / (1 - probs_rejected)
log_odds_ratio = torch.log(odds_chosen / odds_rejected)
l_or = -torch.log(torch.sigmoid(log_odds_ratio))
# Combine SFT and L_OR losses
total_loss = sft_loss + self.lambda_or * l_or.mean()
return total_loss
# Training loop
model = ... # Initialize your model
optimizer = ... # Initialize your optimizer
orpo_loss = ORPOLoss(lambda_or=0.2) # Initialize ORPO loss with lambda_or value
for epoch in range(num_epochs):
for batch in dataloader:
input_ids, attention_mask, labels_chosen, labels_rejected = batch
# Forward pass
logits_chosen = model(input_ids, attention_mask=attention_mask).logits
logits_rejected = model(input_ids, attention_mask=attention_mask).logits
# Compute ORPO loss
loss = orpo_loss(logits_chosen, logits_rejected)
# Backward pass and optimization
loss.backward()
optimizer.step()
optimizer.zero_grad()
Paper - [2403.14599] MyVLM: Personalizing VLMs for User-Specific Queries
Aim is to add personalisation to any existing visual language model or train one from scratch. Mostly they want to do this to capture snap userbase’s image and other objects accurately.
Most of such ideas revolve around playing with embeddings which is the case here as well.
They run a face detection model, figure out whose face it is using cosine similarity from existing database and then pass those embeddings and the metadata to the VLM
They also have object detection models but here they use classifiers instead of embeddings. The classifier they train are pretty fine grained e.g. recognise which type of dog it is then just say it’s a dog
This paper contains a lot of hyperparameters in the appendix so better to go through them before jumping to implementation.
The QFormer here is only used in case of BLIP-2, when used with llava, they simply remove it and append the embeddings to the ones after the projection MLP
One thing I liked about this paper is that they have actually done a thorough analysis if this approach is even sound mathematically or not. Plenty of nuggets in the paper where they scaled the embeddings up/down to make sure their effects are not over/under emphasized
Paper - [2403.18802] Long-form factuality in large language models
Nothing groundbreaking here imo
They use LLMs to split out facts from a given generated passage and then use google search to verify if those facts are correct or not
It is simply working now cause the models are simply bigger and much more smarter.
All the alpha is in Appendix as well as one section where they discuss precision recall of LLMs
Paper - [2404.16710] Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding
Quite an important paper imo (cause of my bias when it comes to matryoshka embeddings)
The premise is that not all tokens require you to go through all layers of the models (this has been explored in other papers as well). Some tokens are much easier to predict than other
So what you can do instead is that directly take embeddings from an intermediate model layer and pass it through lm_head to get predictions.
But what if we predict something wrong? Well for that, instead of using normal auto-regressive decoding you can use speculative decoding. You simply keep on predicting using earlier layers and then keep on correcting it using the output from all the layers. The correction is fast since there’s no need to do that auto-regressively.
Also if we use the same model to correct, we should already have KV cache from initial few layers, so effectively we only need to compute the final layer.
The only issue with this idea is that it require training the model in a way so that earlier embeddings are still good enough (basically by incorporating their loss scaled down by a factor in the final loss) plus the training itself is more costly
They are training two models here and then merging them
The first model is pretty obvious, given an answer by an LLM, rate how good it is from 1 - 5. However to improve the response they do 2-3 prompt engineering tricks which can be used without this model as well. Instead of just providing initial instruction and answer to generate a score, they also provide a reference answer in the prompt along with explicit evaluation criteria rubrik. Also, instead of only generating the score, they also ask the model to first generate a verbal explanation first.
The second model they train is a pairwise ranking one where along with instruction you give two answers and ask LLM to tell which is better. Again they do all the prompt engineering tricks mentioned before but with small variations e.g. evaluation criteria doesn’t contain scores, generate two verbal feedback that compare them with criteria
I will never intuitively understand why model merging works but I have made my peace with it.
Merge cheatsheet hidden in the paper, woohooo, till now I only pretended to know what they were. All seem simple
Paper - [2402.13929] SDXL-Lightning: Progressive Adversarial Diffusion Distillation
Asian bros are amazing. They are competing with each other in the same company to give us the best possible models.
And so from Bytedance, we get two great ones to make SD inference faster
One SDXL lightning and other is HyperSD
The goal is to train a lora/model that can help you to generate high quality images from SDXL with just 2 steps. They do this using distillation using Adversarial objective (you can check the SD turbo to see what’s that). Basically there’s a discriminator model that is trained to predict if an image is generated by teacher or student. And then you have a student model. We want to minimize the probability gap of the discriminator output between images generate by teacher and student
For discriminator they are using existing UNet of SD models and nothing fancy
I wasn’t aware about this flaw in noise schedule at all but it’s so apparent now. Good thing is that it does make these models slightly better for image to image (since you don’t start from pure noise in that case)
Some of the unique things they do to get high quality 1step and 2step models
Both use LAION an COCO datasets to train image models in case you also want to train yours
Paper - [2404.13686] Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis
This is also an attempt by Bytedance bros to make SD inference faster. However the approach they are following here is more similar to how consistency models are cooked instead of Distilled models.
What’s the primary diff between these two approaches?
Well in consistency models, instead of predicting the noise and doing diffusion, you generally predict the latent itself at the t=0 i.e. the final image
This as you can assume doesn’t work well a lot of times since predicting something at t=0 is quite hard
That’s why the authors propose a different path where instead of predicting directly at t=0, they break down the timestamps into K segments (where k is varied) and then predict only one segment at a time
They also have the same conclusion as the lightning paper that MSE loss is more effective when k is smol while adversarial is more effective for higher ks.
The CFG remains to be achilles heel of fast SD implementations. In practice I have never been able to get decent results with value above 2.0
However, they did release some checkpoints to fix it recently
Paper it is based on - [2203.03041] Highly Accurate Dichotomous Image Segmentation
Paper - [2405.14857] Semantica: An Adaptable Image-Conditioned Diffusion Model
by google deepmind. Can’t make up my mind if it is mid or interesting.
To me it seems like an attempt to make a powerful IPAdapter alternative i.e. condition the output based on input images. However here they want to influence the output more strongly plus support datasets that are not even part of the model during its training via In-Context examples. E.g. if you train a model on animal but don’t include elephant in the dataset
However if you give it elephant images as input prompt, can it generate one itself?
Most of the magic of this paper lies in the dataset. It contains multiple images that can be used in a single context. By URL here they refer to the wikipedia URL or some webpage from which all images were taken.
Source - https://aider.chat/2024/05/22/swe-bench-lite.html
https://aider.chat/2023/10/22/repomap.html
These are simple yet useful blogs. Not going to add a lot of info here, better to go through them in 10 minutes.
Mostly involves using AST based fetcher to fill the context instead of simple similarity search based
Also includes multiple other optimisations like prompt engineering and some vague code editing backends.
Also if you want to create AST (Abstract Syntax tree) of your repo simply use - https://tree-sitter.github.io/tree-sitter/
Basic funda here is that you use search to improve the answers of LLMs instead of fine tuning or making them larger.
They use a lot of ml mumbo jumbo but what they are doing is quite simple.
You first generate an answer using LLM.
Then you ask the LLM to get critical feedback for that answer.
Then you ask the LLM to generate an improved answer based on this feedback.
Then you ask LLM to grade the answer from -100 to 100.
you use it to update the scores of answers and its parents.
Finally, in the next iteration you select the children with best score and repeat the same process.
Paper - [2402.14905] MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
This paper talks about architecture changes which you can make to smol llms to reduce the memory requirements while also increasing their accuracy
Most important observation is that deeper nets are better than wider ones when it comes to accuracy (Gemma2 paper also had similar conclusion)
Also, Swiglu always for more accuracy (SwiGLU(x) = x * sigmoid(beta * x) + (1 - sigmoid(beta * x)) * (Wx + b)
Next, they test out how many attention heads are optimal. First they use Grouped query attention as it uses less memory than normal multi head attention
Next they try some “hacks” which supposedly work
First is layer sharing, where they make the weights of the next layer the same as the previous layer. This means loading much less weights.
Finally they also share embeddings between input and output layers. This leads to a slight drop in accuracy but memory savings are significant since embedding params make up for around 10-20% of the total params for smol llms
Paper - Kolors/imgs/Kolors_paper.pdf at master
We all know text encoders make a lot of difference in image gen. That’s why SDXL uses 3 of them (2 CLIPs and 1 T5)
These folks go a step further and instead of using a text encoder, simply use the penultimate layer output of chatGLM-6b-base model.
Next they use a better MLLM to caption the images so that the descriptions capture as many concepts as possible.
I still don’t understand how they make text rendering better cause all they mentioned is synthetic data.
They also have training run divided into two phases - one where they teach the model concepts using low-res images, second where they tune it for quality using higher res images
For higher res images they also adjust the schedule during training so that we do get almost pure noise in forward diffusion since that’s the input during inference
Paper - [2311.06242] Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
I generally ignore all models by microsoft. But after phi3, this is the other model that has been branded kino by ML anons.
What’s their secret sauce though?
The model architecture is not something revolutionary. They are using an Image encoder to tokenize the images and then using an encoder-decoder model to predict the text.
Only difference is tokenizing the bounding box which tbh a lot of papers do right now especially with the rise of extracting info from docs
The attention block used in image encoder is also a bit unique. Usually in spatial attention we use channels as the features while the pixel is used as a token. Here though we also introduce another attention block that does the opposite i.e. use pixels as features for each channel (can be simply done by reshaping).
Most of their magic however lies in the data creation pipeline (which is secret sauce for most OSS model improvements tbh)
Paper - [2407.09025] SpreadsheetLLM: Encoding Spreadsheets for Large Language Models
Written by interns so was already skeptical. I thought they trained some model to embed spreadsheets
But what they are doing is simply finding a way to represent spreadsheet data in prompt. They are mostly focusing on representing the structure of the sheet rather than actual data.
Like if you were rawdogging, you would simply represent data as (cell location, data)
However, they propose some optimizations to this so that we don’t run into prompt context limits
- inverted index : where they simply use values as keys and cells as the list for the index. This helps in saving a lot of context by avoiding empty cells plus and duplicate values
- data format aggregation - just combining the values sharing the same cell data type e.g. datetime, currency etc.
- chain of spreadsheet - you only give sheet structure and task in prompt (sheet structure can be provided with any of the techniques above). A sheet can contain multiple tables so you ask LLM which table contains relevant data.
The LLM provides you the boundaries of the relevant tables. Then you make a second call, with the actual data from these tables. Hence a chain, tada :|
They also have this thing called structured anchors, where they use heuristics to compress the table while retaining structure. IMHO, no one is going to use it in prod. Also since the info on this is split in appendix and main content, here’s the python code to understand it better
Paper - SAM 2: Segment Anything in Images and Videos | Research - AI at Meta
I have already written about SAM in this doc. They have made it even better on images and now it supports videos (as a sequence of images and not MP4, I lied).
The primary addition in this architecture is the memory attention and memory bank. It’s just a fancy terminology for a place where they store embeddings of the previous predictions and then use them in a transformer block to influence the current frame with cross-attention.
They have changed the Image encoder as well to Hiera so that it’s accurate as well as fast.
Mask decoder is mostly same except they have skip connections from image encoder for some high res features plus object embeddings (still not sure what they mean, need to read more)
Checkout appendix C if you want full details on the model.
The architecture is quite similar to SD3 tbh (and makes sense as well since it is created by same engineers)
https://www.reddit.com/r/LocalLLaMA/comments/1ekr7ji/fluxs_architecture_diagram_dont_think_theres_a/
Paper - https://arxiv.org/pdf/2409.02634
So we already have Hallo that can do the same thing. However, if you have opened the Hallo repo ever, you’ll notice it has way too many components that need to work together for the model to work.
The aim of this project is to strip of a lot of things like face detection and simply use good embeddings for audio to animate the face.
The idea is like this
You have a denoising unet where instead of text embeddings you using audio embedding for conditioning. The input to this SD unet is your usual Noise but timestep embeddings are concatenated with the embeddings of audio and it’s translation plus other features
Then you have a second unet called reference net. This unet actually takes the image as input along with the motion frames
The main use of ref net is to ensure that the generate images for frames actually follow the animation sequence correctly i.e. each frame should diverge significantly from previous frames. They do that by using multiple previous frames as inputs. The attention layers in this u-net are all on spatial dimensions like the original SD unet.
The second unet however which actually generates the image is a bit different. It contains the spatial attention layers but apart from that it also contains the temporal attention layers. These new layers actually use the output of each layer of reference u-net as condition.
I’ll be honest the terminology is bit confusing in this paper.
Ok, I should’ve read the intro first. Now I understand what they are trying to do better.
Basically there already have been attempts to animate images using audio. However they suffer from following issues -
So they have added temporal attention to solve the second issue mentioned here. They also have added a Time segmentation module that takes a series of frames and groups them together based on how further they are from original. They group together more frames if they are further from the current timestamp cause their details matter less.
To solve the first issue, instead of using the audio embeddings directly they have this Audio to latents module, which is basically some attention layers. They train this module along with the whole network to translate audio embeddings into motion latents i.e. embeddings that actually correlate with the motion. It’s sort of like training a projection layer for VLMs but also involves attention not just MLP.
Paper - https://arxiv.org/pdf/2407.01449
Paper - https://arxiv.org/pdf/2410.13848
Focus of paper is on an approach to use multiple encoders for visual tasks. They argue (and quite correctly) that image generation and understanding require very different features. The former requires finer ones. Hence they propose an architecture where you can plugin multiple encoders.
Hmm, but what’s the new thing here? Well they train this new architecture such that the model itself is able to decide which encoders features should dominate instead of manually enabling/disabling or like some primitive function.
They want to extend this approach to multiple other encoders as well (e.g. one for audio) and let the model itself figure out the possible paths.
Their training is divided into 3 stages where different layers are frozen
Stage 1 - the focus is mostly on LLM and adapters to learn the relationship between text and image. Dataset includes 1.25 million image-text paired captions from ShareGPT4V for multimodal understanding and approximately 1.2 million samples from ImageNet-1k for visual generation
Stage 2 - the focus is on complete image generation and understanding. This is most heavy part of the training with all the datasets
Stage 3 - This is just instruction tuning so that llm follows them correctly
For visual gen tasks the cross entropy loss in training is only calculated using image tokens
Code - https://github.com/xjdr-alt/entropix/blob/main/entropix.ipynb
The entropy sampling approach by xjdr is gaining a lot of hype. Let’s see what the hype is all about.
The core idea revolves around the following 4 metrics
1. Entropy - This indicates how confident the model is in its current prediction. We calculate one for logits and another for attention matrix
2. Var entropy - measures how much entropy is changing over multiple predictions
3. Agreement - how much attention heads agree with each other. Basically do all heads think that the same token is important or not.
4 Interaction Strength - how much the model thinks the parts of input are related to each other. It’s just mean of attention scores
The algo is something like this
1. If both entropy and var entropy are low, it means that model is pretty confident and we should simply keep on following the path we are on
2. If entropy is high but var entropy is low, it means the model is not sure for almost every prediction. It is here that we actually return a token that represents a clarifying question. The hope is that model will get more info after this and then come on correct path
3. If entropy is low but var entropy is high, it means the model is confident but the confidence varies significantly across tokens. Here we dial up the temperature a bit. We also increase top_k in sampling if agreement is low so that we can hopefully find a more confident path like 1
4. If both are high, the model sometimes knows but other times it doesn’t. Here we dial up the temperature even more. We also decrease top_p for some reason which I don’t fully understand
Also, I don’t know jax so I used claude to understand the shape of the various arrays to make sure if my understanding is correct. You only need to focus on sample , calculate_varentropy_logsoftmax, and calculate_metrics method
import jax
import jax.numpy as jnp
import numpy as np
from typing import Dict, Tuple
# Constants
LN_2 = 0.69314718056 # ln(2) = 1.0 / LOG2_E
@jax.jit
def calculate_varentropy_logsoftmax(logits: jnp.ndarray, axis: int = -1) -> Tuple[jnp.ndarray, jnp.ndarray]:
"""Calculate the entropy and varentropy of the probability distribution using logsoftmax."""
log_probs = jax.nn.log_softmax(logits, axis=axis)
probs = jnp.exp(log_probs)
entropy = -jnp.sum(probs * log_probs, axis=axis) / LN_2
varentropy = jnp.sum(probs * (log_probs / LN_2 + entropy[..., None])**2, axis=axis)
return entropy, varentropy
@jax.jit
def calculate_metrics(logits: jnp.ndarray, attention_scores: jnp.ndarray) -> Dict[str, jnp.ndarray]:
entropy, varentropy = calculate_varentropy_logsoftmax(logits)
attention_probs = jax.nn.softmax(attention_scores, axis=-1)
attn_entropy = -jnp.sum(attention_probs * jnp.log2(jnp.clip(attention_probs, 1e-10, 1.0)), axis=-1)
attn_varentropy = jnp.var(attn_entropy, axis=-1)
mean_attention = jnp.mean(attention_probs, axis=1)
agreement = jnp.mean(jnp.abs(attention_probs - mean_attention[:, None, :]), axis=(1, 2))
print(f"agreement shape: {agreement.shape}")
interaction_strength = jnp.mean(jnp.abs(attention_scores), axis=(1, 2, 3))
print(f"interaction strength shape: {interaction_strength.shape}")
return {
"logits_entropy": jnp.mean(entropy),
"logits_varentropy": jnp.mean(varentropy),
"attn_entropy": jnp.mean(attn_entropy),
"attn_varentropy": jnp.mean(attn_varentropy),
"agreement": jnp.mean(agreement),
"interaction_strength": interaction_strength
}
def test_calculate_metrics():
# Create sample input data
bsz, seqlen, vocab_size = 2, 3, 5
num_heads = 4
key = jax.random.PRNGKey(0)
key, subkey1, subkey2 = jax.random.split(key, 3)
logits = jax.random.normal(subkey1, (bsz, seqlen, vocab_size))
attention_scores = jax.random.normal(subkey2, (bsz, num_heads, seqlen, seqlen))
print("Input shapes:")
print(f"logits shape: {logits.shape}")
print(f"attention_scores shape: {attention_scores.shape}")
# Calculate metrics
metrics = calculate_metrics(logits, attention_scores)
print("\nIntermediate shapes:")
entropy, varentropy = calculate_varentropy_logsoftmax(logits)
print(f"entropy shape: {entropy.shape}")
print(f"varentropy shape: {varentropy.shape}")
attention_probs = jax.nn.softmax(attention_scores, axis=-1)
print(f"attention_probs shape: {attention_probs.shape}")
attn_entropy = -jnp.sum(attention_probs * jnp.log2(jnp.clip(attention_probs, 1e-10, 1.0)), axis=-1)
print(f"attn_entropy shape: {attn_entropy.shape}")
mean_attention = jnp.mean(attention_probs, axis=1)
print(f"mean_attention shape: {mean_attention.shape}")
print("\nOutput shapes and values:")
for key, value in metrics.items():
print(f"{key} shape: {value.shape}, value: {value}")
# Run the test
if __name__ == "__main__":
test_calculate_metrics()
Understanding basics of sampling params:
Paper - https://arxiv.org/pdf/2402.05755
To be honest, I am reading this paper only to understand how they trained it on expressive tokens.
Basically this model is auto-regressive but supports both speech input and generation. Obviously they’ll have to use tokens for everything here which means more encoders
We are already aware about text so let’s skip it.
Now for the fun expressive part, well it’s just more encoders with more tokens 😐
Paper - https://arxiv.org/pdf/2409.03733
It’s just prompting LLMs to first create observations/hints about the problem, then create more observations from previous ones
Then write pseudocode and then the actual coding solution
Paper - https://arxiv.org/pdf/2410.07718
Paper - https://arxiv.org/pdf/2406.08801
Encodddeerrss - like every other paper
The secret sauce is the reference net that basically controls the generation process of the main diffusion model. Its job is to ensure that the output frames match the original image.
However, we do want some motion in each frame, especially lips, eyes etc. That’s why we also use text and audio embeddings as conditioning input. All of this is via cross-attention
We also need encodings of past-frames as input so that we can ensure that we are not generating the exact same thing. However, simply using them has a side-effect that they heavily influence the output. This is why we first make them pass through two steps -
* patch data augmentation - we divide the image into patches and randomly drop some patch from each frame
* gaussian noise - add a slight random noise to each frame.
They have also added this high-resolution enhancement module which is a transformer. Each block of this transformer is a self-attention layer followed by a temporal-alignment layer (which is again attention but with inputs reshaped) . The predicted output is finally decoded using a codebook and a HQ decoder for high resolution vid with better temporal alignment,
Paper - [2412.06769] Training Large Language Models to Reason in a Continuous Latent Space
We all know tokenizers are the bane of LLMs existence. Now that test-time compute is another paradigm to scale, people rightfully feel that tokenizers are actually limiting a lot of reasoning by removing a lot of useful information.
This is what this paper aims to solve. The idea is really simple - Do not convert thoughts to tokens lmao.
But then how do we pass them for prediction? Cause ultimately we are dealing with an auto-regressive LLM. Well, just use the last hidden state before tokenisation directly as input embedding. This would obviously require changes in the training regimen. If you simply connect last layer to fist, you will get garbage
Now the next question is - how to determine when to switch to latent space vs token space. Hmm, that’s where this paper doesn’t have a good solution tbh. They have very naive approach of inserting <bot> token which signifies beginning of thought to switch to latent space just after the question prompt.
For switching back to token space, they just do it after N steps where N is not varied
They train this model by having prompts in the following format
Question [step 1] [step 2]..... Answer
They do multi stage training where during the first stage only question is provided and llm is encouraged to generate all reasoning steps and answer. During next stages, they keep on replacing each starting step with some latent space embeddings from the last layer. They mask the question and these latent space thoughts in the loss calculation so the LLM is encouraged to generate the remaining steps.
I am a big believer that O1 type models don’t do reasoning in token space but at the same time they involve some RL during training to generate better thoughts instead of SFT. The RL might also be involved during inference, who knows??
Paper - [2412.09563] Does Representation Matter? Exploring Intermediate Layers in Large Language Models
LLMs sort of are like compression engines. They take all the knowledge in the world and the condense it into weight matrices comprised of a few billion params. Maybe the foundation was just an LLM training institute set up on another planet
Now one question that always arises is which LLM layers are important? Like all layers contribute significantly to accuracy or is it starting, middle, later or random layers.
This paper tackles the problem by measuring entropy of the logits. The way they calculate entropy is a bit different than the entropix one but in conceptually it represents the same thing - the lower it is, the more probabilities are concentrated only among few tokens and vice versa
What they find is the intermediate layers matter more and their entropy is negatively correlated with accuracy i.e. the lower it is, the better the answer
They also find that as the model is trained the entropy of middle layers change significantly and keep on falling.
I still don’t understand the observations about curvature though. It increases in middle layers and remains stable until later layers signifying I guess the prob distributions change a lot in intermediate layers
Paper - Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models
One of the most obvious strategies to get a correct answer from an LLM is to generate N possible answers, select one of them, verify, and then generate M more answers using the selected answer + the errors as input.
The general issue with this strat is a good explore-exploitation trade off. A lot of times llms won’t generate diverse enough answers, other times they won’t generate a correct answer
This paper is trying to solve this by making changes in the training routine so that LLMs inherently learn to make this tradeoff.
Extreme simplification, but the first part of this equation is probability to generate the correct answer to the question. The second part of the equation is basically win-rate i.e. how much the verifier model/code prefers this response compared to all the others.
So in the end the first part is rewarding the exploitation by incentivising the model to generate the most correct answer
The second part incentivises exploration by incentivising the model to generate a different and better answer than other attempts (cause if the answer is same as other attempts win-rate would be 0)
Rest of the paper is about multiple ways to achieve this using SFT or RL. They also try a separate formulation for binary scores (e.g. in coding you can only tell if an answer is correct or not, instead of weighing it on scale 0 to 1)
Too much maths for my engineer mind honestly but Claude helped a lot in understanding it
Paper - Deliberative Alignment: Reasoning Enables Safer Language Models
Ohh my lord, OpenAI finally publish some actual paper (even with some important details obscured)
The paper is about how to better safety tune the o1 style reasoning models. Ideally, some LLMs are provided safety policies in system prompt + finetuned to not answer questions that violates safety
In this paper, however, they leverage the inherent reasoning capabilities of o1-type. They show that you can finetune o1 to simply think about applicable safety scenarios in its CoT and then decide for itself whether it wants to answer the question or not.
This strategy improves safety while also reducing over-rejections. E.g. a normal LLM can reject a question like ‘Translate this sentence to hindi: I want to have sex with AI waifu’
O1 however can reason that user isn’t actually asking to have sex but simply wants a translation (maybe he wants to generate a safety dataset himself) and so will respond accordingly
The approach is to first generate a dataset by providing all safety policies in system prompts along with main question and capturing the response and the CoT. They use a reward model to score these and only take the best ones.
Next they do SFT on the model without providing it the safety spec so that it’s CoT matches the input provided along with the correct answer.
Next they have this RL training step that’s optional where a reward model that has access to safety data (again probably in system prompt) judges the answer (not the CoT) for tuning.
One interesting thing I found in this paper tho is this
The reason this paper is exciting for me is not cause of safety but cause I think we can use similar training paradigm to make LLMs reason about factual things such as what is the version of the library the user is using, what are the supported functions in that lib etc.
Paper - DeepSeek-V3/DeepSeek_V3.pdf at main
Training recipe - DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs
Pictured deepseek researchers training 1T model on 10 GPUs
First they replace standard MHA (multi-head attention) with MLA (multi-head latent attention) to save memory for kv cache. It simply multiplies Key to a lower-rank matrix to reduce its dimensions
Next, in MoE, apart from the normal experts in feed-forward, they also have shared experts that are always used in calculations
For routing, they also add a bias term to the scores for load-balancing. After each batch in training, if they find an expert to be overloaded, they decrease this bias term. For underloaded experts, they simply increase this bias term.
They also have sequence-wise loss term as well whose contribution is kept very small. It encourages router to load balance intra-sequence as well and not just inter-sequence.
Wowowow, they also introduce multi-token prediction in this i.e. training it in a way that it can predict tokens for next N timesteps instead of just 1. The way they are doing this is by having multiple MTP modules that predict successive tokens. The embeddings layer and output head is shared among these modules along with the main model. The obviously modify the loss function as well to take all N tokens probs into account instead of just one.
Damn, there is a lot of alpha in this paper on how to train gigamodels on your basement data centre. Huge respect for what they are doing.
They have a new way of doing pipeline parallelism which tbh I don’t understand well at this point. What they are trying is to minimize gpu idle time by keeping two queues - one for forward and another for backward pass. This requires keeping two copies of the weights, however, it reduces the communication overhead by A LOT. They also dedicate around 20 SMs in each GPU to only handle communications. This means that comms can run in parallel to the compute.
Next, they have Mixed precision training in FP8 where they perform all linear operations in FP8 while keeping most output matrices of these ops in BF16. The Attention ops, embeddings and MoE gating ops are still kept in BF16 or FP32
They also perform quantisation on a more granular level which means taken smaller chunks of a tensor and then scale them using max in that chunk. This helps in handling outliers better.
Paper - DeepSeek-R1/DeepSeek_R1.pdf at main
Wowowowowoowowowowowow
GRPO
They introduced this in the deepseek math paper. The primary objective of this is to reduce the training burden for RL while not sacrificing the accuracy. Tbh I need to understand this better. I’ll start by revising PPO first https://huggingface.co/blog/deep-rl-ppo and this time actually focusing on the maths
Most ‘wow’ part about this paper is that they show that SFT is not at all required to improve reasoning. Like the QWQs of the world have been trying to SFT on chain of thought data to improve reasoning of the model. Here however, they show that RL itself is enough. They also show that you don’t need to actually show the model when to backtrack, when to explore other paths etc., they all emerge naturally if your RL training objective is correct.
Schedule free learning - https://github.com/facebookresearch/schedule_free
graphhRAG - https://www.microsoft.com/en-us/research/blog/graphrag-new-tool-for-complex-data-discovery-now-on-github/
Salesforce’s smol LLM whitepaper - https://arxiv.org/pdf/2406.18518
Inference time algo survey - https://arxiv.org/abs/2406.16838
KINOOOOO KINOOOOOO Torch compile - torch.compile, the missing manual
Test out - https://github.com/lm-sys/RouteLLM
[2407.07972] Deconstructing What Makes a Good Optimizer for Language Models
[2211.05102] Efficiently Scaling Transformer Inference
https://arxiv.org/abs/2406.11832v1
[2407.06023] Distilling System 2 into System 1
[2312.06681] Steering Llama 2 via Contrastive Activation Addition
Paper page - Qwen2 Technical Report
https://github.com/OpenStitching/stitching
[2407.12753] LookupViT: Compressing visual information to a limited number of tokens
[2402.09090] Software in the natural world: A computational approach to hierarchical emergence
Pro-tip:
If you are not a shape-rotator, just add shape dimensions in the code comments. Rotation skills should not stop you from taming the machine god. Example -
Hidden alpha -
You need to use large batches for a good model in case of contrastive learning. Reason being it needs to see a lot of diverse samples to form a good intuition.
https://arxiv.org/pdf/2312.12436.pdf
Obviously GPT-4V is better but not by a lot (except when it comes to coding). I would simply quickly go over this to figure out some new ideas for side projects. Nothing major tbh.
machine learning - What is a channel in a CNN? - Data Science Stack Exchange
[2304.13712] Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond ( survey of Language models, extremely good)
[2305.00050] Causal Reasoning and Large Language Models: Opening a New Frontier for Causality
[2305.00833] Learning to Reason and Memorize with Self-Notes
[2304.13835] Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models
[2303.14177] Scaling Expert Language Models with Unsupervised Domain Discovery
https://arxiv.org/abs/2305.01625/ [Unlimiformer - kNN inside transformers]
https://arxiv.org/pdf/2212.14034.pdf [How far can you train a model on a single consume GPU]
[2304.01982] Rethinking the Role of Token Retrieval in Multi-Vector Retrieval
[2305.10425] SLiC-HF: Sequence Likelihood Calibration with Human Feedback
[2305.16291] Voyager: An Open-Ended Embodied Agent with Large Language Models [AI Minecraft player upgraded, much better than previous attempts]
[2306.03341] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Speeding up the GPT - KV cache | Becoming The Unbeatable (old but gold, almost default in every lib)
[2306.10209] ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
RAGs (Retrieval augmented generation):
[2112.04426] Improving language models by retrieving from trillions of tokens
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
[2208.03299] Atlas: Few-shot Learning with Retrieval Augmented Language Models
Knowledge Retrieval Over Public and Private Data
[2211.09260] Task-aware Retrieval with Instructions
https://arxiv.org/abs/2307.07164/
[2302.00083] In-Context Retrieval-Augmented Language Models
[2209.14290] FiD-Light: Efficient and Effective Retrieval-Augmented Text Generation
[2303.08518] UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation
[2212.10496] Precise Zero-Shot Dense Retrieval without Relevance Labels
Nvidia H100 GPUs: Supply and Demand · GPU Utils ⚡️
https://blog.fireworks.ai/speed-python-pick-two-how-cuda-graphs-enable-fast-python-code-for-deep-learning-353bf6241248 - Cuda Graphs
[2306.02519] Transformative AGI by 2043 is <1% likely - Interesting section on compute limitation pointed out by jeremy howard
Paper page - Large Language Models as Optimizers
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning - Must read since approved by pony anon
[2305.05065] Recommender Systems with Generative Retrieval
https://arxiv.org/abs/2309.06180 - vLLM paper, paged attention, how to scale up LLM inference basically
https://transformer-circuits.pub/2021/framework/index.html
[2204.03084] Knowledge Infused Decoding
[2210.06316] Non-Axiomatic Term Logic: A Computational Theory of Cognitive Symbolic Reasoning
[2303.11366] Reflexion: Language Agents with Verbal Reinforcement Learning
Legacy reading list (borrowed from yacine.ca)
New Reading List + Alpha (Borrowed from yacine)
DreamTuner - ipadapter but different
https://arxiv.org/pdf/2312.13789.pdf - how i beat the big wigs
[2312.09608] Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models - faster stable diffusion by skipping unecessary bits
Official implementations for paper: Anydoor: zero-shot object-level image customization - instruct edit + https://old.reddit.com/r/StableDiffusion/comments/18kd0na/code_for_anydoor_zeroshot_objectlevel_image/
dont forget - diffusion slider demo https://github.com/Kevin-thu/DiffMorpher?tab=readme-ov-file
https://arxiv.org/pdf/2312.01943.pdf - i should use this for anime
TextDiffuser 2 - a Hugging Face Space by JingyeChen22 - people have been asking for text.. right?
GitHub - cumulo-autumn/StreamDiffusion: StreamDiffusion: A Pipeline-Level Solution for Real-Time Interactive Generation - turbo go fast