Published using Google Docs
Keeping up with AGI
Updated automatically every 5 minutes

I AM YOUR LOYAL SERVANT, A PROUD PLUMBER! PRAISE THE OMNISSIAH!

Welcome to TDM’s Vault.

Trying my best (to keep up with AI literature), here’s the proof that’ll reside in basilisk’s hive mind.

This is honestly just a stupid log. Don’t expect any structure out of it. Meant mostly to shame myself into studying more to avoid a drop in social credit.

Publishing this doc btw is me shamelessly stealing yacine’s idea.

You can also ping me on Twitter anytime if want to build something fun or just chat about which

AI startup has most GPUs - https://x.com/cto_junior or mail me at cto.junioor@gmail.com 

LORA

HuggingGPT (could’ve chosen a better name, linkedin tier)

GPTEval

GPT4ALL

Vector DB

GPU Basics

HyDE

Generative Agent

Toolformers

ReAct

LLMA Decoding Acceleration

NOOO, THEY AUTOMATED L3s (KINDA!)

OpenAGI

4-bit quantization

AutoGPTs, Hmmmm

Deepspeed

INT4 finetuning for LLMs

AutoGPTs, Hmmmm, Hmmmm

Chameleon

Evaluate Code +

Can The foundation be just an LLM? If only Hari Seldon read this paper

Iter-CoT

WizardLM

DECKARD - RL Agent that dreams

Training LLMs using AI generated dialogues

Automating Data Analysts [By Microsoft(™)]

Local PC Waifu

(FLARE) Active Retrieval Augmented Generation

I LOVE COMPUTERS!!!!!

Flash Attention

ALiBi

Hack to make inference faster (by HuggingFace)

Unlimiformer

Tree of thoughts

Model Interpretation

QLoRA

Yes your models can memorize exact stuff

Voyager [Diamond ranked AI Minecraft player]

Need to update doc

Activation-aware Weight Quantisation (AWQ)

SpQR (Sparse Quantised Representation)

GGML adds 2-bit quantisation

SOTA document bender for your company QA

Multimodal is hard

Insane alpha drop from kaiokendev

Skinny dip into GGML code base

Skip Decode

Multi-party chat

Lost in the Middle

GPT-4 Details Leaked

How to check fine tuning datasets’ quality?

DPO (Direct Preference Optimization)

Mixture of Experts

Switch transformers

Glam

St-MOE

Multi-Query Attention

Symbol Rank ( for coding LLMs)

ReLORA

Zero++

Flash attention 2

LIMA

RLHF

[Lora Hub] Wait, was I talking about being blessed with the mandate of the heaven, Yes I still have it

TinyStories

FNet

Scaling S3 is not easy [Not related to ML but also related to AI cause all data is in S3]

RetNet

MoE (by Deepmind) (It’s soft not sparse)

Skill Issue Paper

BERT Primer

Estimate LLM Flops and Memory requirement

RoPE

Speculative decoding

Cool paper  - Topology of NN

How to reduce KV cache mem usage?

Hyena

VectorDB arc

Ok, I am going to become Vector DB expert this week

Lucene HNSW

FAISS

Annoy

Mixture of Experts: PEFT edition by Cohere

LLM as Optimisers

Generative Recommendors - Cool paper by Google

Flamingo

Fusing Modalities - Chimera by Meta

PromptBreeder

LLAVA

LLAVA-1.5

IMPORTANT INTERPRETABILITY PAPER BY ANTHROPIC

SAM

Qwen-VL

SigLIP

One peace

Make LLM do Maths

Distil-Whisper

It’s not AGI (it’s just your data)

Insane ML Notes on Twitter with Q&A

Stable Video Diffusion (SVD)

Stable Diffusion Turbo (or How to distill a diffusion model 101)

I can’t hear the MUSIC*  !!!!!!! NEEEED TO GET BETTTTTTTER!!!

Images are Sentences

Videos are sentences

Sentences are predictable

Mamba - faster architecture (Reading cause Tri Dao is author)

Gemini

Mitigating LLM Hallucinations

LCMs

Use smol models to train large models faster

DoReMI

LLM Paper from Apple?? : That’s a rare sight

Multimodal paper from Apple???

Amazing paper to Learn about Dingboard

TDM edge Multimodal arc (I blame Vik)

MobileVLM

MathPile

Unified-IO 2

DocLLM

Microsoft broke MTEB

Reading List from AHM

Reading List from Yacine

LASERRRRR (for reasoning)

Embarrassing myself publicly arc (PHOTOMAKER)

Lumiere

Deepseek Coder

IPAdapter

How to create AGI?

ILYA’s READING LIST (For getting up to speed on today’s architectures)

Stream Diffusion - Brrrrr ImageGen at 100FPS

MLLM-Guided Image Editing (MGIE)

Matryoshka Embeddings

Generalising Length of Transformers

World Model

Diffusion Transformers

Stable Diffusion 3

Deepseek-VL

Synth2

Fashion Diffusion (Make your waifu dress in Zara)

Another Apple LLM (this time it’s multimodal)

Quiet-Star (Is it really the fabled openai algo, nope)

Transformers for time series (truly retarded)

GaLore

ORPO

MyVLM (Shitty Name only Snapchat can think of)

Factuality in LLMs

Layer Skip

Training a Judge (model)

How to create a FAQ dataset from your company docs?

SDXL Lightning

HyperSD

REMOVE BACKGROUND OSS MODEL LFGGGG!!!
https://huggingface.co/schirrmacher/ormbg

Semantica

Making a good AI coder

Search in smol llms

Make Smol LLMs Kino (by Meta)

Kolors (Chinese SDXL)

Florence-2

Embedding Spreadsheets 101

SAM-2 (with MP4 support)

Flux architecture (stolen from reddit)

Loopy - Make images speak and sing

ColPali - Multimodal RAG made simpler and faster

Janus - Deepseek goes multimodal

Entropix

Spirit LM

Plan search

Hallo-2

TDM’s reasoning arc - must increase AGI IQ by 20 basis points

COCONUT

Which LLM layers are important?

Allow LLMs to explore-exploit better

Deliberative Alignment

Deepseek V3 (What’s new?)

N+1 reading lis

t

APPENDIX

Tips and Tricks

Karpathy’s Presentation in MS Build 2023

GEMINI PRO VS GPT-4V comparison

How to protect PII in data

Answers to Stupid questions (mostly for me)

Reading List

LORA

Need to play around with AutoGPT and more agents. Mfw it deletes all the pepe memes in my mac.

https://www.reddit.com/r/ChatGPT/comments/12diapw/gpt4_week_3_chatbots_are_yesterdays_news_ai/

(Moving too fast, need to catch up)

Still haven’t tried Toolformer and ReAct, they are now available in LangChain
Should be pretty easy (goddamn tho I really don’t want to write YAML configs)

HuggingGPT (could’ve chosen a better name, linkedin tier)

https://arxiv.org/pdf/2303.17580.pdf

Seems to be the end of NLPcels. Ideally, could have been done using NLP by tracking the question tokens I guess?

I still don’t fully understand how LLMs predict the tool/model to be used accurately, how do we control the hallucination there, is setting temperature = 0 the solution for everything?

GPTEval

https://arxiv.org/abs/2303.16634

Some interesting stuff in that paper, mostly related to how they used logprobs to do scoring rather than absolute scores

Also, for GPT-4, calculated logprobs manually

GPT4ALL

Works quite well in mac

Only issue is I had to download 2 separate things (1 weights and 1 executable) from two different places

Terminal interface quite good, can pipe input/output as well

Result though not really great (Damn these days billion is not enough, we need to enter the trillion era fr).

Should I finetune this over some OSS docs??? The train.py script seems easy peasy, but setting it up on colab might give me nightmares.

LLM as Terminal commands like jq (Brain explode jpg)

Actually, I am just going to lora the fuck out of GPT4All

Not going well honestly, too many errors

Finally atleast it ran when I gave up on accelerate and just ran it using normal python3 train.py

Still failed while trying to slice 0-length tensor, seems to be related to tokenizer returning empty

Btw, this transformers submodule looks interesting, need to check it out later hehe

https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All_Technical_Report.pdf
So Basically it takes 8 hours on A100 to finetune LORA (Hmm, and also 800$), not worth it for me honestly (a noob collab user)

Tried running the finetuning in colab with the following configs - https://github.com/nomic-ai/gpt4all/issues/108

Dataset taken from  - https://huggingface.co/datasets/nomic-ai/gpt4all_prompt_generations_with_p3/blob/main/data.jsonl

Disabled wandb logging (poor guy, no account)
Running into some tokenization errors with zero length tensors.

OK, LFG!!!, It is working, I am able to finetune. Seems like you need source column as well along with prompt and response 

Seems like it is not utilizing GPU at all?? Memory usage seems 0. Only running on CPU
WTF. Ths is when running with python3 train.py

Running
accelerate launch doesn’t work at all in colab. Saw too many errors on github as well with no clear solution.

Using the P3 removed dataset now (had to fucking convert parquet to jsonl)

Let’s see if it works

So turns out I am dumb (obviously)

You can just use the parquet dataset with P3 removed from the hugging face (https://huggingface.co/datasets/nomic-ai/gpt4all_prompt_generations/tree/main/data). The code already supports that. Let’s see if it works this time

Naah Still fails.

I want to figure out this generative agent shit for sure. Like Agents are literally the next step.

LangChain is also moving quite fast on this.

Vector DB

I clearly know in principle how Vector DBs work.
But I still don’t understand how they solve the performance issue of searching through all embeddings for nearest neighbours (that’s cause I am bad at maths)

Pretty sure there’s some heuristic from which you can select a smaller set of segments/files containing the embeddings that can fit into the memory.

OK looked at the chromDB code base and these mfs are just using wrappers on duckDB and hnswlib.

Duckdb stores the data and hnswlib creates the index for similarity search.

Mf hnsw codebase is in c++. Time to paste it into chatGPT to understand wtf is going on.

So for searching atleast it is simply doing BFS on a graph and adding result to a priority queue

Open questions :

Atleast now I understand why do they ask the similarity measure at the time of creating index in pinecone.

Also benchmarks for most nearest neighbour algos -  https://github.com/erikbern/ann-benchmarks

GPU Basics

OK, so to create a scalable vector DB, I guess there’s no solution other than to use GPU (since vector search is compute heavy)

This is def outdated as it refers to P100 which is quite old (Apparently a lot of it is still same till Ampere, but changed significantly in Hopper)

So they follow SIMT instead of SIMD. each of the 32 lanes/threads in a warp execute the same instruction but on different data. It’s kinda like a turbocharged SIMD.

Much better diagram from - https://web.ecs.syr.edu/~ffiorett/files/papers/padl14.pdf

I should learn GPU programming fr

 

Some alpha on GPU in this similarity search paper (https://arxiv.org/pdf/1702.08734.pdf)

HyDE

This guy is sooooooo correct. That’s literally what I am facing.

I need to push a million docs in context.

Map-reduce summarisation is losing sooo much info.

Plus making too many GPT calls is just sloooow (mf, gpt chat APIs support batching soon)

This is genius. Mfw you can call vectorDB in the end of the chain rather the start.

We literally took an hallucinated answer and converted it to actual text.

Issue is though, what if GPT can’t even hallucinate, like it literally outputs one line?

One more drawback is that you have to compute embeddings of large answers every time instead of a shorter query.

So two performance bottlenecks.

I guess I should give shoggoth lang (the weird compression by GPT) a try. Its lossy but might help here cause we are retrieving from the database.

Paper - https://arxiv.org/pdf/2212.10496.pdf

Generative Agent

We actually simulating society now via LLMs.

Again just plugging in memory + some prompt engineering to reason works wonders.

https://arxiv.org/pdf/2304.03442.pdf

This agent might just be better than me cause I can’t remember who the fuck I ran into at the park and clearly not what they were working on (unless they are working on some cool shit).

Btw this runs on GPT-3.5-TURBO and not GPT-4.

Just switching the model will make this soooooo much more impressive.

So for simulating memory, they are simply storing whatever the agent observed in some DB. At the point of interaction, these observations are fetched and then rated on recency, importance (by directly calling the LLM to assign a score) and relevance to the current situation. Take weighted sum, select as many memories as can fit in context window of LLM and hit it.

I feel that this type of scoring via LLM doesn’t work really well though. Same was said in that using GPT as labeller paper.

On the scale of 1 to 10, where 1 is purely mundane (e.g., brushing teeth, making bed) and 10 is extremely poignant (e.g., a break up, college acceptance), rate the likely poignancy of the following piece of memory.

Memory: buying groceries at The Willows Market and Pharmacy
Rating: <fill in>

This prompt returns an integer value of 2 for “cleaning up the room” and 8 for “asking your crush out on a date.” The importance score is generated at the time the memory object is created.

^ Brain explode JPG (Honestly though IRL, it’ll be too slow to be practical in large systems due to sheer number of LLM calls but that should be resolved in like 6 months)

So for Reflection they are using memory as well. Just gather recent observations, ask LLMs what questions should I ask based on these observations and then ask LLM again to answer those questions.  It is only triggered btw when importance score of observations goes beyond a certain threshold. Generally that happens twice-thrice a day (man, this is like a human lol)

So after solving for Observation and Reflection, final step in Planning. Mainly how to avoid duplicate actions like eating food twice in same hour as well as incoherent actions like playing basketball in a swimming pool.

Again, they are using memory for it. Just store the current plan along with Observation and reflection. First ask LLM to make a broad plan for the day, then using LLM start to break that plan into chunks of 1 hour. Then break it down further into chunks of 15 minutes. This solves for non-duplicate actions.

Secondly, you can also make changes to the plan based on new observations. Again this can be done by asking LLM questions like

Final complete architecture.

 
I feel like the number of things that are possible if LLM inference is like in single digit millis. We might actually be able to simulate brain functions. FAAASTEEEER COMPUTTTE! QUANTISATION!!!

Toolformers

Ok so this paper had some maths equations

But turns out those were just API calls represented as a(i) -> r, where a is api, i is input and r is output

So they are first taking a clean dataset, then selecting the places where we can insert some API call

This is being done by asking an LLM

They then filter out the API calls where the output of the API is actually helpful. They do this by calculating CE loss by providing the result of API, and not providing result of API. If that’s above a configured threshold tao, only then the API call is added to the input. This is only during training.

Now they finetune an LLM with this context + LLM call

And voila

You now have an LLM that can predict API calls

During inference, you simply detect → token

As soon as you get it, means the preceding text was an API call and now you actually need to do the work and fill the result before predicting rest of the tokens.

Pretty simple paper honestly, but finetuning is pain in the ass. Can LoRA be applied here during training?

ReAct

One of the most popular papers on getting LLM to be more humanlike

Ngl, too many words for what seems like prompt engineering

Should skip directly to appendix

Yeah, I didn’t find a lot of value here. The results also don’t seem that big of an improvement.

Gonna stick to a few-shot prompting for most tasks.

LLMA Decoding Acceleration

Link - https://arxiv.org/pdf/2304.04487.pdf

Basically why are you generating tokens one by one anon when you can simply copy a lot of it from the context in the prompt itself

Like when you do context based or history based Question answering or something else, the LLM will (as expected) generate a lot of stuff as it is mentioned in the context.

So you can simply change the LLM inference design so that when n number of generated tokens match some text in the input, we can simply take k more tokens from the input and simply append it to the output.

Then we can ask LLM to directly generate next K tokens and  then remove the ones that are not the same.

So your inference is reduced from K steps to just 2.

[BTW, most of these papers have hidden alpha in appendix where they tell their prompt templates]

Honestly the pace of LLM research is so fast that a lot of these techniques can be simply made better

By plugging in the techniques from the other papers published in last month or so.

E.g. you can improve the generative agent paper by using a better scoring technique from GPTEval paper (take weighted sum of logprobs instead of absolute rank)

Also, I might need to learn Linear algebra fr now to understand some of these papers better  (but let’s see how much I can avoid doing that first). Luckily most of them have just basic error loss or probabilities equations for now.

Some more papers I need to read (related to making inference faster) -

4-bit quantisation - https://arxiv.org/abs/2212.09720 (not gonna have to pretend that I know about it while discussing llama.cpp)

SparseGPT - https://arxiv.org/abs/2301.00774 (mostly related to pruning I guess)

Compression - https://arxiv.org/abs/2002.02925

Distillation (of attention) - https://arxiv.org/abs/2002.02925

NOOO, THEY AUTOMATED L3s (KINDA!)

It might very well be already better than me. Only issue with such systems again is the inference speed.

But then again, I play Apex or Overwatch while debugging an issue so we are both inefficient. (Codex = code-davinci-002)

Hmm, I remember reading about this high-level semantic idea in the Generative Agent paper as well (specifically in the planning phase).

Basically first generate a good enough summary of the code, break it down into chunks and then recursively generate a finer summary of each chunk.

This way the program doesn’t hallucinate much or creates duplicate summaries.

OpenAGI

Not sure about their few-shot approach cause the solution is mentioned in the prompts itself.

Idea is interesting though but isn’t that exactly what toolformer/plugin is doing though?

Handing over the complex tasks to plugins and then using their output.

I guess toolformer can’t operator in pipelined fashion is being done here. Not really sure though.

Following is much better way IMO VVVV

4-bit quantization

Reminded me why I hate being in this limbo of US and UK english. Like should I use sation or zation

FUCK OFFF!

Anyways,

Quantisation is basically mapping a value to a finite set F

So e.g. You can map 32 bit integer values to a set of 16 floating point or (integer) values

For the most basic quantisation the technique is quite simple

Implementation in llama.cpp

Seems simple but there are still a few gaps in my understanding.

Also, why the fuck is markdown code block not working here.

```

QK = 32

size_t ggml_quantize_q4_0(const float * src, void * dst, int n, int k, int64_t * hist) {

    assert(k % QK == 0);

    const int nb = k / QK;

    for (int j = 0; j < n; j += k) {

        block_q4_0 * restrict y = (block_q4_0 *)dst + j/QK;

        quantize_row_q4_0_reference(src + j, y, k);

        for (int i = 0; i < nb; i++) {

            for (int l = 0; l < QK; l += 2) {

                const uint8_t vi0 = y[i].qs[l/2] & 0xF;

                const uint8_t vi1 = y[i].qs[l/2] >> 4;

                hist[vi0]++;

                hist[vi1]++;

            }

        }

    }

    return (n/QK*sizeof(block_q4_0));

}

static void quantize_row_q4_0_reference(const float * restrict x, block_q4_0 * restrict y, int k) {

    assert(k % QK == 0);

    const int nb = k / QK;

    uint8_t pp[QK/2];

    for (int i = 0; i < nb; i++) {

        float amax = 0.0f; // absolute max

        for (int l = 0; l < QK; l++) {

            const float v = x[i*QK + l];

            amax = MAX(amax, fabsf(v));

        }

        const float d = amax / ((1 << 3) - 1);

        const float id = d ? 1.0f/d : 0.0f;

        y[i].d = d;

        for (int l = 0; l < QK; l += 2) {

            const float v0 = x[i*QK + l + 0]*id;

            const float v1 = x[i*QK + l + 1]*id;

            const uint8_t vi0 = (int8_t)roundf(v0) + 8;

            const uint8_t vi1 = (int8_t)roundf(v1) + 8;

            assert(vi0 < 16);

            assert(vi1 < 16);

            pp[l/2] = vi0 | (vi1 << 4);

        }

        memcpy(y[i].qs, pp, sizeof(pp));

    }

}

```

Also, I know it is just 1 line call in pytorch afaik but its fun to look into how some stuff is actually implemented internally. Might be useful during the apocalypse when github arctic vault is taken over by neo-atlantians.

AutoGPTs, Hmmmm

It seems like just looking at prompts.txt is enough for such tools (praise the LLMs)

The tooling is mostly trivial

Only need to ensure the output of LLM is in JSON (for which I saw A LOOOOOT of hacks in their codebase)

Anyways, so it is mostly the prompts in the following fashion

Thoughts -> reasoning -> Plan -> Criticism -> Speak

Thoughts dictate the motivation

Reasoning justifies it

Plan lists down the steps in bullet points

Criticism makes sure to not do anything illegal mentioned in the original instructions

Speak is optional, if you want to communicate what you’re doing to the user in summarised form

Deepspeed

Check out the RLHF module in Deepspeed chat

DeepSpeed/README.md at master

Btw, 1000X programmer ggerganov cooked another appetizer. This time its whisper doing inference via large model in just 1 second on macs.

INT4 finetuning for LLMs

WUT!

https://github.com/stochasticai/xturing

We can finetune it now on our laptops

Praise the omnissiah!

I feel though the time taken will be HUUUUUUUUGEEE

Ok so in the stats it still takes quite long (6-7 hours) to finetune on 3070.
I can assume somewhere near 5 hours on 4090.

Practical buttt… who’s gonna monitor it for so long for a side project.

Also, their notebook is a bit weird, it uses docker command inside the notebook (first time seeing this)

AutoGPTs, Hmmmm, Hmmmm

Seems like everyone hates vectorDBs now.

I still think they have a proper usecase once you go beyond 10 million vectors.

On that note, I should checkout Instructor-xl - https://arxiv.org/pdf/2212.09741.pdf

VISION + LLM = AI Manga with dialogues (https://github.com/Vision-CAIR/MiniGPT-4)

https://huggingface.co/blog/stackllama

MultiGPTs(https://github.com/rumpfmax/Multi-GPT/) , More LLMs with Vision (https://arxiv.org/abs/2304.08485) , Prompt inversion (: harxiv.org/abs/2304.08460 - man this list keeps on growing

Chameleon

(https://arxiv.org/pdf/2304.09842.pdf)

Better than toolformer (according to them), in the sense it doesn’t require any finetuning for new tools

You simply use LLM to generate a plan based on an inventory of existing modules. Since inventory can be updated easily, integrating new tools is quite simple. Only limitation is the context window length of the LLM.

Evaluate Code +

Paper - https://arxiv.org/pdf/2304.09433.pdf

Basically they are using LLMs to convert unstructured corpus of text to structured data.

The new thing here is mainly not simply using directl approach via prompt.

It’s to generate a schema and a code to do the conversion.

Reasoning being the text is subject to change and you can keep on adding more details. Running via direct approach is expensive which needs to revaluate everything for the answer while with new approach you can simply execute the python code.

Another thing is determining which code is correct. They generate multiple code blocks and then basically

On reading more I feel this will be extremely good for finbros. They are the ones dealing with most unstructured data honestly (shocking but true, not all data is excel). Can help them extract stats easily from multiple company’s data which are in totally different formats.

The True magic is in the function synthesis part which involves weak supervision.

Can The foundation be just an LLM? If only Hari Seldon read this paper

https://arxiv.org/pdf/2304.11062.pdf

On more analysis, it is not that practical to use. The model can only retain fix amount of memory.

Scaling Transformer to 1M tokens and beyond with RMT (Paper Explained)

Not worried about the end of normie SDE day job from AI cause I am craving for more work honestly.

Iter-CoT

Guess what, we added another loop in the GPT

Nothing major honestly, just keep on correcting CoT (think step by step) with either

Doing this for 5-6 iterations results in correct results.

Now these correct demonstrations can be used in Few-shot prompts the next time (god I wish if context length was bigger)

WizardLM

JUst using prompts to expand the dataset to include complex tasks.

Exploring both breadth and depth.

Claims its better then chatGPT for complex tasks.

Not sure. Doubt.

—------------—------------—------------—------------—------------—------------—------------—------------—------------

Btw, I just realised I don’t understand anything in ML after reading about DinoV2. I need to know about self supervised learning as well as distillation. Guess its time for chatGPT to teach me about this.

DECKARD - RL Agent that dreams

Paper - https://arxiv.org/pdf/2301.12050.pdf 

Basic funda is instead of starting a RL agent from zero knowledge, you use LLM to create an Abstract World Model (AWM) and then use the model as starting point.

The model is created during the dream phase.

The model is verified by the RL agent using rewards in the wake phase.

Was tested in Minecraft where it creates recipes for crafting stuff in dreams and then RL agents learns to actually mine/craft those recipes. If something is not valid, then it is discarded from the graph and new verified nodes are added.

LLM is pretty good at generating recipes. Mostly fucks up in quantities but not the ingredients.

They are mostly using code-davinci-002 for generation.

Training LLMs using AI generated dialogues

Paper - [2304.14318] q2d: Turning Questions into Dialogs to Teach Models How to Search

One of the few papers that uses PaLM-540B instead of GPT (although they provide code that works with the latter)

Premise is you can generate human annotator level chat conversations with LLMs.

Steps

  1. Get question from a database
  2. Ask PaLM to generate a conversation (using few shot examples in demonstration)
  3. Generate a reverse query based on the generated dialogues and few-shot examples. Again use PaLM
  4. Filter the conversation based on -
  1. if the generated query and original queries intent are dif (uses SBERT)
  2. Where the answer is included in the dialogue itself. (the answer should ideally be the last response of the LLM in training data)
  3. Some of the generated dialogues match with the original question.

 

Automating Data Analysts [By Microsoft(™)]

Paper: [2305.01598] From Words to Code: Harnessing Data for Program Synthesis from Natural Language

I guess this might be the way they generated code for copilot as well. Filled with lots of small small nuggets. They are quite focused on good UX and not just research which is great.

Using code-davinci-002 for LLM calls

Problem statement is to write K programs to process data D and then present them to the users.

What makes it different is the way those K programs are being ranked. The aim is to offer programs to the user that are correct but have enough variability. There’s no point in presenting 100 programs that look exactly the same.

Steps

  1. Get a prompt to generate program
  2. Attach data schema and some rows to the prompt
  3. Ask it to generate N programs where N > K. Use multiple temperatures (0, 0.6, 0.8) for good enough variability. Generally prefer high temperatures. (this ensures your programs are not exactly similar to each other and thus if one is wrong, all other programs are wrong as well)
  4. Order generated programs by average logprob of tokens. (the default ordering just follows initial seed prob distribution, this otoh yields better results)
  5. Execute programs on a subset of data and filter out ones with exceptions.
  6. Re-rank magic (so that diverse set of programs are shown to the user first instead of similar ones)
  1. Calculate output of every programs
  2. Group programs with similar output together
  3. Rank programs in each group (based on logprob)
  4. Return top programs from each group first, then second best programs from each group second and so on
  1. You can then filter out programs for which the output doesn’t pass the data quality check. Right now, just filtering out outputs containing null columns or empty tables.
  2. (optional) You can also ask LLM to predict N outputs for the original task and data schema. You can then compute similarity scores w.r.t. Actual program output and use that to modify the scores before step 4. This is helpful for tasks for which model has seen very few samples in the training data e.g. numpy programs (popular) vs Power query (not so popular)

Local PC Waifu

Ooga Booga chat UI + Stable diffusion Character pfp + pygmalion-6b character model + custom persona (inspired by some submissions from https://botprompts.net/)

On more experiments, the superCot LoRA also does pretty well with character simulation (even with characters that use profanity)
However, the dialogues are super short for some reason.
Reminds of the tweet by Noelle on how pygmalion dialogues feel much more natural/human cause it is trained on actual chat data instead of artificial datasets.

(FLARE) Active Retrieval Augmented Generation

Basic problem they are trying to solve is generating long text for a question based on a retrieval.

Retrieving a lot of chunks based on a small question doesn’t work well.

What’s needed is as you are generating text, you keep on fetching chunks to fill the information.

We need to solve for three problems though with this method

 

The solve for these problems using the following methods

I LOVE COMPUTERS!!!!!

A tear rolled down my face (so beautiful, I will never stop loving computers)

Flash Attention

Calculating attention is really slow. Limited by memory bandwidth since it scales quadratically w.r.t

Matrix Dimensions.

Basic funda is to make the algo IO-Aware (at first I thought they will be doing some syscalls to get io stats but its not that)

You split the original matrices of size N * d into multiple blocks of size B.

Run two loops

The outer one iterates over each block of Key K and Value V matrices and loads them from high bandwidth memory (HBM) to SRAM

The inner loop iterates over blocks of Query Q and calculates the softmax and other matrices for output.

During backprop, the algo is changed a bit so that intermediate matrices (Softmax and dot product) need not be loaded from HBM again. The algo can easily calculate those using the output itself which is already available in SRAM.

Hence, you basically use more compute to use less memory access.

Luckily GPUs have a lot of compute so it’s not a problem.

Here’s GPT4s simpler explanation in case you lose a few brain cells after binging anime with tsunderes

Now, let's go through the algorithm step by step:

1. **Set block sizes**: The algorithm sets the size of the blocks it will divide the data into, based on the available on-chip memory (SRAM) and the dimension of the data (d).

2. **Initialize**: The algorithm initializes an output matrix O, and two vectors l and m with zeros and negative infinity, respectively. These are used to store the results of the computations.

3. **Divide Q, K, V**: The matrices Q, K, and V are divided into smaller blocks.

4. **Divide O, l, m**: Similarly, the output matrix O, and the vectors l and m are divided into smaller blocks.

5-15. **Compute attention**: For each block of K and V (steps 5 and 6), and for each block of Q, O, l, and m (steps 7 and 8), the algorithm performs the following computations:

   - Compute the attention scores S (step 9).

   

   - Compute the max of each row of S (step 10). This is used for numerical stability when computing the exponentials in the next step.

   

   - Compute the exponentials of the attention scores, normalized by subtracting the max computed in the previous step. This produces the attention probabilities P (step 10).

   

   - Compute the sum of each row of P (step 10). This is used to normalize the attention probabilities.

   

   - Update the vectors l and m using the values computed in the previous steps (steps 11 and 12). These are used to keep track of the maximum attention scores and the sum of the attention probabilities.

   

   - Write the updated values of O, l, and m back to the memory (steps 12 and 13).

16. **Return**: The algorithm returns the final output matrix O, which contains the weighted sum of the Values (V), weighted by the computed attention probabilities.

This algorithm optimizes the attention computation by processing the data in blocks that fit into the on-chip memory, which is much faster than main memory (HBM). This is a technique known as "block-based computation", and it is widely used in computer science to optimize computations for memory-hierarchy-based systems.

In the context of the Flash Attention algorithm, l and m are vectors used in the calculation of the softmax function, which is a crucial part of the attention mechanism in transformer-based models.

The softmax function is used to convert raw scores (in this case, the result of QK^T, denoted as S_ij in the algorithm) into probabilities that sum up to 1. This function has an exponential operation, which can lead to numerical instability issues (like overflow or underflow) when the raw scores are very large or very small.

To mitigate this issue, a common trick used in practice is to subtract the maximum value in the set of scores from all scores before applying the exponential function. This is where m comes in -- it's used to store the maximum values.

Specifically, in Step 10 of the algorithm:

m~ = rowmax(S_ij) calculates the maximum value of each row in the score matrix, and

P~ = exp(S_ij - m~) subtracts these max values from the scores before applying the exponential function.

In Step 11:

m_i = max(m_i, m~_ij) updates m with the new maximum values.

The l vector, on the other hand, is used to store the sum of the softmax outputs (i.e., the attention probabilities). In Step 10, l~ = rowsum(P~_ij) calculates the sum of each row in the attention probability matrix. Then, in Step 11, l_i = exp(-m_i) * l_i + exp(-m_new) * l~_ij updates l with the new sums, where m_new is the updated maximum value. These sums are used in Step 12 to normalize the attention probabilities.

Overall, l and m are used to perform stable softmax calculations and to store intermediate results that are used for later normalization.

ALiBi

Code - https://github.com/ofirpress/attention_with_linear_biases

This algo (along with FlashAttention) is currently being used to extend the context length of the models. The most popular use case being mpt-7b-storywriter.

This algo btw is just a hack that works (not making this up, they accept it his in their README btw)

Basically, they remove positional embeddings from the attention calculation.


And then mask the attention scores with m.X where X is proportional to the distance i.e. farther the word, the more heavily it is penalised.

That’s it. Somehow it works.

m is also constant btw determined before training (½ ^ 0.5)

Hack to make inference faster (by HuggingFace)

Assisted Generation: a new direction toward low-latency text generation


gist: Use Smaller LM to generate stuff faster while using a LLM to fix output in case it deviates.

Decision taken on the basis on token mismatch plus output logprobs

Pros - low latency

Cons - More compute (cause you are running both models)

Unlimiformer

Paper: [2305.01625] Unlimiformer: Long-Range Transformers with Unlimited Length Input

So how to increase the context length of the transformers?
Flash attention? Ok done

ALiBi? Done

Congrats, you got to 60K in length.

Only problem is I need a million token context to stuff my divorce court case documents

Worry not.

What if you simply plugged in vectorDB into transformer architecture?
Well that would make inference insanely slow

Hmm, but what if we kept it small enough that it always fits in GPU or CPU RAM

Yep, it will work in that case.

This is basically what unlimiformer is. The hidden states are stored in a vector db like FAISS and when attention is computed, instead of multiplying every key and query we only fetch top-N keys and multiply it with query. This call is done separately for each head.

Tree of thoughts

Paper: [2305.10601] Tree of Thoughts: Deliberate Problem Solving with Large Language Models

I think we are speedrunning data structures with LLMs at this point. Soon we’ll be getting graph prompting, some esoteric mf balanced tree prompting and so on.



Anyways, this one is an extension of the chain of thoughts (CoT) prompting.

Ask LLM to generate N possible paths to the solution (e.g. generate 1 word to fill the crossword)

Then you can use BFS or DFS to explore all the possible solutions from  each branch.

At each step you can also run a validation (typically using LLM) to see if it's even worth pursuing the branch. If not, you simply discard it.

Another thing you can do is take a vote on which path to follow instead of running validation. Vote is also taken using multiple LLM calls

Issue is though I can’t apply such techniques in prod code since LLM calls are too slow rn.

They will be viable when latency comes in hundreds of millis.

Good for problem solving use-cases though.

Honestly, I just had an idea. I should maybe add a tab or something to Ooba which allows me to easily leverage all of this clever prompting.

Model Interpretation

Just struck me how trust is important in society and model interpretation is solving that problem rather than for research.

If we can interpret the black boxes better, we can convince regulators in the fields such as Medicine, lawyers, civil engineers, piloting EVAs, etc. to use these boxes in high-risk environments.

Two recent experiments, both use LLMs

https://arxiv.org/abs/2305.09863 (Microsoft)

Language models can explain neurons in language models (OpenAI)

The initial step is the same in both i.e. to figure out the tokens for which the neuron activates the most. This can be done quite easily by just analyzing the output log probs of some inputs.

In the next step, what MS folks do is they ask GPT-3.5 to generate 5 explanations based on the selected tokens. E.g. if selected tokens are wife, sister, father, mother then the explanation can be ‘family and relationships’

Microsoft approach

Next, they ask GPT-3.5 to generate some paragraphs that contain words related to the explanation. 10 paragraphs are generated that contain the related tokens and 10 paragraphs are generated that have none of the related tokens.

Then we make each of these para pass through the neuron and check the output logprobs again. The higher the difference b/w positive and negative samples, the better the explanation is for the neuron. One with the largest difference is the winner.

OpenAI Approach

Here, they use GPT-4 to simulate the neuron itself. The original text is given and passed to this simulated neuron (just a prompt asking LLM to output scores from 1 to 10 based on the explanation).

The output of the simulated neurons is then compared to the actual activations. The closer the outputs are, the better the explanation is.

This requires considerably fewer steps then MS approach but I feel it will fuck up on generating scores.

I need to experiment with ImageBind.

So wrote a local memes organizer.

Works pretty well

QLoRA

Blog - https://huggingface.co/blog/4bit-transformers-bitsandbytes

Paper has some math so I didn’t read it honestly till now

But basic funda is quite simple

Using it is quite simple. Just add a few params mentioned in the blog to any existing PEFT based training code e.g. alpaca-lora. The qlora.py code in the official repo seems to be broken.

For the memory usage mentioned in the paper, you need to use batch size of 1 and gradient accumulation steps of 4.

Since it is a bit slow, I used a batch size of 4  and a lora rank of 64.


With that I was able to train a 13B vicuna model on my smol dataset in an hour on a 4080 card.

Fucking awesome!

Compute Metrics for llama-supercot-13B

The output is not great though since I only trained it for an hour. Need to train longer.

Yes your models can memorize exact stuff

https://twitter.com/main_horse/status/1662478420738187266?s=20

This is also a lie btw, just a random narrative

I got tested in the public arena so I am testing you as well

Voyager [Diamond ranked AI Minecraft player]

Paper -https://arxiv.org/pdf/2305.16291.pdf

Ask GPT-4 to write a program based on the current environment context

Tell Gpt-3.5 to generate description for that program (I should start doing this as well, meta-commentary by GPT on text blocks)

Store it in a vector database with key has the description and program as the value

For programming they use three feedbacks -

This is yet another great point. Reasoning capabilities of GPT-4 (we should call it proto-AGI at this moment) enable us to do a lot of this stuff. Can’t believe how much better GPT-4 is compared to 3.5

Link to all prompts in the codebase - https://github.com/MineDojo/Voyager/tree/main/voyager/prompts

Need to update doc

Haven’t updated this doc in a week or so, busy with day job stuff

Also spending time actually trying out a few techniques from this doc. I realized I don’t understand some stuff especially when I saw weird tokenization code in qlora.py code.

So just testing out in my local

Some things sound simple when reading but you realize so many hidden details when you actually implement stuff.

This imo makes a lot of difference in your understanding as well as the speed of iteration when it comes to shipping stuff. Like you can figure out with a quick look at the error what’s the actual root case when it might take someone to spend 2 days googling.

Everyone has had this feeling of just knowing something but not being able to explain how.

It’s because they spent time earlier on in their life playing with so many tools and techniques that it's just built into their subconsciousness.

Also, I need a way to update this doc directly via terminal. Opening google doc in a browser and scrolling is too slow.

Activation-aware Weight Quantisation (AWQ)

Paper - https://arxiv.org/pdf/2306.00978.pdf

Code - https://github.com/mit-han-lab/llm-awq

Claims to be 1.5X faster than GPTQ as well as more accurate


Most quantisation techniques currently rely on re-ordering weights post quantisation to get better accuracy. This operation however is not natively supported by GPUs and hence slow.

Here they do not use re-ordering.

What they do is simply perform a normal quantisation using min max approach.

Then they check using a sample of inputs which lead to more activation.

Then they keep 0.1-1% of such weights in f16 format only while converting rested to INT3/4

This helps solve for accuracy.

Mixed precision (f16 and INT4) is not however GPU friendly

So they finally figure out an appropriate scaling factor that minimizes the difference in output for a layer.

Then the f16 weights are scaled by this and converted to INT4.

This is the python file where most of the activation-aware logic occurs.

https://github.com/mit-han-lab/llm-awq/blob/3a6dfc39ed20d793f7c26624c4b9f9599960dd3b/awq/quantize/auto_scale.py

This is where we cache input feature samples for some data from pre training dataset to determine activation
https://github.com/mit-han-lab/llm-awq/blob/3a6dfc39ed20d793f7c26624c4b9f9599960dd3b/awq/quantize/pre_quant.py

SpQR (Sparse Quantised Representation)

Paper - https://arxiv.org/abs/2306.03078

Literally within a week of the previous technique we have a better one.

Core idea is slightly similar - Some weights are more important than others so focus on preserving them correctly rather than every parameter.

Here what they do is make a sparse representation that consists of important weights and then try to minimize the error for these weights (still need to read how they are calculating error and minimising, most likely seems to be some sample dataset as in AWQ)

GGML adds 2-bit quantisation

PR link - https://github.com/ggerganov/llama.cpp/pull/1684

Wait wut? Is it even worth it? Like how bad would this level of quant be

Well

It is not bad at all.

The 2 bit quantised version of large model has better perplexity then f16 version of smaller model. The gap for 13B f16 and 30B 2-bit is quite high.

This means I should now start running 30B on my 4080 instead of 13B (Yea it fits in 16GB VRAM)

SOTA document bender for your company QA

Not made up without any thought

Employs HyDE (Hypothetical document embeddings) + specialist model technique presented in lots of papers

The primary objective is to fetch correct embeddings when questions are extremely short plus ordering them correctly on more than similarity

Multimodal is hard

You would assume simply combining embeddings should work but its not like that

Reasoning being embeddings of different type of objects (i.e. text, audio, video, image) have different Signal to noise ratio.

This is the reason why you can train a great image or audio model with just 1-3B parameters (MusicGen, SD) but text requires much much more params

Insane alpha drop from kaiokendev

https://kaiokendev.github.io/til#extending-context-to-8k

You need to extend your model context by 4X

Worry not

Just divide the positional embeddings by 4 (lol, lmao even)

Don’t believe me?

See this tweet from mr. ggml

https://twitter.com/ggerganov/status/1671915699025977351?s=20

One possible reason it works is that large models tend to overfit on positional embeddings

So if they see an embedding like 4096 which was never encountered in the training, they start outputting gibberish.

However, if you make the model believe that 4096 is infact 2048 (dividing all embeddings by 2), the model suddenly starts giving correct output.

This however doesn’t explain though what happens with decimal embeddings like 1024.75 since they were also not encountered in the training.

Skinny dip into GGML code base

GGML folks are doing god’s work. Giving the horny 4chan bois in college hostels on their cheap Asus phones a way to run LLMs locally is not a small task.

It is even harder to make them run on smol pi machines but it works.

Two weeks back I was trying to add capability to swap loras in the llama.cpp.

The code is already there to apply the Lora. To remove one, you can simply subtract the BA adapter matrix from the weights instead of adding it.

However, it didn’t work as expected when I was doing print debugging and for that I had to do a smol give into the ggml codebase.

https://github.com/ggerganov/llama.cpp/blob/b8c8dda75fdf5fdea49c80af36818e7c30fe0ddf/llama.cpp#L2896

All of the matrices in ggml codebase are represented using ggml_tensor 

Each tensor struct generally contains the source tensors and the operator using those two source matrices can be combined to get this one.

The floating point data is just present in void* data array. It is void* so that you can store data in any format - f32, f16, quantized ints.

Take an example of the first op

ggml_tensor * BA = ggml_mul_mat(lora_ctx, loraA, loraB);

Here it is simply performing matrix multiplication on two matrices loraA and loraB. The lora_ctx is used for temporary memory buffers and is cleaned up after an operation is complete for a layer.

There is a catch tho. This op actually doesn’t do anything! It just creates a new tensor with sources as loraA and loraB and operator as GGML_OP_MUL_MAT. https://github.com/ggerganov/llama.cpp/blob/b8c8dda75fdf5fdea49c80af36818e7c30fe0ddf/ggml.c#L5849

So what’s the point of calling this?

Well, for ggml all the computations for a layer are performed in a single go (I guess for better GPU utilisation as well as lazy execution? Not sure)

Once all ops are listed down, a DFS traversal is done from the last tensor all the way to the root tensors to form a computation graph. This is done in line

struct ggml_cgraph gf = ggml_build_forward(r);

The final result is just a 1-d array containing the tensors in the order in which they should be computer i.e. leaves first and root last.

Once you have the array, the computation is actually triggered using ggml_graph_compute(lora_ctx, &gf);

 

As you can see above, the multiplication is handled separately based on the data type of the tensor. The general theme is most calculations are done in f32 mode and all other datatypes are converted to F32 and back to quantised form for the calculations. This can be different in ggml cuda code but I haven’t taken a deep look at that.

https://github.com/ggerganov/llama.cpp/blob/b8c8dda75fdf5fdea49c80af36818e7c30fe0ddf/ggml.c#L11057

Skip Decode

Paper -https://arxiv.org/pdf/2307.02628.pdf

So far from what I got after reading this is that I need to understand KV caching better.  

Update:

So basically the KV Cache in itself isn’t the problem.

It’s the use of KV Cache with early termination, that’s the issue.

So basically in early termination, you’ve a classifier or some other algo at a layer that can use the log probs generated after the layer N to decide if the tokens should even go to the next layer or we should simply declare a winner here.

Now the thing is with this technique, if you terminated the previous token at layer N but for the new token you need to terminate at layer N + 2, you are left with 2 layers for which you have no KV cache data. So now you need to recompute the KV cache for the previous token and the last 2 layers before proceeding to do calculations for the current token.

This is computationally heavy and what this paper is trying to solve.

How?

Well instead of letting everything terminate at random layers, what if we could make a deterministic algo to predict the last layer of the token.


Well, that’s really simple if you just plot at what layers do transformers have good enough confidence for Nth token. You will find that for the token later in the sequence, you can predict them just by passing only 1-2 layers cause you have a large amount of context available in the input.

For the earlier tokens however you need to go through all the layers to predict correctly.

So you can simply use a function that’s like monotonically decreasing and use it to predict no. of layers for Nth token.

But what about the KV cache?

So see, since now you are basically ensuring that if the Nth token goes through M layers, then it’s guaranteed that N+1th token will go through <=M layer (cause your func is monotonically decreasing). Thus, you will always have vectors in the KV cache for all M layers and for all the N tokens.

Cool. But why is this called skip decode then? There’s no skipping so far.

Well, your previous tokens have gone through M layers and your next tokens are going through <=M layers. You are now stuck with exact opposite problem that the new tokens don’t benefit from the extra computation done by previous tokens. To solve for this, instead of using the first M layers, the authors propose to use the last M layers. Hence, the skipping (cause you are skipping first few layers).

This is what the final output looks like.

 

Final question tho, is this all actually useful at all?

Yep, absolutely, 100 percent. Leads to 2-5X speed up in the inference (which would be amazing for large unquantized models running on my 4080 PC).

Multi-party chat

Paper - https://arxiv.org/abs/2304.13835

This is much more relatable now after using with the talk repo (https://github.com/yacineMTB/talk)

The primary problem is how to allow an LLM to

This is because other way is to clear the whole context and start LLM again with another persona. Can’t be done on each turn as it is too expensive.

Another way is to simply switch loras but training loras for each persona is a compute costs which VCs won’t sponsor. You can switch one easily though in less than 200ms in llama.cpp

So we are left with training an LLM in such a way that

Most of the magic of this paper lies in the dataset rather than the techniques.

Lost in the Middle

Paper - Paper page - Lost in the Middle: How Language Models Use Long Contexts

Just researchers trying to figure out if long context is even helpful or not in sota LLMs.

Good thing is they tried both OSS and closed source LLMs

Not so good - unless your doc is at the starting or the end of the context, it won’t influence the LLMs answer. This means that ranking really really matters. Which is why I use cohere rerank after fetching docs from pinecone, have seen insane but correct shifts in ranks.

Another thing - the more docs you stuff into the context the less accurate your results become.

Almost all models behave the same way.

So from a practical LLM Q&A perspective

Why does this happen tho?

  1. Cause most models are decoder only architecture which can only see the past tokens. The ones with encoder-decoder architecture like Flan-UL2 exhibit much less variance based on the doc position.
  2. Query-aware contextualisation - basically due to the decoder only architecture, model perform better if Query is before the context rather than after the context.
  3. Instruction based training - for most models the instruction based tuning keeps the correct results only in the initial parts of the context, hence they are extremely weighted to choose the first answer rather than the correct answer.

GPT-4 Details Leaked

Still not sure but so many legit folks think it’s true

So I must do the hard work now and read about all the techniques mentioned in that article

Multi query attention -
https://arxiv.org/pdf/1911.02150.pdf

MoE (Mixture of experts) -
[2101.03961] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity 
[2112.06905] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts 
[2202.08906] ST-MoE: Designing Stable and Transferable Sparse Expert Models 

Sparse Expert Models (Switch Transformers, GLAM, and more... w/ the Authors)

Multimodal vision - https://arxiv.org/abs/2204.14198

Speculative decoding - https://arxiv.org/pdf/2302.01318.pdf

How to check fine tuning datasets’ quality?

(Still haven’t read the gpt-4 papers, too much work in day job)

Had a discussion with telt, realized judging the quality of fine tuning datasets is a hard problem

For my custom one, I simply did multiple rounds of cleanup after my loss curve was not converging.
That is not the right way though, cause even if the curve goes down, you can get highly inaccurate but sorta correct sounding results.

Some methods which I have seen in paper are related to determining the variety of data via clustering but might have missed stuff that grades the datasets other than using Humans.

Please don’t suggest GPT-4 to grade it (which is also what I have seen in some papers). It doesn’t work well ime.

Only valid resources I’ve found is this - Paper page - Instruction Mining: High-Quality Instruction Data Selection for Large Language Models

Let me know if you make it here.

Update:

Seems like one other idea is to train a classifier to select high-quality vs low-quality. Seems to be a hassle though but applied in lot of goog papers as well as GPT-3

Although I am skeptical of approaches like above which rely on GPT to grade the answers, the papers do get good results using it.

Why I am skeptical is the fact it was shown that GPT favors its own generated answers plus the absolute rating scale can screw up a lot and relying on logprob based method is better.

DPO (Direct Preference Optimization)

Faster way then PPO (Proximal Policy Optimization) to do RLHF.

Doesn’t require a separate model


Interestingly too much math in the paper but thanks to
https://twitter.com/akbirthko autism I realized the math is quite simple.  Check the code here - https://github.com/okarthikb/DPO/blob/main/train.py


In essence you are just training the policy model with a loss function that minimizes the prob difference on the ratio of accepted/rejected probs w.r.t. Ref model

Mixture of Experts

What I’m realizing is this is the bleeding edge of transformer research.

Dense models are still the talk of the town. Compute prices are falling to the floor and hence it makes sense to keep on scaling dense models.

However, there’s still not enough compute in the world to run inference on 2T param models for 100M users in a few seconds.

That’s where the MoE models shine.

The basic idea is you still have one large model but only a part of it (called the expert) is triggered in the inference.

Now there are multiple attempts going on to make this better and better. Excluding the initial  approach crafted for RNNs by Noam shazeer, I am finding the GShard and Switch transformers paper to be more relatable.

Switch transformers

[2101.03961] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity 

Instead of a normal feed forward layer, we use a routing based feed forward layer with N experts.

Router is just a simple linear function which predicts which expert has the highest probability of generating the next token. Now we can simply take softmax and choose K <= N experts to route our data. In this case, K = 1 always.

Best part is you can simply distill these large sparse models to smol dense models while retaining a good amount of accuracy.

For the deployment part, you can easily shard these models using a combination of multiple paradigms.

If you’re a noob like me and don’t really understand what’s going on in this diagram. Simply look at this pseudocode -

https://gist.github.com/cto-junior/88477d52818597bf725e02bfb0559b43

The paper also has pseudocode specific to their implementation but it uses mesh tensorflow (baahh!!)

Here’s a good one that uses pytorch (translated using chatGPT):

https://gist.github.com/cto-junior/5018d526f2056546f6607986b08b423d

One thing I still don’t understand though is how these models are trained,  like especially the gating function. Do you just follow the normal training regime where inputs are passed to all the experts and finally settle upon some gating weights by backprop? Or do you explicitly choose which expert to run forward and backprop on for a particular cluster of dataset?

Extremely important points in case you choose to train a trillion param model in the basement. Don’t laugh, it should be possible in 5 years.

Glam

It just uses two instead of one expert and also adds another layer without an expert on top of the expert one.

The authors of original papers weren’t impressed (and I was not as well)

Some legit guy told me the GPT-4 is actually based on this architecture.

St-MOE

[2202.08906] ST-MoE: Designing Stable and Transferable Sparse Expert Models 

This focuses mostly on stability of the expert transformers especially during finetuning. Have to read it but it’s too loooooong.

Below is a much more exhaustive reading list thanks to main_horse on twitter

Multi-Query Attention

 https://arxiv.org/pdf/1911.02150.pdf

This is actually quite simple. You simply calculate attention for the same key and value using multiple queries. It’s just a variant of multi head attention with key and values shared. Primary motivation is to reduce the memory footprint during inference and training while capturing as much performance as possible.

https://gist.github.com/cto-junior/0adbaa7c5a8b2ce115939c7092af783b

Symbol Rank ( for coding LLMs)

https://twitter.com/ocolegro/status/1676602607106760705?s=20

GitHub - emrgnt-cmplxty/automata: Automata: The Future is Self-Written

ReLORA

[2307.05695] Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

Ok so the idea is really simple.

You add a lora adapter, you train it

Once the training is finished, you simply merge those lora weights with the layer.

Then you reset the A & B matrices and resume the training again.

Doing it multiple times will give you good results.

The catch is simple reset doesn’t work due to some past gradient thingy which will mess up the future updates so they reset a lot of optimizer states to 0 as well.

Overall code is very easy  - https://github.com/Guitaricet/peft_pretraining

Just grep for the can_reset if block in the torchrun_main.py 

And grep for merge_and_reinit in peft_pretraining/relora.py

Thanks for reading this, I should mention though more likely it is not usable for your llama model.

I am not making this up but the authors themselves haven’t tested it properly beyond 1B models and there too the results were dicey.

Zero++

DeepSpeed ZeRO++: A leap in speed for LLM and chat model training with 4X less communication - Microsoft Research

Ok so this is easy as well. They are just trying to reduce the comm overhead in distributed training.

So they do three things -

Flash attention 2

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Ok, this mostly seems like just optimisation on the kernel front rather than like an algo-rewrite of v1.

LIMA

LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA  LIMA LIMA LIMA

[2305.11206] LIMA: Less Is More for Alignment

Just use less but better quality example for supervised finetuning, rather the shitty dumps from your support portal

RLHF 

I have the mandate of the heaven

Tweeted this into the void and woke up to a presentation from HF bros in the morning. Entropy works in mischievous ways.

ICML '23 Tutorial on Reinforcement Learning from Human Feedback

DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py at master

[Lora Hub] Wait, was I talking about being blessed with the mandate of the heaven, Yes I still have it

sail-sg/lorahub

https://arxiv.org/abs/2307.13269

We already know that GPT4 leaks said the MoE architecture is the way to go.

So now anons are trying to replicate that for local models.

Issue is a proper MoE architecture works in the following way

So at inference time you only have a part of the model that’s actually doing the computation while the rest is inactive.

For local models, this approach is not the right way forward. Reason being it severely inflates the size of the model and we want smol.

So what to do?

We already have smol experts (kinda) called LoRAs. But the issue with loras is you can combine them but not select one dynamically during the inference time. Secondly no one knows if it’ll even actually benefit the base model.

Well that’s what this paper shows. The approach works and your model is better at more tasks. They however don’t select one lora, they simply multiply each lora adapter matrix A & B by a different set of weights, sum all A’s together, then sum all B’s together and finally multiply both to get a merged Lora.
The weights take care of how much each lora should contribute to the final output.

For the weight training, they say using gradient based approach will be super slow, what they recommend is a nograd approach (that’s present in their code as well), where they just select 20 loras, take loss with the output then use this nograd optim to adjust weights. Only a few iters are done.

TinyStories

Since karpathy implemented and validated it, I must read it.
[2305.07759] TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Ok, here the claim is very simple. You don’t really need a large model for coherent English output.

What you need is [Ba Dum Tss!] a good dataset.

So to create that dataset they select around a 1000 words that are typically known by 4-5 year old kids. Then used GPT to create small stories around those words to increase the diversity of the dataset.

Then simply train multiple small models all < 100M params.

What they find is amazing!  Not only does your model produce coherent output, it also gains a bit of reasoning ability like LLMs e.g. it can remember facts, figure out correct grammar etc.

I don’t like their final eval method though. They simply used GPT-4 to grade the answers generated by small models. The reason I don’t like this is even though if GPT-4’s scoring is consistent, I remember reading somewhere it prefers the output that GPT-4 itself will generate rather than a better one. Might not be a big deal here since if you match GPT-4  levels you are already good.

Next - experimenting with llama.c

GitHub - karpathy/llama2.c: Inference Llama 2 in one file of pure C

FNet

Have you thought about replacing attention with FFT anon? It works and is faster. Validated by kaioken dev

[2105.03824] FNet: Mixing Tokens with Fourier Transforms

Scaling S3 is not easy [Not related to ML but also related to AI cause all data is in S3]

Building and operating a pretty big storage system called S3 | All Things Distributed

Was reading about how S3 is scaled for a billion users. Realized I know nothing about computers. I am so dumb. Need to get smarter.

The gist is HDDs are cheap (everyone knows that) and they are getting larger storage wise (everyone knows that yet again) but the issue is reading/writing in HDD is done by a mechanical head. And there’s a limit to which you can make mechanical heads faster.

So more often that note you’re choked by IO. The way to solve for this is to shard your data and distribute it across as many HDDs as possible.

That way you can do multiple fetches in parallel rather than waiting for a second for head to seek 2TB of data in a single node.

RetNet

I thought it was a meme but now it’s gaining a lot of steam so I must absorb it’s data stream

MoE (by Deepmind) (It’s soft not sparse)

From Sparse to Soft Mixtures of Experts

Instead of routing a single token to an expert, what they’re proposing is to take a weighted average of the tokens and then route it to an expert. This allows for better training stability and ability to scale the experts.

The primary disadvantage is this doesn’t work for autoregressive decoders.  

It looks like they are using slots per expert as the tuning knob to make the model faster or slower. Each slot has its own set of parameters (known as dispatch). We simply multiply the input tokens using a slots params and feed it to the expert. Then we do the reverse using ‘combine params’ and get the combined prob of a single output token.

If you’re noob in maths like me, use this GPT-4 explainer for understanding - https://chat.openai.com/share/6bc517e0-2acf-48ba-8436-f1a2d8702b65

Skill Issue Paper

[2307.14430] Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models

Train LLMs like you would teach a kid

Don’t teach them algebra before you have taught them basic arithmetics

 

This is an interesting approach to clustering.

This paper shouldn’t be complicated to read at all but they fell into the rabbit hole of using too many math symbols to denote function calls.

Skill Graph creation - quite easy honestly if you ignore the maths

You first create a train and eval set for various skills using clustering

Then you take a base model and train K versions of it for H steps, one for each skill. Finally store the validation loss difference for that skill b/w original base model and finetuned model.

Now you simply start considering all pairs of skills, and train the base model on union of both skill dataset.
Now you observe the loss of this finetuned model vs the base model.
If the delta loss here is greater than the delta loss when trained just on one of the skills in previous step, we simply add an edge in the graph.

Using the Skill graph matrix A to actually select samples during training

Now that you have the skill graph, it’s time to use it during the training. This is the main part of the paper.
Naively select equal number of samples from all relevant skill is not good enough, since one skill can disproportionately affect the loss.

A better way is to take into account which skills are leading to bigger validation loss and then change the sample distribution accordingly. We will be using pi to denote the fraction of samples that should be from skill i.

It is not that difficult except for some assumptions. You initialize the proportion of each sample related to the weight of the adjacency weight matrix edge. Then at each iteration, you observe the loss of the model w.r.t. All the unique skill sets samples in the validation set. Then you train the model with an existing mixture of sample. Finally you adjust the proportion of the samples based on the new loss.

WHY ARE WE STILL DOING MANUAL STATUS UPDATES IN COMPANIES?

Time to plugin LLMs to the Slack feed and let it automatically track everything for you.

BERT Primer

https://arxiv.org/abs/2002.12327

I have skipped over a lot of basics, time to catch up

Estimate LLM Flops and Memory requirement

I am not good at maths so it took me some time but after reading https://finbarr.ca/how-is-llama-cpp-possible/ I finally understand

Most basic formulae

Number of params approx (P) - 12 * [(Model dimension)^2] * [number of layers]

Mem usage  = bytes required by each params (4 generally) * P

Flops usage = 2 * P (assuming it requires one mul and one add per param)

All of this is extremely approx and doesn’t account for vocab to embeddings as well as normalisation layers

RoPE

So that finally even codellama paper has demonstrated that using RoPE scaling you can get to almost 100K context size, it’s time for me to stop pretending that I know what it is and actually learn.

Paper - [2104.09864] RoFormer: Enhanced Transformer with Rotary Position Embedding

If you don’t know what positional embedding is, it’s basically capturing the position information of the sequence in the input. This is needed so that you can differ b/w B follows A from A follows B. Also needed to determine how far is A w.r.t B and should it even affect B’s prob

The trad way of doing it is to simply learn embeddings of the same dimensions as the model but for each position. Then add these embeddings to the model input.

But this suffers from a major drawback - You can’t adapt your model to go beyond the positional embeddings it has learned since they don’t exist in the matrix. Secondly it doesn’t capture the relative positions well, only the absolutes.

RoPE aims to solve it. The fundamental idea is first to use only the relative positional information, secondly use some function (not lookup) to convert that information to embeddings. The function that they use is sinusoidal angular transform which basically rotates your token embeddings by some angle based on the position (hence the name). Sounds quite intuitive.

Code - https://gist.github.com/cto-junior/3493fe428a069a3500f85f8558ef5df9

Speculative decoding

Paper: [2302.01318] Accelerating Large Language Model Decoding with Speculative Sampling

I had heard about it and I heard about it again today (both time from Seminalysis)

THe idea is pretty simple - you use a smol model to generate tokens quickly (speculation) and then correct the output if it diverges from the slower but accurate large model.

The smaller model will be faster in generation but if you have to check its output 1 by 1 with larger model then you’re still bottlenecked right?

Well, the solution is to simply generate K tokens at a time from a smaller model and then feed all K tokens to the main model and get the correct log probs for the next tokens in a single pass.  Then you can simply compare and discard the incorrect tokens.

Almost all major LLMs are using this in prod (as per the leaks). Leads to 2 - 5X improvement in latencies

Cool paper  - Topology of NN

[2004.06093] Topology of deep neural networks

TLDR in this thread - https://twitter.com/suchenzang/status/1696924361373151337?s=20

I’d be lying if I said I understand all the maths in this. But I do understand the idea. They are basically trying to show how classifier neural nets change the datasets after each layer so that each class is easily distinguishable in the space.

It also shows how ReLU is more effective than sinusoidal functions at doing this.

How to reduce KV cache mem usage?

[2305.17118] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

Size of the KV Cache for one token in an LLM assuming f16 weights is -  

(From - Transformer Inference Arithmetic | kipply's blog)

You can also multiply this by batch size B for production systems

By convention, most models have nhead * dhead = dmodel

So e.g. for LLama-7B, for a batch size of 1,  this would be equal to  4 * 32 * 4096 = 524288 bytes = 0.5MB

Now this is per token, so if you want to run inference on 2048 tokens, the total kv cache required would be = 2048 * 0.5 = 1GB.

And this is just for this small 7B model. For 65B it would come to around 5GB (4 * 80 * 8192 * 2048).

And this is for batch size = 1 btw which means low flops utilization and insanely slow inference.

So there's a need to find a way to compress the KV cache. That’s what this paper aims to solve.

The basic hypothesis is simple. A) Only a few tokens are important and others are not. B) The importance of future tokens is more or less correlated with the ones in the past.

The core algo then is not that tough. You make a window of size w and only keep the values in the kv cache which always have values greater than the threshold in that window w. You also always keep all the values of the most recent tokens denoted by r.

Overall I feel like it’s a good hypothesis but there are too many approximations built into the inference side. They should publish the results for llama or something rather than OPT.

Hyena

https://hazyresearch.stanford.edu/blog/2023-03-07-hyena

You must read this. I am telling you. It is mid but you must.

Update:
I finally read this blog. They are proposing a new architecture which helps reduce the inference time especially cause the attention formula is quadratic in nature (we compute attention for every token in sequence with every other token in the sequence)

They simplify the attention formula simply as A(x) . x where x is the input embeddings. Most network architectures don’t have A(x) but just a single W weight matrix learned during training. This makes the attention based models quite unique as they can adjust to inputs very well and can demonstrate capabilities like in-context and few shot learning.

But in attention based models, except for this layer all other layers simply are of W.x nature.

Their argument is instead of using quadratic attention here, what if we made other layers a bit different so that they are of A(x).x nature as well but A(x) being a sub-quadratic function.

The function they propose is a long convolution i.e. long sliding window matrix that attends to some tokens in the sequence and generates the output.  

The hyena_orders is just some number which they keep as 3 for some reason.

The results presented are pretty good but I hope they present it to work same or better w.r.t. llama-like alternatives.

Also, if you’re like me with rusty knowledge of CNNs, here’s a simple explanation by GPT-4

 I need to get so much better aaahhh

Also, I think I’d be dabbling a bit into tinygrad, see what abstractions I can recreate from scratch. Should be a fun exercise.

VectorDB arc

Ok, I am going to become Vector DB expert this week

It has been mandated by the heaven

Let’s look into it from a practical perspective.

I found this good starter resource for anyone who wants to understand the basics - The definitive guide to using Vector Search to solve your semantic search production workload needs.

I know about 40% of it already so directly jumping on how actual indexing works.

Most popular algo is HNSW but scann is faster. There are a lot of other alternatives as well.

Now most people are only concerned about speed of the algo but that is sorta necessary but not sufficient.

To create a full fledged vector db, you need more

And many more

Considering all this I have started reading about Lucene’s HNSW implementation (since Lucene is production-grade and almost used everywhere. I am afraid though it might be slow)

Lucene HNSW

All alpha from - https://issues.apache.org/jira/browse/LUCENE-10054

And https://issues.apache.org/jira/browse/LUCENE-9004

More orgs should start doing this honestly - https://people.apache.org/~mikemccand/lucenebench/VectorSearch.html 

Publish nightly perf numbers on OSS libs.

The first issue was a gold mine btw. I learnt a lot.

The primary doubt I had was how to create a single graph and so far it looks like that’s not the case.
They are still using small graphs (10M record max)

So that means you yourself will have to combine the results (maybe they have some utility for it)

Another thing is the memory usage optimisations they have done. Loading everything in memory is not required, you can just keep the entry points for each hierarchy and the neighbors. The values are always docIDs so it only requires 4 bytes to store. Once you get a docID you can look up the actual vector value from the segment.

How to do distributed HNSW? Like the one where I don’t replicate the data but where I shard the data and then query multiple subgraphs to get final answer.

Some hint -
https://github.com/nmslib/hnswlib/issues/377

Also Pyramid Paper - https://arxiv.org/pdf/1906.10602.pdf

Ok, I found the answer for this. Right now it’s pretty dumb but works. If you need top 10 matches, you query for top 100 records from each shard and then sort again. It’s costly and inaccurate but that’s what’s being done in Elasticsearch atleast

FAISS

https://github.com/facebookresearch/faiss/wiki/

FAISS wiki has a lot of alpha

I was reading chroma and weavite docs but to me they look like half solutions. Primarily cause I don’t see they support efficient sharding of vectors. It is still mostly defined by some primary key kind of approach in weavite.

Annoy

Need to look into this but afaik this and ScaNN are immutable once built. This would mean carefully choosing a large enough shard which you can’t modify later on. Hmm doesn’t seem to be worth pursuing.

Mixture of Experts: PEFT edition by Cohere

Paper - https://arxiv.org/abs/2309.05444

After LoraHub, another paper showing the viability of MoE using only Lora or (IA)3

The architecture is almost the same which involves a gating function which decides the experts.

Notable changes though -

If you don’t know what IA3 is, it’s pretty simple. Just 3 vectors per layer that are multiplied to Key, Value and FF layer weights respectively during inference. IA3 in general performs worse than lora but here when used in MoE fashion they perform better.

Another issue with IA3 is that your vector size is fixed w.r.t. To model dimensions. In LoRA, you can easily change the rank to change the size and play around with evals.

LLM as Optimisers

Paper - [2309.03409] Large Language Models as Optimizers

Not sure how many papers I gotta show folks before they accept that LLMs can reason pretty well. They also try the best models from each company and all of em perform pretty well. GPT-4 obviously being the best.

This is by google so please believe me now.

The problem statement is simple. First - given an optimisation problem like travelling salesman or simple gradient descent and the path take till now in the form of coordinates, can the LLMs converge to the final solution?

The answer is yes.

The second problem statement is - Given an eval, along with previous prompts used for the eval and the final scores of each prompt, can you generate a new prompt that’s better? It’s the variation of the first problem statement.

The answer is again, yes.

Just ask you LLM to take a deep breath 😀

I will become this guy, watch me

Generative Recommendors - Cool paper by Google

Paper - [2305.05065] Recommender Systems with Generative Retrieval

If you don’t know already, the most common implementation of recommendation systems in most companies is the following

This poses a problem though. What to do when new items keep on being added to the catalogue. Retraining the embedding model is expensive with millions of items plus slow.

What if we simply trained a model to output the most related item id based on the past selection by the user?

What if it’s a generative model rather than a classifier or something? Would that even work and not fucking hallucinate while generating the id.

Turns out it does.

Their approach is as follows

Turns out this approach is amazing and gives SoTA results on most popular recommendation evals and real life scenarios.  They also show that the model doesn’t generate invalid ids a lot ( < 1% in most cases)

ʼ

Man, not getting a lot of time to read paper these days. Occupied by Day job stuff.
But I must push harder.

Flamingo

Paper -�� Flamingo: a Visual Language Model for Few-Shot Learning

This seems to be a secret sauce behind a lot of ongoing GPT-4(V) like projects.


The premise is using a pre-trained visual encoder as well as an LLM to attend to multimodal inputs.

They combine these two models using three things

  1. Perceiver Resampler - In simplest form, this is just reducing the feature vector coming out from CNN to a fixed dimension space i.e. making it independent of image resolution or video length. Reason - it will be easier to be able to feed to LLM
  2. Cross-attention Dense layer - This one is quite obvious and popular solutions. Basically now that you have embeddings from the image, you want to combine them with the embeddings of the text so both can attend to each other. Cross-Attention is the best way to do this. They add the cross-attention layer between each layer of the original LLM. The only difference then you will see is the tanh function. That’s been added so that this layer is ineffective during model initialisation during training
  3. The final piece is masking. This is helpful when you want to support multiple images in the input along with descriptive texts. The idea is that a portion of the text should only attend to the image preceding it (cause this text describes the previous image). This piece honestly might not be needed in most architectures and is just a consequence of the way this model’s dataset looks like (a series of Images, each followed by a text)



All of this comes together to make the final Flamingo model. It is now trained while keeping original encoder weights and LLM weight frozen.

Fusing Modalities - Chimera by Meta

Paper - https://arxiv.org/abs/2309.15564

Multimodal transformers are the rage of the town. Especially now that people are seeing the power of GPT4-V.

The current way to train them is to change the architecture of an LLM to infuse cross-attention and then training it from scratch to use both Image and Text embeddings.
This is very expensive.

Meta proposes a simpler way. What if you could simply combine two pretrained models to process both image and texts.
Ngl, this was also how the
Flamingo architecture worked (which is most likely used in GPT-4V). But even then it introduced 3 extra type of layers to make it work. Plus it required training from scratch (although kept the visual encoder and LLM weights frozen)

This paper proposes something much simpler. You simply add a cross attention block at the output of each LLM or text to image block. This x-attn block processes input from both modalities. The output of multiple x-attn block is then combined using a linear transformer.

Now the question is - how to train the x-attn block though? That’s the neat part. They simply use supervised finetuning over a small dataset to get the desired outputs.



PromptBreeder

Paper - https://arxiv.org/pdf/2309.16797.pdf


I’ll be honest here, I saw this paper in a tweet and simply proceeded to ignore it cause not interested in another making prompts have sex to create synthetic dataset approach. Especially cause evol instructor and airoboros already exist and are good.

Then I saw on r/localLlama that this paper is actually by Goog deepmind. Hence I read it.


The approach is mutate a prompt using an LLM

How?

Just append a mutate instruction (e.g. make it more creative), before the prompt. You can also append a reasoning style (e.g. think step by step).  

They have multiple ways to select the best prompt as well as mutate the mutate prompt itself using LLM

Hence the name prompt breeder.

More important is to filter out similar prompts by using BERT embeddings and cosine similarity

Next, you also provide the good quality prompts in the context of LLM so that it generates an unique one with high quality rather than repeating the same

The primary alpha of this paper honestly is in Appendix. Just go through all the prompts and strategies and select the best ones.

LLAVA

Paper - https://browse.arxiv.org/pdf/2304.08485.pdf

Terrible name, decent model. Multimodal, hence interesting.

You take the image, then use some encoder like CLIP and generate embeddings.

Now you do a transform so that these embeddings match the dimension of the LLM’s input layer. LLava folks use a single weight matrix to do this which makes it quite easy to train.

And voila, you have a multimodal LLM.

The training part is also not something out of this world. First you keep both CLIP and LLM weights frozen and only adjust the projection weights.

Next you do a full finetune for both LLM and projection layer. You still keep the CLIP frozen.

Another Interesting thing is they use CLIP encodings from the penultimate layer and not the last layer like you would ideally do.

Overall, this is actually quite simpler than Flamingo and possibly perform worse but it is faster to finetune and train cause of no complex cross attention mechanisms and fewer additional layers.

LLAVA-1.5

Paper - https://browse.arxiv.org/pdf/2310.03744.pdf

Honestly, it’s same as LLAVA 1 except they two things

Everything else is the same.

IMPORTANT INTERPRETABILITY PAPER BY ANTHROPIC

https://transformer-circuits.pub/2023/monosemantic-features/index.html

I am finally reading this masterpiece but let me be honest. Understanding it requires a lot of existing knowledge about NNs plus support from chatGPT

You have been taught from childhood in every deep learning course that NNs are basically black box.

These days although we have gained significant ability to understand small NNs as well we still lack in ability to figure out how each individual neuron behaves in a network trained on large amounts of data.

There is a lot of research going on in this domain as it is extremely useful to affect the NNs outcomes and control/guide them easily.

There was a blog/paper by OpenAI where they used GPT-4 to understand GPT-2 neurons. See Model Interpretation in this doc.

This paper is another one which uses another NN to figure out first one.

The hypothesis is this: We are not able to understand each individual neuron cause it activates on seemingly random inputs which make no sense to naked eye. However, the possible reason it does that is because it is compressing so much data into  the small number of weights. Thus, one neuron ends up representing multiple inputs.

If this is true, what if we trained a much wider neural net on the inputs to this neuron? Since this NN is wide, each neuron should end up representing a considerably fewer number of features. And since they are fewer in number, they should be easily understandable as well.

Well that’s what  they try to do in this paper. They use a sparse autoencoder who’s width varies from 1X to 256X of the neural net. The input to this autoencoder is the activation of the neural net layer. The output is also the activations (thus the goal is to reconstruct them). They use MSE loss for it plus an L1 regularization penalty to force sparseness. Without sparseness you would have simply too many neurons activating in this autoencoder making feature distinction difficult.

They haven’t done a half-assed job as well. They do a lot of analysis and publish it to make sure that the features that are detected are in fact representative of the text in the context plus they are simply not neurons weights slapped here.


Validating Specificity
To validate the features, they use log likelihood. E.g. take an Arabic character.. You can find out the  probability of that character occurring when the feature corresponding to it is activated vs prob. of  that character occurring in the overall dataset. If the ratio is high, that means your feature does specifically point to that character. Now using a single character is tricky so for approximation they use a proxy here. E.g. just a word containing Arabic characters.

Validating sensitivity

Here they just measure the correlation b/w the times our feature got activates vs the time we detected the proxy (i.e. any arabic character) in the context. High correlation means the feature is indeed sensitive to the proxy.

Downstream Effects

If your feature does point to an arabic character, it should also make the future predictions gravitate more towards arabic characters (cause it makes natural sense). They try to measure this for each feature and find out that this is indeed true. They also do a study by disabling the feature and see that the future outputs skews towards normal english or something else.

SAM

Paper- [2304.02643] Segment Anything

Thought about reading it after yacine’s dingboard success. The model basically outputs the correct segment masks after downloading the data.

The model architecture is not that complicated.



You first have an Image Encoder (
ViTMAE) to generate image embeddings.  

Then you have
CLIP (obviously since it’s used everywhere) to generate embeddings for the text prompts.


For bounding box or pixel prompts they use learned embeddings

For Mask based prompts, they use convolution and then sum it up with image embeddings.

Once you have the final embeddings you feed it to a transformer based decoder that outputs a mask. They use two way cross attention (prompt with image and vice-versa) in this decoder.

When you get the mask, your next job is to map it to the image. For that they use a simple MLP that computes the mask probability at each image location. This mask decoder part is inspired by the Maskformer paper.

I did spend some time understand it’s code (and by spending time I mean going on an evening walk talking to chatGPT to explain the piece of code I copy pasted into it before going on the walk). So basically for each pixel you predict the probabilities of the mask class it belongs to. And then you can simply group up all the pixels with same mask class to form a segment.

Qwen-VL

Paper - [2308.12966] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

This is llava but for Qwen model by Alibaba. Some have told me it is better than llava.
I checked its architecture and that might be possible simply cause it uses cross-attention instead of a simple projection layer / MLP to connect CLIP to Qwen.

However weird part is they don’t do cross attention of image embeddings and text embeddings like flamingo.

They use some learned embeddings and do cross attention with that.

I think the major reason for such architectural choices is that no one wants to modify the base LLM.

SigLIP

Paper - [2303.15343] Sigmoid Loss for Language Image Pre-Training

Lucas has been sharing a lot of SigLIP hype on my TL so it made sense to actually see what it is.

As yall already know, CLIP's image encoder is used in almost every Multimodal LLM out there.So how is CLIP trained?

Well you take embeddings from an image encoder and a text encoder and project them to the same dimensions using weight matrices. Once you have the same dimension embeddings, you simply take a dot product along with a temperature param and get a 2d matrix signifying how close each image and text pair really are.

To train CLIP, you apply Cross entropy loss on the following 2d matrix

SigLip simplifies this by changing the loss function. They use sigmoid instead of softmax and change the function so that it’s just dependent on each image-text pair rather than combination of pairs.

This doesn’t seem like a big changes but what it allows you to do is remove the need for all-gather communication during the training run

One peace

https://arxiv.org/pdf/2305.11172.pdf

Mostly interested cause it’s the only multimodal one that doesn’t use CLIP lmao

Make LLM do Maths

https://polymathic-ai.org/blog/xval/

Distil-Whisper

Paper - [2311.00430] Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

Distillation is a widely popular set of techniques to reduce the size of the model.

It involves training a smaller model to predict output probs same as the larger model (not the same as ground truth). We use KL Divergence loss to ensure this.

In Distil Whisper, they are using the same approach with some modifications

  1. They also use the normal CE loss but considering the output predicted by the teacher as ground truth
  2. They don’t consider every output from the teach for the above loss but only the ones with WER (Word error rate) below a certain threshold. This is done to avoid hallucinated transcriptions.
  3. The small model is decoder only (as opposed to enc-dec of whisper). Even in decoder they use only the first and last decoder layer of larger model.

It’s not AGI (it’s just your data)

Paper - [2311.00871] Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models

Insane ML Notes on Twitter with Q&A

 https://x.com/xariusrke/status/1727254622442791400?s=20

Stable Video Diffusion (SVD)

Paper - https://stability.ai/research/stable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets

Just realized we have cracked good Video gen after watching Pika announcements.

So it makes sense that I learn how does it work especially the temporal gen and interpolation (instead of doing my day job)

One fascinating thing I found is the amount of effort spent in data curation for training. This still remains the most important part of all the models out there.  They filter out the videos with small motions and text.
They also use video BLIP, CoCo and other models to generate descriptions of the videos. Then use the same information to filter out the data


Still need to read the rest  

You can simply read this paper for model architecture - https://arxiv.org/pdf/2304.08818.pdf

One more thing - why do people use this word ‘latent’ a lot. Please, stop. Use something which plebs can understand.

So here’s what the basic model is

Now doing this for all the frames of the videos will be pretty expensive especially if input has high FPS.

So what you do is extract out only key frames in the videos which represent a high semantic shift (can be done with cosine distance or like text based descriptions). Then you only use those key frames to generate more keyframes.

Once you have the results, you use another model to interpolate b/w those frames.The interpolation model is also similar to the above one but just has different layers (what exactly tho?) instead of temporal ones b/w spatial layers

Finally they finetune the model to produce sequence of frames rather than just the next frame

Stable Diffusion Turbo (or How to distill a diffusion model 101)

Paper - Adversarial Diffusion Distillation

At first I thought it’ll be the usual Student Teacher model with a different loss function. But turns out it has one more component - a discriminator model

Another interesting aspect is the input to the teacher model is not the original image but the diffused latents from the student model

The Discriminator model just tries to ascertain how close the provided image is to the original image.  

I can’t hear the MUSIC*  !!!!!!! NEEEED TO GET BETTTTTTTER!!!

*  Music here implies diffusion operations in latent space

Images are Sentences

Videos are sentences

Sentences are predictable

https://arxiv.org/pdf/2312.00785.pdf

Mamba - faster architecture (Reading cause Tri Dao is author)

[2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Have just started reading it but I already know this one is going to be a banger. Mostly talking about taking an existing State space model architecture and how to make the parameters time variant which according to them is the major blocker for these models not being good enough in real world.

Why hasn’t someone done this already? Well, because it increases the computational overhead and makes the models slower and if you have slow models then why not just use transformers with quadratic attention.

To solve this they have developed a hardware aware algo (mostly that decides what data should be in HBM vs SRAM) and who’s better to do this than Tri dao who wrote Flash Attention.


I forgive this paper for using too many mathnerdsnipes simply cause it is tbh required to understand the motivation.  Thanks to chatGPT, here's a simple explainer for most of the terms you’ll encounter in this paper.

Most important part of the paper is this. The takeaway for me is that memory bandwidth is so low that recomputation is faster than storing and fetching.

Gemini

They actually released it. And it’s good?

Sort of

But tbh, benchmarks are not what interest me. You can go ahead and view them. Most evals are trash anyways and suffer from data leakage in subtle forms.

I am more interested in tips and tricks scattered across the fine prints in the paper. Tbh, they didn’t reveal much of those.

Importance of dataset quality. Also, they finally realise that users leave the app if you deny them the response. Just mention something helpful instead.

I am truly excited about Nano and what usecases will be possible on the device after its release. I consulted the huggingface dashboard and the stats (atleast for MMLU) look great for its size.

It is also multimodal which was a surprise to me for such a small model. But then again that’s the advantage of
Fuyu like architectures instead of using a dedicated image encoder which makes the overall size bloated.

This is one of the most difficult parts about training, handling hardware failures. TBH I should read all the 3 linked papers and understand what can happen (I do have some idea from OPT-175B logbook)

Another interesting thing is that they do not checkpoint the weights to distributed store since it’ll be simply too slow (rugged by Network IO).  Instead what they do is either keep some robust copy of weights in memory or maybe transfer the weights from one of the other nodes in the cluster (due to Model + Data parallelism)r.



Also, please shield your clusters from cosmic rays. Yet another win for basement dungeon AGI enthusiasts.

Mitigating LLM Hallucinations

Karpathy tweeted about it

TBH I am aware about the RAG Based approach to ground the output in facts.

Not sure about Reflection, Verification chains, Decoding uncertainty

If I had to guess what they means

Reflection - Simply ask LLM if this is the correct answer or not and then ask it to modify it

Verification Chain - Maybe just a fact checker and similar tools post the LLM output before sending it to user

Decoding uncertainty - Basically if the logprobs are lower (or you can consult from some hidden layer), the output might not be true

LCMs

[2310.04378] Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

SD Turbo is the talk of the town but before that we had LCMs.

OK, there is way too much maths in this paper. Time to upload it into chatGPT


Update: Math too hard, went over notebook shared by the kind (
https://twitter.com/felix_red_panda), now math is easy

Also one thing that this paper finally is peaking my interest to dive into various aspects of diffusion models
E.g. Schedulers -
The ML developers guide to Schedulers in Stable Diffusion

The basic idea is really simple - you want to train a network to predict an image from a noisy latent without going through the whole iterative sampling process.


Why use EMA (exponential moving average) decay rate?

https://www.reddit.com/r/MachineLearning/comments/ucflc2/comment/i6a52yc/?utm_source=share&utm_medium=web2x&context=3

Tbh you should just read this amazing blog instead - https://naklecha.notion.site/explained-latent-consistency-models-13a9290c0fd3427d8d1a1e0bed97bde2

Use smol models to train large models faster

[2312.05328] Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

Although I read this paper and I get what they are trying to do, I am still not getting a sense that I actually know this. Maybe it’s because I haven’t read the linked papers about online learning. Should do that first.

OK, after reading the next paper and gaining some intuition, I remembered I am just baka.

It’s actually quite simple. What they are trying to do is almost the same as doremi in that they use 2 smol models to select the data to be used to train large models. Their main contribution is how smol can we make these two models so that overall we save FLOPs rather than spending less in training but overall more in total.

They simply keep on running this in an actor model (which is actually quite popular in distributed computing if you have worked with spark etc.). Your workers keep on running in parallel and compute scores for the samples from datasets. The score is nothing but the cross-entropy loss of the scorer model and reference model.

You keep on updating the scores in a memory bank (can be any DB).  

You then use these scores to sample the data, thus prioritizing examples that would actually help large model to learn something new instead of simply repeating it.

In the end you update the weights of both reference model and large model.
Why both? Cause what makes all this work is that the loss trend of smol reference models serves as good enough proxy for the loss trend of large models.

Finally they keep on  reducing the scorer and reference model size until the training regime is compute positive (i.e. takes fewer FLOPs w.r.t. Learner). As you can notice the learn obviously takes longer to train as reference models get smaller but the hit is not large

DoReMI

[2305.10429] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

The idea is simple. You want optimal data distribution in pretraining. By optimal I mean the samples that lead to lowest loss as early as possible.

How to achieve that?

Well the intuition (a good one) in most of these papers is that smol models, although not great at producing output, are still good enough proxy to dictate how large models are gonna behave. So e.g. if a small model trained on dataset X l has high loss on some dataset Y, then it might be the same for a large model as well trained on the same dataset X.

Here we use 2 small models - one is a reference model that is petrained using normal sampling weighted by token count. Second is the proxy model. The whole trick lies in how to train this proxy.

We do that by first starting with uniform sample distribution over domains, doing a normal forward pass for a batch, taking difference of loss b/w this and reference model for each domain. Now we select the worst domain loss among these (max) and try to adjust the alpha weights (for sampling) to minimize this.

After T steps, we simply take the average of all the alpha weights per domain and use that as sampling from the large model.

Does it work?

Yes, as you can see at just 80K steps the model is performing better than baseline at 160K

I KEEP ON FORGETTING THAT I ACTUALLY NEED TO LEARN ABOUT DPO AND PPO AND OTHER SHIT (although I did make an attempt earlier in this doc for DPO)

Why did I suddenly have this thought? Cause they released DPO fine tuned models for SD which are actually good at prompt following!!! Ya khuda, no more ((masterpiece)), ((high-res)), 5 fingers, bullshit.

[2311.12908] Diffusion Model Alignment Using Direct Preference Optimization

LLM Paper from Apple?? : That’s a rare sight

[2312.11514] LLM in a flash: Efficient Large Language Model Inference with Limited Memory

So apple is def serious about running them in device. What they are trying to achieve is how to run models larger than the available memory. TBH, we already run all programs with memory usage larger than RAM (using virtual memory and paging)

Why can’t we have a similar thing for LLMs? Cause it’ll make inference slower.

But smol devices have a big advantage, they simply use flash storage which is quite fast if used in the right manner. The focus of this paper is on this part.

They are relying on two important properties here:

1. The FFN (not attention) weights of most LLMs are sparse

2. The time to first byte from disk > time to read data

Now to take leverage of these two props

  1. They only load weights which give non-zero activation to save memory
  2. They read more data than necessary from flash and then discard the non-useful one.

To read only non-zero weights, they are using a low rank predictor that simply tells beforehand which neurons will give positive activations.

To read more data, they load both up_proj and down_proj matrices and keep them in a single row.

When I read the benchmark setup of this paper tho, I get an ick.

The memory management def seems neat - they ensure continuous allocation in a pre-allocated memory region.

Multimodal paper from Apple???

[2310.07704] Ferret: Refer and Ground Anything Anywhere at Any Granularity

Initially I thought this paper was mid cause it was using Vicuna which hasn’t been sota for like a year now. But now that I read it, the point is not about the base model, it’s about the technique being used to ground the model.

If you’ve read GPT 4-V, Gemini or Llava whitepapers, you will know the achilles heel of all these models is the ability to not parse as well as create bounding box correctly. Tbf, GPT-4V still can parse but creating is not it’s strong suite.

Here, they are trying to tackle only the parsing the bounding box part. Most of the magic is in how they represent it in the first place.

They are using a special MLP based sampler which instead of representing the box as a set of coordinates, represent it as using a set of modified features.  

Amazing paper to Learn about Dingboard

Paper page - Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

It presents common architectures powering the Text to image space and what are the common bottlenecks. It’s by FAIR meta so actually useful content rather than it being a blog post in PDF form.

Deepseek, YAYI-30B, WaveCoder

Seems like I should invest some time to learn about dataset creation and filter pipelines. Lots of alpha in case I actually choose to earn money via ML consulting

TDM edge Multimodal arc (I blame Vik)

LISTEN TO ME RN!!!
EDGE AI
THAT’S IT
VTUBER WAIFU IN YOUR GLASSES
JUST BET ON MOBILE COMPUTE GETTING BETTER AND BETTER!!!

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Paper page - TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Paper page - MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices

MobileVLM

Paper page - MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices

Loved this cause they actually tried to optimise the model to be able to run on snapdragon devices.

Apologies for the phone screenshot I was reading this in the mall while waiting for someone

MathPile

Paper page - Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Only reason you should read dataset papers is to just find out what they use to clean up the set. More often than not I just skip to that section.


For contamination detection:

We employed line-level exact match detection for both our corpus and test sets, as the questions in these benchmarks are generally brief and often contained within a single line. Specifically, we split documents into lines, hashed each line using MD5, and took the first 64 bits along with the corresponding line to form a set. This procedure was also applied to the constructed reference test set collection. If a line from the test set, along with its corresponding hash code, is found in the training set’s corresponding set, and the length of the line is over 50 characters,13 we classify it as a leaked sample with an exact match

Unified-IO 2

Paper - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

This is quite close to what Gemini is doing for multimodal gen

Uses llama tokenizer for text

Uses the concatenated output from 2nd and 2nd last layer of ViT for images

For audio, it is a bit more complex but I am pretty sure it would be like 1-2 ffmpeg commands. The main funda is you need to just need to 1. Encode and 2. Project into 3. 1-d space

DocLLM

Paper - Paper page - DocLLM: A layout-aware generative language model for multimodal document understanding

Just like the top comment on this page I was expecting this paper to be a snoozefest (banks, pffttt)

But they really tried something other than prompt engineering a finetuning llama (tbh it is llama2-7B but not vanilla arch)

That being said, some details are def missing from this paper cause well, it’s a bank so needs to think everything they do is a secret

Basically they first use some OCR code (plenty of libs in OSS), to get text and their corresponding bounding boxes from the docs.

Next - they need some way to encode these bounding boxes (not mentioned clearly in the paper), once encoded they change the attention block to use different Q and K matrices for bounding boxes. You can think of this part like a simplified cross  attention.


Seems like a follow up paper to this - Paper page - DocGraphLM: Documental Graph Language Model for Information Extraction

Microsoft broke MTEB

[2401.00368] Improving Text Embeddings with Large Language Models

Reading List from AHM

Dataset Best Papers

Hallucination minimisation and Refusal on not knowing the answer

Transformers from a Maths perspective (not including finbarr, eleuther ai maths)

Vikp’s work with dataset prep and related stuff

 Factual Grounding methods

Reading List from Yacine

DreamTuner - ipadapter but different

https://arxiv.org/pdf/2312.13789.pdf - how i beat the big wigs

[2312.09608] Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models - faster stable diffusion by skipping unecessary bits

Official implementations for paper: Anydoor: zero-shot object-level image customization - instruct edit + https://old.reddit.com/r/StableDiffusion/comments/18kd0na/code_for_anydoor_zeroshot_objectlevel_image/

dont forget - diffusion slider demo https://github.com/Kevin-thu/DiffMorpher?tab=readme-ov-file

https://arxiv.org/pdf/2312.01943.pdf - i should use this for anime

TextDiffuser 2 - a Hugging Face Space by JingyeChen22 - people have been asking for text.. right?

GitHub - open-mmlab/PIA: PIA, your Personalized Image Animator. Animate your images by text prompt, combing with Dreambooth, achieving stunning videos. PIA,你的个性化图像动画生成器,利用文本提示将图像变为奇妙的动画 - video generator

GitHub - cumulo-autumn/StreamDiffusion: StreamDiffusion: A Pipeline-Level Solution for Real-Time Interactive Generation - turbo go fast

LASERRRRR (for reasoning)

[2312.13558] The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

wowowowow

Paper page - Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache - Mostly engineering hence I love it

[2401.01325] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning - cause surya implemented it and it worked, high signal

[2401.02954] DeepSeek LLM: Scaling Open-Source Language Models with Longtermism - Mostly to see scaling laws and hyperparams choice

[2401.00588] Fairness in Serving Large Language Models - Scheduler by lmsys

Paper page - Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws - hmmm

[2401.01335] Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models - someone was talking about in a GC, high signal

gonna spend more time writing LLM code now, this is a fundamental blocker now in my path forward to actually becoming a good assistant to ML bros

Embarrassing myself publicly arc (PHOTOMAKER)

I was playing around with Tencent Photomaker for a few days. I am blown away by how good it is with faces so I naturally went ahead and read  their paper

[2312.04461] PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

What I realised after reading is it, I really do not understand how it works.

What they are doing is pretty simple, they have trigger words and class words

The trigger words can be something like img and class words can be man, woman, boy etc.

The class word should always be followed by the trigger word.

Whenever a  trigger word is encountered in the prompt, they remove it, take the features of the class word preceding it and then do the tensor jujitsu

What technique? They take the embeddings of the face images you uploaded, project them into the same dimensions as text embeddings and then fuse them with the embeddings of the class word.  


This is the part that kept on worrying me, like how does this even work, why is face information simply not lost, I was honestly expecting a control net like thing.

Then I proceeded to read their code and that made me question my skills further. Reason being I was not familiar with a few torch methods they were using and also the code is shitty in general.


Then in the paper they mentioned that it actually works simply cause SD already has Cross-attention which takes care of mixing face id info with image.

So rn I am going through the whole SDXL architecture in colab (it’s embarrassing that I haven’t done that at all) and trying to understand this flow. The fact that I don’t know this already is so baaaaaaaad.

Today I was trying to verify if they are lying to us in SDXL paper. Turns out I am just dumb and forgot they concatenate the embeddings from two CLIP models

Lumiere

Paper - https://arxiv.org/pdf/2401.12945.pdf

Video gen model by goog. They are not relying on the SVD (Stable Video Diffusion and not Singular Value Decomposition) way to generate only keyframes and interpolate b/w them. They instead generate all frames in a single pass.

I am worried tho that this will have huge mem requirements though. Will read the paper to understand more.

Ok, so the way they are avoiding huge compute requirements is basically via doing temporal convolutions on very small latents.

Once you have video at course resolution it is upscaled using MultiDiffusion (whatever that is, I need to read)

Deepseek Coder

Paper - [2401.14196] DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Lots of alpha in this paper, asian bros I kneel

Thanks for being inclusive for us Java programmers

You would think it’s just topo sort right? But then us programmers are so shit we do cyclic dependencies such as passing factory instance to implementation so it can create new instances for some recursion hell.

Worry not, the asian bros know about this practice. Interesting thing that they kept the file path info in comments as well, not sure why that is important.

They then dedup at repo level (i.e. concatening all the files of the repo and applying some near-dedup algo). It should be similar to cosine similarity but I need to see the exact implementation for this. Expecting something similar to this - Large-scale Near-deduplication Behind BigCode

Using smol models to estimate loss curve trends / accuracy of larger models should be widely known. Don’t waste compute unnecessarily.

I will never not be amazed that CoT actually works in the LLMs

IPAdapter

Paper - [2308.06721] IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Been using this model and impressed by its performance in local setup. It does a really simple thing tbh, where it allows you to prompt a diffusion model with images (not to be confused with i2i as it simply influences the style rather than the exact outlines)

Their approach is quite simple. Diffusion models already use Cross Attention to condition the latents using text.
You can just add another cross attention layer that instead of using text embeddings as key and value, uses image embeddings as key and value

Once done,  you can add the text and image cross attention output and get the final result

How to create AGI?

Paper - [2401.14953] Learning Universal Predictors

This paper has too much math but it is expected because of what they are trying to prove.
I am bad at it so taking help of GPT to guide me through the explanations (embarrassing myself (complementary)) -
https://chat.openai.com/share/fdaaa434-011b-4da6-a3a2-d69ddc8c3180

The primary idea that they are trying to put forward is LLMs in itself can do meta learning (which is fancy way of saying if you give them context, they can figure out a solution from that concept even if they haven’t seen it before). And the way to make meta learning efficient according to them is to have enough diverse set of problems already part of pretraining.

To prove this, what they do is that using Brainfuck (yeah, really, brainfuck) as the primary language for its simplicity (as in it just reads from one place and writes to another with a small working memory) as opposed to langs like java which can have read from multiple channels and do random access.

For problem diversity, the ensure multiple tasks from different complexity are part of the training dataset



I still feel overwhelmed by the mathematical terms in this paper, gonna switch to reading code instead
https://github.com/google-deepmind/neural_networks_solomonoff_induction/tree/main

So the code is really simple tbh, they are just sampling tasks with Brainfuck lang code. The samples are ensured to be diverse enough via randomisation.

This is done per batch and then used to train the network

Then the nets are evaluated against other types of tasks (specifically chomsky hierarchy ones and CTW ones). We observe the accuracy does increase for each model as their sizes are grown even if they are not trained on this dataset (hence, AGI)

As a followup I would also need to read [1905.03030] Meta-learning of Sequential Strategies so i can understand the whole meta-learning thing a bit better

ILYA’s READING LIST (For getting up to speed on today’s architectures)

https://arc.net/folder/D0472A20-9C20-4D3F-B145-D2865C0A9FEE

Stream Diffusion - Brrrrr ImageGen at 100FPS

Paper - [2312.12491] StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

Allows you to either use SD + LCM lora OR SD-Turbo to do insanely fast generation

Doesn’t support SDXL yet

How come it is so fast tho?

Turns out it's just simple old optimisations you would do for you Java app as well

-  multiple threads to offload lightweight computations (such as encoding/decoding images)

- pipelining with batches to allow process multiple images in a single pass

- cache to store precomputed text prompt embddings as well as KV for those embeddings in cross-attention layer

- a bit of magic (tensorrt)

MLLM-Guided Image Editing (MGIE)

Paper from Apple (along with the code, goddamn). Imagine those cool demos at google io where you can magically remove things from photos. Well this is a superior version of that.
Not only you can add/remove/move anything you want, but you can also do it via simple natural language instructions instead of using your fingers.

The basic architecture is quite simple -  they are just using a VLM (or as they say MLLM) to get description of the image with the edits requests. E.g. What would the image look like if I add a pizza on the table.

Now the answers of VLMs can be quite long, so they have this magical summariser that shortens the answers to make them more precise.

Now when you get this answer, they also add some [IMG] tokens. These tokens are used as input to anothersequence model that transforms them into embeddings.

Now you can use I2I mode of diffusion models. But we condition this image not on the prompt but on the embeddings we got in the previous step (using inbuilt cross-attention of most diffusion models)

Now they train this whole pipeline to minimise two losses -



Matryoshka Embeddings

Paper - [2205.13147] Matryoshka Representation Learning

Hearing this term too frequently in the contexts that are most definitely not related to dolls (if they are then it’s concerning)

Basically in the existing embedding models, you get floating-point vectors of fixed lengths (1536 for openai, 768 for clip, etc.). However sometimes you wish you had smaller vectors (to tradeoff accuracy for latency). Changing models just for embedding length sounds tedious

Enters Matryoshka

The concept is pretty simple, generally you train embedding models by training classifiers and then taking hidden layer representation.

Now what if instead of just using the full output of the last layer to compute logits and calculate loss, you instead did it for multiple length vectors e.g. 8,16,32,64…2048

Then you simply added up the loss for each of these and tried to gradient descent on this cumulative loss.

Well, yes, it’s as simple as that and it works perfectly.  Using this loss formulation, the network tries to compress any coarse info in the smaller length vectors and then proceed to finer ones later on.




Another interesting usecase for this mentioned in the paper is adaptive retrieval. So basically you use only the first N floats to perform the similarity search while you use the M bits to perform re-ranking on the retrieved results where M >> N.  This allows you to significantly make your queries faster while not sacrificing accuracy.

Generalising Length of Transformers

Paper - [2402.09371] Transformers Can Achieve Length Generalization But Not Robustly

Just me (an outsider) trying to understand what might have went into gemini 1.5 pre training so that it can generalize to 10M context length even with limited training.

Tbh this paper is not the sauce as it is too simple plus tries to verify results only on a small problem

3rd is just doing 321 + 654 instead of 123 + 456

4th is just doing 3c2b1a + 6c5b4a instead of 321 + 654

2nd is instead of using encoding position 1 as vector of 1, 2 as vector of 2 and so on
What if you took a random set of positions from length L (which will be much much more than maximum context tokens we will feed to this model)

Then you sort those random positions in ascending order and assign it to each actual position.

E.g. sampled position are 4, 7 so 1st and 2nd tokens get assigned the embeddings of 4th and 7th

We can then training this network only for length N but then using the same technique during prediction for length M >> N.

1st is FIRE which iiuc uses an MLP to learn embeddings for each position instead of using some fixed function like linear or sinusoidal

World Model

Paper - [2402.08268] World Model on Million-Length Video And Language With RingAttention

Diffusion Transformers

Paper - [2212.09748] Scalable Diffusion Models with Transformers

Code - GitHub - facebookresearch/DiT: Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"

Since Sora has dropped and hints at using similar architecture (with a lot of magic), good idea to go through this

Architecture wise there’s not a lot going on here. Other than standard diffusion models, there are two changes

For patching, simply use MLP to convert it into embeddings. Once you get the embeddings, you also apply positional encoding using sinusoidal frequency based version

Then you proceed to predicting the tokens using the transform block. They use N transform blocks (where N varies according to model size i.e. S, B, L etc.) They also try different types of DiT blocks primarily to introduce text based conditioning.
Ultimately after ablation they used the one the Adaptive layer norm with zero init for gammas (Note for me, I should read more on this)

What is the Adaptive layer norm?


Stable Diffusion 3

Paper - [2403.03206] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Interesting part is using a third text encoder T5 and then using separate weights for it instead of simply piping it in cross-attention

Synthetic datasets are now being part of Visual models as well. Might be time to create some simple repo to make this seamless.

Increasing channels improves the performance of the models

Deepseek-VL

Paper - [2403.05525] DeepSeek-VL: Towards Real-World Vision-Language Understanding

Strangely Teorotaxes shilling deepseek on my TL almost every hour has worked and I have developed an affection for their smol models that perform great in benchmarks as well as are useful practically

I am also reading this paper so I can steal moondream’s alpha without selling my house to hire vikhyatk

For datasets, they have a pretty diverse mix that covers web UI screenshots, OCR and chart parsing. This is good as simple image text pairs is not good enough for practical utility of these models

Interesting choice for the architecture to use two image encoders and not one

First they use SigLIP for semantic information. However siglip uses lower dimensional latents so the details get lost

For preserving details they use VitDet based on SAM-B that can accept higher dimensional latents.

The output of both of these encoders is concatenated before being passed via MLP

Synth2

Paper - https://arxiv.org/abs/2403.07750

Everyone is now aware about the usefulness of synthetic datasets for training LLMs. Even the latest and greatest Claude-3 has some synthetic datasets in its training set

The next natural step for this is synthetic datasets for multimodal LMs.

And the easiest modality to start with is images i.e VLMs

A simpler version of this would be simply generating captions for existing images using some other VLM

A harder version would be generating both image - text pairs since generating images is costly as well as might not be great in quality and adhere to prompt really well

This is how they are doing it

  1. Generate Caption using some LLM (Gemini Pro here)
  2. Generate images from these captions

Here they do pre training of their image gen so that to eliminate the effect of human-annotated VLM dataset for experiments. This step is not necessary for practical purposes imo. They are using Muse as the primary image gen here since it’s transformer based architecture.

  1. Now the next step is to simply use this data fr VLM training. One small optimization they do here is that they use the same image encoder in both image gen as well as VLM. This allows them to directly use the embedding tokens output from image gen and feed it into the VLMs projection layer without the expensive decoding and encoding to pixel space.

Fashion Diffusion (Make your waifu dress in Zara)

Paper - [2403.01779] OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on

The outfit fitting problem seems simple at first glance - you just have to inpaint right?

But the issue becomes more apparent as you dive deeper

E.g. How do you make sure the clothes and brands are represented exactly the same? How do you account for weird body poses?

There are multiple approaches that exist currently to solve this problem. The authors propose a new one that’s far more accurate

So basically they train one more u-net in parallel to SD. This unet only gets the garment as well as some text label as input. We however, do not use it to generate anything. What we do is simply take the inputs to the spatial attention layer in the unet and then concatenate them with the inputs to the spatial attention layer of Stable diffusion along the width. This is what they call as output fusion

This would be confusing so you can check the following lines in the official code

Getting query from attention layers of outfitting unt - https://github.com/levihsu/OOTDiffusion/blob/344112ad1c03c2af1cf7a1f07d689b18af4c175a/ootd/pipelines_ootd/attention_garm.py#L234

Concatenating query with denoising unet -

https://github.com/levihsu/OOTDiffusion/blob/344112ad1c03c2af1cf7a1f07d689b18af4c175a/ootd/pipelines_ootd/attention_vton.py#L236

Parent that calls the former and then passed it to the latter  -
https://github.com/levihsu/OOTDiffusion/blob/344112ad1c03c2af1cf7a1f07d689b18af4c175a/ootd/pipelines_ootd/pipeline_ootd.py#L373

Another Apple LLM (this time it’s multimodal)

Paper - [2403.09611] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

I have no hope from apple’s ML teams to make a sota model. Reading this paper only cause they do ablations

For data they found that keeping almost an equal mixture of interleaved records along with image-text pairs gives best capability. You also need some text only data to preserve language capability of llms.

Second approach is interesting to support higher resolution while not sacrificing speed
 

Quiet-Star (Is it really the fabled openai algo, nope)

Paper - [2403.09629] Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Afraid to say but they use too many buzzwords for me to understand EXACTLY what they are doing and they don’t even have code as well.

That being said, this is what I gather

Teacher forcing - Nothing but using data from training set as input for next token as opposed to output from previous steps https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/

Non-myopic loss - Calculate loss based on all future tokens instead of just next token

Star - simply ask model to generate a rational (using prompt) and then an answer instead of directly generating an answer https://arxiv.org/abs/2203.14465

Ok, no worries, simply had to upload paper PDF in opus to crack it

Transformers for time series (truly retarded)

Paper - https://arxiv.org/abs/2403.07815

Why would amazon do this when you can use ARIMA or other lightweight shit?

No idea, but it’s fun

Tbf, they are not doing something insanely genius.

What they are doing is following

They also show improvements with using synthetic data by augmenting time series with noise instead of just training data

Synthetic data gen for time series

GaLore

Paper - [2403.03507] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

So their idea is the following, you already have LORA that uses low-rank matrices A&B to update the weights.



Here they propose that instead of using separate A&B matrices, you can simply keep on reducing the rank of gradients and still get the same level of accuracy while using even lower memory.

They prove this using mathematical voodoo that indeed the gradient matrices tend to have lower rank as training progresses and so you don’t need to store the whole and secondly, if you use low-rank matrices, the loss still converges to a minimum value.

Another interesting thing is that they keep on recomputing the low-rank projections but not at every timestep. Only when T steps have passed, the projections are updated.

The code is actually quite simple. Generated it using claude (so that no one has to bother with weird maths symbol in the paper)

class GaLoreAdam:

    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, rank=1, scale_factor=0.25, freq=200):

        self.params = list(params)

        self.lr = lr

        self.betas = betas

        self.eps = eps

        self.rank = rank

        self.scale_factor = scale_factor

        self.freq = freq

       

        self.step_count = 0

        self.m = [torch.zeros_like(p, memory_format=torch.preserve_format) for p in self.params]

        self.v = [torch.zeros_like(p, memory_format=torch.preserve_format) for p in self.params]

        self.P = [None] * len(self.params)

       

    def step(self):

        with torch.no_grad():

            for i, p in enumerate(self.params):

                grad = p.grad

                if grad is None:

                    continue

               

                if self.step_count % self.freq == 0:

                    U, S, V = torch.svd(grad)

                    self.P[i] = U[:, :self.rank]

               

                m, v = self.m[i], self.v[i]

               

                P_t = self.P[i]

                R_t = P_t.T @ grad

               

                m.mul_(self.betas[0]).add_(R_t, alpha=1 - self.betas[0])

                v.mul_(self.betas[1]).addcmul_(R_t, R_t, value=1 - self.betas[1])

               

                m_hat = m / (1 - self.betas[0] ** (self.step_count + 1))

                v_hat = v / (1 - self.betas[1] ** (self.step_count + 1))

               

                N_t = m_hat / (torch.sqrt(v_hat) + self.eps)

                G_t = self.scale_factor * P_t @ N_t

               

                p.add_(G_t, alpha=-self.lr)

       

        self.step_count += 1


ORPO

Paper - [2403.07691] ORPO: Monolithic Preference Optimization without Reference Model

So what they are proposing here is a way to RLHF the model without using any reward model as well as the need for a separate preference phase. They tell us you can simply do it along with SFT

The algorithm proposed is really simple. They use the log prob of the chosen and rejected samples from the model and then use it to calculate odd ratio

Once you have odd ratio, you can formulate a second loss term based on that, scale it down by a factor lambda and then add it to SFT loss

Now you simply train the model to using gradient from this combined loss

They show that the model is actually larning to adhere to preference and keeps on lowering reject samples log prob.

They also show that the trained model produces responses that are quite same (i.e. closer to reward) given the same prompt (even tho temperature is kept as 1.0)



They also show that trained models do change their responses significantly if you change the prompt meaning they are not over-optimised towards a single reward and do adhere to instructions

Implementation from claude:

import torch

import torch.nn as nn

from torch.nn import functional as F

class ORPOLoss(nn.Module):

    def __init__(self, lambda_or=1.0):

        super(ORPOLoss, self).__init__()

        self.lambda_or = lambda_or

    def forward(self, logits_chosen, logits_rejected):

        # Compute SFT loss (negative log-likelihood)

        sft_loss = F.cross_entropy(logits_chosen.view(-1, logits_chosen.size(-1)), labels_chosen.view(-1))

        # Compute L_OR loss

        probs_chosen = F.softmax(logits_chosen, dim=-1)

        probs_rejected = F.softmax(logits_rejected, dim=-1)

        odds_chosen = probs_chosen / (1 - probs_chosen)

        odds_rejected = probs_rejected / (1 - probs_rejected)

        log_odds_ratio = torch.log(odds_chosen / odds_rejected)

        l_or = -torch.log(torch.sigmoid(log_odds_ratio))

        # Combine SFT and L_OR losses

        total_loss = sft_loss + self.lambda_or * l_or.mean()

        return total_loss

# Training loop

model = ...  # Initialize your model

optimizer = ...  # Initialize your optimizer

orpo_loss = ORPOLoss(lambda_or=0.2)  # Initialize ORPO loss with lambda_or value

for epoch in range(num_epochs):

    for batch in dataloader:

        input_ids, attention_mask, labels_chosen, labels_rejected = batch

        # Forward pass

        logits_chosen = model(input_ids, attention_mask=attention_mask).logits

        logits_rejected = model(input_ids, attention_mask=attention_mask).logits

        # Compute ORPO loss

        loss = orpo_loss(logits_chosen, logits_rejected)

        # Backward pass and optimization

        loss.backward()

        optimizer.step()

        optimizer.zero_grad()

MyVLM (Shitty Name only Snapchat can think of)

Paper - [2403.14599] MyVLM: Personalizing VLMs for User-Specific Queries

Aim is to add personalisation to any existing visual language model or train one from scratch. Mostly they want to do this to capture snap userbase’s image and other objects accurately.

Most of such ideas revolve around playing with embeddings which is the case here as well.

They run a face detection model, figure out whose face it is  using cosine similarity from existing database and then pass those embeddings and the metadata to the VLM

They also have object detection models but here they use classifiers instead of embeddings. The classifier they train are pretty fine grained e.g. recognise which type of dog it is then just say it’s a dog

This paper contains a lot of hyperparameters in the appendix so better to go through them before jumping to implementation.

The QFormer here is only used in case of BLIP-2, when used with llava, they simply remove it and append the embeddings to the ones after the projection MLP

One thing I liked about this paper is that they have actually done a thorough analysis if this approach is even sound mathematically or not. Plenty of nuggets in the paper where they scaled the embeddings up/down to make sure their effects are not over/under emphasized

 

Factuality in LLMs

Paper - [2403.18802] Long-form factuality in large language models

Nothing groundbreaking here imo

They use LLMs to split out facts from a given generated passage and then use google search to verify if those facts are correct or not

It is simply working now cause the models are simply bigger and much more smarter.

All the alpha is in Appendix as well as one section where they discuss precision recall of LLMs

Layer Skip

Paper - [2404.16710] Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding

Quite an important paper imo (cause of my bias when it comes to matryoshka embeddings)

The premise is that not all tokens require you to go through all layers of the models (this has been explored in other papers as well). Some tokens are much easier to predict than other

So what you can do instead is that directly take embeddings from an intermediate model layer and pass it through lm_head to get predictions.

But what if we predict something wrong? Well for that, instead of using normal auto-regressive decoding you can use speculative decoding. You simply keep on predicting using earlier layers and then keep on correcting it using the output from all the layers. The correction is fast since there’s no need to do that auto-regressively.

Also if we use the same model to correct, we should already have KV cache from initial few layers, so effectively we only need to compute the final layer.

The only issue with this idea is that it require training the model in a way so that earlier embeddings are still good enough (basically by incorporating their loss scaled down by a factor in the final loss) plus the training itself is more costly

Training a Judge (model)

Paper - Paper page - Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

They are training two models here and then merging them

The first model is pretty obvious, given an answer by an LLM, rate how good it is from 1 - 5. However to improve the response they do 2-3 prompt engineering tricks which can be used without this model as well. Instead of just providing initial instruction and answer to generate a score, they also provide a reference answer in the prompt along with explicit evaluation criteria rubrik. Also, instead of only generating the score, they also ask the model to first generate a verbal explanation first.

The second model they train is a pairwise ranking one where along with instruction you give two answers and ask LLM to tell which is better. Again they do all the prompt engineering tricks mentioned before but with small variations e.g. evaluation criteria doesn’t contain scores, generate two verbal feedback that compare them with criteria

I will never intuitively understand why model merging works but I have made my peace with it.

Merge cheatsheet hidden in the paper, woohooo, till now I only pretended to know what they were. All seem simple

How to create a FAQ dataset from your company docs?

SDXL Lightning

Paper - [2402.13929] SDXL-Lightning: Progressive Adversarial Diffusion Distillation

Asian bros are amazing. They are competing with each other in the same company to give us the best possible models.

And so from Bytedance, we get two great ones to make SD inference faster

One SDXL lightning and other is HyperSD

The goal is to train a lora/model that can help you to generate high quality images from SDXL with just 2 steps. They do this using distillation using Adversarial objective (you can check the SD turbo to see what’s that). Basically there’s a discriminator model that is trained to predict if an image is generated by teacher or student. And then you have a student model. We want to minimize the probability gap of the discriminator output between images generate by teacher and student

For discriminator they are using existing UNet of SD models and nothing fancy

I wasn’t aware about this flaw in noise schedule at all but it’s so apparent now. Good thing is that it does make these models slightly better for image to image (since you don’t start from pure noise in that case)

Some of the unique things they do to get high quality 1step and 2step models


Both use LAION an COCO datasets to train image models in case you also want to train yours

HyperSD

Paper - [2404.13686] Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis

This is also an attempt by Bytedance bros to make SD inference faster. However the approach they are following here is more similar to how consistency models are cooked instead of Distilled models.
What’s the primary diff between these two approaches?

Well in consistency models, instead of predicting the noise and doing diffusion, you generally predict the latent itself at the t=0 i.e. the final image

This as you can assume doesn’t work well a lot of times since predicting something at t=0 is quite hard

That’s why the authors propose a different path where instead of predicting directly at t=0, they break down the timestamps into K segments (where k is varied) and then predict only one segment at a time

They also have the same conclusion as the lightning paper that MSE loss is more effective when k is smol while adversarial is more effective for higher ks.

The CFG remains to be achilles heel of fast SD implementations. In practice I have never been able to get decent results with value above 2.0

However, they did release some checkpoints to fix it recently

REMOVE BACKGROUND OSS MODEL LFGGGG!!!
https://huggingface.co/schirrmacher/ormbg

Paper it is based on - [2203.03041] Highly Accurate Dichotomous Image Segmentation

Semantica

Paper - [2405.14857] Semantica: An Adaptable Image-Conditioned Diffusion Model

by google deepmind. Can’t make up my mind if it is mid or interesting.

To me it seems like an attempt to make a powerful IPAdapter alternative i.e. condition the output based on input images. However here they want to influence the output more strongly plus support datasets that are not even part of the model during its training via In-Context examples. E.g. if you train a model on animal but don’t include elephant in the dataset
However if you give it elephant images as input prompt, can it generate one itself?

Most of the magic of this paper lies in the dataset. It contains multiple images that can be used in a single context. By URL here they refer to the wikipedia URL or some webpage from which all images were taken.

Making a good AI coder

Source  - https://aider.chat/2024/05/22/swe-bench-lite.html

https://aider.chat/2023/10/22/repomap.html

These are simple yet useful blogs. Not going to add a lot of info here, better to go through them in 10 minutes.

Mostly involves using AST based fetcher to fill the context instead of simple similarity search based

Also includes multiple other optimisations like prompt engineering and some vague code editing backends.

Also if you want to create AST (Abstract Syntax tree) of your repo simply use - https://tree-sitter.github.io/tree-sitter/

Search in smol llms

Paper - [2406.07394] Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

Basic funda here is that you use search to improve the answers of LLMs instead of fine tuning or making them larger.

They use a lot of ml mumbo jumbo but what they are doing is quite simple.

You first generate an answer using LLM.

Then you ask the LLM to get critical feedback for that answer.

Then you ask the LLM to generate an improved answer based on this feedback.

Then you ask LLM to grade the answer from -100 to 100.  

you use it to update the scores of answers and its parents.

Finally, in the next iteration you select the children with best score and repeat the same process.  

Make Smol LLMs Kino (by Meta)

Paper - [2402.14905] MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

This paper talks about architecture changes which you can make to smol llms to reduce the memory requirements while also increasing their accuracy

Most important observation is that deeper nets are better than wider ones when it comes to accuracy (Gemma2 paper also had similar conclusion)

Also, Swiglu always for more accuracy (SwiGLU(x) = x * sigmoid(beta * x) + (1 - sigmoid(beta * x)) * (Wx + b)

Next, they test out how many attention heads are optimal. First they use Grouped query attention as it uses less memory than normal multi head attention

Next they try some “hacks” which supposedly work

First is layer sharing, where they make the weights of the next layer the same as the previous layer. This means loading much less weights.

Finally they also share embeddings between input and output layers. This leads to a slight drop in accuracy but memory savings are significant since embedding params make up for around 10-20% of the total params for smol llms

Kolors (Chinese SDXL)

Paper - Kolors/imgs/Kolors_paper.pdf at master

We all know text encoders make a lot of difference in image gen. That’s why SDXL uses 3 of them (2 CLIPs and 1 T5)


These folks go a step further and instead of using a text encoder, simply use the penultimate layer output of chatGLM-6b-base model.  

Next they use a better MLLM to caption the images so that the descriptions capture as many concepts as possible.

I still don’t understand how they make text rendering better cause all they mentioned is synthetic data.

They also have training run divided into two phases - one where they teach the model concepts using low-res images, second where they tune it for quality using higher res images

For higher res images they also adjust the schedule during training so that we do get almost pure noise in forward diffusion since that’s the input during inference

 

Florence-2

Paper  - [2311.06242] Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

I generally ignore all models by microsoft. But after phi3, this is the other model that has been branded kino by ML anons.

What’s their secret sauce though?

The model architecture is not something revolutionary. They are using an Image encoder to tokenize the images and then using an encoder-decoder model to predict the text.

Only difference is tokenizing the bounding box which tbh a lot of papers do right now especially with the rise of extracting info from docs

The attention block used in image encoder is also a bit unique. Usually in spatial attention we use channels as the features while the pixel is used as a token. Here though we also introduce another attention block that does the opposite i.e. use pixels as features for each channel (can be simply done by reshaping).

Most of their magic however lies in the data creation pipeline (which is secret sauce for most OSS model improvements tbh)

Embedding Spreadsheets 101

Paper - [2407.09025] SpreadsheetLLM: Encoding Spreadsheets for Large Language Models

Written by interns so was already skeptical. I thought they trained some model to embed spreadsheets

But what they are doing is simply finding a way to represent spreadsheet data in prompt. They are mostly focusing on representing the structure of the sheet rather than actual data.

Like if you were rawdogging, you would simply represent data as (cell location, data)

However, they propose some optimizations to this so that we don’t run into prompt context limits

- inverted index : where they simply use values as keys and cells as the list for the index. This helps in saving a lot of context by avoiding empty cells plus and duplicate values

- data format aggregation - just combining the values sharing the same cell data type e.g. datetime, currency etc.


- chain of spreadsheet - you only give sheet structure and task in prompt (sheet structure can be provided with any of the techniques above). A sheet can contain multiple tables so you ask LLM which table contains relevant data.
The LLM provides you the boundaries of the relevant tables. Then you make a second call, with the actual data from these tables. Hence a chain, tada :|

They also have this thing called structured anchors, where they use heuristics to compress the table while retaining structure. IMHO, no one is going to use it in prod. Also since the info on this is split in appendix and main content, here’s the python code to understand it better

SAM-2 (with MP4 support)

Paper - SAM 2: Segment Anything in Images and Videos | Research - AI at Meta

I have already written about SAM in this doc. They have made it even better on images and now it supports videos (as a sequence of images and not MP4, I lied).


The primary addition in this architecture is the memory attention and memory bank. It’s just a fancy terminology for a place where they store embeddings of the previous predictions and then use them in a transformer block to influence the current frame with cross-attention.

They have changed the Image encoder as well to Hiera so that it’s accurate as well as fast.

Mask decoder is mostly same except they have skip connections from image encoder for some high res features plus object embeddings (still not sure what they mean, need to read more)

Checkout appendix C if you want full details on the model.

Flux architecture (stolen from reddit)

The architecture is quite similar to SD3 tbh (and makes sense as well since it is created by same engineers)

https://www.reddit.com/r/LocalLLaMA/comments/1ekr7ji/fluxs_architecture_diagram_dont_think_theres_a/

Loopy - Make images speak and sing

Paper - https://arxiv.org/pdf/2409.02634

So we already have Hallo that can do the same thing. However, if you have opened the Hallo repo ever, you’ll notice it has way too many components that need to work together for the model to work.

The aim of this project is to strip of a lot of things like face detection and simply use good embeddings for audio to animate the face.

The idea is like this

You have a denoising unet where instead of text embeddings you using audio embedding for conditioning. The input to this SD unet is your usual Noise but timestep embeddings are concatenated with the embeddings of audio and it’s translation plus other features

Then you have a second unet called reference net. This unet actually takes the image as input along with the motion frames


The main use of ref net is to ensure that the generate images for frames actually follow the animation sequence correctly i.e. each frame should diverge significantly from previous frames. They do that by using multiple previous frames as inputs. The attention layers in this u-net are all on spatial dimensions like the original SD unet.


The second unet however which actually generates the image is a bit different. It contains the spatial attention layers but apart from that it also contains the temporal attention layers. These new layers actually use the output of each layer of reference u-net as condition.  

I’ll be honest the terminology is bit confusing in this paper.

Ok, I should’ve read the intro first. Now I understand what they are trying to do better.

Basically there already have been attempts to animate images using audio. However they suffer from following issues -

So they have added temporal attention to solve the second issue mentioned here. They also have added a Time segmentation module that takes a series of frames and groups them together based on how further they are from original. They group together more frames if they are further from the current timestamp cause their details matter less.

To solve the first issue, instead of using the audio embeddings directly they have this Audio to latents module, which is basically some attention layers. They train this module along with the whole network to translate audio embeddings into motion latents i.e. embeddings that actually correlate with the motion. It’s sort of like training a projection layer for VLMs but also involves attention not just MLP.

ColPali - Multimodal RAG made simpler and faster

Paper - https://arxiv.org/pdf/2407.01449

Janus - Deepseek goes multimodal

Paper - https://arxiv.org/pdf/2410.13848

Focus of paper is on an approach to use multiple encoders for visual tasks. They argue (and quite correctly) that image generation and understanding require very different features. The former requires finer ones. Hence they propose an architecture where you can plugin multiple encoders.

Hmm, but what’s the new thing here? Well they train this new architecture such that the model itself is able to decide which encoders features should dominate instead of manually enabling/disabling or like some primitive function.  

They want to extend this approach to multiple other encoders as well (e.g. one for audio) and let the model itself figure out the possible paths.

Their training is divided into 3 stages where different layers are frozen

Stage 1 - the focus is mostly on LLM and adapters to learn the relationship between text and image.  Dataset includes 1.25 million image-text paired captions from ShareGPT4V  for multimodal understanding and approximately 1.2 million samples from ImageNet-1k   for visual generation

Stage 2 - the focus is on complete image generation and understanding. This is most heavy part of the training with all the datasets

Stage 3 - This is just instruction tuning so that llm follows them correctly

For visual gen tasks the cross entropy loss in training is only calculated using image tokens

Entropix

Code - https://github.com/xjdr-alt/entropix/blob/main/entropix.ipynb

The entropy sampling approach by xjdr is gaining a lot of hype. Let’s see what the hype is all about.

The core idea revolves around the following 4 metrics

1. Entropy - This indicates how confident the model is in its current prediction. We calculate one for logits and another for attention matrix

2. Var entropy - measures how much entropy is changing over multiple predictions

3. Agreement - how much attention heads agree with each other. Basically do all heads think that the same token is important or not.  


4  Interaction Strength - how much the model thinks the parts of input are related to each other. It’s just mean of attention scores


The algo is something like this

1. If both entropy and var entropy are low, it means that model is pretty confident and we should simply keep on following the path we are on

2. If entropy is high but var entropy is low, it means the model is not sure for almost every prediction. It is here that we actually return a token that represents a clarifying question. The hope is that model will get more info after this and then come on correct path

3. If entropy is low but var entropy is high, it means the model is confident but the confidence varies significantly across tokens. Here we dial up the temperature a bit. We also increase top_k in sampling if agreement is low so that we can hopefully find a more confident path like 1

4. If both are high, the model sometimes knows but other times it doesn’t. Here we dial up the temperature even more. We also decrease top_p for some reason which I don’t fully understand

Also, I don’t know jax so I used claude to understand the shape of the various arrays to make sure if my understanding is correct. You only need to focus on
sample , calculate_varentropy_logsoftmax, and calculate_metrics method

import jax

import jax.numpy as jnp

import numpy as np

from typing import Dict, Tuple

# Constants

LN_2 = 0.69314718056  # ln(2) = 1.0 / LOG2_E

@jax.jit

def calculate_varentropy_logsoftmax(logits: jnp.ndarray, axis: int = -1) -> Tuple[jnp.ndarray, jnp.ndarray]:

   """Calculate the entropy and varentropy of the probability distribution using logsoftmax."""

   log_probs = jax.nn.log_softmax(logits, axis=axis)

   probs = jnp.exp(log_probs)

   entropy = -jnp.sum(probs * log_probs, axis=axis) / LN_2

   varentropy = jnp.sum(probs * (log_probs / LN_2 + entropy[..., None])**2, axis=axis)

   return entropy, varentropy

@jax.jit

def calculate_metrics(logits: jnp.ndarray, attention_scores: jnp.ndarray) -> Dict[str, jnp.ndarray]:

   entropy, varentropy = calculate_varentropy_logsoftmax(logits)

   attention_probs = jax.nn.softmax(attention_scores, axis=-1)

   attn_entropy = -jnp.sum(attention_probs * jnp.log2(jnp.clip(attention_probs, 1e-10, 1.0)), axis=-1)

   attn_varentropy = jnp.var(attn_entropy, axis=-1)

   mean_attention = jnp.mean(attention_probs, axis=1)

   agreement = jnp.mean(jnp.abs(attention_probs - mean_attention[:, None, :]), axis=(1, 2))

 

   print(f"agreement shape: {agreement.shape}")

   interaction_strength = jnp.mean(jnp.abs(attention_scores), axis=(1, 2, 3))

   print(f"interaction strength shape: {interaction_strength.shape}")

   return {

       "logits_entropy": jnp.mean(entropy),

       "logits_varentropy": jnp.mean(varentropy),

       "attn_entropy": jnp.mean(attn_entropy),

       "attn_varentropy": jnp.mean(attn_varentropy),

       "agreement": jnp.mean(agreement),

       "interaction_strength": interaction_strength

   }

def test_calculate_metrics():

   # Create sample input data

   bsz, seqlen, vocab_size = 2, 3, 5

   num_heads = 4

 

   key = jax.random.PRNGKey(0)

   key, subkey1, subkey2 = jax.random.split(key, 3)

 

   logits = jax.random.normal(subkey1, (bsz, seqlen, vocab_size))

   attention_scores = jax.random.normal(subkey2, (bsz, num_heads, seqlen, seqlen))

 

   print("Input shapes:")

   print(f"logits shape: {logits.shape}")

   print(f"attention_scores shape: {attention_scores.shape}")

 

   # Calculate metrics

   metrics = calculate_metrics(logits, attention_scores)

 

   print("\nIntermediate shapes:")

   entropy, varentropy = calculate_varentropy_logsoftmax(logits)

   print(f"entropy shape: {entropy.shape}")

   print(f"varentropy shape: {varentropy.shape}")

 

   attention_probs = jax.nn.softmax(attention_scores, axis=-1)

   print(f"attention_probs shape: {attention_probs.shape}")

 

   attn_entropy = -jnp.sum(attention_probs * jnp.log2(jnp.clip(attention_probs, 1e-10, 1.0)), axis=-1)

   print(f"attn_entropy shape: {attn_entropy.shape}")

 

   mean_attention = jnp.mean(attention_probs, axis=1)

   print(f"mean_attention shape: {mean_attention.shape}")

 

   print("\nOutput shapes and values:")

   for key, value in metrics.items():

       print(f"{key} shape: {value.shape}, value: {value}")

# Run the test

if __name__ == "__main__":

   test_calculate_metrics()


Understanding basics of sampling params:

Spirit LM

Paper - https://arxiv.org/pdf/2402.05755

To be honest, I am reading this paper only to understand how they trained it on expressive tokens.

Basically this model is auto-regressive but supports both speech input and generation. Obviously they’ll have to use tokens for everything here which means more encoders

We are already aware about text so let’s skip it.

Now for the fun expressive part, well it’s just more encoders with more tokens 😐

Plan search

Paper - https://arxiv.org/pdf/2409.03733

It’s just prompting LLMs to first create observations/hints about the problem, then create more observations from previous ones
Then write pseudocode and then the actual coding solution

Hallo-2

Paper - https://arxiv.org/pdf/2410.07718

Paper - https://arxiv.org/pdf/2406.08801

Encodddeerrss - like every other paper

The secret sauce is the reference net that basically controls the generation process of the main diffusion model. Its job is to ensure that the output frames match the original image.

However, we do want some motion in each frame, especially lips, eyes etc. That’s why we also use text and audio embeddings as conditioning input. All of this is via cross-attention

We also need encodings of past-frames as input so that we can ensure that we are not generating the exact same thing. However, simply using them has a side-effect that they heavily influence the output. This is why we first make them pass through two steps -

* patch data augmentation - we divide the image into patches and randomly drop some patch from each frame

* gaussian noise - add a slight random noise to each frame.

They have also added this high-resolution enhancement module which is a transformer. Each block of this transformer is a self-attention layer followed by a temporal-alignment layer (which is again attention but with inputs reshaped) . The predicted output is finally decoded using a codebook and a HQ decoder for high resolution vid with better temporal alignment,

TDM’s reasoning arc - must increase AGI IQ by 20 basis points

COCONUT

Paper - [2412.06769] Training Large Language Models to Reason in a Continuous Latent Space

We all know tokenizers are the bane of LLMs existence. Now that test-time compute is another paradigm to scale, people rightfully feel that tokenizers are actually limiting a lot of reasoning by removing a lot of useful information.

This is what this paper aims to solve. The idea is really simple - Do not convert thoughts to tokens lmao.

But then how do we pass them for prediction? Cause ultimately we are dealing with an auto-regressive LLM. Well, just use the last hidden state before tokenisation directly as input embedding. This would obviously require changes in the training regimen. If you simply connect last layer to fist, you will get garbage

Now the next question is - how to determine when to switch to latent space vs token space. Hmm, that’s where this paper doesn’t have a good solution tbh. They have very naive approach of inserting <bot> token which signifies beginning of thought to switch to latent space just after the question prompt.

For switching back to token space, they just do it after N steps where N is not varied

They train this model by having prompts in the following format

Question [step 1] [step 2]..... Answer

They do multi stage training where during the first stage only question is provided and llm is encouraged to generate all reasoning steps and answer. During next stages, they keep on replacing each starting step with some latent space embeddings from the last layer. They mask the question and these latent space thoughts in the loss calculation so the LLM is encouraged to generate the remaining steps.

I am a big believer that O1 type models don’t do reasoning in token space but at the same time they involve some RL during training to generate better thoughts instead of SFT. The RL might also be involved during inference, who knows??

Which LLM layers are important?

Paper - [2412.09563] Does Representation Matter? Exploring Intermediate Layers in Large Language Models

LLMs sort of are like compression engines. They take all the knowledge in the world and the condense it into weight matrices comprised of a few billion params. Maybe the foundation was just an LLM training institute set up on another planet

Now one question that always arises is which LLM layers are important? Like all layers contribute significantly to accuracy or is it starting, middle, later or random layers.

This paper tackles the problem by measuring entropy of the logits. The way they calculate entropy is a bit different than the entropix one but in conceptually it represents the same thing - the lower it is, the more probabilities are concentrated only among few tokens and vice versa

What they find is the intermediate layers matter more and their entropy is negatively correlated with accuracy i.e. the lower it is, the better the answer

They also find that as the model is trained the entropy of middle layers change significantly and keep on falling.

 

I still don’t understand the observations about curvature though. It increases in middle layers and remains stable until later layers signifying I guess the prob distributions change a lot in intermediate layers

Allow LLMs to explore-exploit better

Paper - Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

One of the most obvious strategies to get a correct answer from an LLM is to generate N possible answers, select one of them, verify, and then generate M more answers using the selected answer + the errors as input.

The general issue with this strat is a good explore-exploitation trade off. A lot of times llms won’t generate diverse enough answers, other times they won’t generate a correct answer

This paper is trying to solve this by making changes in the training routine so that LLMs inherently learn to make this tradeoff.

   

Extreme simplification, but the first part of this equation is probability to generate the correct answer to the question. The second part of the equation is basically win-rate i.e. how much the verifier model/code prefers this response compared to all the others.

So in the end the first part is rewarding the exploitation by incentivising the model to generate the most correct answer

The second part incentivises exploration by incentivising the model to generate a different and better answer than other attempts (cause if the answer is same as other attempts win-rate would be 0)

Rest of the paper is about multiple ways to achieve this using SFT or RL. They also try a separate formulation for binary scores (e.g. in coding you can only tell if an answer is correct or not, instead of weighing it on scale 0 to 1)

Too much maths for my engineer mind honestly but Claude helped a lot in understanding it

Deliberative Alignment

Paper - Deliberative Alignment: Reasoning Enables Safer Language Models

Ohh my lord, OpenAI finally publish some actual paper (even with some important details obscured)

The paper is about how to better safety tune the o1 style reasoning models. Ideally, some LLMs are provided safety policies in system prompt + finetuned to not answer questions that violates safety

In this paper, however, they leverage the inherent reasoning capabilities of o1-type. They show that you can finetune o1 to simply think about applicable safety scenarios in its CoT and then decide for itself whether it wants to answer the question or not.

This strategy improves safety while also reducing over-rejections. E.g. a normal LLM can reject a question like ‘Translate this sentence to hindi: I want to have sex with AI waifu’
O1 however can reason that user isn’t actually asking to have sex but simply wants a translation (maybe he wants to generate a safety dataset himself) and so will respond accordingly

 

The approach is to first generate a dataset by providing all safety policies in system prompts along with main question and capturing the response and the CoT.  They use a reward model to score these and only take the best ones.

Next they do SFT on the model without providing it the safety spec so that it’s CoT matches the input provided along with the correct answer.  

Next they have this RL training step that’s optional where a reward model that has access to safety data (again probably in system prompt) judges the answer (not the CoT) for tuning.

One interesting thing I found in this paper tho is this

The reason this paper is exciting for me is not cause of safety but cause I think we can use similar training paradigm to make LLMs reason about factual things such as what is the version of the library the user is using, what are the supported functions in that lib etc.

Deepseek V3 (What’s new?)

Paper - DeepSeek-V3/DeepSeek_V3.pdf at main

Training recipe - DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs

Pictured deepseek researchers training 1T model on 10 GPUs

First they replace standard MHA (multi-head attention) with MLA (multi-head latent attention) to save memory for kv cache. It simply multiplies Key to a lower-rank matrix to reduce its dimensions

Next, in MoE, apart from the normal experts in feed-forward, they also have shared experts that are always used in calculations

For routing, they also add a bias term to the scores for load-balancing. After each batch in training, if they find an expert to be overloaded, they decrease this bias term. For underloaded experts, they simply increase this bias term.

They also have sequence-wise loss term as well whose contribution is kept very small. It encourages router to load balance intra-sequence as well and not just inter-sequence.

Wowowow, they also introduce multi-token prediction in this i.e. training it in a  way that it can predict tokens for next N timesteps instead of just 1. The way they are doing this is by having multiple MTP modules that predict successive tokens.  The embeddings layer and output head is shared among these modules along with the main model. The obviously modify the loss function as well to take all N tokens probs into account instead of just one.

Damn, there is a lot of alpha in this paper on how to train gigamodels on your basement data centre. Huge respect for what they are doing.

They have a new way of doing pipeline parallelism which tbh I don’t understand well at this point. What they are trying is to minimize gpu idle time by keeping two queues - one for forward and another for backward pass. This requires keeping two copies of the weights, however, it reduces the communication overhead by A LOT. They also dedicate around 20 SMs in each GPU to only handle communications. This means that comms can run in parallel to the compute.

Next, they have Mixed precision training in FP8 where they perform all linear operations in FP8 while keeping most output matrices of these ops in BF16. The Attention ops, embeddings and MoE gating ops are still kept in BF16 or FP32

   

They also perform quantisation on a more granular level which means taken smaller chunks of a tensor and then scale them using max in that chunk. This helps in handling outliers better.

Deepseek R1 (open source sota reasoner)

Paper - DeepSeek-R1/DeepSeek_R1.pdf at main

Wowowowowoowowowowowow

GRPO 

They introduced this in the deepseek math paper. The primary objective of this is to reduce the training burden for RL while not sacrificing the accuracy. Tbh I need to understand this better. I’ll start by revising PPO first https://huggingface.co/blog/deep-rl-ppo and this time actually focusing on the maths

Most ‘wow’ part about this paper is that they show that SFT is not at all required to improve reasoning. Like the QWQs of the world have been trying to SFT on chain of thought data to improve reasoning of the model. Here however, they show that RL itself is enough. They also show that you don’t need to actually show the model when to backtrack, when to explore other paths etc., they all emerge naturally if your RL training objective is correct.

N+1 reading list

https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20

Schedule free learning - https://github.com/facebookresearch/schedule_free

graphhRAG - https://www.microsoft.com/en-us/research/blog/graphrag-new-tool-for-complex-data-discovery-now-on-github/

Salesforce’s smol LLM whitepaper - https://arxiv.org/pdf/2406.18518

Meta 3d gen - https://ai.meta.com/research/publications/meta-3d-gen/?utm_source=threads&utm_medium=organic_social&utm_content=carousel&utm_campaign=research

Inference time algo survey - https://arxiv.org/abs/2406.16838

KINOOOOO KINOOOOOO Torch compile  - torch.compile, the missing manual

Test out - https://github.com/lm-sys/RouteLLM

[2407.07972] Deconstructing What Makes a Good Optimizer for Language Models

[2211.05102] Efficiently Scaling Transformer Inference

https://arxiv.org/abs/2406.11832v1

[2407.06023] Distilling System 2 into System 1

[2312.06681] Steering Llama 2 via Contrastive Activation Addition

[2404.07647] Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck

Paper page - Qwen2 Technical Report

https://github.com/OpenStitching/stitching

[2407.12753] LookupViT: Compressing visual information to a limited number of tokens

[2402.09090] Software in the natural world: A computational approach to hierarchical emergence

APPENDIX

Tips and Tricks

Don’t know anything about LLMs?
Start here  - Paper page - Understanding LLMs: A Comprehensive Overview from Training to Inference

Not getting good results for your vector DB based search, Use these

HyDE - Hypothetical Document Embeddings — 🦜🔗 LangChain 0.0.155

Maybe this as well

Cohere launched re-rank - Say Goodbye to Irrelevant Search Results: Cohere Rerank Is Here 

This should be better than using HyDE.

Also - [2302.00093] Large Language Models Can Be Easily Distracted by Irrelevant Context (why feeding correct docs in LLM is important)

Do you want to build a scalable vector DB?

Use this - ScaNN

OR

GitHub - nmslib/hnswlib: Header-only C++/python library for fast approximate nearest neighbors

Do you want local embeddings which rival openAI ones?

https://huggingface.co/hkunlp/instructor-xl

USE THIS LIB IF YOU WANT PERFECT JSON OUTPUT FROM LLM

https://github.com/1rgs/jsonformer/

Are you Lazy? Do you still want to finetune LLaMa?

Worry not, use this to generate dataset - https://github.com/yizhongw/self-instruct

OR this https://github.com/togethercomputer/RedPajama-Data

Check this model out for good conversational capabilities (AI Waifu ready)
https://huggingface.co/PygmalionAI/pygmalion-7b

Limited Memory? Still want to run Machine god? No worries, all state secrets here

https://github.com/oobabooga/text-generation-webui/blob/main/docs/Spell-book.md

Smol model List with trusted reviews (imageboard replies)
https://rentry.org/lmg_models

Finetuning LLMs (Most asked on twitter recently)

The Novice's LLM Training Guide

​​[D] Have you tried fine-tuning an open source LLM? : r/MachineLearning  (endorsed by Hardmaru)

Github issues are a great source for learning comp sci

Learn NuMA -
https://github.com/ggerganov/llama.cpp/issues/1437

Learn Mmap -
https://github.com/ggerganov/llama.cpp/issues/91

Learn Loop Unrolling - https://github.com/ggerganov/llama.cpp/pull/1530

Adding code samples to your training data is good way to increase LLMs slight reasoning capabilities
https://arxiv.org/pdf/2210.07128.pdf

Confirmed by Teknium as well -
https://twitter.com/Teknium1/status/1658123329763409921?s=20

(yes I trust anons on twitter more than a paper)

Incase you want to run AGI in your iphone

https://github.com/mlc-ai/mlc-llm/blob/main/ios/README.md

In case you want to generate instruction data from LLM itself

https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt

Estimate memory required for training

https://blog.eleuther.ai/transformer-math/

SD QR codes

https://learn.thinkdiffusion.com/creating-qr-codes-with-controlnet/

GPU inference

Fast GPU based PyTorch model serving in 100 lines of Python | by Nikolaj Goodger | Medium

https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution

GitHub - replicate/cog: Containers for machine learning

Want to understand how transformer inference works?

Transformer Inference Arithmetic | kipply's blog

Read this and get close to cloud’s alpha:

https://en.algorithmica.org/hpc/

You don’t know who cloud is? He’s an eastern euro dev who might solve OpenAI’s inference cost issues https://twitter.com/cloud11665 

Relation-extraction near Sota

Babelscape/rebel-large · Hugging Face

OSS LLM pricing estimation

https://together.ai/pricing 

Fast Inference for LLMs

TensortRT LLM - https://x.com/abacaj/status/1719486649610940615?s=20 (complicated setup)

vLLM + AWQ - https://x.com/HamelHusain/status/1719882599261413840?s=20 (easy setup, python native)

Deepspeed-fastgen - https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen

Python Profiling

https://github.com/plasma-umass/scalene

Kipply’s blog (for when you just need to find something to read)

https://kipp.ly/sept-oct-2023/

For when I actually decide to learn some maths behind ML

https://databookuw.com/databookV2.pdf

Training Recipe (Overfit then Regularize)
https://karpathy.github.io/2019/04/25/recipe/

A good summary of all the techniques for faster inference

https://vgel.me/posts/faster-inference/

Dataset curation

https://github.com/lilacai/lilac

CUDA tutorial

Getting Started With CUDA for Python Programmers

I should implement this

[2402.09171] Automated Unit Test Improvement using Large Language Models at Meta

Running pytorch models on edge

https://pytorch.org/executorch/stable/build-run-mps.html

Background removal

https://huggingface.co/briaai/RMBG-1.4

https://pypi.org/project/transparent-background/

https://github.com/ZhengPeng7/BiRefNet

Flux optimisation

https://github.com/sayakpaul/diffusers-torchao/

Multimodal doc retrieval

Awesome Document AI - a merve Collection

Most popular local models you can use

https://rentry.org/LocalModelsLinks

Pro-tip:

If you are not a shape-rotator, just add shape dimensions in the code comments. Rotation skills should not stop you from taming the machine god. Example -

Karpathy’s Presentation in MS Build 2023

Hidden alpha -

You need to use large batches for a good model in case of contrastive learning. Reason being it needs to see a lot of diverse samples to form a good intuition.

GEMINI PRO VS GPT-4V comparison

https://arxiv.org/pdf/2312.12436.pdf


Obviously GPT-4V is better but not by a lot (except when it comes to coding). I would simply quickly go over this to figure out some new ideas for side projects. Nothing major tbh.

How to protect PII in data

Answers to Stupid questions (mostly for me)

machine learning - What is a channel in a CNN? - Data Science Stack Exchange

Reading List

[2304.13712] Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond ( survey of Language models, extremely good)

[2305.00050] Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

[2305.00833] Learning to Reason and Memorize with Self-Notes

[2305.01210] Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

[2304.13835] Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models

[2303.14177] Scaling Expert Language Models with Unsupervised Domain Discovery

https://arxiv.org/abs/2305.01625/ [Unlimiformer - kNN inside transformers]

https://arxiv.org/pdf/2212.14034.pdf [How far can you train a model on a single consume GPU]

[2304.01982] Rethinking the Role of Token Retrieval in Multi-Vector Retrieval 

[2305.10425] SLiC-HF: Sequence Likelihood Calibration with Human Feedback

[2305.16291] Voyager: An Open-Ended Embodied Agent with Large Language Models [AI Minecraft player upgraded, much better than previous attempts]

[2306.03341] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Speeding up the GPT - KV cache | Becoming The Unbeatable (old but gold, almost default in every lib)

Predicting Grokking in LLMs

[2306.10209] ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

RAGs (Retrieval augmented generation):

[2112.04426] Improving language models by retrieving from trillions of tokens

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

[2208.03299] Atlas: Few-shot Learning with Retrieval Augmented Language Models

Knowledge Retrieval Over Public and Private Data

[2211.09260] Task-aware Retrieval with Instructions

https://arxiv.org/abs/2307.07164/

[2302.00083] In-Context Retrieval-Augmented Language Models

[2209.14290] FiD-Light: Efficient and Effective Retrieval-Augmented Text Generation

[2303.08518] UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation

[2212.10496] Precise Zero-Shot Dense Retrieval without Relevance Labels

Nvidia H100 GPUs: Supply and Demand · GPU Utils ⚡️

https://blog.fireworks.ai/speed-python-pick-two-how-cuda-graphs-enable-fast-python-code-for-deep-learning-353bf6241248 - Cuda Graphs

[2306.02519] Transformative AGI by 2043 is <1% likely - Interesting section on compute limitation pointed out by jeremy howard

Paper page - Large Language Models as Optimizers

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning - Must read since approved by pony anon

[2305.05065] Recommender Systems with Generative Retrieval

https://arxiv.org/abs/2309.06180 - vLLM paper, paged attention, how to scale up LLM inference basically

https://transformer-circuits.pub/2021/framework/index.html

https://en.wikipedia.org/wiki/Moravec%27s_paradox#:~:text=Moravec's%20paradox%20is%20the%20observation,skills%20require%20enormous%20computational%20resources.

[2204.03084] Knowledge Infused Decoding 

[2210.06316] Non-Axiomatic Term Logic: A Computational Theory of Cognitive Symbolic Reasoning

[2303.11366] Reflexion: Language Agents with Verbal Reinforcement Learning

Legacy reading list (borrowed from yacine.ca)

New Reading List + Alpha (Borrowed from yacine)

DreamTuner - ipadapter but different

https://arxiv.org/pdf/2312.13789.pdf - how i beat the big wigs

[2312.09608] Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models - faster stable diffusion by skipping unecessary bits

Official implementations for paper: Anydoor: zero-shot object-level image customization - instruct edit + https://old.reddit.com/r/StableDiffusion/comments/18kd0na/code_for_anydoor_zeroshot_objectlevel_image/

dont forget - diffusion slider demo https://github.com/Kevin-thu/DiffMorpher?tab=readme-ov-file

https://arxiv.org/pdf/2312.01943.pdf - i should use this for anime

TextDiffuser 2 - a Hugging Face Space by JingyeChen22 - people have been asking for text.. right?

GitHub - open-mmlab/PIA: PIA, your Personalized Image Animator. Animate your images by text prompt, combing with Dreambooth, achieving stunning videos. PIA,你的个性化图像动画生成器,利用文本提示将图像变为奇妙的动画 - video generator

GitHub - cumulo-autumn/StreamDiffusion: StreamDiffusion: A Pipeline-Level Solution for Real-Time Interactive Generation - turbo go fast