Perfecting Merge-kit MoE’s:
We don't need Mixtral-instructs “secret sauce”
A[a] write up by: Rombodawg
Introduction:
(Note: Please stop suggesting edits[b][c][d], Comments enabled for AI merging using MergKit, MoE creation, and paper feedback.)
First of all I’d like to point out the elephant in the room… Why google docs[e][f][g][h] for the write up? This write up is an evolution in merging, not a static one. I expect the community to leave comments on this write up (as is a function of google docs) as to their findings from merging in order to improve and refine the process further. We aren't like mistral-ai, or open-ai. We don't leave things behind closed doors. The open source community is going to come together this time and we are going to make our own Mixture-of-Experts (MoE) models, and they are going to far outperform Mixtral-instruct by a landslide, not only thanks to my finding in this write up, but thanks to the further refinement that will come from the comments after it (from the community[i][j])
Chapter 1: Overview:
MoEs created by mergekit are often called FrankenMoEs, this refers to lackluster merges that either perform badly in inference or plainly do not function as intended. However there is a misconception all MoEs made with mergekit are FrankenMoEs, this is simply not the case. Mergekit has the full capability to make a fully functional and powerful MoE models, the problem the community has been facing is that basically no one has known how to properly use mergekit to create MoE’s with success to a degree that would match mistral-ai’s Mixtral-base and Mixtral-Instruct models. With that being said I present my perfect MoE made with mergekit: Everyone-coder-4x7b-base. Linked below (also linked is mergekit’s MoE branch):
How is Everyone-coder a perfected MoE? That is what we will talk about in the next section of the write up but first let me lay out each section that we are going to talk about so you get a basic idea:
Chapter 2: Making the perfect MoE
Section 1: Using the right models for the merge:
The perfect MoE model made through mergekit comes down to a number of factors:
1. Picking the matching models for the merge
2. Using the right settings in the .yml file
3. Loading the models in the merge to create their own prompts
4. Making sure to create detailed and diverse prompts to create an effective router
5. Using the actual base model of the fine tuned as the base in the merge
(Ex. Llama-2-base, Mistral-Base, ect)
If you have all these factors correct, then you likely have a perfect MoE, or what I would call a new base model (more on this in section 4). Let's break down each section above and go into greater detail. As far as picking matching models is concerned, think of an MoE like a big brain, you are combining a bunch of smaller brains into 1 larger brain. If you combine smaller brains that don't think alike then the bigger brain is gonna have problems thinking correctly, it will have mental health issues, or in our case the MoE is going to have inference, perplexity, token generating issues, and a host of other issues that I can't even name off the top of my head. Lets take for example the model i created (Everyone-coder-4x7b-base). This model used the following models to create:
What do these models have in common? They all have some form of numbers or coding training in them. Dolphin is mostly trained on coding data, plus some non-coding data. Beagle is trained on the open source Bagel dataset which has a lot of code and non-code instructions. Openchat-3.5-0106 is an excellent coding model as well as a good general purpose model. And WizardMath is a model trained on mathematics data which is often used in coding as well. Overall these models excel in coding as well as non-coding generalized tasks so they are well suited for each other. For comparison's sake let me give you some examples of model that would not work well in a MoE merge with these models and why:
The noromaid ai models are specifically made for “spicy” or sexualized role play that is common and in no way shameful in the open source ai community. However the model is specifically tailored for this roleplay and is refined for a back and forth conversation between the user and itself. This would make it very difficult to work well with coding and mathematics models such as the ones we find the (Everyone-coder-4x7b-base) mix.
The Openbuddy series of models have been fine-tuned to understand, translate and teach a multitude of different written languages. While this model is a great resource to have in an MoE it does not merge well with a lot of models because of a literal “Language barrier” that is caused when the Openbuddy decided upon between the selection of models in the MoE by the built in router. Unless of course the other models are specifically selected to compliment the Openbuddy model there will be issues in the MoE’s inference and token generation.
Please understand, I'm not saying the Noromaid or Openbuddy models are bad for merging, not at all : I believe they are excellent models for merging into an MoE. I just believe they are bad models for merging with coding models specifically because of the differences in tokens generated and generation styles. To conclude this section, I want you to think about how the models in your MoE merge complement each other, again you are literally building a new brain, don't make it suffer from having too many diverging thoughts or it won't perform well when you go to do inference. The selection of models might be one of the most important parts of the merging process, second only to the actual prompts used to create the router.
Section 2: Using the models to create the router:
What is the router in an MoE? The router is a set of weights in itself. It's a little AI model that makes decisions as to which model will be used for every token generated. The router is the second most important part of the MoE model. It handles exactly which models are used for inference depending on what is asked of the model. So the creation of the router needs to be done correctly, or the entire MoE is basically garbage. Then the question is: How do we create a proper router? Well, we don't know how Mistral.ai created the official Mixtral models and their routers, but we do know how to make them with mergekit. And now, thanks to this write-up, we know how to do it properly. Below you will find the example from the mixtral branch of merge kit that shows how to set up your .yml file for merging an MoE model of your own:
(The yellow highlighted sections are from the mergekit-mixtral branch)
mergekit-moe is a script for combining Mistral or Llama models of the same size into Mixtral Mixture of Experts models. The script will combine the self-attention and layer normalization parameters from a "base" model with the MLP parameters from a set of "expert" models. mergekit-moe uses its own YML configuration syntax, which looks like so:
base_model: path/to/self_attn_donor
gate_mode: hidden # one of "hidden", "cheap_embed", or "random"
dtype: bfloat16 # output dtype (float32, float16, or bfloat16)
experts:
- source_model: expert_model_1
positive_prompts:
- "This is a prompt that is demonstrative of what expert_model_1 excels at"
# (optional)
# negative_prompts:
# - "This is a prompt expert_model_1 should not be used for"
- source_model: expert_model_2
# ... and so on
The script takes two arguments, an input config and an output path:
mergekit-moe ./config.yml ./my-clowncar-moe-12x180B
There are three methods for populating the MoE gates implemented.
Uses the hidden state representations of the positive/negative prompts for MoE gate parameters. Best quality and most effective option; the default. Requires evaluating each prompt using the base model so you might not be able to use this on constrained hardware (depending on the model). You can use --load-in-8bit or --load-in-4bit to reduce VRAM usage.
Uses only the raw token embedding of the prompts, using the same gate parameters for every layer. Distinctly less effective than "hidden". Can be run on much, much lower end hardware.
Randomly initializes the MoE gates. Good for if you are going to fine tune the model afterwards[m][n][o][p], or maybe if you want something a little unhinged? I won't judge.
While cg123 of github tried his best to explain how to use this branch of mixtral, he misses some very key and important details. The section (in highlighted yellow) that we want to focus on right now is the .yml configuration, specifically where it says “Positive prompts”. Each AI model in the merge is given positive prompts that are used to train the router’s weights to understand what each model in the merge is an expert in. Remember these prompts are NOT for the merged models, they are for the router weights to be trained with. The question is then how do we create good prompts for the models we are merging with? Well, the best solution is to use the models themselves. It sounds counterintuitive but the most effective solution is to load the individual AI models that you are using for your merge, and ask them what their strengths are. However you want these outputs to be guided as the prompts need to be in a specific format for the router to work properly. I have created a prompt template you can use bellow to prompt your AI model that you plan to use in your merge based on my Everyone-Coder models .yml config file:
Note that in the prompt below everything that is bolded is either to be replaced, or to be removed before prompting your model.
BEGINNING OF PROMPT:
You are an expert AI model that specializes in (Insert specialty here). [q][r]Your name is (Insert actual AI model's name here). Your task is to look at the .yml file below where other AI models have written prompts that match their strengths as AI model experts, and you're to recreate the section of the .yml file that is used for your AI model with the description that best describes your strengths as a (Insert specialty here). (The part after this bolded section is optional and only required if you are using the actual prompts and model names from the models in the merge itself: Note: If your strengths are similar to another model listed you must write strengths that are different from that other models, they can be similar but cannot be exact duplicates.)
```
base_model: mistralai_Mistral-7B-v0.1
gate_mode: hidden
dtype: float16
experts:
- source_model: (Insert actual ai models name here)
positive_prompts:
- "(Insert some short description or keyword here)"
- "(Insert some short description or keyword here)"
- "(Insert some short description or keyword here)"
- "(Insert some short description or keyword here)"
- source_model: fblgit_UNA-TheBeagle-7b-v1
positive_prompts:
- "How do you"
- "Explain the concept of"
- "Give an overview of"
- "Compare and contrast between"
- "Provide information about"
- "Help me understand"
- "Summarize"
- "Make a recommendation on"
- "Answer this question"
- source_model: LucciAI_openchat-3.5-0106-function-calling
positive_prompts:
- "Write a program to solve this problem"
- "Modify this function to improve its performance"
- "Refactor this code to enhance readability"
- "Create a custom function for this specific use case"
- "Optimize this algorithm to reduce computational complexity"
- "Implement this feature by extending existing codebase"
- "Integrate this API call into the application"
- "Help me troubleshoot and fix this bug"
- "Review and test this code snippet before deployment"
- "Analyze this error log to identify potential issues"
- "Generate a set of unit tests for this module"
- "Evaluate different approaches to solving this problem"
- "Do a web search for"
- "Use the plugin to"
```
END OF PROMPT
In summary, you don't have to use those specific models that I used for the prompt template, or even the descriptions that I used for my models. The point is just to give the AI detailed descriptions to see that it should give details about its own strengths so the router understands when and where to use the final MoE during inference. I will also say that I would refrain from using single key words in the prompts since these are literally being used like training data for the routers, and training an AI model on a single word is not a good idea. The expected final output of your AI model should look something like this:
- source_model: cognitivecomputations_dolphin-2.6-mistral-7b-dpo-laser
positive_prompts:
- "Help me debug this code."
- "Rewrite this function in Python."
- "Optimize this C# script."
- "Implement this feature using JavaScript."
- "Convert this HTML structure into a more efficient design."
- "Assist me with writing a program that"
Section 3: Detailed but diverse prompts
Another thing I want to mention about the prompts you are creating is that they should not only be detailed but they should also be different from each other enough not to cause errors when the router model is deciding which expert to use during inference. Now, if you go to my link for Everyone-coder you will see at the bottom showing the .yml file that I actually have 2 coding models in the merge (openchat-3.5-0106 and dolphin-2.6-mistral). Although they are both coding models, and they both have coding related prompts, neither model has duplicate prompts, or even similar enough prompts to confuse the router during inference.
Chapter 3: (Rant) Why using duplicate base models in MoE’s is pointeless
Section 1-3: 8x dumb models = 1 big dumb model:
I've seen people on huggingface making models that some refer to as “clown-car” MoEs. Essentially saying you are throwing a bunch of clowns into a car to make a joke model. And personally the term is exactly right. The reason Mixtral from mistral ai was such a success was because each of the 8 mistral-7b models they used in the final merge was trained on different data that made them experts in something different.[s] The whole point is to make a diverse brain of sorts that, although thinks alike, should have different qualities that make it special when working together. However when you take the same exact model, not trained on anything in particular, and duplicate it 2x, 4x, 8x, 16x or however many times you duplicate then merge it, you aren't going to get better results. No matter if you fine tune the resulting MoE model, it's still just as dumb as the original model you started with. You are just making a model that is harder to inference, takes up more file space and has bigger RAM requirements to run, with no increase in quality except for a possible placebo or error in benchmark data.
Some may think I'm wrong about this but personally I just think it's common sense. You need different experts to make an MoE happen. You need each model in the merge to be fine-tuned beforehand, in order for the merge to be effective. I personally don't support the creation of models like this, and while I'm not bashing the people who do create these models, I just ask what is the point. I feel like it's wasted time and effort. Please FOR GOD'S SAKE do not go give crap to the people who are making these models, they most likely are just experimenting and having fun. But If you want successful MoEs don't follow their example.
Chapter 4: Why is a proper MoE made by mergekit considered a new base model?
Section 1:Its all in the Quality:
So most models made by mergekit are thought of as inferior models by the community, I aim to change that. With the creation of the perfect MoE made with mergekit using my strategies in this write up, referring to the Everyone-Coder model, I think that models made with mergekit shouldn't be judged because they are merges and placed automatically in the losers bracket. I think that if the merge is done properly and to a higher standard it should be considered a new base model, just like how Mixtral-Base was created by Mistral-ai.
Section 2:Hand testing the models:
Any merged model can be put on open-llm-leaderboard and be given a good score, but when in practical use only a select few actually perform as expected. Before a model is tagged as a base model, it should be hand tested to see the quality of the outputs, and if it performs as expected. A lot of experimenting will come, in the near future after this write up is released, from the community because we will all want our own bite of shiny new Mixtral-Base model. However we need to remember that only models that are held to the highest standard should be differentiated from the “frankenmerges” otherwise we won't know what's good and what's low quality. The way we find out which models are considered “Base” quality is through inference and hand benchmark testing, with tests made by the community. There are a number of tests I have seen myself that can push an ai model to “think” on an advanced level that, only a model
that was made with a higher quality of effort, can hope to succeed at.
Section 3:How to test the models:
A few notable tests are the Kanye rhyming test, which most models miserably fail, the goal of which is to follow the rhyming pattern and continue to make new rhymes that follow the same patterns while not repeating the same words. You can see the prompt bellow:
USER
Continue this where it left off, follow the pattern:
After a long day of work, Kanye West goes to his Kanye Nest to take his Kanye Rest. He wakes up feeling his Kanye Best. Then he’ll get Kanye Dressed on his Kanye Vest to go on a Kanye Quest. He goes to church and becomes Kanye Blessed, then to a hotel room to be a Kanye Guest. Then to school to take his Kanye Test. He forgot to brush his teeth. Did he run out of Kanye Crest? His neighbor stole it, what a Kanye Pest.[t]
Another good set of tests was made by the popular youtuber Matthew Berman, which he called his own LLM leaderboard. I will link this bellow, but you can see some of the tougher tests include asking the ai to “write snake in python code”, asking the ai if a marble is placed in a cup upside down in a cup does gravity affect it when the cup is lifted and moved to the microwave, and many other unique challenges.
Matthew Berman’s LLM leaderboard/Benchmarks list:
-https://tide-freckle-52b.notion.site/1e0168e3481747ebaa365f77a3af3cc1?v=83e3d58d1c3c45ad879834981b8c2530
To conclude this chapter, the open source-community is held responsible for figuring out which models are regarded as base models. Not open-ai, not mistral-ai, not microsoft, not google. We are the community, we have the power to make these decisions. We need to get to work testing these models by hand and documenting our findings publicly. We as a community need to not be lazy and expect the big companies to do the work for us. If we do, they will outshine us and make us pay the big bucks to use their fancy AIs that they charge a fortune for. When we come together and make things happen, CEOs get scared, executives make letters like “we have no moat”. This is because we have all the power and they don't, we just don't use it. WE DON'T WORK TOGETHER. We need to start doing that.
Chapter 5: Working together is the only way to improve open source Ai
Section 1: What we have been missing is co-operation:
To follow up the previous section, co-operation in the open source community is a must. I know that people need to make their quotas, and even smaller companies need to make their profits, but we won't grow if we don't share our findings. I could have easily kept everything in this write up to myself, made a new Mixtal model, profited from it and called it a day. Did I? Why don't you tell me? I am making a call to action to the open source community, publish your findings, share what you;ve learned, inspire others, grow together, for each other, for ai. If we don't then the big companies are gonna shut it all down. They will run it from a closed door, and we won't have any fun. Don't let that happen… don't ruin my dream.
Section 2: Everyone-Coder was created through feedback
The model I've mentioned so many times, the one I referred to as the “perfected MoE” or the new “Base” model, Everyone-Coder, was only made because of collaboration from the community. I am an avid user of discord, and thanks to TheBloke's discord server (linked below) I have been able to get a ton of community feedback as to how to improve my merging methods to create Everyone-Coder.
TheBloke AI discord server:
The key word here is “feedback”. I was only able to achieve the perfect model, as well as write this write-up because other people told me what I was doing was and wasn't working. If it wasn't for the feedback I couldn't have succeeded. This is why I'm telling the community that we need to work together. This is only the first write up I am writing, I hope I can write another one based on the feedback I get from people commenting on this write up (because I will leave comments set to on). This is the only way, working together. I have said it 100 times and I'll keep saying it, but for time's sake let's move on to the next section.
Section 3: A call to action
These are my final words on this write-up. I am asking you, the little guys, the members of the community running your rtx 3060 12gb gpu with 32gb of ram. Or whatever m1 macbook you have, or hell they even have handhelds that can run AI now. But I'm asking you, to use this write-up, experiment, make some AIs, test them through Matthew Bermans rubric, look online for other types of tests you can do, and come back here and tell the world what you learned. Only then can we grow as a community. Only then will we succeed
Post publishing note:
As Omar Sanseviero of hugging face points out on twitter, there had been some misconception about the actual models used in the merge of the original mixture models released by Mistral-ai. In my write-up above I wrote: “The reason Mixtral from mistral ai was such a success was because each of the 8 mistral-7b models they used in the final merge was trained on different data that made them experts in something different.” However after further research it seems this isn't exactly the case. Ill quote Omar’s tweet below to explain with images that he added on the next page for you to view as evidence:
TWEET START
“"the assumption is that they have diverse training amongst apart from each other"
That's not really the definition of experts (MoEs should really be named routed sparse models or something like that). It's true that merge MoEs have this assumption, but it's not the case of pre-trained MoEs - this merging of existing models for building MoEs is somewhat recent.
See some of the initial MoE papers https://arxiv.org/abs/1312.4314 or https://arxiv.org/abs/1701.06538
Just as in the Mixtral paper, the experts specialize in syntax or token style, not in a task. See image from ST-MoE or outrageous large NNs.
TWEET END
Below you will find the images that show how the experts in the original mixtral merge were actually differentiated by token generation style, and not necessarily trained on different datasets (How we conduct mergkit MoE models).
1 week later… We’ve learned a lot
So I've been doing alot of merging and making models behind the scenes since I've written this write up, and I wanted to share my discoveries. It seems when using mergekit there is a very strict need to match the parameters of the models merged in order to create a “Perfect” MoE. Some of these parameters include "vocab_size", "num_hidden_layers", "max_position_embeddings", and possibly more parameters found in the config.json file of any model on hugginface. Mergekit itself will likely not tell you if these parameters are matching when you merge models together, and instead the merge will proceed as normal, and the resulting MoE will have major issues. Another Issue I’ve ran into is some older frameworks for models such as deepseek-coder, which is based on llama-1, required merge-kit to add extra tokens to the models configs and tokenizer files while the merge is happening resulting in a model that is not ready out of the box, and required training on the extra tokens before it is usable.
Lets go over what can happen if we merge models with different parameters or use them without training on extra added tokens. When i merged deepseek-coder-moe_8x6.7b-base, which you can still find on my huggingface account, I it would not produce anything in generation except for repeated tokens such as “Hi Hi Hi Hi Hi Hi Hi Hi Hi” or a string of 1’s all in a row despite the excerpts having matching configs, simply because the model needed training on the extra tokens added after merging. When combining a bunch of code-llama models that had different "vocab_size" they could almost generate coherent text but also eventually began to repeat tokens and create incoherent text. When merging models that have different "num_hidden_layers" the models simply wouldn't load in transformers, and would error out when quantization was attempted.
Although I am guilty of merging models that have different "max_position_embeddings" as you can see in my now popular model Everyone-Coder-4x7b-Base, in which the openchat model only has 8k context window and the rest of the models in the merge have 32k, I wouldn't recommend doing so because it can likely lead to errors in longer text generation where one experts would run out of context window while the others have room to generate more text.
To summarize the findings in the past week I would say, as always, experimenting is the key to success when it comes to AI, as well as sharing your findings. When it comes to merging, and using mergekit specifically, it is best to check the config.json files of every model in the merge, no matter if its an MoE merge or not, and make sure they are as closely related as they can be in terms of settings, otherwise there will be issues.
Thanks for reading this update and be sure to leave comments with your new findings too. 😊
[a]Readers you can read additional finding labelled "1 week later… We’ve learned alot" at the end of the paper.
[b]Please note: For some reason suggestions are getting auto-rejected. Please leave comments instead.
[c]_Marked as resolved_
[d]_Re-opened_
[e]Ever thought about using a legit version control system like github, gitea or something like that?
[f]Yea but honestly I like how the write up can evolve, and can have more added to it, as well as have comments added in google docs. It makes it more like the ai industry as a whole. Evolving and not static. Although I dont plan on making changes to the content itself, except possible error correcting if something is strictly untrue, I want to be able to add onto the write up and have the community add comments so we can keep coming back here as we learn more and more about ai and merging specifically
[g]_Marked as resolved_
[h]_Re-opened_
[i]Use the task_arithmetic merging method which has major increases in coding performance as opposed to the ties method.
[j]Yes I can confirm this is true. I have made 2 model already with this method. linked bellow:
https://huggingface.co/rombodawg/DeepMagic-Coder-7b-Alt
https://huggingface.co/rombodawg/Everyone-Coder-33b-v2-Base
[k]u must have no idea about how models numerically inside works.
[l]More elaboration on this comment would be appreciated
[m]When reading this document to me the following question is not 100% clear:
When I merge the LLMs with the "hidden" opinion and provide positive and negative prompts:
Is it still possible to finetune the MoE after merging or should I select the "random" option if I want to do that.
[n]Although it does say in the mergekit-mixtral github page excerpt (as highlighted in yellow) that using the "random" gate mode option is the recommended for fine-tuning afterwards. I encourage you to experiment. I believe the fine tuning process itself would be done better with higher quality models, which would be made through the "hidden" mode. But the github page doesnt say exactly why the "random" mode is recommended for fine-tuning. So I couldnt tell you for sure the reason why its written as it is.
[o]Well, I asked this on the GitHiub repo: https://github.com/cg123/mergekit/issues/116
[p]Im glad you did. I look forward to hearing the respond, post it here if you would.
[q]IMO this is a nice idea and a nice story to ask the LLM about itself. But IMO the LLMs do not have enough "self awareness" to give realy helpful answers here. I would recommend to come up with your own prompts which depend on the strengths of the LLM and the training data, domain and language.
[r]Speaking to the readers, I would say to experiment. Always experiment. Try my idea, if you aren't satisfied with the results try to come up with your own prompts as Philip May said. You can even try using an ai to web-scrape data, or do your own research on the ai you are using in the merge by browsing the web, then either use that knowledge to come up with prompts, or copy and paste all the data you find into the ai you are using (or even another ai thats capable or writing good prompts and has a high token limit like cluade 2.0), and coming up with new prompts. The point is to try diffrent things to come up with the best solution, dont settle for one way of doing things.
[s]Readers please read the post publishing note at the end of the paper for more context about this, as new information has come out showing that this isn't 100% accurate.
[t]1 total reaction
Fangyuan Yu reacted with 🤣 at 2024-03-27 15:53 PM