Project Memo “Your_Devin”

It’s funny, but just two months before Cognition Lab gave their demo, my friend Devin and I were at the AGI house hackathon in Hillsborough, hosted by Josh from Coframe. Guess what? We built a code generation autopilot that writes code, deploys code, and debugs code in a loop for front-end tasks.  We won that hackathon. And what happened next?

These guys at Cognition stole our idea and didn’t even bother to change the name (it’s a joke). Well, I’m not jealous; let them take all the ideas, as long as we build something useful for the public good.

Hmm, so since our idea proved so practical, I thought, why not take it a step further. What is Devin beyond Devin? The Devin beyond Devin is “Your_Devin”.

What does that mean? And why?

Let’s answer the following question: What is the biggest problem with all coding copilots? Hallucinations. But why? Well, just having codebase context awareness isn’t enough to instruct a model to write code for you. You have to push the model further, fine-tune it. But how? How can I fine-tune it on a single codebase?

Well, here comes the Ada-instruct paper. From a single text corpus, we generate ten high-quality samples of question-answer pairs that serve as our main training dataset. But is it enough? No, as the next step, we use these benchmarks to instruct a larger LLM to generate a number of downstream dataset tasks for the small custom model. Since models hallucinate we’ll run all task code snippets to select the ones that produce a valid response. We then fine-tune this small private model on this dataset, and create Your-Devin. A Devin who starts performing much better on tasks related to your codebase…

Well, it’s not the end.

What would be even better than Your_Devin? How can we generalize from a well-trained copilot of our codebases to a broader set of problems without having to share our codebase training dataset?

We crossbreed Devins by merging fine-tuned models together (the MergeKit library comes in handy here).

Will it work? Not necessarily. Most of the time, it won’t. But who cares? We only care if one crossbreed out of a hundred performs better; we’ll be happy!

So, imagine now, a world where every software engineer can create a Devin and then crossbreed them to produce a better, golden Devin! 😁 It’s a world of decentralized autonomous models, which, by a law of natural selection, can hopefully pose a challenge to the huge, trillions-size generalist models produced by big tech. 😊

How it works?

Stack to be used based on the list of judges:

  • Mistral - base model for further fine-tuning
  • Hugging Face - storing your fine-tuned model, datasets
  • Noes Research - synthetic data generation
  • Axolotl - fine-tuning of various AI models
  • Azure - inference for Mistral Large
  • MongoDB - vector database for embeddings
  • mergekit - for merging pre-trained language models
  • Arize - observability for the app

TBD:

  • OctoAI - hosting and inference of your model
  • Fireworks (inference, function calling)
  • Brave - Web search API (for user tasks to search/scrape docs)
  • OpenPipe - Fine-tuning for developers (end-to-end, train, deploy, eval, observe)


Your_Devin script flow:

  • Give access to github to see can take all deployed repos, or a specific project [7]
  • Auto-select your code snippets as benchmarks [7]
  • Selecting Code Samples: relevance, variety, quality
  • Structuring Your Samples: prompt, completion
  • See if benchmarks pass tests (the code actually works as intended) [9]
  • Generate synthetic instructions based on benchmarks (user - assistant style) [1]
  • Load instructions to Mistral Large using a few-shot learning approach to become instruction generator [2]
  • Generate a large volume of task-aligned instructions. (generate synthetic data) [3]
  • Use these instructions for downstream task training, enhancing model performance on specific applications. (Fine-tune your personal model Mixtral 8*7b) [4]
  • Adapt mergekit to be able to merge your model with other users [10]
  • Store your code in vector db for future fine-tuning [5]
  • Observe the app with Phoenix Arize [6]

User experience:

  • Register in some UI (everything can be in VS extension), connect your github, start creating:
  • A benchmark
  • Instructions
  • Dataset
  • Fine-tuned model
  • Deploy
  • VScode extension with a sidebar chat that generate code snippets

Demo prompt: “I’m building a RAG app for my company and would  like to set up MongoDB vector database”

Other notes:

Include citations:

@article{cui2023ada,

  title={Ada-Instruct: Adapting Instruction Generators for Complex Reasoning},

  author={Cui, Wanyun and Wang, Qianle},

  journal={arXiv preprint arXiv:2310.04484},

  year={2023}

}

@misc{Genstruct,

      url={[https://https://huggingface.co/NousResearch/Genstruct-7B](https://huggingface.co/NousResearch/https://huggingface.co/NousResearch/Genstruct-7B)},

      title={Genstruct},

      author={"euclaise"}

}

What to do next:

  1. Check other synthetic data generator based on eugene paper
  1. Self-instruct: https://github.com/yizhongw/self-instruct 
  2. Generator overview: https://arxiv.org/html/2403.04190v1 
  1. Openpipe: can we call api to load dataset and start training job
  2. Check colab book with QLora
  3. Is there a simple api call to upload model to hugging face?
  4. Try implement ada-instruct github repo (stuck on model step)
  5. Try implementing genstruct (stuck on accelerate error)
  6. Azure mistral large model
  7. Try using sponsor fine-tuning apis
  8. Deploy model to octoai/fireworks/huggingface
  9. Use mongodb for training dataset storage?
  10. Connect to Arize to observe performance of the new model
  11. Connect git hub to auto-select coding snippets
  12. Package it into an extension