1 of 50

Learnings from building an AI tutor using GPT-4

Aman Dalmia

2 of 50

HyperVerge Academy (HVA)

Many young Indians who come from under-resourced backgrounds struggle to find decent paying jobs due to lack of skills & network.

HVA is a platform that brings together working professionals to mentor motivated learners so they can get credible, well-paying jobs in the tech industry.

3 of 50

What’s the Problem?

Learners tend to do things without a proper understanding of why they are doing what they are doing. Asking the right questions bridges this gap and gets them to think critically.

A mentor plays this role of:

  1. Asking the right questions
  2. Identifying the learning gap,
  3. Providing pointed and actionable feedback

Learners are bottlenecked by how much time a mentor can give them. In the limited time they have, a mentor can ask only so many questions.

4 of 50

What If?

What if there was a tool that could:

  1. Ask questions on a particular topic
  2. Evaluate answer shared by the user
  3. Provide pointed, actionable feedback on what they did well, and what could be improved, without giving away the answer?

This could potentially:

  1. Identify learning gaps that a mentor may have missed
  2. Optimise mentor’s time spent with the learners

5 of 50

Enter: SensAI

  • Lack of awareness of the gaps in their understanding until they err in an interview.
  • Lack of regular, personalized feedback on every concept
    • Mentors only spend 3-4 hrs a week for a group of 5-6 students
    • Most programming assessment tools only give a binary correct/wrong feedback
  • Time wasted in finding the right video to learn from in their local language that works for them.
  • Illusion of having understood a concept just because they completed watching a video without actually implementing what they learnt
  • Inability to get all their doubts resolved in a way that encourages them to think without giving away the answer

Assessment

Content

Doubt solving

6 of 50

SensAI Assessment Demo

7 of 50

SensAI English Demo

8 of 50

SensAI Content Prototype Demo

9 of 50

SensAI Data Captured in Google Sheets

10 of 50

Metrics that Enable us to Track

Progress Metrics

  • Attempted questions: 81%
  • Completed questions: 77%

11 of 50

What is the impact we are seeing so far?

Case 1: Ability to identify learning gaps

  • Through this interaction, we were able to notice that learner struggled with using arrays and objects together
  • This was something the mentor did not identify

12 of 50

More examples of learners realising they had learning gaps

13 of 50

What is the impact we are seeing so far?

Case 2: Step by step guidance when learner is stuck

Note: System can be improved further by not giving away the entire answer.

14 of 50

What is the impact we are seeing so far?

Case 3: Aids in practice and confidence building

“I practice on Sensai after I complete a milestone. It gives me confidence that I know this topic” - Another learner over a call

15 of 50

Building with direct feedback from our learners!

16 of 50

Building with direct feedback from our learners!

Next item in Roadmap:

  • Support for more number of questions
  • Not repeating the same feedback again and again

17 of 50

How does it work?

Question generation

GPT-4

Topic (JS)

Sub-topic (basics)

Concept (operators)

Blooms level (remembering)

Learning outcome

18 of 50

How does it work?

Router

GPT-4

Topic

Sub-topic

Concept

Blooms level

Learning outcome

User query

answer

clarification

irrelevant

miscellaneous

19 of 50

How does it work?

Personalized feedback (1)

GPT-4

Topic

Sub-topic

Concept

Blooms level

Learning outcome

Work out the solution

20 of 50

How does it work?

Personalized feedback (2)

GPT-4

Topic

Sub-topic

Concept

Blooms level

Learning outcome

Evaluation score

Feedback

Own solution

User answer

21 of 50

How does it work?

Clarification, miscellaneous

GPT-4

Topic

Sub-topic

Concept

Blooms level

Learning outcome

Feedback

System Prompt

User answer

22 of 50

Speed tactics: streaming + caching (10x boost)

Personalized feedback (1)

GPT-4

Topic

Sub-topic

Concept

Blooms level

Learning outcome

Work out the solution

23 of 50

Raise your hands

  • If you understand tokens

  • If you have heard of embeddings and understand what embeddings means

  • If you understand what is prompt engineering

24 of 50

Tokens

Challenge?

Challenge?

25 of 50

Tokens

26 of 50

Embeddings

Word embeddings

27 of 50

What is prompt engineering?

How is GPT-3 trained? Next-word prediction

28 of 50

What is prompt engineering?

Emergence

29 of 50

Prompt engineering crash course

  • Be very specific and clear. Think of GPT-4 as an assistant that will do what you ask it to, given your instructions are clear and concise.

  • The correct grammar, punctuation, spacing, paragraphs matter.

  • Put user inputs inside special delimiters like ``` as they are common in programming tasks and GPT-4 is trained on a lot of code. So, it can recognise that something within ``` ``` is distinct.

  • Instead of “don’t do X”, say “avoid X”

  • Use a new line for every part of the prompt

30 of 50

Prompt engineering crash course

  • Assign a role to the AI model. E.g. You are a very thoughtful and kind interviewer. Steers the model to behave the way in which you want it to.

  • Divide your prompts into system prompt (how should the model behave) and user prompt (the specific input it needs to act on)

  • Specify a structured format to receive the output back in which you can easily parse (e.g. JSON)

  • Temperature: 0 for factuality and determinism, 1 for creativity and more randomness

31 of 50

Prompt engineering crash course

Chain-of-thought prompting: Give the model time to think

  • The model is biased to give some output as soon as possible.
  • If you directly ask the model to give the final answer, it will likely make a mistake
  • Instead, ask the model to think through the individual steps and only after that produce the final output.
  • This is because of how the attention mechanism works
  • Reduces hallucination drastically

32 of 50

Prompt engineering crash course

Chain-of-thought prompting: Give the model time to think

E.g. specify the individual steps the model should follow before arriving at an answer

33 of 50

Prompt engineering crash course

Zero-shot Chain-of-thought prompting: Give the model time to think without any explicit instructions and let the model figure out the intermediate steps

  • “Let’s think step by step”1
  • “Let's work this out in a step by step way to be sure we have the right answer”
  • “Take a deep breath and work on this problem step by step”2

34 of 50

Prompt engineering crash course

  • Start with something, test, identify issues, update the prompt to fix those issues and keep iterating.

  • If response time is not a constraint, more advance techniques like self-consistency (generate multiple candidates and take a majority vote), tree of thoughts (play out different futures and pick the future that looks more liekly, etc.

35 of 50

Prompt engineering starter template

System prompt

You are a [adjective] [role] (e.g. very good and encouraging interviewer)��[context] (e.g. you are interacting with students who are preparing for an interview and you need to encourage them to think independently)

You will be given [describe the input] delimited by ``` characters (e.g. topic, question, student’s response)

You need to [describe the task] (e.g. evaluate the student’s response and give actionable feedback)

Important Instructions:

[bullet list of everything to keep in mind while answering the input + all edge cases]

Provide the output in the following format:

Let’s think step by step �{concise explanation (<50 words)}

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "⁠  json" and "  ⁠":

⁠ ```json

{"type": answer | clarification | irrelevant | miscellaneous // type of the query}

```

Chain of thought prompting

Ask for a specific output format

36 of 50

Prompt engineering starter template

Additional note

Here, the thinking step-by-step needs to happen before the final output so that the model can use the reasoning to generate a higher quality response

37 of 50

Some learnings

  • Using the right prompts can go a long way. Don’t think about training or fine-tuning as the first step. Learning prompt engineering is important. Resources at the end of the slide!�
  • Start with GPT-4. There is a massive performance difference between GPT3.5 and GPT-4. If something does not work with GPT-4, unlikely that it will work with GPT3.5.

  • The cost concerns of GPT-4 can be addressed later. The bigger problem is not providing real value. De-risk that first.

38 of 50

Some learnings

  • Don’t focus on open-source LLMs until you have a working prototype using GPT4

  • Log whatever is happening - user response, AI response - helps with debugging, reproducibility, understanding edge cases of the model and builds a diverse training/evaluation dataset

  • Manually look at the data being logged regularly to find mistakes and proactively fix those mistakes. This is the highest RoI activity that can boost your performance.

  • Change your mindset from “This does not work” -> “This does not work yet”. LLMs are getting more capable by the day.

39 of 50

Some learnings

  • If you’re building an open-ended chat interface, use the following prompt chain to begin:
    • Intent classification: what is the user trying to do?
    • Handle each intent with a separate prompt
  • Modularize pipeline components as much as possible and cache whatever you can to improve speed
  • Spend time on OpenAI playground tinkering with different system and user prompts: https://platform.openai.com/playground?mode=chat&model=gpt-4

40 of 50

Handy trick for debugging

  • Copy paste your prompts (system and user prompts) corresponding to an error case into OpenAI playground and set all the other parameters like temperature, model name, max_tokens etc. the same as your API call.
  • The error case should be reproduced.
  • Ask the model: “Explain your reasoning” (don’t use “Why” as the model will instantly say that it has made a mistake and will correct it). This will highlight what part of the prompt was causing the model to make that mistake. Test different changes to see what fixes the error.

41 of 50

Some learnings

  • Spend time thinking hard of real use cases that can add a lot of value to your beneficiaries as opposed to trying to mimic what might look fancy

  • Usually, avoid the hammer-looking-for-a-nail situation. But today we are limited more by our creativity of what is possible that solutions to existing problems might not be obvious to us until we actually play around with the hammer.�
  • Being as focused and specific as possible will help reduce hallucination. E.g: using learning outcomes and blooms level for question generation ensures a relevant question being generated for SensAI

  • Adding an AI layer can create an initial excitement and get people in, but without real value, people will leave. That means all the usual product building stuff still needs to be done: good engineering, rapid iteration, talking to users, fixing bugs etc.

42 of 50

Some learnings

  • Use Langchain / Llamaindex instead of the raw OpenAI API as they have very useful and important helper functions
  • Create usable prototypes rapidly in Python using streamlit
  • GPT4 with temperature 0 is close to being deterministic. If you find it is not and is instead hallucinating sometimes, then there is likely an ambiguity that is not covered in the prompt. How to solve it: identify errors, debug what is causing the model to fluctuate between responses, update prompt to disambiguate that edge case while making sure it doesn’t make the performance worse for other edge cases

43 of 50

Dawn of Large Multimodal Models (LMMs)

Go through all the examples: https://arxiv.org/pdf/2309.17421.pdf

44 of 50

Dawn of Large Multimodal Models (LMMs)

Go through all the examples: https://arxiv.org/pdf/2309.17421.pdf

45 of 50

System Prompt vs Knowledge base

System Prompt: specify how you want the AI model to behave

Knowledge base: contains the knowledge about the inherent system that you are letting the user chat with

Knowledge base is the treasure, AI model is the guide and system prompt are instructions for the guide

AI

Beneficiary

46 of 50

Recommendations for structuring knowledge base

Note: some recommendations are specific to the Jugalbandi API

The document can be as long as you want but quality matters more than quantity. Make sure that you’re being crisp and concise in whatever you’re putting into it.

Make sure that you don’t put contradictory or even duplicate information as it can confuse the AI model and unnecessarily increase token size (hence, 💰)

Structure it into neatly separated paragraphs with each paragraph having at most 4000 characters (including spaces, punctuations, etc.)

47 of 50

Recommendations for structuring knowledge base

Using an FAQ format can potentially lead to better performance but keep one thing in mind:

do not put an extra blank line between the question and answer

48 of 50

RAG explainer

49 of 50

RLHF and fine-tuning

50 of 50

Resources

Prompt engineering

courses from deeplearning.ai

Follow: @bhutanisanyam1, omarsar0