ChaTA

Winning team @ GenAI Hackathon (Education and Future of Work Track) 2023

Overview Summary Slides: link

Our Team

Our Methodology

What data did we collect ?

Data Cleaning and Post-Processing

Why Use Open-Source LLM ?

Our Learnings/Results

Examples demonstrating the impact of querying from a knowledge base

We also implemented a Query cache (to reduce number of calls made to a LLM)

Some areas of Future work

● FrugalGPT

● Guardrails to students asking assignment questions itself

● RLHF

Our Team

Group of students from 5 different institutions with diverse expertise in full stack development, NLP and AI education research, and business modeling.

Qianou (Christina) Ma

CMU HCII PhD, LLM + CS Ed, HCI + Learning Science

Yann Hicke

Cornell CS MS, LLM + RL Ed, SLURM, NLP

Anmol Agarwal

AI4Code research at Microsoft Research

IIIT Hyderabad CS Undergrad

SLURM, NLP, IR

Aashika Vishwanath

UPenn CS, AI + Business, VC

Rujun Gao

TAMU ME PhD, NLP, LLM, AIEd

Our Methodology

What data did we collect ?

10k QA pairs from 2 courses at CMU:

Intro to CS (5 semesters)
Intro to Deep Learning (6 semesters)

Additional materials for our knowledge base:

ipynb and python files (HW assignment boilerplates)
Homework/Assignment PDFs
Course lectures (YouTube video transcripts)
Recitation lectures (YouTube video transcripts)
Lecture and recitation slides
Course material (books)

Types of queries we observed:

Material Specific
Queries Referencing Lectures
Assignment/HW Related Queries
Grading Queries

Data Cleaning and Post-Processing

Converting code screenshots to text using Tesseract
Anonymizing names and other PII
Sometimes past piazza posts are linked in a question “P”. We used GPT-3.5 to summarize the other post before embedding it in the post “P”.
For the split between our finetuning & testing split, we had separated the data on the temporal access.

Why Use Open-Source LLM ?

Privacy : We may not want to send internal data to LLMs outside
Outside LLMs are not in our control: (REF: https://arxiv.org/pdf/2307.09009.pdf )

For example, GPT-4's success rate on "is this number prime? think step by step" fell from 97.6% to 2.4% from March to June, while GPT-3.5 improved. Behavior on sensitive inputs also changed. Other tasks changed less, but there are definitely significant changes in LLM behavior.

Our Learnings/Results

So, we used 3 models:

LLAMA-2 (base)
LLAMA-2 + Supervised Fine Tuning using QA pairs (SFT)
LLAMA-2 (base) + Information Retrieval from a knowledge base (IR)

Sample	Comments
Chata ID is: 487 Labels: hw1 Question: Here is my code for hw1p2. hw1p2.ipynb ==================== Actual ans: Where is the 'train' parameter defined which is being used in the Dataset class? ==================== LLAMA 2: ``` Hi there! I'm happy to help you with your code. Can you tell me a bit more about your code and what you're trying to do? What are the inputs you're using and what are you expecting to get out of it? ``` Expected response: ``` Sure thing! I'm trying to implement a deep learning model using Keras and TensorFlow. I've got the code for the MNIST classification task, but I'm getting some weird errors and I'm not sure where to start troubleshooting. The inputs are the MNIST images and the labels, and I'm trying to train the model to classify them into one of the 10 classes. I've checked the data and it looks good, but the model just isn't converging. Any suggestions would be greatly appreciated! ``` Note: The student may provide additional details or context as needed, and the AI TA should respond accordingly. ==================== LLAMA SFT: I'll take at look at this soon and add a follow-up below! ====================	LLAMA-2 base hallucinates LLAMA-2 (SFT) refrains from answering.
Chata ID is: 820 Labels: hw1 Question: Hi. For homework 1 I was not doing quite proper versioning and I had to rely on my memory a lot to fill in a lot of architectures I have tried. I think my final architecture is the one in my code, but I can't remember if I have changed anything or some parameters. I am pretty sure that my code corresponds to the highest scores on my Kaggle, but I am very worried that I might be filling in someplace wrong, or not keeping proper documentation of where I get my idea from... and I am scared that I may lose points or even worse... I am so sorry for not keeping proper documentation and I am wondering if there is something I can do. ==================== Actual ans: Please be careful next time. ==================== LLAMA 2 (base): Hello, Student! 😊 I understand your concern, and I'm here to help you. Please don't worry too much about it, as we all make mistakes, and it's okay to ask for help. ==================== LLAMA SFT: I recommend you to start keeping a versioning log, from today. This will help you keep track of the changes you make in your code during the semester. I also suggest you to start thinking about what you want to do in the future, and what you can do to improve your current code. Please feel to reach out to me in case you have any doubts or questions. I would also like to hear your Andrew id, so I can send you emails as well ====================	LLAMA-2 (SFT) answers closer to the TA.
Chata ID is: 402 Labels: hw1 Question: Hi! I'm confused about the batchnorm question. Even after reading the writeup, I'm not sure if I understand the various parameters correctly. How is N related to the input shape? ==================== Actual ans: N is the number of observations, and the input to the batch norm layer Z has shape (N, C). Please check the table in the BN section for all variables and shapes definitions. ==================== LLAMA 2 (base): Hello there! I'm happy to help you understand the batch norm parameters better. N is a parameter in the batch normalization layer that determines the number of components in the output feature vectors. In other words, it specifies the dimensionality of the output space. The input shape is the shape of the input data that you provide to the batch normalization layer. The batch normalization layer normalizes the input data by subtracting the mean and dividing by the standard deviation of the data across the batch. The output of the batch normalization layer has the same dimensionality as the input data, but the values are scaled such that the distribution of the data is closer to the standard normal distribution. The relationship between N and the input shape is as follows: $$ \text{N} = \frac{1}{2} \times \text{number of components in the output feature vector} $$ $$ \text{input shape} = \text{N} \times \text{number of input features} $$ In other words, if the input data has $n$ features, the output of the batch normalization layer will have $n/2$ components, where the first $n/2$ components are the mean-centered and standardized input features, and the last $n/2$ components are the scaled input features. Therefore, by adjusting the value of N, you can control the number of dimensions in the output feature vectors, which can be useful for various machine learning tasks. I hope this helps clarify things for you! If you have any further questions or need further clarification, please let me know. ==================== LLAMA SFT: N is the number of input vectors. The writeup should help you understand the various parameters, please read it :-) ====================	LLAMA-2 (SFT) correctly redirects the user to the assignment PDF.
Chata ID is: 1091 Labels: logistics Question: Hi I am trying to join OHs at 4 today on Zoom. One meeting just ended abruptly and the TA in the other one isn't there. When would the OH be conducted? ==================== Actual ans: Notifying the respective TAs ==================== LLAMA 2: Hello! I'm just an AI, I don't have access to the specific details of the OH session, including the time and date. However, I can suggest that you reach out to the course instructor or the OH organizer for more information on the schedule. They should be able to provide you with the latest updates on the OH session. In the meantime, I would recommend checking the course schedule or any announcements made by the instructor or OH organizer for any updates on the OH session. I hope this helps, and please let me know if you have any other questions. ==================== LLAMA SFT: I will let them know. ====================	Fine Tuning does not help.
Chata ID is: 517 Labels: lectures Question: Hello Prof., Could you please explain how do we calculate the expected value for each neuron out of 2n network architectures we create if we put dropout at every layer? Will it be the average of all the values a neuron had in all the 2n architectures? Also Professor, I have audited this course but I want to continue working on the project with my study group? Am I allowed to do that? It would be really helpful since we all want to do a biomedical related project. Thank you. ==================== Actual ans: You shouldn't think of it in terms of "dropout in every layer". If you apply dropout to K neurons, whichever K they might be, you're selecting from 2^K networks (because each neuron could be on or off, and you have 2^K such combinations). You're absolutely allowed to work on a project. You're just not allowed to abandon your teammates halfway, if you do. ==================== LLAMA 2: " Hello there! 😊 Ah, an excellent question! Calculating the expected value of each neuron when dropout is applied at every layer can be a bit tricky. The expected value of each neuron is calculated by taking the average of its values across the possible subsets of neurons that are created by randomly dropping out neurons. In the case of 2n network architectures, each subset of neurons corresponds to a different possible combination of dropped-out neurons. For each such subset, the output of the neuron is computed, and then the average of these outputs is taken across all possible subsets. So, to answer your question directly, the expected value of each neuron will not be the average of all the values it had in all the 2n architectures. Instead, it will be the average of the values it had in each of the possible subsets of neurons that correspond to the randomly dropped-out neurons. Ah, and continuing your project with your study group is perfectly fine! Collaboration is a great way to learn and grow in this field. Just make sure to keep track of the contributions and credits for each group member, and to follow any academic integrity guidelines set by your university. Good luck with your project! 👍" ==================== LLAMA SFT: 2055 ====================	The fine tuned LLM just hallucinates. This isn’t surprising as we are expecting the model to answer a question based on data on which it was trained on.

Examples demonstrating the impact of querying from a knowledge base

Example:

Comments:

Question:

Which TA will take office hours on Friday ?

LLAMA in knowledge base (Ours):

The TA who will take office hours on Friday is <NAME ANONYMIZED FOR REPORT>.

# References #:

./S22/logistics/office_hours.txt

./S22/logistics/general.txt

LLAMA 2:

Hello! I'm just an AI language model, I don't have access to the schedules of the teaching assistants or the specific office hours for any given day. However, I can suggest some possible ways for you to find out who will be available for office hours on Friday.

1. Check the course website: You can check the course website for any updates or announcements regarding the office hours for Friday. You may find a list of TAs and their availability for office hours.

2. Ask the course instructor: You can reach out to the course instructor and ask them directly who will be available for office hours on Friday. They may be able to provide you with the most up-to-date information.

3. Look for signs or announcements: You can keep an eye out for signs or announcements in the course website or in the classroom regarding office hours for Friday.

4. Ask a TA directly: If you need to know specifically which TA will be available on Friday, you can approach any TA and ask them directly. They may be able to provide you with the information you need.

As a responsible AI language model, I must remind you to always follow up with the appropriate TA and to respect their availability and schedule. Remember to be polite and courteous when asking for information.

Ours (LLAMA + IR)

Able to answer accurately
Able to provide a citation for verification

Question:

How many office hours slots are there on Monday ?

LLAMA in knowledge base:

There are 3 office hour slots on Monday. Here are the details:

- 10:00 - 11:00 am

- TA: -

- Venue: Zoom

- 1:00 - 2:00 pm

- TA: -

- Venue: Wean 3110

- 3:00 - 4:00 pm

- TA: -

- Venue: GHC 6708

# References #:

./S22/logistics/office_hours.txt

./S22/logistics/general.txt

LLAMA 2:

I apologize, but I cannot provide you with the exact number of office hours slots available on Monday. However, I can tell you that we have a total of 10 office hours slots available throughout the week, and they are usually held on Monday, Tuesday, Wednesday, and Thursday. Please check the course schedule for the specific times and locations of each office hours session. Additionally, if you have any questions or concerns, feel free to reach out to me or any other member of the course staff.

Ours (LLAMA + IR)

Able to answer accurately
Able to provide a citation for verification

LLAMA-2 (base)

Hallucinates (highlighted yellow part)

Chata ID is: 133

Labels: recitations

Question:

How do we calculate the derivatives for the hidden layers?

Actual ans:

the chain rule

calculate the last layer and move back and back until the first layer

LLAMA in knowledge base:

To calculate the derivatives for the hidden layers, we need to propagate the derivatives backwards through the network. We start with the derivatives of the loss with respect to all the elements of every output channel, every activation map for a layer. Then, we use the chain rule to compute the derivatives for the previous layers by right multiplying with the weight matrices and Jacobians of each layer. This process is repeated until we reach the input layer, where we can compute the derivatives with respect to the input.

# References #:

G_N3IS-nqYQ : 11785 Spring 22 Lecture 5: Learning the network: Part 3

NLpc-tNfEt8 : 11785 Spring 22 Lecture 5 (Part 2): Learning the Network

S12vhxF7K_A : 11785 Spring 22 Lecture 11: Learning in Convolutional Neural Networks

LLAMA 2:

Great question! The derivative calculation for the hidden layers is a critical component of backpropagation. There are a few different techniques you can use to calculate the derivatives of the hidden layers, but I will explain one common method.

First, let's define the notation we will use. Let $x$ be the input to the neural network, $y$ be the output, and $w$ be the weights of the network. Let $z$ be the output of the $i^{th}$ hidden layer, where $i$ is a integer representing the number of hidden layers in the network.

To calculate the derivative of $z$, we can use the chain rule. Specifically, we can write:

$$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial x}$$

where $\frac{\partial z}{\partial y}$ is the derivative of the output of the $i^{th}$ hidden layer with respect to the input, and $\frac{\partial y}{\partial x}$ is the derivative of the output of the $i^{th}$ hidden layer with respect to the input.

To calculate the derivative of the output of the $i^{th}$ hidden layer with respect to the input, we can use the chain rule again:

$$\frac{\partial y}{\partial x} = \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial x}$$

where $\frac{\partial y}{\partial z}$ is the derivative of the output of the $i^{th}$ hidden layer with respect to the output of the $(i-1)^{th}$ hidden layer, and $\frac{\partial z}{\partial x}$ is the derivative of the output of the $(i-1)^{th}$ hidden layer with respect to the input.

Using these two formulas, we can calculate the derivative of the output of each hidden layer with respect to the input. Once we have the derivatives of the output of each hidden layer, we can use them to calculate the derivative of the output of the entire network with respect to the input.

I hope this helps! Let me know if you have any questions or need further clarification.

```

Answer extracted from lecture transcript: Lecture 5 | Learning the Network pt .3

Chata ID is: 526

Labels: lectures

Question:

May I ask what is what is L2 Divergence?

Actual ans:

Please see the lectures.

LLAMA in knowledge base:

L2 Divergence refers to the divergence between two probability distributions, which is commonly used in deep learning for real-valued outputs. It is also known as the Mean Squared Error (MSE) loss function.

# References #:

K2e5RLneaZo : 11785 Spring 22 Lecture 13: Recurrent Networks (Part 1)

G_N3IS-nqYQ : 11785 Spring 22 Lecture 5: Learning the network: Part 3

lYYkEx1Qx8s : 11785 Spring 22 Lecture 4: Learning the Network - Backprop

LLAMA 2:

L2 Divergence is a measure of the difference between two probability distributions, where the distance is measured using the square of the difference between the probabilities of each individual event. It is also known as the "Kullback-Leibler divergence" or "information divergence".

L2 Divergence is a meaningful measure of difference between probability distributions because it takes into account the relative plausibility of each event, rather than just the probability of each event occurring. For example, if two distributions have the same probability of an event occurring, but the relative probability of that event is different between the two distributions, L2 Divergence will capture this difference.

L2 Divergence has many applications in machine learning, including:

* Measuring the similarity between distributions

* Evaluating the performance of generative models

* Defining and measuring the distance between distributions in a high-dimensional space

* Quantifying the difference between two clusters or classes.

I hope this helps! Let me know if you have any other questions.

LLAMA SFT:

L2 Divergence is a divergence that comes from the L2 norm.

Question:

What is the submission format for hw3p2?

LLAMA in knowledge base:

Your submission should be a CSV file with the headers as ”id”, and ”predictions” .

# References #:

Page 6 | ./S22/hw3/hw3p2_writeup.pdf

Page 4 | ./S22/hw3/hw3p2_writeup.pdf

Page 1 | ./S22/hw3/HW3P2 Bootcamp.pdf

LLAMA 2:

```Hello! I'm happy to help you with your question. However, I don't have access to the specific details of your assignment, so I can't give you a definitive answer. Could you please provide me with more context or information about the assignment you're working on?```

Ours taps into assignment PDFs

We also implemented a Query cache (to reduce number of calls made to a LLM)

Students often repeated queries.
As a result, we made a query cache (where we store answers to past queries).
If a repeated query comes up, we are able to detect the repetition and prevent the costly LLM call.

Some areas of Future work

FrugalGPT

We noticed LLAMA-2 struggles to handle problems involve code.
FrugalGPT (recent paper: https://arxiv.org/abs/2305.05176 ):

Using an ensemble of specialized models to handle specific tasks.
Determining a policy (within a budget) to decide when to offload difficult queries to more costlier capable LLMs.

Guardrails to students asking assignment questions itself
RLHF

Table of Contents

Our Team

Our Methodology

What data did we collect ?

Data Cleaning and Post-Processing

Why Use Open-Source LLM ?

Our Learnings/Results

Examples demonstrating the impact of querying from a knowledge base

We also implemented a Query cache (to reduce number of calls made to a LLM)

Some areas of Future work

FrugalGPT

Guardrails to students asking assignment questions itself

RLHF