JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.
Quiz 4 - Agent Evaluation & Project Overview (10/6)
* Indicates required question
Email
*
Your email
Question 1:
When designing a new benchmark for an AI agent, which of the following is the most critical principle to ensure the evaluation is meaningful?
*
1 point
A. The benchmark includes as many tasks as possible to ensure broad coverage.
B. The evaluation metrics are simple and easy to calculate automatically.
C. The benchmark has high 'Outcome Validity,' meaning a high score genuinely reflects successful task completion.
D. The tasks are completely novel and have never been seen in any other benchmark.
Question 2:
Why would a research team prefer to use a 'dynamic benchmark' (like DynaBench or LiveCodeBench) over a 'static benchmark' (like MMLU)?
*
1 point
A. Static benchmarks are always more expensive to create and maintain.
B. Static benchmarks cannot be used to compare different models directly.
C. Dynamic benchmarks only support closed-ended questions, which are easier to grade.
D. Dynamic benchmarks are designed to reduce the risk of data contamination and overfitting.
Question 3:
Evaluating an AI agent’s performance on a non-verifiable task—such as creative writing or summarizing a complex article—can be difficult. Which approach is commonly used to address this challenge?
*
1 point
A. Comparing the agent’s output to a single reference answer using metrics like F1-score.
B. Applying unit tests and code coverage to verify the agent’s internal reasoning process.
C. Measuring keyword overlap between the agent’s response and the original prompt.
D.Using human evaluators or an “LLM-as-a-Judge” approach to rate output quality based on a defined rubric.
Question 4:
According to the principles of good benchmark design, a task like 'book a flight' is considered more realistic and valuable for evaluation than a task like 'solve this abstract logic puzzle'. Why is this?
*
1 point
A.Logic puzzles are inherently biased and unfair to certain types of AI models.
B. Benchmarks should reflect useful, real-world scenarios to measure an agent's practical capabilities.
C. It is impossible to automatically determine if a logic puzzle has been solved correctly.
D. Booking a flight requires fewer computational resources for an agent to perform.
Question 5:
What is the primary function of an agent evaluation framework within the AI development lifecycle?
*
1 point
A. To provide a systematic and reproducible way to measure capability progress and assess risks.
B. To train the AI model to become more intelligent by giving it difficult problems.
C. To generate marketing materials that highlight the agent's best performances.
D. To create a leaderboard that ranks every available AI agent in the world.
A copy of your responses will be emailed to the address you provided.
Submit
Clear form
reCAPTCHA
Privacy
Terms
This form was created inside of UC Berkeley.
Does this form look suspicious?
Report
Forms
Help and feedback
Contact form owner
Help Forms improve
Report