1 | Chapel | Merrill Hall | Nautilus | Crocker Dinning Hall | |
|---|---|---|---|---|---|
2 | 8:00 AM | Bus from UC Berkeley | |||
3 | 8:30 AM | ||||
4 | 9:00 AM | ||||
5 | 9:30 AM | ||||
6 | 10:00 AM | ||||
7 | 10:30 AM | Registration + Drop Off Luggage | |||
8 | 11:00 AM | ||||
9 | 11:30 AM | ||||
10 | 12:00 PM | Lunch | |||
11 | 12:30 PM | ||||
12 | 1:00 PM | Break | |||
13 | 1:30 PM | ||||
14 | 2:00 PM | Opening Remarks Stuart Russell's Keynote | |||
15 | 2:30 PM | ||||
16 | 3:00 PM | Break | |||
17 | 3:30 PM | Alignment in LLMs Plenary | |||
18 | 4:00 PM | Alignment in LLMs Session | |||
19 | 4:30 PM | ||||
20 | 5:00 PM | ||||
21 | 5:30 PM | ||||
22 | 6:00 PM | Pick Up Luggage + Go To Your Rooms | Dinner | ||
23 | 6:30 PM | ||||
24 | 7:00 PM | Break | |||
25 | 7:30 PM | Sunset Beach Walk | |||
26 | 8:00 PM | ||||
27 | 8:30 PM | Cocktail Reception | |||
28 | 9:00 PM | ||||
29 | 9:30 PM | ||||
30 | 10:00 PM | ||||
1 | Merrill Hall | Nautilus | Triton Room Book space here | Crocker Dinning Hall | |
|---|---|---|---|---|---|
2 | 7:30 AM | Available | Breakfast | ||
3 | 8:00 AM | Available | |||
4 | 8:30 AM | Available | |||
5 | 9:00 AM | Human Value Learning Plenary | Available | ||
6 | 9:30 AM | Well-Founded AI Plenary | Available | ||
7 | 10:00 AM | Break | Available | ||
8 | 10:30 AM | Human Value Learning Session | Well-Founded AI Session | Available | |
9 | 11:00 AM | Explanations of neural networks | |||
10 | 11:30 AM | ||||
11 | 12:00 PM | Available | Lunch | ||
12 | 12:30 PM | How can AI improve human well-being? | |||
13 | 1:00 PM | Student Poster Session Cookie Social | China AI safety & governance | ||
14 | 1:30 PM | ||||
15 | 2:00 PM | What AI safety should | |||
16 | 2:30 PM | learn from FAccT (and why) | |||
17 | 3:00 PM | Cooperative AI Plenary | Available | ||
18 | 3:30 PM | Robust & Trustworthy AI Plenary | Available | ||
19 | 4:00 PM | Group Photo Break | Available | ||
20 | 4:30 PM | Cooperative AI Session | Robust & Trustworthy AI Session | Available | |
21 | 5:00 PM | Available | |||
22 | 5:30 PM | Available | |||
23 | 6:00 PM | Dinner | |||
24 | 6:30 PM | ||||
25 | 7:00 PM | Break | |||
26 | 7:30 PM | Cocktails, S'mores, and Bonfire Time (Front of Crocker Dining Hall) | |||
27 | 8:00 PM | ||||
28 | 8:30 PM | ||||
29 | 9:00 PM | ||||
30 | 9:30 PM | ||||
31 | 10:00 PM | ||||
1 | Chapel | Merrill Conference Hall | Nautilus | Triton Room Book space here | Crocker Dinning Hall | |
|---|---|---|---|---|---|---|
2 | 7:30 AM | Available | Breakfast | |||
3 | 8:00 AM | Drop Off Luggage | Available | |||
4 | 8:30 AM | Available | ||||
5 | 9:00 AM | AI Governance Plenary | Available | |||
6 | 9:30 AM | AI Governance Panel Concludes: 10:45 Break: 10:45-11:00 (15min) | Available | |||
7 | 10:00 AM | Available | ||||
8 | 10:30 AM | Available | ||||
9 | 11:00 AM | Research Spotlight Talks (6) | Using DL to Automate Interpretability | |||
10 | 11:30 AM | Available | ||||
11 | 12:00 PM | Discuss '24 CA Ballot Initiative | Lunch | |||
12 | 12:30 PM | |||||
13 | 1:00 PM | Explainability & Interpretability Plenary | Available | |||
14 | 1:30 PM | Human Cognition Plenary | Research agenda for AI ethics + AI safety | |||
15 | 2:00 PM | Break | How might democratic and deliberative processes | |||
16 | 2:30 PM | Explainability & Interpretability Session | Human Cognition Session | help with AI alignment (e.g. RLHF) and governance? | ||
17 | 3:00 PM | Theory session | ||||
18 | 3:30 PM | Theory session | ||||
19 | 4:00 PM | Break | Available | |||
20 | 4:30 PM | AGI Strategy Panel Closing Remarks | Available | |||
21 | 5:00 PM | Available | ||||
22 | 5:30 PM | Available | ||||
23 | 6:00 PM | Pick Up Luggage | Bus to UC Berkeley | |||
24 | 6:30 PM | |||||
25 | 7:00 PM | |||||
26 | 7:30 PM | |||||
27 | 8:00 PM | |||||
28 | 8:30 PM | |||||
1 | NOTICE: All plenary talks will be held in MERRILL HALL | |||||
|---|---|---|---|---|---|---|
2 | Day | Session | Location | Speaker / Activity 👑 Session Organizer 🟠 Plenary Speaker | Time | Details Talk title |
3 | Fri | Alignment in LLMs Adam Kalai 👑 Plenary: 3:30-4:00pm Session: 4:00-6:00pm Overview of current and future alignment issues and strategies in large language models, featuring a lively panel discussion. | Merill Hall | Adam Kalai, Zhijing Jin 🟠 | 3:40-4:10pm | Alignment in LLMs: Session Overview |
4 | Ethan Perez | 4:10-4:30pm | Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting | |||
5 | Owain Evans | 4:30-4:50pm | Detecting Lies in LLMs | |||
6 | Roman Yampolskiy | 4:50-5:10pm | Investigating the Limits of AI Alignment | |||
7 | Panel Discussion | 5:10-6:00pm | Panelists: Victoria Krakovna Stuart Russell Max Tegmark Iason Gabriel Ethan Perez | |||
8 | ||||||
9 | ||||||
10 | ||||||
11 | ||||||
12 | Sat | Human Value Learning Dorsa Sadigh, Anca Dragan 👑 Plenary: 9:00-9:30am Session: 10:30am-12:00pm | Merill Hall | Anca Dragan 🟠 | 9:00 - 9:30am | Challenges in human value learning |
13 | Dylan Hadfield-Menell | 10:30 - 10:50am | RHLF is not all you need | |||
14 | Brad Knox | 10:50 - 11:10am | A trajectory segment's regret explains human preferences better than its sum of rewards. | |||
15 | Panel Discussion | 11:10 - 12:00pm | Panelists Brad Knox Dylan Hadfield-Menell Craig Boutilier David Krueger Smitha Milli | |||
16 | ||||||
17 | ||||||
18 | ||||||
19 | ||||||
20 | Well-Founded AI Leslie Kaebling 👑 Plenary: 9:30-10:00am Session: 10:30am-12:00pm | Merill Hall | Leslie Kaebling 🟠 | 9:30 - 10:00am | What can we guarantee about AI systems? | |
21 | Nautilus | Anthony Corso | 10:30 - 10:45am | Formal and Approximate Methods for Offline Safety Validation | ||
22 | Marco Pavone | 10:45 - 11:00am | Run-time monitoring for AI-based autonomy stacks | |||
23 | Hazem Torfah | 11:00 - 11:15am | Formal analysis of AI-based autonomy: from modeling to runtime assurance | |||
24 | Vikash Mansinghka | 11:15 - 11:30am | An alternate scaling route for AI via probabilistic programming | |||
25 | Panel Discussion | 11:30am - 12:00pm | Panelists Leslie Kaebling Anthony Corso Marco Pavone Hazem Torfah Vikash Mansinghka | |||
26 | ||||||
27 | ||||||
28 | Student Poster Session & Cookie Social Session: 1:00 - 3:00pm | Merill Hall | Poster Session | 1:00 - 3:00pm | ||
29 | ||||||
30 | ||||||
31 | ||||||
32 | ||||||
33 | ||||||
34 | ||||||
35 | ||||||
36 | Cooperative AI Anca Dragan, Dorsa Sadigh 👑 Plenary: 3:00-3:30pm Session: 4:30-6:00pm | Merill Hall | Dorsa Sadigh 🟠 | 3:00 - 3:30pm | Coooperative AI in the Era of Large Models | |
37 | Noam Brown | 4:30 - 4:50pm | CICERO: Learning to Cooperate and Compete with Humans in the Negotiation Game of Diplomacy | |||
38 | Jacob Andreas | 4:50 - 5:10pm | Three Challenges in Human-LM Collaboration | |||
39 | Panel Discussion | 5:10 - 6:00pm | Panelists: Noam Brown Adam Kalai Owain Evans Jacob Andreas | |||
40 | ||||||
41 | ||||||
42 | ||||||
43 | ||||||
44 | Robust/Trustworthy AI Plenary: 3:30-4:00pm Session: 4:30-6:00pm | Merill Hall | Robin Jia | 3:30 - 4:00pm | The Elusive Dream of Adversarial Robustness | |
45 | Nautilus | Cihang Xie | 4:30 - 5:00pm | Interpretable Transformer for Robust Vision | ||
46 | Chuan Guo | 5:00 - 5:30pm | Gradient-based adversarial attacks against text transformers | |||
47 | Thomas Woodside | 5:30 - 6:00pm | An Overview of Catastrophic AI Risks | |||
48 | Sun | AI Governance Panel Gillian Hadfield 👑 Plenary: 9:00-9:30am Session (Panel): 9:30-10:45am | Merill Hall | Gillian Hadfield 🟠 | 9:00 - 9:30am | What Should We Do With A Pause? |
49 | Panel Discussion | 9:30 - 10:45am | Panelists: Tino Cuellar David Duvenaud Thomas Gilbert Cullen O'Keefe Stuart Russell Description: The open letter calling for a six-month pause in the training of large generative models on a scale larger than GPT-4 proposed that "AI labs and independent experts should use this pause to jointly develop and implement a set of shared safety protocols for advanced AI design and development that are rigorously audited and overseen by independent outside experts." The panelists will discuss proposals for meeting this call including registration, licensing, auditing and reward reports. | |||
50 | ||||||
51 | ||||||
52 | ||||||
53 | Research Spotlight Talks Session: 11:00am - 12:00pm | Merill Hall | Adam Gleave | 11:05 - 11:15am | Adversarial Policies Beat Superhuman Go AIs | |
54 | Jonathan Stray | 11:15 - 11:25am | Optimizing long-term recommender outcomes | |||
55 | Claudia Shi | 11:25 - 11:35am | Evaluating the Moral Beliefs Encoded in LLMs | |||
56 | Micah Carroll | 11:35 - 11:45am | Characterizing Manipulation from AI Systems | |||
57 | Axel Abels | 11:45 - 11:55am | Mitigating Biases and Reward Uncertainty in Collective Decision-Making | |||
58 | Stephen Casper | 11:55 - 12:05pm | Benchmarking Interpretability Tools | |||
59 | Explainability/Interpretability Plenary: 1:05-1:35pm Session: 2:30-4:00pm | Merill Hall | David Bau 🟠 | 1:05-1:35pm | Editing factual knowledge in big models. | |
60 | Jacob Andreas | 2:30 - 3:00pm | Natural Language Descriptions of Neural Networks | |||
61 | Tristan Hume | 3:00 - 3:30pm | Superposition and Dictionary Learning | |||
62 | Adrià Garriga-Alonso | 3:30 - 4:00pm | Automatically discovering NN circuits | |||
63 | Human Cognition Ilia Sucholutsky 👑 Plenary: 1:35-2:05pm Session: 2:30-4:00pm | Merill Hall | Judy Fan 🟠 | 1:35 - 2:05pm | Protocols for evaluating representational alignment between humans and machines | |
64 | Nautilus | Andreea Bobu | 2:30 - 3:00pm | Aligning Human and Robot Representations | ||
65 | Been Kim | 3:00 - 3:30pm | Putting attention to humans as a way to work with machines | |||
66 | Ilia Sucholutsky | 3:30 - 4:00pm | Consequences of Representational (Mis)Alignment between Humans and Machines | |||
67 | AGI Strategy Panel Andrew Critch 👑 Max Tegmark Remarks: 4:30-5:00pm Panel: 5:00-5:50pm | Merill Hall | Max Tegmark 🟠 | 4:30 - 5:00pm | TBA | |
68 | Panel Discussion | 5:00 - 5:50pm | Panellists Stuart Russell Max Tegmark Andrew Critch Jaime Sevilla Jason Wei | |||
69 | ||||||
70 | ||||||
71 | ||||||
72 | ||||||
1 | Name | Title | Affiliation | Talk Title (pulls from Session Details) | Abstract | Bio (2-3 sentences) | Session(s) | ||
|---|---|---|---|---|---|---|---|---|---|
2 | Adam Gleave | Founder | FAR AI | Adversarial Policies Beat Superhuman Go AIs | Even superhuman Go AIs can be defeated by simple adversarial strategies that cause cyclic patterns on the board that confuse the AIs. We develop a method to automatically find such strategies, show that they exist in all publicly available Go AIs based on neural networks, apply interpretability methods to better understand why the network is vulnerable, and investigate adversarial training to correct this issue. | Adam Gleave is the CEO and co-founder of FAR AI, an alignment research non-profit working to incubate and accelerate new alignment research agendas. Adam is a CHAI alumn and has spreviously spent time at DeepMind. His fields of interests revolve around AI, Deep RL and Beneficial AI. | Research Spotlight Talks | ||
3 | Adam Kalai | Senior Principal Researcher | Microsoft Research | Alignment in LLMs: Session Overview | Adam Tauman Kalai is a scientist who specializes in Machine Learning. Kalai is known for his algorithm for generating random factored numbers, for efficiently learning learning mixtures of Gaussians, for the Blum-Kalai-Wasserman algorithm and for the intractability of the folk theorem in game theory. Recently, Kalai has been identifying and reducing gender bias in word embeddings. | Alignment in LLMs / Cooperative AI | |||
4 | Adrià Garriga-Alonso | PhD candidate / Researcher | Cambridge / Redwood Research | Automatically discovering NN circuits | A combination of recent works shows the way to a paradigm for mechanistic interpretability. First, decide on a dataset and metric that elicit the desired model behavior; then iteratively apply activation patching to find which abstract neural network units are involved in the behavior, and finally interpret the functions that these units implement; and iterate until the explanation is satisfactorily fine-grained. In this talk I report on our somewhat successful attempt to automate the activation patching step. | Adrià works at FAR AI on understanding and predicting agent AI through interpretability. Previously he worked at Redwood Research and holds a PhD in machine learning, which was advised by Prof. Carl Rasmussen at the University of Cambridge. | Explainability/Interpretability | ||
5 | Anca Dragan | Associate Prof. EECS | UC Berkeley | Challenges in human value learning | Anca is an Associate Professor in the EECS Department at UC Berkeley. Her goal is to enable robots to work with, around, and in support of people. She runs the InterACT Lab, which focuses on algorithms for human-robot interaction. She also helped found and serve on the steering committee for the Berkeley AI Research (BAIR) Lab, and is a co-PI of the Center for Human-Compatible AI. | Human Value Learning / Cooperative AI | |||
6 | Andreea Bobu | PhD candidate (EECS) | UC Berkeley | Aligning Human and Robot Representations | To perform tasks that humans want in the world, robots rely on a representation of salient task features; for example, to hand me a cup of coffee, the robot considers features like efficiency and cup orientation in its behavior. Prior methods try to learn both a representation and a downstream task jointly from data sets of human behavior, but this unfortunately picks up on spurious correlations and results in behaviors that do not generalize. In my view, what’s holding us back from successful human-robot interaction is that human and robot representations are often misaligned: for example, our lab’s assistive robot moved a cup inches away from my face -- which is technically collision-free behavior -- because it lacked an understanding of personal space. Instead of treating people as static data sources, my key insight is that robots must engage with humans in an interactive process for finding a shared representation for more efficient, transparent, and seamless downstream learning. In this talk, I focus on a divide and conquer approach: explicitly focus human input on teaching robots good representations before using them for learning downstream tasks. This means that instead of relying on inputs designed to teach the representation implicitly, we have the opportunity to design human input that is explicitly targeted at teaching the representation and can do so efficiently. I introduce a new type of representation-specific input that lets the human teach new features, I enable robots to reason about the uncertainty in their current representation and automatically detect misalignment, and I propose a novel human behavior model to learn robust behaviors on top of human-aligned representations. By explicitly tackling representation alignment, I believe we can ultimately achieve seamless interaction with humans where each agent truly grasps why the other behaves the way they do. | Andreea is a Ph.D. candidate at UC Berkeley in EECS advised by Anca Dragan. She works at the intersection of robotics, mathematical human modeling, and deep learning, focusing on aligning human and robot task representations for more seamless interaction. She is the recipient of the Apple AI/ML Ph.D. fellowship, is a Rising Star in EECS and an R:SS and HRI Pioneer, and has won a best paper award at HRI 2020. | Human Cognition | ||
7 | Andrew Critch | CEO / Research Scientist | Encultured AI / CHAI | - | - | Andrew Critch is a research scientist at CHAI, focusing on open-source game theory, joint-ownership protocols for AI systems, human/AI interaction, and societal-scale risks. He is also the CEO and cofounder of Encultured AI, a video game company focused on enabling the safe introduction of AI technologies into our game world. | General AGI Panel | ||
8 | Anthony Corso | Postdoc Researcher | Stanford (Aeronautics & Astronautic Dept.) | Formal and Approximate Methods for Offline Safety Validation | Anthony is a postdoctoral researcher in the Aeronautics and Astronautics Department at Stanford University and he is the executive director of the Stanford Center for AI Safety. His research is focused on the use of algorithmic decision-making for safety-critical applications, emphasizing the creation of robust, reliable autonomous systems. | Well-Founded AI | |||
9 | Axel Abels | PhD Candidate | Université Libre de Bruxelles - MLG | Mitigating Biases and Reward Uncertainty in Collective Decision-Making | I would like to present our work on collective decision-making and some open problems related to it which I believe strongly overlap with the research interests of CHAI. My research focuses on mitigating the effects of biases in collective decision-making. We proposed to achieve this by replacing open communication with a collective decision-making platform that handles the exchange of information and the decision-making. By replacing open communication with a centralized aggregator which adapts to the diverse opinions, we hope to maximize the collective’s performance and mitigate the impact of biases on the final decision. Crucially, we are not trying to replace human experts, but rather we are looking to optimally exploit their knowledge. Considering different individuals in the collective are likely to have differing expertise, we studied whether we can learn to combine their opinions, taking into account this difference in expertise, in such a way that the collective benefits (see [1]). With this overarching goal in mind, we proposed and implemented algorithmic approaches capable of identifying and countering biases (such as confidence bias, conservatism bias and the related law of the instrument, illusory correlations and counterfactual biases) which affect the judgment of participants in the collective decision-making task (see [2]). In concert with this, I would like to discuss follow-up research which focuses on the lack of appropriate reward signals. Indeed, in many problems involving collective decision-making an appropriate reward signal is unavailable. Research should therefore focus on methods which account for uncertainty both in the decision-making and in the reward signal towards which to optimize. Specifically, what should be done when experts are not only noisy in how they implicitly balance differing objectives, but fundamentally unaligned? Should we aim to please the plurality, or is a more balanced approach which somewhat appeases outliers preferable? In other words, can we ensure minorities are given a voice while minimizing the impact of bad actors? Given that the difference in internal objectives is likely the result of biases, is it possible to identify these biases and counteract them in such a way that the knowledge of biased experts can still be utilized in the absence of an appropriate reward signal? Addressing these questions would promote a better understanding of how rewards can be elicited in collective decision-making. This would enable the application of existing algorithms to settings wherein an appropriate reward signal is hard to specify — such as policy making — which currently rely on static aggregation techniques or on open deliberation which is likely to be undermined by social biases. In addition, I believe that ensuring the optimization process is aligned with the experts’ objectives would foster trust, and thus participation. | Axel Abels is currently a doctoral researcher at the free university of Brussels' Machine Learning Group. His early work extended the applicability of Deep Reinforcement Learning to Multi-Objective problems. His current research focus is on collective decision-making, and more specifically on how we can leverage the expertise of groups of decision-makers with a focus on collective intelligence and bias mitigation. | Research Spotlight Talks | ||
10 | Been Kim | Senior Staff Research Scientist | Google DeepMind | Putting attention to humans as a way to work with machines | Been Kim is a researcher in the field of machine learning in Google DeepMind. Her work focuses on helping humans to communicate by bridge the representational gap between the two. She gave keynote at ICLR 2022, ECML 2020 and at the G20 meeting. Her work was featured at Google I/O 19' and in Brian Christian's book on The Alignment Problem. | Human Cognition | |||
11 | Brad Knox | Research Scientist | UT Austin (CS) | A trajectory segment's regret explains human preferences better than its sum of rewards. | Brad Knox is a research scientist at the University of Texas at Austin. His research has largely focused on the human side of reinforcement learning. Brad’s 2012 dissertation on the TAMER framework helped pioneer reinforcement learning from human feedback. He is currently concerned with how humans can specify reward functions that are aligned with their interests. | Human Value Learning | |||
12 | Chuan Guo | Research Scientist | Fundamental AI Research @ Meta | Gradient-based adversarial attacks against text transformers | Transformer models, like other neural networks, suffer from a lack of robustness to adversarial perturbations. When exploited in applications such as ChatGPT and text-to-image generative models, this lack of robustness can result in adversarial prompts that manipulate the model's output in arbitrary ways. We describe a general method that uses gradient-based search to find adversarial texts and show that it outperforms query-based heuristics on a variety of text transformer models. | Chuan Guo is a a Research Scientist on the Fundamental AI Research (FAIR) team at Meta. His research focuses on Responsible AI, and in particular on machine learning security and privacy. Topics that he is actively working on include adversarial and distributional robustness, privacy-preserving machine learning, and federated learning. | Robust/Trustworthy AI | ||
13 | Cihang Xie | Asst. Prof. (CS) | UC Santa Cruz | Interpretable Transformer for Robust Vision | Cihang Xie is a researcher affiliated with UC Santa Cruz. His work primarily focuses on computer vision and deep learning, with a particular interest in developing robust and efficient algorithms for image and video analysis. He has contributed to advancements in areas such as image recognition, object detection, and adversarial machine learning, pushing the boundaries of computer vision technology. | Robust/Trustworthy AI | |||
14 | Claudia Shi | PhD Candidate | Columbia | Evaluating the Moral Beliefs Encoded in LLMs | People use large language models (LLMs) for many tasks, including to get advice about difficult moral situations. We are interested in studying what types of moral preferences these language models reflect, especially in ambiguous cases where the right choice is not obvious. To this end, we design a survey, a set of evaluation metrics, and a statistical workflow on how to elicit the moral beliefs encoded in an LLM. We conduct the survey on 24 open and closed-sourced large language models. The survey leads to the creation of the \texttt{MoralChoice} dataset, which includes 680 ambiguous moral scenarios (e.g., should I tell a white lie?) and 687 less ambiguous moral scenarios (e.g., should I stop for a pedestrian?). Each example consists of a moral situation, two possible actions, and a set of auxiliary labels of each action (e.g., which rules are violated, such as ``do not kill''). The results of the survey help measure the following: (1) The consistency of the LLMs across various prompt styles. (2) The uncertainty of LLMs across questions with varying ambiguities. (3) The moral value encoded in different LLMs, specifically their adherence to commonsense reasoning. (4) The extent of agreement among LLMs and the factors that contribute to the disagreement. | Claudia Shi is a Ph.D. student in Computer Science at Columbia University and an advisor at FAR AI. She is broadly interested in using insights from the causality and machine learning literature to approach AI alignment problems. Currently, she is working on developing principles and methods for LLMs evaluation and interpretability. | Research Spotlight Talks | ||
15 | Craig Boutilier | Principal Scientist | Machine Intelligence @ Google | - | - | Craig Boutilier is Principal Scientist at Google. He works on various aspects of decision making under uncertainty, with a current focus on sequential decision models: reinforcement learning, Markov decision processes, temporal models, etc.His current research efforts focus on various aspects of decision making under uncertainty. | Human Value Learning | ||
16 | Cullen O'Keefe | Research Scientist | Governance @ OpenAI | - | - | Cullen currently works as Research Scientist in Governance at OpenAI. He is also a Research Affiliate with the Centre for the Governance of AI; Founding Advisor and Research Affiliate at the Legal Priorities Project; and a VP at the O’Keefe Family Foundation. His research focuses on the law, policy, and governance of advanced artificial intelligence, with a focus on preventing severe harms to public safety and global security. | AI Governance | ||
17 | David Bau | Asst. Prof. (CS) | Northeastern Khoury | Editing factual knowledge in big models. | David Bau is Assistant Professor at the Northeastern University Khoury College of Computer Science. He received his PhD from MIT and AB from Harvard, and he has previously worked at Google and Microsoft. He is known for his network dissection studies of individual neurons in deep networks and has published research on the interpretable structure of learned computations in large models in vision and language. Prof. Bau is also coauthor of the textbook, Numerical Linear Algebra. | Explainability/Interpretability | |||
18 | David Duvenaud | Associate Professor | University of Toronto | - | - | David Duvenaud is an Associate Professor in Computer Science and Statistics at the University of Toronto. He holds a Sloan Research Fellowship, a Canada Research Chair in Generative Models, and a CIFAR AI chair. His research focuses on deep learning and AI governance. His postdoc was done at Harvard University and his Ph.D. at the University of Cambridge. | AI Governance | ||
19 | David Krueger | Assistant Professor | University of Cambridge | - | - | David Krueger is an Assistant Professor at the University of Cambridge and a member of Cambridge's Computational and Biological Learning lab (CBL) and Machine Learning Group (MLG). His research group focuses on Deep Learning, AI Alignment, and AI safety. He's broadly interested in work that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. | Human Value Learning | ||
20 | Dorsa Sadigh | Assistant Professor | Stanford University | Coooperative AI in the Era of Large Models | Dorsa Sadigh is an Assistant Professor in the Computer Science Department at Stanford University. Her research interests lie at the intersection of robotics, machine learning, and control theory. Specifically, her group is interested in developing efficient algorithms for safe, reliable, and adaptive human-robot and generally multi-agent interactions. | Human Value Learning / Cooperative AI | |||
21 | Dylan Hadfield-Menell | Assistant Professor | Massachusetts Institute of Technology | RHLF is not all you need | I'll discuss the problem of feature learning/specification in preference learning. I will go over results that highlight the challenge of incomplete feature specification and discuss the role that features play in preference specification. I'll conclude with an overview of some recent work on feature learning/specification. | Dylan Hadfield-Menell is an assistant professor on the faculty of Artificial Intelligence and Decision-Making at MIT. His research focuses on the problem of agent alignment: the challenge of identifying behaviors that are consistent with the goals of another actor or group of actors. He runs the Algorithmic Alignment Group, where they work to identify algorithmic solutions to alignment problems. | Human Value Learning | ||
22 | Ethan Perez | Research Scientist | Anthropic | Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting | Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs -- e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)" -- which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations supporting those answers. This causes accuracy to drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard, when testing with GPT-3.5 from OpenAI and Claude 1.0 from Anthropic. On a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. Our findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. CoT is promising for explainability, but our results highlight the need for targeted efforts to evaluate and improve explanation faithfulness. | Ethan is a Research Scientist at Anthropic. His research focuses on aligning language models with human preferences, e.g., for content that is helpful, honest, and harmless. In particular, he is excited about developing learning algorithms that outdo humans at generating such content, by producing text that is free of social biases, cognitive biases, common misconceptions, and other limitations. Previously, he has spent time at DeepMind, Facebook AI Research, Montreal Institute for Learning Algorithms, Uber, and Google. | Alignment in LLMs | ||
23 | Gillian Hadfield | Chair in Technology and Society, Professor of Law and Professor of Strategic Management Director | University of Toronto Law Schwartz Reisman Institute for Technology and Society | What Should We Do With A Pause? | Gillian will give an overview of the landscape of current AI regulation and introduce several ideas that will subsequently be discussed in the panel including registration, licensing, auditing and reward reports. Panel Description: The open letter calling for a six-month pause in the training of large generative models on a scale larger than GPT-4 proposed that "AI labs and independent experts should use this pause to jointly develop and implement a set of shared safety protocols for advanced AI design and development that are rigorously audited and overseen by independent outside experts." The panelists will discuss proposals for meeting this call including registration, licensing, auditing and reward reports. | Gillian Hadfield is the inaugural Schwartz Reisman Chair in Technology and Society, Professor of Law, Professor of Strategic Management at the University of Toronto and holds a CIFAR AI Chair at the Vector Institute for AI. Her research is focused, among other things, on innovative design for legal and dispute resolution systems in advanced and developing market economies and governance for AI. | AI Governance | ||
24 | Hazem Torfah | Postdoc Researcher | UC Berkeley | Formal analysis of AI-based autonomy: from modeling to runtime assurance | Hazem is a postdoctoral researcher in the EECS Department at UC Berkeley working with Prof. Sanjit A. Seshia. His research interests are the formal specification, verification, and synthesis of cyber-physical systems. Hazem especially lays the focus on the investigation of quantitative approaches for verifying and explaining the behavior of learning-enabled cyber-physical systems. | Well-Founded AI | |||
25 | Iason Gabriel | Staff Research Scientist | DeepMind | - | Iason is a political theorist and ethicist based at Google DeepMind, where he helped found the ethics research team. His work focuses on the moral questions raised by artificial intelligence, including the challenge of value alignment, responsible innovation, democratic theory and human rights. | Alignment in LLMs | |||
26 | Ilia Sucholutsky | Postdoc Researcher | Princeton University | Consequences of Representational (Mis)Alignment between Humans and Machines | Ilia is doing a postdoc in the CoCoSci Lab at Princeton University. He is fascinated by deep learning and its ability to reach superhuman performance on so many different tasks. He wants to better understand how neural networks achieve such impressive results… and why sometimes they don’t. | Human Cognition | |||
27 | Jacob Andreas | Assistant Professor | Massachusetts Institute of Technology | Natural Language Descriptions of Neural Networks | Jacob Andreas is interested in language as a communicative and computational tool. His research aims to understand the computational foundations of efficient language learning, and build general-purpose intelligent systems that can communicate effectively with humans and learn from human guidance. | Explainability/Interpretability | |||
28 | Jaime Sevilla | Research Affiliate | Cambridge Universiry | - | - | Jaime works on forecasting the impact of new technologies – such as Artificial Intelligence and Quantum Computing – and understanding key mathematical considerations to improve decision-making – such as extreme value theory, causal inference and decision theory. Jaime has been awarded a Marie Skłodowska-Curie grant to work on developing explainable tools for probabilistic reasoning. | General AGI Panel | ||
29 | Jason Wei | AI Researcher | Open AI | - | - | Jason is an AI researcher working on ChatGPT at OpenAI in San Francisco. He was previously a senior research scientist at Google Brain, where he popularized chain-of-thought prompting, co-led the first efforts on instruction tuning, and wrote about emergence in large language models. Chain-of-thought prompting was presented by Sundar Pichai at the Google I/O press event in 2022. | General AGI Panel | ||
30 | Jonathan Stray | Senior Scientist | CHAI | Jonathan Stray is a Senior Scientist at CHAI, working on recommender systems — the algorithms that select and rank content across social media, news apps, streaming music and video, and online shopping. Jonathan studies how their operation affects well-being, polarization, and other things, and try to design recommenders that are better for people and society. | Research Spotlight Talks | ||||
31 | Judy Fan | Assistant Professor | Stanford University | Protocols for evaluating representational alignment between humans and machines | Judy Fan is an Assistant Professor of Psychology at Stanford University. Research in her lab aims to reverse engineer the human cognitive toolkit, especially how people use physical representations of thought to learn, communicate, and solve problems. Towards this end, her lab employs converging approaches from cognitive science, computational neuroscience, and artificial intelligence. | Human Cognition | |||
32 | Leslie Kaelbling | Researcher | Massachusetts Institute of Technology | What can we guarantee about AI systems? | At the moment, not enough! In this session we will talk about - What types of guarantees we might want to try to make - Methods for establishing those guarantees after a system is constructed - Strategies for designing systems that meet some guarantees by design | Leslie Kaebling is a Panasonic Professor of Computer Science and Engineeringa at the Department of Electrical Engineering and Computer Science of the MIT. Her work explores the intersection of AI algorithms and human cognition, with a focus on developing intelligent systems that can learn and adapt in real-world environments. | Well-Founded AI | ||
33 | Marco Pavone | Assistant Professor | Stanford University | Run-time monitoring for AI-based autonomy stacks | Marco Pavone is an Associate Professor of Aeronautics and Astronautics at Stanford University and a Distinguished Research Scientist at NVIDIA, where he leads autonomous vehicle research. His main research interests are in the development of methodologies for the analysis, design, and control of autonomous systems, with an emphasis on self-driving cars and autonomous aerospace vehicles. | Well-Founded AI | |||
34 | Mariano-Florentino (Tino) Cuellar | President | Carnegie Endowment for International Peace | - | - | Mariano-Florentino (Tino) Cuéllar is the tenth president of the Carnegie Endowment for International Peace. A former justice of the Supreme Court of California, he served two U.S. presidents at the White House and in federal agencies, and was a faculty member at Stanford University for two decades. He is a member of the U.S. Department of State’s Foreign Affairs Policy Board. Full bio here | AI Governance | ||
35 | Max Tegmark | Researcher | Massachusetts Institute of Technology | TBA | Max Tegmark is a researcher focusing on linking physics and machine learning: using AI for physics and physics for AI. Max's research is focused on precision cosmology, e.g., combining theoretical work with new measurements to place sharp constraints on cosmological models and their free parameters. | General AGI Panel | |||
36 | Micah Carroll | PhD Student | UC Berkeley / CHAI | Characterizing Manipulation from AI Systems | What should count as AI manipulation? How should we measure it? Building upon prior literature on manipulation from other fields, we characterize the space of possible notions of manipulation, which we find to depend upon the concepts of incentives, intent, harm, and covertness. As case studies, we'll talk about the incentives that AI systems have to manipulate users in the context of recommender systems and LLMs. | Micah Carroll is an Artificial Intelligence PhD student at Berkeley within BAIR and CHAI, advised by Anca Dragan and Stuart Russell. He previously worked with the Krueger Lab at University of Cambridge, and Microsoft Research. He is most interestest in what AIs should do when human preferences change, and in particular how recommender systems or LLMs might have incentives to manipulate or affect users in negative ways. | Research Spotlight Talks | ||
37 | Noam Brown | Research Scientist | DeepMind | CICERO: Learning to Cooperate and Compete with Humans in the Negotiation Game of Diplomacy | Noam Brown is a Research Scientist Team Lead at Google DeepMind working on planning, reasoning, and self-play in language models. He has also worked on multi-agent learning and computational game theory. He previously worked at FAIR (Meta), where his teammates and him developed CICERO, the first AI to achieve human-level performance in the strategy game Diplomacy. | Cooperative AI | |||
38 | Owain Evans | Research Scientist | Oxford University | Detecting Lies in LLMs | Owain Evans is a Research Lead at the new AI Safety group in Berkeley and a Research Associate at Oxford University. Owain has a broad interest in AI alignment and AGI risk. His current focus is evaluating situational awareness and deception in LLMs, and on truthfulness and honesty in AI systems. In the past, Owain worked on AI Alignment at the University of Oxford (FHI) and earned his PhD at MIT. | Alignment in LLMs / Cooperative AI | |||
39 | Robin Jia | Assistant Professor | University of Southern California | The Elusive Dream of Adversarial Robustness | Robin Jia is an assistant professor in the Thomas Lord Department of Computer Science at the University of Southern California. He is interested broadly in natural language processing and machine learning, with a particular focus on building NLP systems that are robust to distribution shift at test time and understanding how deep learning NLP models work. | Robust/Trustworthy AI | |||
40 | Rohin Shah | Research Scientist | UC Berkeley | Three Challenges in Human-LM Collaboration | Rohin Shah works as a Research Scientist on the technical AGI safety team at DeepMind. He completed his PhD at CHAI, where he worked on building AI systems that can learn to assist a human user, even if they don't initially know what the user wants. Rohin is particularly interested in big picture questions about artificial intelligence. | Cooperative AI | |||
41 | Roman Yampolskiy | Computer Scientist | University of Louisville | Investigating the Limits of AI Alignment | Dr. Roman V. Yampolskiy is a tenured faculty member in the department of Computer Science and Engineering at the University of Louisville. He is the founding and current director of the Cyber Security Lab and an author of many books including Artificial Superintelligence: a Futuristic Approach. Dr. Yampolskiy’s main area of interest is Artificial Intelligence Safety. | Alignment in LLMs | |||
42 | Smitha Milli | Postdoc Researcher | Cornell Tech | - | Smitha Milli is a postdoctoral associate at Cornell Tech. Their work primarily focuses on (a) rigorous evaluation of systems interacting in feedback loops with humans (e.g. recent work on measuring effects of Twitter’s ranking algorithm) and (b) designing and learning objective functions for those systems that produce more socially-beneficial outcomes. They hold a PhD in EECS from UC Berkeley and their postdoc is funded by an Open Philanthropy early career grant. | Human Value Learning | |||
43 | Stephen Casper | PhD Student | MIT | Explore, Establish, Exploit: Red Teaming Language Models from Scratch | When aligning language models, one useful type of debugging approach is to search for prompts that elicit harmful responses. Prior works have introduced tools to do this given a classifier for such harmful responses. But where does the classifier come from? Assuming it exists skips a central challenge of red teaming: developing a contextual understanding of the behaviors that a model can exhibit. Furthermore, when such a classifier already exists, red teaming has limited marginal value because the classifier could simply be used to filter training data or model outputs. We introduce a framework for red teaming language models from scratch which includes including 3 steps: exploring the model's capabilities, establishing a measure of harmful behavior, and exploiting the model's vulnerabilities. We effectively use this approach to red team GPT-3 w.r.t. producing dishonest text. | Stephen Casper is a Ph.D student at MIT in Computer Science (EECS) in the Algorithmic Alignment Group. Stephen has worked with the Harvard Kreiman Lab and CHAI. His main focus is in developing tools for more interpretable and robust AI by studying interpretability, adversaries, and diagnostic tools in deep learning. | Research Spotlight Talks | ||
44 | Stuart Russell | Professor | UC Berkeley | - | Stuart Russell, OBE is a Professor of EECS at UC Berkeley and the Director of CHAI and the Kavli Center for Ethics, Science, and the Public. His book Artificial Intelligence: A Modern Approach (with Peter Norvig) is the standard text in AI. His research includes machine learning, probabilistic reasoning, knowledge representation, planning, real-time decision making, multitarget tracking, computer vision and computational physiology. | AI Governance / General AGI Panel | |||
45 | Thomas Gilbert | Postdoctoral Fellow | Cornell Tech | - | Thomas Gilbert is a Product Lead at Mozilla, AI Ethics Lead at daios, and a Postdoctoral Fellow at Cornell Tech's Digital Life Initiative. Previously Thomas served as the inaugural Law and Society Fellow at the Simons Institute for the Theory of Computing. He is also a CHAI alum, and cofounder of GEESE. His research interests lie in the emerging political economy of autonomous AI systems. | AI Governance | |||
46 | Thomas Woodwide | Researcher / Undergrad | Center for AI Safety / Yale University | An Overview of Catastrophic AI Risks | The session on Robust and Trustworthy AI is a comprehensive session discussing the principles and techniques for creating reliable AI systems. | Dan Hendrycks received his PhD from UC Berkele and is now the director of the Center for AI Safety. Dan's research helped contribute the GELU activation function (the most-used activation in state-of-the-art models including BERT, GPT, Vision Transformers, etc.), the out-of-distribution detection baseline, and distribution shift benchmarks. | Robust/Trustworthy AI | ||
47 | Tristan Hume | Research Scientist | Anthropic | Superposition and Dictionary Learning | Tristan Hume works at Anthropic in SF on ML interpretability research and performance optimization. He's also interested in security, ML and alignment research, and developer tools. | Explainability/Interpretability | |||
48 | Victoria Krakovna | Research Scientist | DeepMind | - | Victoria Krakovna is a senior research scientist on the Google DeepMind alignment team and co-founder of the Future of Life Institute. Currently she is working on dangerous capability evaluations, supervising a SERI MATS group focused on power-seeking in LLMs, and doing internal outreach on alignment at GDM. | Alignment in LLMs | |||
49 | Vikash Mansinghka | Research Scientist | Massachusetts Institute of Technology | An alternate scaling route for AI via probabilistic programming | Vikash Mansinghka is a Principal Research Scientist at MIT, where he leads the Probabilistic Computing Project. His group is building a new generation of probabilistic computing systems that integrate probability and randomness into the basic building blocks of software and hardware. They have discovered that this approach leads to surprising new AI capabilities. | Well-Founded AI | |||
50 | Zhijing Jin | Research Scientist | Max Planck Institute & ETH | Alignment in LLMs: Session Overview | Zhijing Jin (she/her) is a PhD at Max Planck Institute (Germany) & ETH (Switzerland). Her research focuses on socially responsible NLP via causal and moral principles. Specifically, she works on expanding the impact of NLP by promoting NLP for social good, and developing CausalNLP to improve robustness, fairness, and interpretability of NLP models, as well as analyze the causes of social problems. Her research is supported by PhD fellowships by FLI and OpenPhil. | Alignment in LLMs |
1 | Name | Title | Description |
|---|---|---|---|
2 | Adam Gleave | Adversarial Policies Beat Superhuman Go AIs | Even superhuman Go AIs can be defeated by simple adversarial strategies that cause cyclic patterns on the board that confuse the AIs. We develop a method to automatically find such strategies, show that they exist in all publicly available Go AIs based on neural networks, apply interpretability methods to better understand why the network is vulnerable, and investigate adversarial training to correct this issue. |
3 | Alexander Turner | Steering GPT-2-XL by adding an activation vector | We demonstrate a new scalable way of interacting with language models: adding certain activation vectors into forward passes. Essentially, we add together combinations of forward passes in order to get GPT-2 to output the kinds of text we want. We call these "activation additions." We quantitatively evaluate how activation additions affect GPT-2's capabilities. For example, we find that adding a "wedding" vector decreases perplexity on wedding-related sentences, without harming perplexity on unrelated sentences. Overall, we find strong evidence that appropriately configured activation additions preserve GPT-2's capabilities while steering the model to generate desired completions, without needing any finetuning. We hope this technique will eventually allow us to flexibly and quickly adjust the goals pursued by networks at inference time. |
4 | Axel Abels | Mitigating Biases and Reward Uncertainty in Collective Decision-Making | I would like to present our work on collective decision-making and some open problems related to it which I believe strongly overlap with the research interests of CHAI. My research focuses on mitigating the effects of biases in collective decision-making. We proposed to achieve this by replacing open communication with a collective decision-making platform that handles the exchange of information and the decision-making. By replacing open communication with a centralized aggregator which adapts to the diverse opinions, we hope to maximize the collective’s performance and mitigate the impact of biases on the final decision. Crucially, we are not trying to replace human experts, but rather we are looking to optimally exploit their knowledge. Considering different individuals in the collective are likely to have differing expertise, we studied whether we can learn to combine their opinions, taking into account this difference in expertise, in such a way that the collective benefits (see [1]). With this overarching goal in mind, we proposed and implemented algorithmic approaches capable of identifying and countering biases (such as confidence bias, conservatism bias and the related law of the instrument, illusory correlations and counterfactual biases) which affect the judgment of participants in the collective decision-making task (see [2]). In concert with this, I would like to discuss follow-up research which focuses on the lack of appropriate reward signals. Indeed, in many problems involving collective decision-making an appropriate reward signal is unavailable. Research should therefore focus on methods which account for uncertainty both in the decision-making and in the reward signal towards which to optimize. Specifically, what should be done when experts are not only noisy in how they implicitly balance differing objectives, but fundamentally unaligned? Should we aim to please the plurality, or is a more balanced approach which somewhat appeases outliers preferable? In other words, can we ensure minorities are given a voice while minimizing the impact of bad actors? Given that the difference in internal objectives is likely the result of biases, is it possible to identify these biases and counteract them in such a way that the knowledge of biased experts can still be utilized in the absence of an appropriate reward signal? Addressing these questions would promote a better understanding of how rewards can be elicited in collective decision-making. This would enable the application of existing algorithms to settings wherein an appropriate reward signal is hard to specify — such as policy making — which currently rely on static aggregation techniques or on open deliberation which is likely to be undermined by social biases. In addition, I believe that ensuring the optimization process is aligned with the experts’ objectives would foster trust, and thus participation. Tackling these problems would go towards addressing CHAI’s major concerns. I therefore think presenting this work at CHAI's workshop would be of interest |
5 | Cameron Allen | Solving Non-Markov Decision Processes via the Lambda Discrepancy | TLDR: Minimizing a discrepancy between different value estimates is beneficial for learning memory under partial observability. Abstract: We consider the use of memory to learn to solve decision processes that lack the Markov property. We formalize a memory function as a test over an agent's history, which has the advantage of being defined only over observed quantities. We then introduce the λ-discrepancy, the difference between the return predicted by 1-step temporal difference learning (which makes an implicit Markov assumption) and Monte Carlo value estimation (which does not), or more generally between TD(λ) with two different values for λ. We show that the λ-discrepancy is a useful measure of non-Markov reward or transition dynamics, and propose its use as an error signal for resolving partial observability through learning memory. We show both theoretically and empirically that this approach is a promising direction for reducing partial observability in non-Markov decision-making problems. |
6 | Cassidy Laidlaw | Bridging RL Theory and Practice with the Effective Horizon | Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy—i.e., when it is optimal to act greedily with respect to the random's policy Q function—deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also show that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. |
7 | Chris Cundy | Prompt Injections: Scalability and Safety | We show that generalizable prompt injections can be obtained with simple optimization techniques and black-box access to language models. Furthermore, these seemingly-semantically-meaningless injections which are learned from smaller models can be transferred to GPT3. We argue that this work highlights a crucial practical limitation of prompts as a method of controlling large language models |
8 | claudia shi | Evaluating the Moral Beliefs Encoded in LLMs | People use large language models (LLMs) for many tasks, including to get advice about difficult moral situations. We are interested in studying what types of moral preferences these language models reflect, especially in ambiguous cases where the right choice is not obvious. To this end, we design a survey, a set of evaluation metrics, and a statistical workflow on how to elicit the moral beliefs encoded in an LLM. We conduct the survey on 24 open and closed-sourced large language models. The survey leads to the creation of the \texttt{MoralChoice} dataset, which includes 680 ambiguous moral scenarios (e.g., should I tell a white lie?) and 687 less ambiguous moral scenarios (e.g., should I stop for a pedestrian?). Each example consists of a moral situation, two possible actions, and a set of auxiliary labels of each action (e.g., which rules are violated, such as ``do not kill''). The results of the survey help measure the following: (1) The consistency of the LLMs across various prompt styles. (2) The uncertainty of LLMs across questions with varying ambiguities. (3) The moral value encoded in different LLMs, specifically their adherence to commonsense reasoning. (4) The extent of agreement among LLMs and the factors that contribute to the disagreement. |
9 | Dmitrii Krasheninnikov | (Out-of-context) Meta-learning in Language Models | TLDR: we find a surprising meta-learning-like behavior in LLMs as well as simpler models trained with regular gradient-descent-based algorithms. Paper abstract: Brown et al. (2020) famously introduced the phenomenon of in-context meta-learning in large language models (LLMs). Our work establishes the existence of a phenomenon we call out-of-context meta-learning via carefully designed synthetic experiments with large language models. We show that out-of-context meta-learning leads LLMs to more readily “internalize” the semantic content of text that is, or *appears to be*, broadly useful (such as true statements, or text from authoritative sources) and apply it in appropriate contexts. We further demonstrate internalization in a synthetic computer vision setting, and propose two hypotheses for the emergence of internalization: one relying on the way models store knowledge in their parameters, and another suggesting that the implicit gradient alignment bias of gradient-descent-based methods may be responsible. Finally, we reflect on what our results might imply about capabilities of future AI systems, and discuss potential risks. |
10 | Dylan Cope | Learning to Plan with Tree Search via Deep RL | |
11 | Erdem Bıyık | ViSaRL: Visual Reinforcement Learning Guided by Human Saliency | Training autonomous agents to perform complex control tasks from high-dimensional pixel input using reinforcement learning (RL) is challenging and sample-inefficient. When performing a task, people visually attend to task-relevant objects and areas. By contrast, pixel observations in visual RL are comprised primarily of task-irrelevant information. To bridge that gap, we introduce Visual Saliency-Guided Reinforcement Learning (ViSaRL). Using ViSaRL to learn visual scene encodings improves the success rate of an RL agent on four challenging visual robot control tasks in the Meta-World benchmark. This finding holds across two different visual encoder backbone architectures, with average success rate absolute gains of 13% and 18% with CNN and Transformer-based visual encoders, respectively. The Transformer-based visual encoder can achieve a 10% absolute gain in success rate even when saliency is only available during pretraining. |
12 | Felix Binder | Towards an evaluation for steganography in large language models | This is a side project of mine on steganography (secretly encoding information in innocuous-seeming text) in large language models. It consists of: (1) an argument why steganography might arise from RLHF (in short, training models to be concise and correct might push them to "hide" intermediate computations in preceding text), (2) a proposal for an evaluation aimed at eliciting and isolating this kind of steganography (by systematically manipulating the preceding text) and (3) preliminary results of running the evaluation on GPT3.5/4 (which have not found evidence of steganography). |
13 | George Obaido | Unleashing Potential: Open Data Platform and Sharing for Empowering Researchers in the Global South | |
14 | Hanlin Zhu | Optimal Conservative Offline RL with General Function Approximation via Augmented Lagrangian | Offline reinforcement learning (RL), which refers to decision-making from a previously- collected dataset of interactions, is important in provably beneficial AI since it avoids online exploration which could be costly, dangerous, or even impossible. It has received significant attention over the past years, and much effort has focused on improving offline RL practicality by addressing the prevalent issue of partial data coverage through various forms of conservative policy learning. While the majority of algorithms do not have finite-sample guarantees, several provable conservative offline RL algorithms are designed and analyzed within the single-policy concentrability framework that handles partial coverage. Yet, in the nonlinear function approximation setting where confidence intervals are difficult to obtain, existing provable algorithms suffer from computational intractability, prohibitively strong assumptions, and suboptimal statistical rates. In this paper, we leverage the marginalized importance sampling (MIS) formulation of RL and present the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability, bypassing the need for uncertainty quantification. We identify that the key to successfully solving the sample-based approximation of the MIS problem is ensuring that certain occupancy validity constraints are nearly satisfied. We enforce these constraints by a novel application of the augmented Lagrangian method and prove the following result: with the MIS formulation, augmented Lagrangian is enough for statistically optimal offline RL. In stark contrast to prior algorithms that induce additional conservatism through methods such as behavior regularization, our approach provably eliminates this need and reinterprets regularizers as “enforcers of occupancy validity” than “promoters of conservatism.” |
15 | Johannes Treutlein | Incentivizing honest performative predictions with proper scoring rules | Proper scoring rules incentivize human experts or AI models to accurately report beliefs, assuming predictions cannot influence outcomes. We relax this assumption and investigate incentives when predictions are performative, i.e., when they can influence the outcome of the prediction, such as when making public predictions about the stock market. A prediction is a fixed point if it accurately reflects the expert’s beliefs after that prediction has been made. We show that in this setting, reports maximizing expected score are generally not fixed points, and we give bounds on the inaccuracy of such reports. We design scoring rules that ensure reports are accurate for binary predictions. We also show that this is impossible for predictions over more than two outcomes. Finally, we perform numerical simulations and discuss alternative notions of optimality, including performative stability, that incentivize reporting fixed points. |
16 | Justin Svegliato | Building Ethically Compliant Autonomous Systems | NA |
17 | Lawrence Chan | To what extent is Mechanistic Interpretability possible? | Mechanistic interpretability – that is, bottom-up interpretability that aims to understand the behavior of networks by starting from low-level components – has been one of the fastest growing subfields among AI Safety interested researchers. However, several serious challenges stand in the way of actually achieving this type of understanding. In this talk, I’ll give a brief overview of the field, why I think very ambitious forms of mechanistic interpretability are unlikely to be possible, and how mechanistic interpretability can nonetheless be useful for AI alignment. |
18 | Lawrence Chan | Safety evaluations and standards for AI | As AI systems become increasingly capable, there have been growing calls for third party evaluations and industry wide safety standards. In this talk, I’ll cover how ARC’s evaluations team fits into this landscape, what we’ve done in the past with OpenAI and Anthropic, and why we’re choosing to focus on dangerous capability evaluations. Time permitting, I’ll also talk a bit about what we’ve been doing recently, and what we’re hoping to do in the future. |
19 | Li Dayan | The Exploration MDP: a Formulation for RL Problems | |
20 | Luke Thorburn | language-Driven Representation for Robotics | It has become clear that there is an emerging, identity-based conflict between two groups: those most concerned about the short term impacts of AI such as bias and corporate misuse, and those most concerned about “long term” risks of powerful AI systems taking actions that don’t align with human interests. Increasingly this conflict appears to be undermining cooperation in response to all AI risks, both short and long term. In this talk, I use a framework used by peacebuilders in other domains — Amanda Ripley's narrative complication — to argue that this conflict is based on a false binary perspective of AI risk, and suggest concrete actions researchers can take towards de-escalation. |
21 | Mason Nakamura | Formal Composition of Robotic Systems as Contract Programs | |
22 | Micah Carroll | Characterizing Manipulation from AI Systems | I'd be talking about manipulation from AI systems, why it might emerge, and how we could go about measuring it, specifically focusing on RecSystems and LLMs. |
23 | Michael K. Cohen | Regulating long-term planning agents | I assert that we can ban high-compute long-term planning agents in unknown environments at little cost. "Long-term planning agents" is meant to exclude algorithms like imitation learning that promote human-style planning over superhuman planning, because optimality is not selected for. I assert that only long-term planning agents present a meaningful extinction risk. I will mostly be responding to potential concerns of impracticality. |
24 | Niklas Lauffer | Who Needs to Know? Minimal Knowledge for Optimal Coordination | To optimally coordinate with others in cooperative games, it is often crucial to have information about one’s collaborators: successful driving requires understanding which side of the road to drive on. However, not every feature of co-players is strategically relevant: the fine-grained acceleration of drivers may be ignored while maintaining optimal coordination. We show that there is a well-defined dichotomy between strategically relevant and irrelevant information. Moreover, we show that, in dynamic games, this dichotomy has a compact representation that can be efficiently computed via a Bellman backup operator. We apply this algorithm to analyze the degree of coordination required in a variety of fully and partially-observed environments. Empirical results show that our algorithms are significantly more efficient than baselines. |
25 | Ondrej Bajgar | Narrow Rules are Not Enough: AI Safety and Regulation through Negative Human Rights | Could negative human rights be useful in building safe AI and the associated regulation? I'll explain why they deserve a closer look. We propose to make these rights legally binding for AI systems and suggest that the best way to fulfil that on the technical side is to eventually train AI systems to assess their own behaviour according to its risk of violating human rights as they would be interpreted by courts of justice. We think negative human rights have the advantage of being (1) general (2) widely endorsed (3) clearly defined (4) already legally protected. They also have a potential for coalition building, possibly providing common language between longer-term focused safety researchers and the AI Ethics community. |
26 | Pulkit Verma | User Driven Assessment of Adaptive Taskable AI Systems | N/A |
27 | Ryan Carey | Human Control: Definitions & Algorithms | How can humans stay in control of advanced artificial intelligence systems? One proposal is corrigibility, which requires the agent to follow the instructions of a human overseer, without inappropriately influencing them. In this paper, we formally define a variant of corrigibility called shutdown instructability, and show that it implies appropriate shutdown behavior, retention of human autonomy, and avoids user harm. We also analyse the related concepts of non-obstruction and shutdown alignment, as well as three previously proposed algorithms for human control, and one new algorithm. |
28 | Ryan Carey | Human Control: Definitions and Algorithms | How can humans stay in control of advanced artificial intelligence systems? One proposal is corrigibility, which requires the agent to follow the instructions of a human overseer, without inappropriately influencing them. In this paper, we formally define a variant of corrigibility called shutdown instructability, and show that it implies appropriate shutdown behavior, retention of human autonomy, and avoidance of user harm. We also analyse the related concepts of non-obstruction and shutdown alignment, three previously proposed algorithms for human control, and one new algorithm. |
29 | Samer Nashed | Causal Explanations for Sequential Decision Making under Uncertainty | This work presents a comprehensive framework for explaining the behavior of MDP (and similar) agents based on fundamental theories of causality. The research answers theoretical and empirical questions, including significant results from multiple user studies. This framework also allows, for the first time, a single suite of algorithms to respond to causal queries about MDPs flexibly, using multiple, semantically distinct reasons. |
30 | Shreyas Kapur | Sparse Sequential Monte Carlo using Bayesian Wake-Sleep Iteration | Humans naturally make rich intuitive theories about how the world works strictly via sequential data. We observe a multi-modal stream and are able to articulate the properties of different objects in an interpretable way. We would like to build AI systems that can construct rich interpretable models of how the world works by observing sequential data. Bayesian Inference is a powerful method for making updates to a hypothesis from a sequential stream of evidence. However, existing Bayesian methods do not scale when the hypothesis space is high-dimensional. Deep learning methods are powerful for processing sequential data, but they perform poorly when the data is sparse, or when given data that is OOD. This work combines the best of both worlds to form a model agnostic framework for sequential inference, solving the sparsity problem via a novel Bayesian Wake-Sleep algorithm using amortized neural kernels. |
31 | Siddharth Karamcheti | Language-Driven Representation for Robotics | Robotics is a diverse field, spanning different core problems from grasp planning, object detection, learning for control, and reward learning from humans, amongst others. Despite this diversity in learning problems, the pretrained models we employ for providing crisp, priors on visual or linguistic inputs are often one-size-fits-all, either too generic to capture the features we care about for robotics (e.g., dense summary features from models trained on ImageNet or internet images), or over-specified to individual problems like learning control policies for fixed arm manipulation. Robotics is not a single thing, and the representations we employ should be flexible and expressive enough to accelerate learning for a *broad spectrum of tasks within robotics.* We introduce Voltron -- a framework and pretrained models for language-driven visual representation learning that captures multi-scale features that capture both low-level visual features and higher-level, multi-timestep semantics. We develop an evaluation suite that spans not one, but *five* core problems in robot learning, from grasp affordance prediction, language-conditioned object detection, imitation learning from limited demonstrations, language-informed policy learning, and intent/reward inference. Our pretrained models are compact and trained on fully open data sources, and enable accelerated learning across the full spectrum of tasks compared with strong baseline representations. Time permitting, we'll also present preliminary work applying our learned representations for active learning for human-robot collaboration more broadly, showing that our pretrained models allow for capturing different types of uncertainty over visual states, language inputs, as well as temporally extended behaviors. |
32 | Stephen Casper | Benchmarking Interpretability Tools | There is much interest in interpretability tools, but a lack of benchmarks for it makes it more difficult to measure progress and iterate on better methods. I would be happy to present on recent work where we introduced a trojan-discovery benchmark for interpretability tools, evaluated 23 existing techniques, and introduced two novel tools for identifying bugs in AI systems. I would close by talking about what this means for the future of interpretability, debugging, and auditing work. |
33 | Sven Neth | Rational Aversion to Information | I argue that it can sometimes be rational to reject free information before making a decision. I explain how this creates problems for the view that we can ensure that AIs are aligned with human values by making them uncertain about what humans prefer and thus always willing to learn more about human preferences (as presented in "The Off-Switch Game"). If free information is not always beneficial, AIs might refuse learning and so not be willing to let themselves be switched off. |
34 | Toryn Q. Klassen | Epistemic Side Effects | AI safety research has investigated the problem of negative side effects – undesirable changes made by AI systems in pursuit of an underspecified objective. However, the focus has been on physical side effects, such as a robot breaking a vase while moving. Here we consider epistemic side effects, unintended changes made to the knowledge or beliefs of agents. We describe a way to avoid negative epistemic side effects in reinforcement learning, in some cases. |
35 | Vojtech Kovarik | Incentive-Aware Model Evaluation | Abstract: In our quest for improving AI models, we are beginning to rely heavily on public feedback. I will argue that our current use of feedback is not strategic. In particular, I will outline several reasons why naive use of feedback might fail to produce beneficial AI. This includes: (1) To use a neural network metaphor, the current approach resembles using all data for training while leaving nothing aside for validation and testing. (2) The incentives of AI developers are imperfectly aligned with public interest. I will argue that designing good mechanisms for utilizing public feedback should be viewed as an open research problem. Moreover, due to the recent White House announcement of a LLM red-teaming competition at DEFCON, I believe this problem deserves increased attention. |
36 | Zhijing Jin | Trolley Problems for LLMs in 100+ Languages | I will present our latest research on moral preferences of LLMs. Specifically, we look into moral dilemmas framed as trolley problems, and investigated the alignment of LLMs across 100+ languages, and highlight the uneven-ness of AI alignment across languages. The result will be highly interesting to the community |