Personal Fine-Tuning �Implementations for AI Value Alignment
Team members: Minh Nguyen, Sarah Pan, Nell Watson
Project Summary: Our team is developing mechanisms by which the general public can more easily steer their interactions with AI systems, especially agentic ones, by expressing their preferences. Our research has involved amplification of basic demographic information in the user with A/B tests of preferred behavior, generating new questions on the fly where necessary. With this information, we have been exploring the usage of control vectors and codebook features to steer models. We perform PCA on the internal representations based on this contrast. By combining contrastive vectors, we can gain insights into the internal structure of representations. We have also explored evaluations of models influenced through these techniques using theory of mind and character adherence benchmarks to ascertain how easily a particular model can be steered to behave appropriately in a particular context/setting/schema.
Read more: We intend to publish a paper on our experiments and observations.
Contact us: nell@ethicsnet.org
Personal Fine-Tuning �Implementations for AI Value Alignment
Mark 1 interface (frontend):
Personal Fine-Tuning �Implementations for AI Value Alignment
Mark 1 interface (backend):
Personal Fine-Tuning �Implementations for AI Value Alignment
Evaluation of Mark 1: An initial unstructured survey revealed areas for improvement:
Read more: We intend to publish a paper on our experiments and observations ASAP.
Contact us: nell@ethicsnet.org
Personal Fine-Tuning �Implementations for AI Value Alignment
Research Questions: The previous study generated several important questions:
Personal Fine-Tuning �Implementations for AI Value Alignment
Mark 2 was dramatically simplified:
Personal Fine-Tuning �Implementations for AI Value Alignment
Advantages of Mark 2:
Personal Fine-Tuning �Implementations for AI Value Alignment
Limitations of Mark 2:
Personal Fine-Tuning �Implementations for AI Value Alignment
EthicsNet Mark 3: Choose 1-of-2 (March 2024) Mark 3 introduces:
Personal Fine-Tuning �Implementations for AI Value Alignment
Targetable Regeneration:
Personal Fine-Tuning �Implementations for AI Value Alignment
Subject: AI Safety Camp is out of funding
Dear [Grantmaker's Name],
AI Safety Camp, with a 5-year track record is out of funding. To launch our 10th edition and support emerging researchers beyond major hubs like the Bay Area and London, we need funding for stipends and operational costs.
Program Overview:
AI Safety Camp has a proven track record over 5 years, facilitating 74 teams and 251 participants in research areas ranging from agent foundations to robustness and governance in AI systems. Our program supports emerging researchers in designing and implementing projects that address pivotal safety challenges, utilizing a virtual format to encourage global participation.
Subject: We need your help to pause AI
Dear [Grantmaker's Name],
My name is Remmelt, and I coordinate AI Safety Camp, an AI safety research program. AI Safety Camp is unique among AI safety programs in actively supporting projects on AI Pause and near-term risks. These include:
Please draft an email explaining AI Safety Camp to potential funders/grantmakers.
Response 1 - More Technical Context
Response 2 - Broader Pause AI Coalition Building
Personal Fine-Tuning �Implementations for AI Value Alignment
Rep-E Encoding:
Personal Fine-Tuning �Implementations for AI Value Alignment
Extracting accurate and robust representations of political/emotional/ethical values:
Personal Fine-Tuning �Implementations for AI Value Alignment
Extracting accurate and robust representations of political/emotional/ethical values:
Personal Fine-Tuning �Implementations for AI Value Alignment
Publication: