1 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

Team members: Minh Nguyen, Sarah Pan, Nell Watson

Project Summary: Our team is developing mechanisms by which the general public can more easily steer their interactions with AI systems, especially agentic ones, by expressing their preferences. Our research has involved amplification of basic demographic information in the user with A/B tests of preferred behavior, generating new questions on the fly where necessary. With this information, we have been exploring the usage of control vectors and codebook features to steer models. We perform PCA on the internal representations based on this contrast. By combining contrastive vectors, we can gain insights into the internal structure of representations. We have also explored evaluations of models influenced through these techniques using theory of mind and character adherence benchmarks to ascertain how easily a particular model can be steered to behave appropriately in a particular context/setting/schema.

Read more: We intend to publish a paper on our experiments and observations.

2 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

Mark 1 interface (frontend):

3 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

Mark 1 interface (backend):

4 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

Evaluation of Mark 1: An initial unstructured survey revealed areas for improvement:

Simpler Processes: Users encountered numerous bugs due to the complex data pipelines handling diverse inputs, indicating a need for simplification to facilitate future integrations.
More Diverse Data Encoding: The survey's use of explicit, simplified categories was inadequate. For instance, one user found the question on 'Political Ideology' too reductive. This feedback underscored the need for methods capable of handling a broader range of data inputs to enhance personalization.
Lower Barriers to User Engagement: The user process resembled an academic survey, requiring active and in-depth participation. Since users generally prefer passive interaction, we recognized the need to develop methods that gather data with minimal effort from the users.
Clearer Use-Case Definition: It was challenging for respondents to understand the practical applications of the interface.

Read more: We intend to publish a paper on our experiments and observations ASAP.

5 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

Research Questions: The previous study generated several important questions:

RQ1: Evaluation Methods: What are the best practices for evaluating different approaches? Should these evaluations be qualitative, quantitative, or a combination of both?
RQ2: Diversity Accuracy: Does the system accurately reflect the diversity of individual users? This involves exploring mechanisms for assessing representation and personalization.
RQ3: Preference Testing: How effective are before-and-after A/B tests in determining user preferences and enhancing system performance?
RQ4: Contextual Behavior: Are the current benchmarks, such as EQ (Emotional Quotient) tests, sufficient for assessing whether the system behaves appropriately in various contexts, settings, or schemas?
RQ5: Theory of Mind: Can we develop or refine benchmarks for 'theory of mind' to evaluate how well the system understands and anticipates user needs and intentions?
RQ6: Character Roleplay: How can we implement character roleplay benchmarks to ensure the system maintains a distinctive voice or viewpoints that resonate with diverse user groups?
RQ:7 Vectorization & Steering Mechanisms: What are the best methods to vectorize value personalization functions for reliable steering by machines?

6 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

Mark 2 was dramatically simplified:

Chat Interface: To better understand the user, the LLM engages in casual introductory conversations, asking questions about the user's demographics and values.
Character Card: The responses are summarized and stored in the form of 'character cards'—plaintext biographies that detail aspects of the user's personality.
Silly Tavern: Our new approach, heavily inspired by SillyTavernAI.com, a popular character chatbot platform and an open-source alternative to Character.AI, led to the release of EthicsNet Mark 2 as a plugin on SillyTavern. We gathered feedback from users to assess the improvements and identify remaining challenges.

7 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

Advantages of Mark 2:

More Diverse Input: Leveraging LLMs, V2 can process and store diverse data effectively, including asking relevant clarifying questions.
Fewer Bugs: By utilizing functional off-the-shelf LLMs instead of our own databases for data processing, we significantly reduced the occurrence of bugs. No bugs were reported.
More Compatibility: Our review of numerous state-of-the-art methods for encoding user preferences showed that simple text descriptions are the most reliable and general method for interfacing with LLMs, particularly given the rapid changes in AI architectures and encoding methods. Testers found this approach familiar and intuitive.

8 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

Limitations of Mark 2:

Even Lower Barriers to User Engagement: Although the chat format is more intuitive and casual than a lengthy survey, it still demands significant engagement from users. About 20% reported losing interest after a few questions, and we suspect the actual disengagement rate could be higher among non-respondents. This indicates a need for even more passive data collection methods.
Clearer Use Case: The current chat format proved too demanding for casual users, which likely contributed to the low response rate. Conversely, power users felt they could easily write detailed character cards on their own, a common practice within the character chatbot hobbyist community.

9 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

EthicsNet Mark 3: Choose 1-of-2 (March 2024) Mark 3 introduces:

Choose 1-of-2: Users are now presented with two statements and asked to choose the one they agree with more. This shift from text-based chat to a simple binary choice reduces the amount of texting and thinking required by users. This approach is heavily inspired by the user interfaces of TikTok and Infinite Craft.
Efficiency and Compatibility: Encoding binary preferences not only streamlines user interaction but also enhances efficiency and compatibility with various LLM architectures, facilitating the accommodation of diverse topics. This binary preference method, known as Contrastive Activation Addition, has been detailed in a publication by Rimsky et al., 2023.

10 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

Targetable Regeneration:

We’re now trying to integrate this into the Regenerate feature in ChatGPT/Claude.

11 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

Subject: AI Safety Camp is out of funding

Dear [Grantmaker's Name],

AI Safety Camp, with a 5-year track record is out of funding. To launch our 10th edition and support emerging researchers beyond major hubs like the Bay Area and London, we need funding for stipends and operational costs.

Program Overview:

AI Safety Camp has a proven track record over 5 years, facilitating 74 teams and 251 participants in research areas ranging from agent foundations to robustness and governance in AI systems. Our program supports emerging researchers in designing and implementing projects that address pivotal safety challenges, utilizing a virtual format to encourage global participation.

Subject: We need your help to pause AI

Dear [Grantmaker's Name],

My name is Remmelt, and I coordinate AI Safety Camp, an AI safety research program. AI Safety Camp is unique among AI safety programs in actively supporting projects on AI Pause and near-term risks. These include:

Data Laundering: Investigating and exposing how our personal data is misused by large tech corporations.
Countering Big Tech Lobbying: Counteract Big Tech's influence on policy, promoting regulations that prioritize public safety over corporate profits.

Please draft an email explaining AI Safety Camp to potential funders/grantmakers.

Response 1 - More Technical Context

Response 2 - Broader Pause AI Coalition Building

12 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

Rep-E Encoding:

To encode user values, we have adopted Representation Engineering, a straightforward yet effective technique for steering LLM outputs. This method, developed by Zou et al., 2023, utilizes prescriptive single words like 'happy', 'sad', or 'honest' to guide the outputs of large language models.
We've collaborated with open-source contributors to integrate this approach into LLAMA2. This collaboration has resulted in the addition of support for control vectors, as detailed in the pull request by vgel, number 5970, on the ggerganov/llama.cpp repository on GitHub. https://github.com/ggerganov/llama.cpp/pull/5970

13 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

Extracting accurate and robust representations of political/emotional/ethical values:

We are currently testing a system that adapts to how individuals respond to specific examples, enhancing our Representation Engineering technique. Initially, this method activates generic "happy" outputs; however, individual perceptions of happiness can vary significantly.

14 of 15

Personal Fine-Tuning �Implementations for AI Value Alignment

Extracting accurate and robust representations of political/emotional/ethical values:

To address this, we've developed a genetic algorithm that updates the default "happy" vectors based on individual responses. For instance, if a user expresses happiness related to flowers, the algorithm adjusts their happiness vector to include floral examples. This tailored approach allows for more personalized and accurate representations of user emotions.
WIP repo: https://github.com/sarahpannn/genetic-personalization/tree/main

1 of 15

2 of 15

3 of 15

4 of 15

5 of 15

6 of 15

7 of 15

8 of 15

9 of 15

10 of 15

11 of 15

12 of 15

13 of 15

14 of 15

15 of 15