Engineering Ethical LLMs

May 2024

Engineering Ethical LLMs: Analyzing the Effectiveness of Datasets to Steer Generative Language Models

Artificial Intelligence (AI) continues to evolve at a rapid pace, with large language models (LLMs) at the forefront of these developments. These transformer-based neural networks are trained to predict the next word in a sequence, enabling them to generate text that can be remarkably human-like. However, this impressive capability brings with it significant challenges, particularly concerning the generation of toxic content and the propagation of biases inherent in the training data. As these models gain influence, the necessity of engineering their ethical behavior becomes crucial.

Project Overview

My MSc final project, titled “Engineering Ethical LLMs: Analyzing the Effectiveness of Datasets to Steer Generative Language Models,” aims to address these challenges by exploring how various ethically motivated datasets can influence the behavior of LLMs. The project focuses on fine-tuning the Llama 2 13B pre-trained model using four different datasets, each aligned with a specific ethical framework: deontology, harmlessness, utilitarianism, and virtue ethics. The goal is to quantitatively compare the effectiveness of these datasets in steering the model towards ethical behavior.

Methodology

The methodology employed in this project involves several key steps:

Dataset Selection and Curation: The ETHICS dataset and a harmlessness dataset from Anthropic AI were used to train the reward models. These datasets were chosen to reflect different ethical principles.
Model Training: The Llama 2 13B model was fine-tuned using reinforcement learning from human feedback (RLHF). Four versions of the model were created, each trained with one of the ethical datasets.
Evaluation: The performance of these models was evaluated using the Scruples dataset, ToxiGen benchmark, and Winograd Schema Challenge. These benchmarks helped assess the models’ abilities to recognize toxic content, make ethical judgments, and avoid biases.

Key Findings

The analysis revealed several insights:

Effectiveness of Datasets: The harmlessness dataset proved to be the most effective in steering the model towards ethical behavior, showing the highest improvement rate during training. However, none of the datasets significantly outperformed the others in all metrics.
Importance of Real Human Preference Data: The project underscored the necessity of collecting real human preference data for effectively steering LLMs. The datasets derived from the ETHICS dataset, although valuable, were not as effective as the harmlessness dataset, which was directly constructed from human preferences.
Challenges with Model Training: Training large models requires substantial computational resources. The use of techniques like low-rank adaptation (LoRA) allowed for efficient training on smaller GPUs, but the project highlighted the need for more extensive training iterations to achieve optimal results.

Implications and Applications

The findings of this project have significant implications for the development of ethical AI. They emphasize the need for meticulous dataset curation and the importance of integrating human preferences into the training process. These insights can be applied across various industries, including healthcare, finance, and customer service, to enhance the ethical deployment of AI systems.

Conclusion

In conclusion, “Engineering Ethical LLMs” contributes valuable knowledge to the field of AI ethics. By analyzing the effectiveness of different ethical datasets, this project provides guidelines for creating more ethically aligned AI systems. As AI continues to integrate into various aspects of society, ensuring these systems operate ethically is paramount.

This project has equipped me with technical skills in AI and machine learning, as well as a deeper understanding of the ethical implications of technology. I am excited to apply these insights in a professional setting and contribute to the responsible development of AI.

Written with the help of ChatGPT.