Home
Shows
About
Blog
Links

Posts
Kairos.fm
Team
All Shows
FAQ
Linkpage

On this page

Links

Evals Hackathon November 2023 (1)

Oct 25, 2023·

Into AI Safety

Into AI Safety

· 1 min read

This episode kicks off our first subseries, which will consist of recordings taken during my team's meetings for the AlignmentJams Evals Hackathon in November of 2023. Our team won first place, so you'll be listening to the process which, at the end of the day, turned out to be pretty good.

Check out Apart Research, the group that runs the AlignmentJamz Hackathons.

Links

Links to all articles/papers which are mentioned throughout the episode can be found below, in order of their appearance.

Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains
- New paper shows truthfulness & instruction-following don’t generalize by default
- Generalization Analogies Website
Discovering Language Model Behaviors with Model-Written Evaluations
- Model-Written Evals Website
OpenAI Evals GitHub
METR (previously ARC Evals)
Goodharting on Wikipedia
From Instructions to Intrinsic Human Values, a Survey of Alignment Goals for Big Models
Fine Tuning Aligned Language Models Compromises Safety Even When Users Do Not Intend
Shadow Alignment: The Ease of Subverting Safely Aligned Language Models
Will Releasing the Weights of Future Large Language Models Grant Widespread Access to Pandemic Agents?
Building Less Flawed Metrics, Understanding and Creating Better Measurement and Incentive Systems
eLeutherAI’s Model Evaluation Harness
Evalugator Library

Last updated on Jun 17, 2024

Hackathon Evals November 2023

Into AI Safety

Authors

One way to make sense of all this tech stuff.

← Applying for Funding w/ Esben Kran Oct 25, 2023

Introduction and Motivation Oct 21, 2023 →

Related

Evals Hackathon November 2023 (2)
Pretraining Safety w/ Ethan Roland
Reclaiming UBI in the AI Age w/ Joe Williams
Building Asymmetric Defenses w/ Zainab Majid
Drawing Red Lines w/ Su Cizem

© 2026 Kairos.fm | This work is licensed under CC BY SA 4.0