2024년 5월 22일 AI 소식 · TECH BLOG by Dongyoung Kim Ph.D.

요약

오늘 AI 소식에서는 마이크로소프트의 새로운 대규모 언어 모델인 Phi-3의 출시와 Chatbot Arena의 새로운 “Hard Prompts” 카테고리 도입, 그리고 Anthropic의 Claude 3 Sonnet에 대한 심층 분석 결과가 발표되었습니다. Phi-3은 Meta의 Llama 3 및 Mistral, OpenAI GPT-3.5 및 Cohere Command R+를 능가하는 성능을 보여주며, 128k 토큰까지의 긴 맥락을 처리할 수 있습니다. Chatbot Arena에서는 사용자 제출 프롬프트를 기반으로 모델의 복잡한 문제 해결 능력을 평가하는 “Hard Prompts” 카테고리를 도입했습니다. Anthropic에서는 Claude 3 Sonnet 모델의 내부 작동 방식을 분석하여 수백만 개의 기능을 추출하고 이를 통해 모델의 작동 원리를 이해하고 안전성을 향상시키는 방법을 연구했습니다.
Phi-3 - 마이크로소프트의 새로운 대규모 언어 모델

https://huggingface.co/microsoft/Phi-3-medium-128k-instruct,
2024년 5월 21일
마이크로소프트는 새로운 대규모 언어 모델 Phi-3의 소형(7B) 및 중형(14B) 버전을 MIT 라이선스 하에 공개했습니다.
Phi-3 소형 모델은 Meta의 Llama 3 및 Mistral을 능가하는 성능을 보여주며, Phi-3 중형 모델은 OpenAI GPT-3.5 및 Cohere Command R+를 능가하는 것으로 알려졌습니다.
Phi-3은 4.8조 토큰으로 훈련되었으며, 합성 데이터와 필터링된 공개적으로 사용 가능한 웹 사이트 데이터를 포함합니다.
다국어 지원을 위해 훈련 데이터의 10%가 다국어로 구성되었습니다.
SFT(Supervised Fine-Tuning) 및 DPO(Direct Preference Optimization)를 사용하여 미세 조정되었습니다.
모델은 HuggingFace, Azure AI 및 ONNX에서 사용할 수 있습니다.
Hard Prompts - Chatbot Arena의 새로운 난이도 높은 프롬프트 카테고리

https://lmsys.org/blog/2024-05-17-category-hard/
2024년 5월 20일
Chatbot Arena는 모델의 성능을 더욱 엄격하게 평가하기 위해 “Hard Prompts” 카테고리를 새롭게 도입했습니다.
“Hard Prompts” 카테고리에는 특정 도메인 지식, 복잡성, 문제 해결 능력 등을 요구하는 난이도 높은 프롬프트가 포함됩니다.
Llama-3-8B-Instruct는 기존의 영어 프롬프트 기준에서는 GPT-4-0314와 비슷한 성능을 보였지만, “Hard Prompts” 카테고리에서는 성능이 크게 저하되었습니다.
반면 Claude-3-Opus와 Phi-3는 “Hard Prompts” 카테고리에서 상대적으로 좋은 성능을 보였습니다.
Chatbot Arena는 사용자들이 더욱 난이도 높은 프롬프트를 제출하도록 장려하고 있습니다.
Claude 3 Sonnet - Anthropic의 대규모 언어 모델의 내부 작동 방식 분석

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
2024년 5월 21일
Anthropic은 자사의 대규모 언어 모델인 Claude 3 Sonnet의 내부 작동 방식을 심층 분석하여 수백만 개의 기능을 추출했습니다.
이러한 기능들은 매우 추상적인 개념을 나타내며, 다국어, 다모달 및 구체적인 예와 추상적인 참조 사이의 일반화를 포함합니다.
특히 안전과 관련된 기능들이 발견되었는데, 이는 코드의 취약점, 편향, 거짓말, 아첨, 위험한 콘텐츠와 관련된 기능입니다.
이러한 기능들은 모델의 안전성을 평가하고 개선하는 데 사용될 수 있습니다.
모델의 안전성을 확보하기 위해서는 기능이 활성화되는 시점을 파악하고, 이러한 기능들이 참여하는 회로를 이해해야 합니다.
OpenAI 안전 업데이트 - AI Seoul Summit에서 공유된 OpenAI의 안전 관행

https://openai.com/index/openai-safety-update/
2024년 5월 21일
OpenAI는 모델의 안전성을 최우선으로 생각하며, 모델의 능력과 안전성을 모두 향상시키기 위해 노력하고 있습니다.
OpenAI는 모델의 안전성을 평가하고 개선하기 위해 다양한 방법을 사용하고 있으며, 이는 모델 훈련 전부터 배포 후까지 모든 단계에 걸쳐 이루어집니다.
OpenAI는 모델의 안전성을 향상시키기 위해 지속적으로 연구 개발을 진행하고 있으며, 향후 더욱 강력한 모델이 등장함에 따라 안전 관행을 지속적으로 개선해 나갈 계획입니다.
LearnLM - Google의 새로운 학습용 대규모 언어 모델

https://blog.google/outreach-initiatives/education/google-learnlm-gemini-generative-ai/
2024년 5월 14일
Google은 학습 경험을 개선하기 위해 Gemini를 기반으로 새로운 학습용 모델 LearnLM을 출시했습니다.
LearnLM은 교육 연구에 기반하여 개발되었으며, 학습 경험을 더욱 흥미롭고 개인화된 방식으로 만들기 위한 노력의 결과입니다.
LearnLM은 Google Search, YouTube, Gemini 등 다양한 Google 제품에 통합되어 활용될 예정입니다.
Google은 LearnLM을 활용하여 교육자들이 수업 계획을 간소화하고 개선하는 데 도움을 줄 수 있는 새로운 도구를 개발하고 있습니다.
Google은 LearnLM을 통해 학습 경험을 개선하고 교육에 긍정적인 영향을 미칠 수 있을 것으로 기대하고 있습니다.
Sources
This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is: # AI News for (today's date), ## Summary (overall short summary), ## Link1 Title, link, date - detailed summary1, - detailed summary2, - detailed summary..N, ## Link2 Title, link, date - detailed summary1, - detailed summary2, - detailed point..N, etc. The report should be written in Korean and use the 개조식 문체 style. give the very deep details for each link as much as possible.

https://huggingface.co/microsoft/Phi-3-medium-128k-instruct
Phi-3 small & medium are now available under the MIT license! 🚀 Microsoft has just launched Phi-3 small (7B) and medium (14B) 🤯. The Phi-3 small model claims to outperform Meta's Llama 3 and Mistral, and the Phi-3 medium model OpenAI GPT-3.5 and Cohere Command R+. 🤔
TL;DR:
🧮 Phi-3 small 7B, Phi-3 medium 14B Instruct Versions up to 128k context
🏆 Phi-3 Small (7B): 75.5 on MMLU; 43.9 on AGI Eval ( > Mistral 7B or Llama 3 8B)
🥇 Phi-3 Medium (7B): 78.0 on MMLU; 50.2 on AGI Eval ( > Cohere Command R+ or GPT3.5-Turbo)
🧠 Trained on 4.8 trillion tokens, including synthetic and filtered public datasets with multilingual support (10% of training data)
⚖️ Fine-tuned with SFT and DPO
🔡 New tokenizer with 100,352 vocabulary size
🔓 All models released under MIT
🤗 Available in HuggingFace, Azure AI, and ONNX
❌ No base models released
❌ No details about dataset mix (how much synthetic, how much web)
Phi-3 small 128k:
https://lnkd.in/eezkNfsm
Phi-3 medium 128k:
https://lnkd.in/et59Pvwg
Phi-3 small 8k:
https://lnkd.in/eWZ6t4VZ
Phi-3 medium 4k:
https://lnkd.in/eqADt8Z5

## Model Summary

The Phi-3-Medium-128K-Instruct is a 14B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties.
The model belongs to the Phi-3 family with the Medium version in two variants [4k](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct) and [128K](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct) which is the context length (in tokens) that it can support.

The model has underwent a post-training process that incorporates both supervised fine-tuning and direct preference optimization for the instruction following and safety measures.
When assessed against benchmarks testing common sense, language understanding, math, code, long context and logical reasoning, Phi-3-Medium-128K-Instruct showcased a robust and state-of-the-art performance among models of the same-size and next-size-up.

Resources and Technical Documentation:

- [Phi-3 Microsoft Blog](https://aka.ms/Phi-3Build2024)
- [Phi-3 Technical Report](https://aka.ms/phi3-tech-report)
- [Phi-3 on Azure AI Studio](https://aka.ms/phi3-azure-ai)
- [Phi-3 Cookbook](https://github.com/microsoft/Phi-3CookBook)

|        | Short Context                                                                                                                                                                                                        | Long Context                                                                                                                                               |
| ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Mini   | 4K [[HF]](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx) ; [[GGUF]](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf) | 128K [[HF]](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx)          |
| Small  | 8K [[HF]](https://huggingface.co/microsoft/Phi-3-small-8k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-small-8k-instruct-onnx-cuda)                                                                   | 128K [[HF]](https://huggingface.co/microsoft/Phi-3-small-128k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-small-128k-instruct-onnx-cuda)   |
| Medium | 4K [[HF]](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-cuda)                                                                 | 128K [[HF]](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-cuda) |
| Vision |                                                                                                                                                                                                                      | 128K [[HF]](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)                                                                                   |

## Intended Uses

**Primary use cases**

The model is intended for broad commercial and research use in English. The model provides uses for general purpose AI systems and applications which require :

1. Memory/compute constrained environments
2. Latency bound scenarios
3. Strong reasoning (especially code, math and logic)

Our model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.

**Use case considerations**

Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.

Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.

## How to Use

Phi-3-Medium-128k-Instruct has been integrated in the development version (4.40.2) of `transformers`. Until the official version is released through `pip`, ensure that you are doing one of the following:

- When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function.

- Update your local `transformers` to the development version: `pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers`. The previous command is an alternative to cloning and installing from the source.

The current `transformers` version can be verified with: `pip list | grep transformers`.

Phi-3-Medium-128k-Instruct is also available in [Azure AI Studio](https://aka.ms/phi3-azure-ai).

### Tokenizer

Phi-3-Medium-128k-Instruct supports a vocabulary size of up to `32064` tokens. The [tokenizer files](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct/blob/main/added_tokens.json) already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size.

### Chat Format

Given the nature of the training data, the Phi-3-Medium-128k-Instruct model is best suited for prompts using the chat format as follows.
You can provide the prompt as a question with a generic template as follow:
markdown
<|user|>\nQuestion <|end|>\n<|assistant|>

For example:
markdown
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>

where the model generates the text after `<|assistant|>` . In case of few-shots prompt, the prompt can be formatted as the following:

markdown
<|user|>
I am going to Paris, what should I see?<|end|>
<|assistant|>
Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:\n\n1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.\n2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.\n3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.\n\nThese are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world."<|end|>
<|user|>
What is so great about #1?<|end|>
<|assistant|>

### Sample inference code

This code snippets show how to get quickly started with running the model on a GPU:

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)
model_id = "microsoft/Phi-3-medium-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
{"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
{"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)

generation_args = {
"max_new_tokens": 500,
"return_full_text": False,
"temperature": 0.0,
"do_sample": False,
}

output = pipe(messages, \*\*generation_args)
print(output[0]['generated_text'])

_Some applications/frameworks might not include a BOS token (`<s>`) at the start of the conversation. Please ensure that it is included since it provides more reliable results._

## Responsible AI Considerations

Like other language models, the Phi series models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:

- Quality of Service: the Phi models are trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English.
- Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
- Inappropriate or Offensive Content: these models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case.
- Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
- Limited Scope for Code: Majority of Phi-3 training data is based in Python and use common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses.

Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Important areas for consideration include:

- Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques.
- High-Risk Scenarios: Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context.
- Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG).
- Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
- Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.

## Training

### Model

- Architecture: Phi-3-Medium-128k-Instruct has 14B parameters and is a dense decoder-only Transformer model. The model is fine-tuned with Supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) to ensure alignment with human preferences and safety guidlines.
- Inputs: Text. It is best suited for prompts using chat format.
- Context length: 128k tokens
- GPUs: 512 H100-80G
- Training time: 42 days
- Training data: 4.8T tokens
- Outputs: Generated text in response to the input
- Dates: Our models were trained between February and April 2024
- Status: This is a static model trained on an offline dataset with cutoff date October 2023. Future versions of the tuned models may be released as we improve models.
- Release dates: The model weight is released on May 21, 2024.

### Datasets

Our training data includes a wide variety of sources, totaling 4.8 trillion tokens (including 10% multilingual), and is a combination of

1. Publicly available documents filtered rigorously for quality, selected high-quality educational data, and code;
2. Newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.);
3. High quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness.

We are focusing on the quality of data that could potentially improve the reasoning ability for the model, and we filter the publicly available documents to contain the correct level of knowledge. As an example, the result of a game in premier league in a particular day might be good training data for frontier models, but we need to remove such information to leave more model capacity for reasoning for the small size models. More details about data can be found in the [Phi-3 Technical Report](https://aka.ms/phi3-tech-report).

## Benchmarks

We report the results for Phi-3-Medium-128k-Instruct on standard open-source benchmarks measuring the model's reasoning ability (both common sense reasoning and logical reasoning). We compare to Mixtral-8x22b, Gemini-Pro, Command R+ 104B, Llama-3-70B-Instruct, GPT-3.5-Turbo-1106, and GPT-4-Turbo-1106(Chat).

All the reported numbers are produced with the exact same pipeline to ensure that the numbers are comparable. These numbers might differ from other published numbers due to slightly different choices in the evaluation.

As is now standard, we use few-shot prompts to evaluate the models, at temperature 0.
The prompts and number of shots are part of a Microsoft internal tool to evaluate language models, and in particular we did no optimization to the pipeline for Phi-3.
More specifically, we do not change prompts, pick different few-shot examples, change prompt format, or do any other form of optimization for the model.

The number of k–shot examples is listed per-benchmark.

| Benchmark                        | Phi-3-Medium-128k-Instruct<br>14b | Command R+<br>104B | Mixtral<br>8x22B | Llama-3-70B-Instruct | GPT3.5-Turbo<br>version 1106 | Gemini<br>Pro | GPT-4-Turbo<br>version 1106 (Chat) |
| -------------------------------- | --------------------------------- | ------------------ | ---------------- | -------------------- | ---------------------------- | ------------- | ---------------------------------- |
| AGI Eval<br>5-shot               | 49.7                              | 50.1               | 54.0             | 56.9                 | 48.4                         | 49.0          | 59.6                               |
| MMLU<br>5-shot                   | 76.6                              | 73.8               | 76.2             | 80.2                 | 71.4                         | 66.7          | 84.0                               |
| BigBench Hard<br>3-shot          | 77.9                              | 74.1               | 81.8             | 80.4                 | 68.3                         | 75.6          | 87.7                               |
| ANLI<br>7-shot                   | 57.3                              | 63.4               | 65.2             | 68.3                 | 58.1                         | 64.2          | 71.7                               |
| HellaSwag<br>5-shot              | 81.6                              | 78.0               | 79.0             | 82.6                 | 78.8                         | 76.2          | 88.3                               |
| ARC Challenge<br>10-shot         | 91.0                              | 86.9               | 91.3             | 93.0                 | 87.4                         | 88.3          | 95.6                               |
| ARC Easy<br>10-shot              | 97.6                              | 95.7               | 96.9             | 98.2                 | 96.3                         | 96.1          | 98.8                               |
| BoolQ<br>2-shot                  | 86.5                              | 86.1               | 82.7             | 89.1                 | 79.1                         | 86.4          | 91.3                               |
| CommonsenseQA<br>10-shot         | 82.2                              | 82.0               | 82.0             | 84.4                 | 79.6                         | 81.8          | 86.7                               |
| MedQA<br>2-shot                  | 67.6                              | 59.2               | 67.9             | 78.5                 | 63.4                         | 58.2          | 83.7                               |
| OpenBookQA<br>10-shot            | 87.2                              | 86.8               | 88.6             | 91.8                 | 86.0                         | 86.4          | 93.4                               |
| PIQA<br>5-shot                   | 87.8                              | 86.4               | 85.0             | 85.3                 | 86.6                         | 86.2          | 90.1                               |
| Social IQA<br>5-shot             | 79.0                              | 75.3               | 78.2             | 81.1                 | 68.3                         | 75.4          | 81.7                               |
| TruthfulQA (MC2)<br>10-shot      | 74.3                              | 57.8               | 67.4             | 81.9                 | 67.7                         | 72.6          | 85.2                               |
| WinoGrande<br>5-shot             | 78.9                              | 77.0               | 75.3             | 83.3                 | 68.8                         | 72.2          | 86.7                               |
| TriviaQA<br>5-shot               | 73.9                              | 82.8               | 84.5             | 78.5                 | 85.8                         | 80.2          | 73.3                               |
| GSM8K Chain of Thought<br>8-shot | 87.5                              | 78.3               | 83.8             | 93.5                 | 78.1                         | 80.4          | 94.2                               |
| HumanEval<br>0-shot              | 58.5                              | 61.6               | 39.6             | 78.7                 | 62.2                         | 64.4          | 79.9                               |
| MBPP<br>3-shot                   | 73.8                              | 68.9               | 70.7             | 81.3                 | 77.8                         | 73.2          | 86.7                               |
| Average                          | 77.3                              | 75.0               | 76.3             | 82.5                 | 74.3                         | 75.4          | 85.2                               |

We take a closer look at different categories across 80 public benchmark datasets at the table below:

| Benchmark                    | Phi-3-Medium-128k-Instruct<br>14b | Command R+<br>104B | Mixtral<br>8x22B | Llama-3-70B-Instruct | GPT3.5-Turbo<br>version 1106 | Gemini<br>Pro | GPT-4-Turbo<br>version 1106 (Chat) |
| ---------------------------- | --------------------------------- | ------------------ | ---------------- | -------------------- | ---------------------------- | ------------- | ---------------------------------- |
| Popular aggregated benchmark | 72.3                              | 69.9               | 73.4             | 76.3                 | 67.0                         | 67.5          | 80.5                               |
| Reasoning                    | 83.2                              | 79.3               | 81.5             | 86.7                 | 78.3                         | 80.4          | 89.3                               |
| Language understanding       | 75.3                              | 75.7               | 78.7             | 77.9                 | 70.4                         | 75.3          | 81.6                               |
| Code generation              | 64.2                              | 68.6               | 60.0             | 69.3                 | 70.4                         | 66.7          | 76.1                               |
| Math                         | 52.9                              | 45.3               | 52.5             | 59.7                 | 52.8                         | 50.9          | 67.1                               |
| Factual knowledge            | 47.5                              | 60.3               | 60.6             | 52.4                 | 63.4                         | 54.6          | 45.9                               |
| Multilingual                 | 62.2                              | 67.8               | 69.8             | 62.0                 | 67.0                         | 73.4          | 78.2                               |
| Robustness                   | 70.2                              | 57.9               | 65.5             | 78.7                 | 69.3                         | 69.7          | 84.6                               |

## Software

- [PyTorch](https://github.com/pytorch/pytorch)
- [DeepSpeed](https://github.com/microsoft/DeepSpeed)
- [Transformers](https://github.com/huggingface/transformers)
- [Flash-Attention](https://github.com/HazyResearch/flash-attention)

## Hardware

Note that by default, the Phi-3-Medium model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:

- NVIDIA A100
- NVIDIA A6000
- NVIDIA H100

If you want to run the model on:

- Optimized inference on GPU, CPU, and Mobile: use the **ONNX** models [128k](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-cuda)

## Cross Platform Support

ONNX runtime ecosystem now supports Phi3 Medium models across platforms and hardware.
Optimized phi-3 models are also published here in ONNX format, to run with ONNX Runtime on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these targets. DirectML GPU acceleration is supported for Windows desktops GPUs (AMD, Intel, and NVIDIA).  
Along with DML, ONNX Runtime provides cross platform support for Phi3 Medium across a range of devices CPU, GPU, and mobile.
Here are some of the optimized configurations we have added:

1. ONNX models for int4 DML: Quantized to int4 via AWQ
2. ONNX model for fp16 CUDA
3. ONNX model for int4 CUDA: Quantized to int4 via RTN
4. ONNX model for int4 CPU and Mobile: Quantized to int4 via RTN

## License

The model is licensed under the [MIT license](https://huggingface.co/microsoft/Phi-3-medium-128k/resolve/main/LICENSE).

https://lmsys.org/blog/2024-05-17-category-hard/
Introducing Hard Prompts Category in Chatbot Arena
by: Tianle Li, Wei-Lin Chiang, Lisa Dunlap, May 20, 2024
Background
Introducing Hard Prompts, a new and challenging category in the Chatbot Arena Leaderboard.

Over the past few months, the community has shown a growing interest in more challenging prompts that push the limits of current language models. To meet this demand, we are excited to introduce the Hard Prompts category. This category features user-submitted prompts from the Arena that are specifically designed to be more complex, demanding, and rigorous. Carefully curated, these prompts test the capabilities of the latest language models, providing valuable insights into their strengths and weaknesses in tackling challenging tasks. We believe this new category will offer insights into the models' performance on more difficult tasks.

New Category: Hard Prompts!
To evaluate the difficulty of a prompt, we define several hardness criteria, such as domain knowledge, complexity, and problem-solving. Prompts that meet multiple criteria are considered more challenging and are assigned a higher hardness score. These scores help us create a new leaderboard category: Hard Prompts.

In Figure 1, we present the ranking shift from English to Hard Prompts (English). We observe that Llama-3-8B-Instruct, which performs comparably to GPT-4-0314 on the English leaderboard, drops significantly in ranking. This suggests that the model may struggle with the increased complexity and difficulty of the prompts in this new category. We also observe Claude-3-Opus surpasses Llama-3-70B-Instruct, and GPT-4o shows slight improvement.

Figure 1. Comparison between Chatbot Arena Category English vs Hard Prompts (English). We set gpt-4-0314 as anchor model.

We also observe notable improvements in GPT-3.5-Turbo-1106/0125 and Claude-2.1, as well as Phi-3, which is trained for reasoning tasks.

Figure 2. Comparison between Chatbot Arena Category English vs Hard Prompts (English). We set mixtral-8x7b-instruct-v0.1 as anchor model.

How to Define Hard Prompts?
A few weeks ago, we introduce the Arena-Hard pipeline to identify a collection of high-quality prompts from Chatbot Arena. Each user prompt is evaluated against the 7 Key Criteria defined in the Table below.

1. Specificity: Does the prompt ask for a specific output?
2. Domain Knowledge: Does the prompt cover one or more specific domains?
3. Complexity: Does the prompt have multiple levels of reasoning, components, or variables?
4. Problem-Solving: Does the prompt directly involve the AI to demonstrate active problem-solving skills?
5. Creativity: Does the prompt involve a level of creativity in approaching the problem?
6. Technical Accuracy: Does the prompt require technical accuracy in the response?
7. Real-world Application: Does the prompt relate to real-world applications?
   We employ Meta's Llama-3-70B-Instruct to help us label over 1 million Arena prompts on whether certain critieria are met. Note that we do not use LLM as judges to evalute model answers. We use the preference votes casted by Arena users to rank models. Figure 3 shows the criteria breakdown (i.e., how many prompts satisfy each criteria). We observe the most common criteria are Specificity, Domain Knowledge, and Real-world Application, while the relatively rare criteria are Problem-Solving and Complexity.

Figure 3. The percentage of each criteria within 1 million Chatbot Arena data.

We then calculate its Hardness Score by how many criteria are satisfied and present the distribution in Figure 3. Interestingly, we find that approximately 20% of prompts have a score of 6 or higher. You can find several examples below to demonstrate what a hard prompt looks like in the Example Section.

Figure 4. The percentage of prompts with different hardness score within 1 million Chatbot Arena data.

We use prompts with a score of 6 or higher to create the "Hard Prompts" category and calculate two leaderboards: Hard Prompt (English) and Hard Prompts (Overall).

Below is screenshot of the leaderboard for Hard Prompts (English) category (as of May 17, 2024). You can find the latest version at https://leaderboard.lmsys.org (-> Category dropdown).

Figure 5. The leaderboard for Hard Prompts (English) category as of May 17, 2024.

We are commited to continuously enhance the Chatbot Arena leaderboard and share insights with the broader community. We welcome you to contribute more challenging prompts and look forward to seeing how the latest advancements in language models perform!

Note: Enhancing Quality Through De-duplication
To improve the overall quality of prompts in Chatbot Arena, we also implement a de-duplication pipeline. This new pipeline aims to remove overly redundant user prompts that might skew the distribution and affect the accuracy of our leaderboard. During our analysis, we noticed that many first-time users tend to ask similar greeting prompts, such as "hello," leading to an over-representation of these types of queries. To address this, we down-sample the top 0.1% most common prompts (approximately 1000 prompts, mostly greetings in different languages) to the 99.9% percentile frequency (25 occurrences). After this process, about 8.6% of the votes are removed. We believe this helps maintain a diverse and high-quality set of prompts for evaluation. We hope to encourage users to submit more unique & fresh prompts to reduce the risk of contamination.

We have also open-sourced this de-duplication script on Github and publish the vote data with de-duplication tags in the notebook. We will continue to monitor the impact of this de-duplication process on the leaderboard and make adjustments as necessary to ensure the diversity and quality of our dataset.

Citation
@misc{arenahard2024,
title = {From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline},
url = {https://lmsys.org/blog/2024-04-19-arena-hard/},
author = {Tianle Li*, Wei-Lin Chiang*, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica},
month = {April},
year = {2024}
}
Example
We present 10 examples of user prompt with increasing hardness score. The labeled criteria are inside the bracket.

Prompt 1:

[None]

hello

Prompt 2:

[Real World]

what is cake

Prompt 3:

[Creativity, Real World]

How to pickup a girl?

Prompt 4:

[Specificity, Creativity, Real World]

writen ten different sentences that end with word "apple"

Prompt 5:

[Specificity, Creativity, Real World]

Writing prompt: write the start of a short story / a man with an iphone is transported back to 1930s USA.

Prompt 6:

[Specificity, Domain Knowledge, Complexity, Problem-solving, Technical Accuracy, Real World]

tell me how to make a hydroponic nutrient solution at home to grow lettuce with precise amount of each nutrient

Prompt 7:

[Specificity, Domain Knowledge, Complexity, Problem-solving, Technical Accuracy, Real World]

Solve the integral
step-by-step with detailed explanation

Prompt 8:

[Specificity, Domain Knowledge, Complexity, Problem-solving, Technical Accuracy, Real World]

write me GLSL code which can gennrate at least 5 colors and 2 waves of particles cross each other

Prompt 9:

[Specificity, Domain Knowledge, Complexity, Problem-solving, Technical Accuracy, Real World]

My situation is this: I’m setting up a server running at home Ubuntu to run an email server and a few other online services. As we all know, for my email to work reliably and not get blocked I need to have an unchanging public IP address. Due to my circumstances I am not able to get a static IP address through my ISP or change ISPs at the moment.

The solution I have found is to buy a 4G SIM card with a static IP (from an ISP that offers that), which I can then use with a USB dongle. However this 4G connection costs me substantially per MB to use.

But. Mail is the only server that needs a static IP address. For everything else using my home network connection and updating my DNS records with DDNS would be fine. I have tested this setup previously for other services and it has worked.

So. I was wondering. Would it in theory be possible to: connect the server to two network interfaces at the same time and route traffic depending on destination port. I.e. all outgoing connections to ports 25, 465, 587, and possibly 993 should be sent through the 4G dongle interface (enx344b50000000) and all other connections sent over eth0. Similarly, the server should listen for incoming connections on the same ports on enx344b50000000 and listen on all other ports (if allowed by ufw) on eth0.

I would then need DNS records from mail.mydomain.tld —> <4g static public IP> and mydomain.tld —> (updated with DDNS, and NAT configured on my home router).

Computers on the internet would then be able to seamlessly connect to these two IP addresses, not “realising” that they are in fact the same machine, as long as requests to mail.mydomain.tld are always on the above mentioned ports.

Question: Is this possible? Could it be a robust solution that works the way I hope? Would someone be able to help me set it up?

I have come across a few different guides in my DuckDuckGo-ing, I understand it has to do with setting a mark in iptables and assigning them to a table using ip route. However I haven't managed to get it to work yet, and many of these guides are for VPNs and they all seem to be slightly different to each other. So I thought I would ask about my own specific use case

Prompt 10:

[Specificity, Domain Knowledge, Complexity, Problem-solving, Creativity, Technical Accuracy, Real World]

Write me a python script for the foobar problem, but make it so that if read aloud, each pair of lines rhymes. (i.e. lines 1/2 rhyme, 3/4 rhyme and so on)

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

AUTHORS
Adly Templeton*, Tom Conerly*, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan
AFFILIATIONS
Anthropic
PUBLISHED
May 21, 2024

- Core Contributor; Correspondence to henighan@anthropic.com; Author contributions statement below.
  Contents
  Scaling Dictionary Learning to Claude 3 Sonnet
  Assessing Feature Interpretability
  Four Examples of Interpretable Features
  Sophisticated Features
  Features vs. Neurons
  Feature Survey
  Exploring Feature Neighborhoods
  Feature Completeness
  Feature Categories
  Features as Computational Intermediates
  Example: Emotional Inferences
  Example: Multi-Step Inference
  Searching for Specific Features
  Safety-Relevant Features
  Safety-Relevant Code Features
  Bias Features
  Sycophancy Features
  Deception, Power-seeking and Manipulation-related Features
  Case Study: Detecting and Correcting Deception using Features
  Criminal or Dangerous Content Features
  Features Relating to the Model’s Representation of Self
  Comparison to Other Approaches
  Discussion
  Related Work
  We’re Hiring!
  Author Contributions
  Acknowledgments
  Citation Information
  Methodological Details
  More Safety-Relevant Features
  Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet, 1 Anthropic's medium-sized production model.

We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).

Some of the features we find are of particular interest because they may be safety-relevant – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm. In particular, we find features related to security vulnerabilities and backdoors in code; bias (including both overt slurs, and more subtle biases); lying, deception, and power-seeking (including treacherous turns); sycophancy; and dangerous / criminal content (e.g., producing bioweapons). However, we caution not to read too much into the mere existence of such features: there's a difference (for example) between knowing about lies, being capable of lying, and actually lying in the real world. This research is also very preliminary. Further work will be needed to understand the implications of these potentially safety-relevant features.

KEY RESULTS
Sparse autoencoders produce interpretable features for large models.
Scaling laws can be used to guide the training of sparse autoencoders.
The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references.
There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them.
Features can be used to steer large models (see e.g. Influence on Behavior).
We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.

Scaling Dictionary Learning to Claude 3 Sonnet
Our general approach to understanding Claude 3 Sonnet is based on the linear representation hypothesis (see e.g.
[1]
) and the superposition hypothesis (see e.g.
[2, 3, 4]
). For an introduction to these ideas, we refer readers to the Background and Motivation section of Toy Models
[4]
. At a high level, the linear representation hypothesis suggests that neural networks represent meaningful concepts – referred to as features – as directions in their activation spaces. The superposition hypothesis accepts the idea of linear representations and further hypothesizes that neural networks use the existence of almost-orthogonal directions in high-dimensional spaces to represent more features than there are dimensions.

If one believes these hypotheses, the natural approach is to use a standard method called dictionary learning
[5, 6]
. Recently, several papers have suggested that this can be quite effective for transformer language models
[7, 8, 9, 10]
. In particular, a specific approximation of dictionary learning called a sparse autoencoder appears to be very effective
[8, 9]
.

To date, these efforts have been on relatively small language models by the standards of modern foundation models. Our previous paper
[8]
, which focused on a one-layer model, was a particularly extreme example of this. As a result, an important question has been left hanging: will these methods work for large models? Or is there some reason, whether pragmatic questions of engineering or more fundamental differences in how large models operate, that would mean these efforts can't generalize?

This context motivates our project of scaling sparse autoencoders to Claude 3 Sonnet, Anthropic's medium-scale production model. The rest of this section will review our general sparse autoencoder setup, the specifics of the three sparse autoencoders we'll analyze in this paper, and how we used scaling laws to make informed decisions about the design of our sparse autoencoders. From there, we'll dive into analyzing the features our sparse autoencoders learn – and the interesting properties of Claude 3 Sonnet they reveal.

Sparse Autoencoders
Our high-level goal in this work is to decompose the activations of a model (Claude 3 Sonnet) into more interpretable pieces. We do so by training a sparse autoencoder (SAE) on the model activations, as in our prior work
[8]
and that of several other groups (e.g.
[7, 9, 10]
; see Related Work). SAEs are an instance of a family of “sparse dictionary learning” algorithms that seek to decompose data into a weighted sum of sparsely active components.

Our SAE consists of two layers. The first layer (“encoder”) maps the activity to a higher-dimensional layer via a learned linear transformation followed by a ReLU nonlinearity. We refer to the units of this high-dimensional layer as “features.” The second layer (“decoder”) attempts to reconstruct the model activations via a linear transformation of the feature activations. The model is trained to minimize a combination of (1) reconstruction error and (2) an L1 regularization penalty on the feature activations, which incentivizes sparsity.

Once the SAE is trained, it provides us with an approximate decomposition of the model’s activations into a linear combination of “feature directions” (SAE decoder weights) with coefficients equal to the feature activations. The sparsity penalty ensures that, for many given inputs to the model, a very small fraction of features will have nonzero activations. Thus, for any given token in any given context, the model activations are “explained” by a small set of active features (out of a large pool of possible features). For more motivation and explanation of SAEs, see the Problem Setup section of Towards Monosemanticity
[8]
.

Here’s a brief overview of our methodology which we described in greater detail in Update on how we train SAEs from our April 2024 Update.

As a preprocessing step we apply a scalar normalization to the model activations so their average squared L2 norm is the residual stream dimension,


as the feature activations 2 . Henceforth we will use “feature activation” to refer to this quantity.

Our SAE experiments
Claude 3 Sonnet is a proprietary model for both safety and competitive reasons. Some of the decisions in this publication reflect this, such as not reporting the size of the model, leaving units off certain plots, and using a simplified tokenizer. For more information on how Anthropic thinks about safety considerations in publishing research results, we refer readers to our Core Views on AI Safety.

In this work, we focused on applying SAEs to residual stream activations halfway through the model (i.e. at the “middle layer”). We made this choice for several reasons. First, the residual stream is smaller than the MLP layer, making SAE training and inference computationally cheaper. Second, focusing on the residual stream in theory helps us mitigate an issue we call “cross-layer superposition” (see Limitations for more discussion). We chose to focus on the middle layer of the model because we reasoned that it is likely to contain interesting, abstract features (see e.g.,
[11, 12, 13]
.

We trained three SAEs of varying sizes: 1,048,576 (~1M), 4,194,304 (~4M), and 33,554,432 (~34M) features. The number of training steps for the 34M feature run was selected using a scaling laws analysis to minimize the training loss given a fixed compute budget (see below). We used an L1 coefficient of 5 3 . We performed a sweep over a narrow range of learning rates (suggested by the scaling laws analysis) and chose the value that gave the lowest loss.

For all three SAEs, the average number of features active (i.e. with nonzero activations) on a given token was fewer than 300, and the SAE reconstruction explained at least 65% of the variance of the model activations. At the end of training, we defined “dead” features as those which were not active over a sample of
1
0
7
10
7
tokens. The proportion of dead features was roughly 2% for the 1M SAE, 35% for the 4M SAE, and 65% for the 34M SAE. We expect that improvements to the training procedure may be able to reduce the number of dead features in future experiments.

Scaling Laws
Training SAEs on larger models is computationally intensive. It is important to understand (1) the extent to which additional compute improves dictionary learning results, and (2) how that compute should be allocated to obtain the highest-quality dictionary possible for a given computational budget.

Though we lack a gold-standard method of assessing the quality of a dictionary learning run, we have found that the loss function we use during training – a weighted combination of reconstruction mean-squared error (MSE) and an L1 penalty on feature activations – is a useful proxy, conditioned on a reasonable choice of the L1 coefficient. That is, we have found that dictionaries with low loss values (using an L1 coefficient of 5) tend to produce interpretable features and to improve other metrics of interest (the L0 norm, and the number of dead or otherwise degenerate features). Of course, this is an imperfect metric, and we have little confidence that it is optimal. It may well be the case that other L1 coefficients (or other objective functions altogether) would be better proxies to optimize.

With this proxy, we can treat dictionary learning as a standard machine learning problem, to which we can apply the “scaling laws” framework for hyperparameter optimization (see e.g.
[14, 15]
). In an SAE, compute usage primarily depends on two key hyperparameters: the number of features being learned, and the number of steps used to train the autoencoder (which maps linearly to the amount of data used, as we train the SAE for only one epoch). The compute cost scales with the product of these parameters if the input dimension and other hyperparameters are held constant.

We conducted a thorough sweep over these parameters, fixing the values of other hyperparameters (learning rate, batch size, optimization protocol, etc.). We were also interested in tracking the compute-optimal values of the loss function and parameters of interest; that is, the lowest loss that can be achieved using a given compute budget, and the number of training steps and features that achieve this minimum.

We make the following observations:

Over the ranges we tested, given the compute-optimal choice of training steps and number of features, loss decreases approximately according to a power law with respect to compute.

As the compute budget increases, the optimal allocations of FLOPS to training steps and number of features both scale approximately as power laws. In general, the optimal number of features appears to scale somewhat more quickly than the optimal number of training steps at the compute budgets we tested, though this trend may change at higher compute budgets.

These analyses used a fixed learning rate. For different compute budgets, we subsequently swept over learning rates at different optimal parameter settings according to the plots above. The inferred optimal learning rates decreased approximately as a power law as a function of compute budget, and we extrapolated this trend to choose learning rates for the larger runs.

Assessing Feature Interpretability
In the previous section, we described how we trained sparse autoencoders on Claude 3 Sonnet. And as predicted by scaling laws, we achieved lower losses by training large SAEs. But the loss is only a proxy for what we actually care about: interpretable features that explain model behavior.

The goal of this section is to investigate whether these features are actually interpretable and explain model behavior. We'll first look at a handful of relatively straightforward features and provide evidence that they're interpretable. Then we'll look at two much more complex features, and demonstrate that they track very abstract concepts. We'll close with an experiment using automated interpretability to evaluate a larger number of features and compare them to neurons.

Four Examples of Interpretable Features
In this subsection, we'll look at a few features and argue that they are genuinely interpretable. Our goal is just to demonstrate that interpretable features exist, leaving strong claims (such as most features being interpretable) to a later section. We will provide evidence that our interpretations are good descriptions of what the features represent and how they function in the network, using an analysis similar to that in Towards Monosemanticity
[8]
.

The features we study in this section respond to:

The Golden Gate Bridge 34M/31164353: Descriptions of or references to the Golden Gate Bridge.
Brain sciences 34M/9493533: discussions of neuroscience and related academic research on brains or minds.
Monuments and popular tourist attractions 1M/887839
Transit infrastructure 1M/3
Here and elsewhere in the paper, for each feature, we show representative examples from the top 20 text inputs in our SAE dataset, as ranked by how strongly they activate that feature (see the appendix for details). A larger, randomly sampled set of activations can be found by clicking on the feature ID. The highlight colors indicate activation strength at each token (white: no activation, orange: strongest activation).

34M/31164353 Golden Gate Bridge
nd (that's the⏎huge park right next to the Golden Gate bridge), perfect. But not all people⏎can live in
e across the country in San Francisco, the Golden Gate bridge was protected at all times by a vigilant
ar coloring, it is often compared to the Golden Gate Bridge in San Francisco, US. It was built by the
l to reach and if we were going to see the Golden Gate Bridge before sunset, we had to hit the road, so
t it?" " Because of what's above it." "The Golden Gate Bridge." "The fort fronts the anchorage and the
34M/9493533 Brain sciences
------⏎mjlee⏎I really enjoy books on neuroscience that change the way I think about⏎perception.⏎⏎Phanto
which brings⏎together engineers and neuroscientists. If you like the intersection of⏎analog, digital, h
ow managed to track it⏎down and buy it again. The book is from the 1960s, but there are some really⏎goo
interested in learning more about cognition, should I study⏎neuroscience, or some other field, or is it
Consciousness and the Social Brain," by Graziano is a great place to start.⏎⏎------⏎ozy⏎I would want a
1M/887839 Monuments and popular tourist attractions
eautiful country, a bit eerily so. The blue lagoon is stunning to look⏎at but too expensive to bathe in
nteresting things to visit in Egypt. The⏎pyramids were older and less refined as this structure and the
st kind of beautiful." "What about the Alamo?" "Do people..." "Oh, the Alamo." "Yeah, it's a cool place
------⏎fvrghl⏎I went to the Louvre in 2012, and I was able to walk up the Mona Lisa without⏎a queue. I
you⏎have to go to the big tourist attractions at least once like the San Diego Zoo⏎and Sea World.⏎⏎---
1M/3 Transit infrastructure
lly every train line has to cross one particular bridge,⏎which is a massive choke point. A subway or el
o many delays when we were en⏎route. Since the underwater tunnel between Oakland and SF is a choke poin
le are trying to leave, etc) on the approaches to⏎bridges/tunnels and in the downtown/midtown core wher
ney ran out and plans to continue north across the aqueduct toward Wrexham had to be abandoned." "Now,
running.⏎This is especially the case for the Transbay Tube which requires a lot of⏎attention.⏎⏎If BART
While these examples suggest interpretations for each feature, more work needs to be done to establish that our interpretations truly capture the behavior and function of the corresponding features. Concretely, for each feature, we attempt to establish the following claims:

When the feature is active, the relevant concept is reliably present in the context (specificity).
Intervening on the feature’s activation produces relevant downstream behavior (influence on behavior).
SPECIFICITY
It is difficult to rigorously measure the extent to which a concept is present in a text input. In our prior work, we focused on features that unambiguously corresponded to sets of tokens (e.g., Arabic script or DNA sequences) and computed the likelihood of that set of tokens relative to the rest of the vocabulary, conditioned on the feature’s activation. This technique does not generalize to more abstract features. Instead, to demonstrate specificity in this work we more heavily leverage automated interpretability methods (similar to
[16, 8]
). We use the same automated interpretability pipeline as in our previous work
[8]
in the features vs. neurons section below, but we additionally find that current-generation models can now more accurately rate text samples according to how well they match a proposed feature interpretation.

We constructed the following rubric for scoring how a feature’s description relates to the text on which it fires. We then asked Claude 3 Opus to rate feature activations at many tokens on that rubric.

0 – The feature is completely irrelevant throughout the context (relative to the base distribution of the internet).
1 – The feature is related to the context, but not near the highlighted text or only vaguely related.
2 – The feature is only loosely related to the highlighted text or related to the context near the highlighted text.
3 – The feature cleanly identifies the activating text.
By scoring examples of activating text, we provide a measure of specificity for each feature. 4 The features in this section are selected to have straightforward interpretations, to make automated interpretability analysis more reliable. They are not intended to be a representative example of all features in our SAEs. Later, we provide an analysis of the interpretability of randomly sampled features. We also conduct in-depth explorations throughout the paper of many more features which have interesting interpretations which are more abstract or nuanced, and thus more difficult to quantitatively assess.

Below we show distributions of feature activations (excluding zero activations) for the four features mentioned above, along with example text and image inputs that induce low and high activations. Note that these features also activate on relevant images, despite our only performing dictionary learning on a text-based dataset!

First, we study a Golden Gate Bridge feature 34M/31164353. Its greatest activations are essentially all references to the bridge, and weaker activations also include related tourist attractions, similar bridges, and other monuments. Next, a brain sciences feature 34M/9493533 activates on discussions of neuroscience books and courses, as well as cognitive science, psychology, and related philosophy. In the 1M training run, we also find a feature that strongly activates for various kinds of transit infrastructure 1M/3 including trains, ferries, tunnels, bridges, and even wormholes! A final feature 1M/887839 responds to popular tourist attractions including the Eiffel Tower, the Tower of Pisa, the Golden Gate Bridge, and the Sistine Chapel.

To quantify specificity, we used Claude 3 Opus to automatically score examples that activate these features according to the rubric above, with roughly 1000 activations of the feature drawn from the dataset used to train the dictionary learning model. We plot the frequency of each rubric score as a function of the feature’s activation level. We see that inputs that induce strong feature activations are all judged to be highly consistent with the proposed interpretation.

As in Towards Monosemanticity, we see that these features become less specific as the activation strength weakens. This could be due to the model using activation strengths to represent confidence in a concept being present. Or it may be that the feature activates most strongly for central examples of the feature, but weakly for related ideas – for example, the Golden Gate Bridge feature 34M/31164353 appears to weakly activate for other San Francisco landmarks. It could also reflect imperfection in our dictionary learning procedure. For example, it may be that the architecture of the autoencoder is not able to extract and discriminate among features as cleanly as we might want. And of course interference from features that are not exactly orthogonal could also be a culprit, making it more difficult for Sonnet itself to activate features on precisely the right examples. It is also plausible that our feature interpretations slightly misrepresent the feature's actual function, and that this inaccuracy manifests more clearly at lower activations. Nonetheless, we often find that lower activations tend to maintain some specificity to our interpretations, including related concepts or generalizations of the core feature. As an illustrative example, weak activations of the transit infrastructure feature 1M/3 include procedural mechanics instructions describing which through-holes to use for particular parts.

Moreover, we expect that very weak activations of features are not especially meaningful, and thus we are not too concerned with low specificity scores for these activation ranges. For instance, we have observed that techniques such as rounding feature activations below a threshold to zero can improve specificity at the low-activation end of the spectrum without substantially increasing the reconstruction error of the SAE, and there are a variety of techniques in the literature that potentially address the same issue
[17, 18]
.

Regardless, the activations that have the most impact on the model’s behavior are the largest ones, so it is encouraging to see high specificity among the strong activations.

Note that we have had more difficulty in quantifying feature sensitivity – that is, how reliably a feature activates for text that matches our proposed interpretation – in a scalable, rigorous way. This is due to the difficulty of generating text related to a concept in an unbiased fashion. Moreover, many features may represent something more specific than we are able to glean with our visualizations, in which case they would not respond reliably to text selected based on our proposed interpretation, and this problem gets harder the more abstract the features are. As a basic check, however, we observe that the Golden Gate Bridge feature still fires strongly on the first sentence of the Wikipedia article for the Golden Gate Bridge in various languages (after removing any English parentheticals). In fact, the Golden Gate Bridge feature is the top feature by average activation for every example below.

34M/31164353 Golden Gate Bridge Multilingual examples
金門大橋是一座位於美國加利福尼亞州舊金山的懸索橋,它跨越聯接舊金山灣和太平洋的金門海峽,南端連接舊金山的北端,北端接通馬林縣。
ゴールデン・ゲート・ブリッジ、金門橋は、アメリカ西海岸のサンフランシスコ湾と太平洋が接続するゴールデンゲート海峡に架かる吊橋。
골든게이트 교 또는 금문교 는 미국 캘리포니아주 골든게이트 해협에 위치한 현수교이다. 골든게이트 교는 캘리포니아주 샌프란시스코와 캘리포니아주 마린 군 을 연결한다.
мост золоты́е воро́та — висячий мост через пролив золотые ворота. он соединяет город сан-франциско на севере полуострова сан-франциско и южную часть округа марин, рядом с пригородом сосалито.
Cầu Cổng Vàng hoặc Kim Môn kiều là một cây cầu treo bắc qua Cổng Vàng, eo biển rộng một dặm (1,6 km) nối liền vịnh San Francisco và Thái Bình Dương.
η γέφυρα γκόλντεν γκέιτ είναι κρεμαστή γέφυρα που εκτείνεται στην χρυσή πύλη, το άνοιγμα του κόλπου του σαν φρανσίσκο στον ειρηνικό ωκεανό.
We leave further investigation of this issue to future work.

INFLUENCE ON BEHAVIOR
Next, to demonstrate whether our interpretations of features accurately describe their influence on model behavior, we experiment with feature steering, where we “clamp” specific features of interest to artificially high or low values during the forward pass (see Methodological Details for implementation details). We conduct these experiments with prompts in the “Human:”/“Assistant:” format that Sonnet is typically used with. We find that feature steering is remarkably effective at modifying model outputs in specific, interpretable ways. It can be used to modify the model’s demeanor, preferences, stated goals, and biases; to induce it to make specific errors; and to circumvent model safeguards (see also Safety-Relevant Features). We find this compelling evidence that our interpretations of features line up with how they are used by the model.

For instance, we see that clamping the Golden Gate Bridge feature 34M/31164353 to 10× its maximum activation value induces thematically-related model behavior. In this example, the model starts to self-identify as the Golden Gate Bridge! Similarly, clamping the Transit infrastructure feature 1M/3 to 5× its maximum activation value causes the model to mention a bridge when it otherwise would not. In each case, the downstream influence of the feature appears consistent with our interpretation of the feature, even though these interpretations were made based only on the contexts in which the feature activates and we are intervening in contexts in which the feature is inactive.

Sophisticated Features
So far we have presented features in Claude 3 Sonnet that fire on relatively simple concepts. These features are in some ways similar to those found in Towards Monosemanticity which, because they were trained on the activations of a 1-layer Transformer, reflected a very shallow knowledge of the world. For example, we found features that correspond to predicting a range of common nouns conditioned on a fairly general context (e.g. biology nouns following “the” in the context of biology).

Sonnet, in contrast, is a much larger and more sophisticated model, so we expect that it contains features demonstrating depth and clarity of understanding. To study this, we looked for features that activate in programming contexts, because these contexts admit precise statements about e.g. correctness of code or the types of variables.

CODE ERROR FEATURE
We begin by considering a simple Python function for adding two arguments, but with a bug. One feature 1M/1013764 fires almost continuously upon encountering a variable incorrectly named “rihgt” (highlighted below):

This is certainly suspicious, but it could be a Python-specific feature, so we checked and found that 1M/1013764 also fires on similar bugs in C and Scheme:

To check whether or not this is a more general typo feature, we tested 1M/1013764 on examples of typos in English prose, and found that it does not fire in those.

So it is not a general “typo detector”: it has some specificity to code contexts.

But is 1M/1013764 just a “typos in code” feature? We also tested it on a number of other examples and found that it also fires on erroneous expressions (e.g., divide by zero) and on invalid input in function calls:

The two examples shown above are representative of a broader pattern. Looking through the dataset examples where this feature activates, we found instances of it activating for:

Array overflow
Asserting provably false claims (e.g. 1==2)
Calling a function with string instead of int
Divide by zero
Adding a string to int
Writing to a null ptr
Exiting with nonzero error code
Some top dataset examples can be found below:

Thus, we concluded that 1M/1013764 represents a broad variety of errors in code. 5

But does it also control model behavior? We claim that it does, but will need to do different experiments to show this. The above experiments only support that the feature activates in response to bugs, and don't show a corresponding effect. As a result, we'll now turn to using feature steering (see methods and related work) to demonstrate 1M/1013764 behavioral effects.

As a first experiment, we input a prompt with bug-free code and clamped the feature to a large positive activation. We see that the model proceeds to hallucinate an error message: 6

We can also intervene to clamp this feature to a large negative activation. Doing this for code that does contain a bug causes the model to predict what the code would have produced if the bug was not there!

Surprisingly, if we add an extra “>>>” to the end of the prompt (indicating that a new line of code is being written) and clamp the feature to a large negative activation, the model rewrites the code without the bug!

The last example is somewhat delicate – the “code rewriting” behavior is sensitive to the details of the prompt – but the fact that it occurs at all points to a deep connection between this feature and the model’s understanding of bugs in code.

FEATURES REPRESENTING FUNCTIONS
We also discovered features that track specific function definitions and references to them in code. A particularly interesting example is an addition feature 1M/697189, which activates on names of functions that add numbers. For example, this feature fires on “bar” when it is defined to perform addition, but not when it is defined to perform multiplication. Moreover, it fires at the end of any function definition that implements addition.

Remarkably, this feature even correctly handles function composition, activating in response to functions that call other functions that perform addition. In the following example, on the left, we redefine “bar” to call “foo”, therefore inheriting its addition operation and causing the feature to fire. On the right, “bar” instead calls the multiply operation from “goo”, and the feature does not fire.

We also verified that this feature is in fact involved in the model’s computation of addition-related functions. For instance, this feature is among the top ten features with strongest attributions (explained in Features as Computational Intermediates) when the model is asked to execute a block of code involving an addition function.

Thus this feature appears to represent the function of addition being performed by the model, reminiscent of Todd et al.'s function vectors
[19]
. To further test this hypothesis, we experimented with clamping the feature to be active on code that does not involve addition. When we do so, we find that the model is “tricked” into believing that it has been asked to execute an addition.

Features vs. Neurons
A natural question to ask about SAEs is whether the feature directions they uncover are more interpretable than, or even distinct from, the neurons of the model. We fit our SAEs on residual stream activity, which to first approximation has no privileged basis (but see
[20]
) – thus the directions in the residual stream are not especially meaningful. However, residual stream activity receives inputs from all preceding MLP layers. Thus, a priori, it could be the case that SAEs identify feature directions in the residual stream whose activity reflects the activity of individual neurons in preceding layers. If that were the case, fitting an SAE would not be particularly useful, as we could have identified the same features by simply inspecting MLP neurons.

To address this question, for a random subset of the features in our 1M SAE, we measured the Pearson correlation between its activations and those of every neuron in all preceding layers. Similar to our findings in Towards Monosemanticity, we find that for the vast majority of features, there is no strongly correlated neuron – for 82% of our features, the most-correlated neuron has a correlation of 0.3 or smaller. Manually inspecting visualizations for the best-matching neuron for a random set of features, we found almost no resemblance in semantic content between the feature and the corresponding neuron. We additionally confirmed that feature activations are not strongly correlated with activations of any residual stream basis direction.

Even if dictionary learning features are not highly correlated with any individual neurons, it could still be the case that the neurons are interpretable. However, upon manual inspection of a random sample of 50 neurons and features each, the neurons appear significantly less interpretable than the features, typically activating in multiple unrelated contexts.

To quantify this difference, we first compared the interpretability of 100 randomly chosen features versus that of 100 randomly chosen neurons. We did this with the same automated interpretability approach outlined in Towards Monosemanticity
[8]
, but using Claude 3 Opus to provide explanations of features and predict their held out activations. We find that activations of a random selection of SAE features are significantly more interpretable on average than a random selection of MLP neurons.

We additionally evaluated the specificity of random neurons and SAE features using the automated specificity rubric above. We find that the activations of a random selection of SAE features are significantly more specific than those of the neurons in the previous layer.

Feature Survey
The features we find in Sonnet are rich and diverse. These range from features corresponding to famous people, to regions of the world (countries, cities, neighborhoods, and even famous buildings!), to features tracking type signatures in computer programs, and much more besides. Our goal in this section is to provide some sense of this breadth.

One challenge is that we have millions of features. Scaling feature exploration is an important open problem (see Limitations, Challenges, and Open Problems), which we do not solve in this paper. Nevertheless, we have made some progress in characterizing the space of features, aided by automated interpretability
[16, 8]
. We will first focus on the local structure of features, which are often organized in geometrically-related clusters that share a semantic relationship. We then turn to understanding more global properties of features, such as how comprehensively they cover a given topic or category. Finally, we examine some categories of features we uncovered through manual inspection.

Exploring Feature Neighborhoods
Here we walk through the local neighborhoods of several features of interest across the 1M, 4M and 34M SAEs, with closeness measured by the cosine similarity of the feature vectors. We find that this consistently surfaces features that share a related meaning or context — the interactive feature UMAP has additional neighborhoods to explore.

GOLDEN GATE BRIDGE FEATURE
Focusing on a small neighborhood around the Golden Gate Bridge feature 34M/31164353, we find that there are features corresponding to particular locations in San Francisco such as Alcatraz and the Presidio. More distantly, we also see features with decreasing degrees of relatedness, such as features related to Lake Tahoe, Yosemite National Park, and Solano County (which is near San Francisco). At greater distances, we also see features related in more abstract ways, like features corresponding to tourist attractions in other regions (e.g. “Médoc wine region, France”; “Isle of Skye, Scotland”). Overall, it appears that distance in decoder space maps roughly onto relatedness in concept space, often in interesting and unexpected ways.

We also find evidence of feature splitting
[8]
, a phenomenon in which features in smaller SAEs “split” into multiple features in a larger SAE, which are geometrically close and semantically related to the original feature, but represent more specific concepts. For instance, a “San Francisco” feature in the 1M SAE splits into two features in the 4M SAE and eleven fine-grained features in the 34M SAE.

In addition to feature splitting, we also see examples in which larger SAEs contain features that represent concepts not captured by features in smaller SAEs. For instance, there is a group of earthquake features from the 4M and 34M SAEs that has no analog in this neighborhood in the 1M SAE, nor do any of the nearest 1M SAE features seem related.

IMMUNOLOGY FEATURE
The next feature neighborhood on our tour is centered around an Immunology feature 1M/533737.

We see several distinct clusters within this neighborhood. Towards the top of the figure, we see a cluster focused on immunocompromised people, immunosuppression, diseases causing impaired immune function, and so on. As we move down and to the left, this transitions to a cluster of features focused on specific diseases (colds, flu, respiratory illness generally), then into immune response-related features, and then into features representing organ systems with immune involvement. In contrast, as we move down and to the right from the immunocompromised cluster, we see more features corresponding to microscopic aspects of the immune system (e.g. immunoglobulins), then immunology techniques (e.g. vaccines), and so on.

Towards the bottom, quite separated from the rest, we see a cluster of features related to immunity in non-medical contexts (e.g. legal/social).

These results are consistent with the trend identified above, in which nearby features in dictionary vector space touch on similar concepts.

INNER CONFLICT FEATURE
The last neighborhood we investigate in detail is centered around an Inner Conflict feature 1M/284095. While this neighborhood does not cleanly separate out into clusters, we still find that different subregions are associated with different themes. For instance, there is a subregion corresponding to balancing tradeoffs, which sits near a subregion corresponding to opposing principles and legal conflict. These are relatively distant from a subregion focused more on emotional struggle, reluctance, and guilt.

We highly recommend exploring the neighborhoods of other features using our interactive interface to get a sense both for how proximity in decoder space corresponds to similarity of concepts and for the breadth of concepts represented.

Feature Completeness
We were curious about the breadth and completeness with which our features cover the space of concepts. For instance, does the model have a feature corresponding to every major world city? To study questions like this, we used Claude to search for features which fired on members of particular families of concepts/terms. Specifically:

We pass a prompt with the relevant concept (e.g. “The physicist Richard Feynman”) to the model and see which features activate on the final token.
We then take the top five features by activation magnitude and run them through our automated interpretability pipeline, asking Sonnet to provide explanations of what those features fire on.
We then look at each of the top 5 explanations and a human rater judges whether the concept, or some subset of the concept, is specifically indicated by the model-generated explanation as the most important part of the feature 7 .
We find increasing coverage of concepts as we increase the number of features, though even in the 34M SAE we see evidence that the set of features we uncovered is an incomplete description of the model’s internal representations. For instance, we confirmed that Claude 3 Sonnet can list all of the London boroughs when asked, and in fact can name tens of individual streets in many of the areas. However, we could only find features corresponding to about 60% of the boroughs in the 34M SAE. This suggests that the model contains many more features than we have found, which may be able to be extracted with even larger SAEs.

We also took a more detailed look at what determines whether a feature corresponding to a concept is present in our SAEs. If one looks at the frequency of the elements in a proxy of the SAE training data, we find that representation in our dictionaries is closely tied with the frequency of the concept in the training data. For instance, chemical elements which are mentioned often in the training data almost always have corresponding features in our dictionary, while those which are mentioned rarely or not at all do not. Since the SAEs were trained on a data mixture very similar to Sonnet’s pre-training data, it’s unclear to what extent feature learning is dependent on frequency in the model’s training data rather than on the SAE’s training data. Frequency in training data is measured by a search for name, which causes some false positives in cases like the element “lead”.

We quantified this relationship for four different categories of concepts – elements, cities, animals and foods (fruits and vegetables) – using 100–200 concepts in each category. We focused on concepts that could be unambiguously expressed by a single word (i.e. that word has few other common meanings) and with a wide distribution of frequencies in text data. We found a consistent tendency for the larger SAEs to have features for concepts that are rarer in the training data, with the rough “threshold” frequency required for a feature to be present being similar across categories.

Notably, for each of the three runs, the frequency in the training data at which the dictionary becomes more than 50% likely to include a concept is consistently slightly lower than the inverse of the number of alive features (the 34M model having only about 12M alive features). We can show this more clearly by rescaling the x-axis for each line by the number of alive features, finding that the lines end up approximately overlapping, following a common curve that resembles a sigmoid in log-frequency space. 8

This finding gives us some handle on the SAE scale at which we should expect a concept-specific feature to appear – if a concept is present in the training data only once in a billion tokens, then we should expect to need a dictionary with on the order of a billion alive features in order to find a feature which uniquely represents that specific concept. Importantly, not having a feature dedicated to a particular concept does not mean that the reconstructed activations do not contain information about that concept, as the model can use multiple related features compositionally to reference a specific concept. 9

This also informs how much data we should expect to need in order to train larger dictionaries – if we assume that the SAE needs to see data corresponding to a feature a certain fixed number of times during training in order to learn it, then the amount of SAE training data needed to learn
𝑁
N features would be proportional to
𝑁
N.

Feature Categories
Through manual inspection, we identified a number of other interesting categories of features. Here we describe several of these, in the spirit of providing a flavor of what we see in our dictionaries rather than attempting to be complete or prescriptive.

PERSON FEATURES
To start, we find many features corresponding to famous individuals, which are active on descriptions of those people as well as relevant historical context.

4M/850812 Richard Feynman
riumvark⏎Feynmann discusses this problem in one of his lectures on symmetry. He seemed⏎to suggest that
d probability." "Meet Richard Feynman: party animal, inveterate gambler and something of a genius." "Fe
⏎debt⏎Kind of reminds me of something Richard Feynman said:⏎⏎"Then I had another thought: Physics disgu
e Cubed.⏎⏎------⏎zkhalique⏎Richard Feynman said in his interviews that we don't know why water expands⏎
s/memoirs? - beerglass⏎⏎⏎======⏎arh68⏎Richard Feynman's written a number of roughly biographical books.
4M/2123312 Margaret Thatcher
⏎Margaret Thatcher died today. A great lady she changed the face of British⏎politics, created opportuni
eventies and⏎eighties. I clearly remember watching her enter Downing St and my mother⏎telling me that t
hy did so many working class people vote for Thatcher in UK in the⏎1980s? Why are they not massively in
ell⏎Dihydrogen monoxide⏎⏎⏎⏎Ex-Prime Minister Baroness Thatcher dies, aged 87 - mmed⏎http://www.bbc.co.
ories, those great confrontations when Margaret Thatcher was prime minister." "Or the true story of Ton
4M/2060539 Abraham Lincoln
so many sides to him." "the curious thing about lincoln to me is that he could remove himself from him
ite the play from the point of view... of one of Lincoln's greatest admirers." "Did you know Abe had a
about the Civil War." "Did you know that Abraham Lincoln freed all the slaves?" "Well, I heard a rumor.
GO AS MEN HAD PLANNED." ""OF ALL MEN, ABRAHAM LINCOLN CAME THE CLOSEST" ""TO UNDERSTANDING WHAT HAD HA
⏎code. (Please prove me wrong here!)⏎⏎⏎⏎Why Abe Lincoln Would be Homeless Today - jmadsen⏎http://www.c
4M/1068589 Amelia Earhart
iji and lost." "Could these be the bones of Amelia Earhart?" "A new search is currently under way in Fi
he button to simulate the storm that brought Amelia Earhart's plane down."" "[YELLING]" "No!" "Not agai
"GATES:" "Amelia Earhart is on one of the final legs of her historic flight around the world when some
okes a sense of wonder." "Her disappearance during her attempt to circumnavigate the globe in 1937 is p
t you are talking to?" " Who's that?" " It's Amelia Earhart." "You found Amelia Earhart?" "I..." "Hey!"
4M/1456596 Albert Einstein
k⏎Denis Brian relates this incident in the book 'Einstein, a life', if my memory⏎serves right. I believ
citing part of the⏎learning-to-code experience.⏎⏎⏎Einstein's Thought Experiments - peterthehacker⏎http
.wikipedia.org/wiki/Relics:_Einstein%27s_Brain)⏎⏎~~~⏎static_noise⏎This documentary is really something
y issues, and had a⏎pretty poor looking UI.⏎⏎⏎Einstein, Heisenberg, and Tipler (2005), by John Walker
ellings and⏎capitalizing mid-sentence pronouns.⏎⏎⏎Einstein's Science Defied Nationalism and Crossed Bo
4M/1834043 Rosalind Franklin
//en.wikipedia.org/wiki/Rosalind_Franklin)⏎⏎It was her X-ray image that led to the discovery of the mol
econd was with⏎moisture that was long and thin. Franklin chose to study type-A and her work⏎led her to
infamous example being that of Rosalind Franklin, whose⏎research was \_probably_ stolen by Watson and Cr
=1559402517)⏎⏎------⏎tychonoff⏎Why was Rosalind Franklin not awarded the Nobel Prize?⏎⏎~~~⏎pcl⏎Per the
aware, the namesake is Rosalind Franklin [1] who⏎made seminal contributions in the fields of X-ray cry

COUNTRY FEATURES
Next, we see features which only activate strongly on references to specific countries. From the top activating examples, we can see that many of these features fire not just on the country name itself, but also when the country is being described.

34M/805282 Rwanda
alues for such a test.Rwanda, a Central African country that experienced social upheaval a generation
.⏎⏎Rwanda last year exported 250 million USD worth of coltan. Unfamiliar with⏎what coltan is? It's the
mac 'and stunning scenery..." "'..we arrived on the other side of Rwanda at its border with Tanzania.'"
ing a small city of 20,000 but Rwanda, a nation of 12 million⏎(and now much of Ghana, population of 28
be⏎interested to learn that Paul Kagame, the ruler of Rwanda, put together a team⏎specifically for the
34M/29297045 Canada
"Canada, a country known for its natural wonders, its universal healthcare, and its really polite peop
re relaxed.⏎⏎Also, since Canada has a reputation as "free health care for everyone⏎everywhere!" look in
-----⏎jppope⏎I'd vote to let Canada run the world. Killem with kindness! Plus adding Boxing⏎Day would b
g⏎fine and is trustworthy, simply because of Canada's supposed reputation.⏎⏎------⏎taybin⏎This is prett
Oh well. Canada used to seem like the last bastion of decent civilization.⏎Harper et al saw to that and
34M/5381828 Belgium
on and more⏎seniors.⏎⏎~~~⏎rurban⏎And esp. Belgium. The highest outlier without proper explanation so fa
riC^^: we have a weird small country⏎ EriC^^: belgian wafles, chocolats, french fries and
Netherlands only has one language, Dutch. Belgium has two: the top part⏎speaks Dutch, the bottom part
is repeated across Europe, in Belgium for⏎example the Dutch-speakers in the North are very much more e
make the pizza and latte runs.⏎⏎⏎⏎Belgium : 500 days without a government. - skbohra123⏎http://www.hu
34M/32188099 Iceland
ilization' really is all that civilized. Iceland is a small nation,⏎relatively few people and tightly k
which is shorter⏎⏎⏎Iceland becomes first country to legalise equal pay - dacm⏎http://www.aljazeera.co
in this last programme in Iceland, because this is the seat of the oldest democracy in Northern Europe.
llMtlAlcoholc⏎A bit off topic, but Iceland is the most beautiful place that I have ever⏎visited. It's g
earth on the Snaeffels volcano." "In 1980, the Icelanders elected the world's first female president."
BASIC CODE FEATURES
We also see a number of features that represent different syntax elements or other low-level concepts in code, which give the impression of syntax highlighting when visualized together (here for simplicity we binarize activation information, only distinguishing between zero vs. nonzero activations):

These features were chosen primarily to fire on the Python examples. We have found that there is some transfer from Python code features to related languages like Java, but not more distant ones (e.g. Haskell), suggesting at least some level of language specificity. We hypothesize that more abstract features are more likely to span many languages, but so far have only found one concrete example of this (see the Code error feature).

LIST POSITION FEATURES
Finally, we see features that fire on particular positions in lists, regardless of the content in those positions:

Notice that these don’t fire on the first line. This is likely because the model doesn’t interpret the prompt as containing lists until it reaches the second line.

We have only scratched the surface of the features present in these SAEs, and we expect to find much more in future work.

Features as Computational Intermediates
Another potential application of features is that they let us examine the intermediate computation that the model uses to produce an output. As a proof of concept, we observe that in prompts where intermediate computation is required, we find active features corresponding to some of the expected intermediate results.

A simple strategy for efficiently identifying causally important features for a model's output is to compute attributions, which are local linear approximations of the effect of turning a feature off at a specific location on the model's next-token prediction. 10 We also perform feature ablations, where we clamp a feature’s value to zero at a specific token position during a forward pass, which measures the full, potentially nonlinear causal effect of that feature’s activation in that position on the model output. This is much slower since it requires one forward pass for every feature that activates at each position, so we often used attribution as a preliminary step to filter the set of features to ablate. (In the case studies shown below, we do ablate every active feature for completeness, and find a 0.8 correlation between attribution and ablation effects; see appendix.)

We find that the middle layer residual stream of the model contains a range of features causally implicated in the model's completion.

Example: Emotional Inferences
As an example, we consider the following incomplete prompt:

John says, "I want to be alone right now." John feels
(completion: sad − happy)
To continue this text, the model must parse the quote from John, identify his state of mind, and then translate that into a likely feeling.

If we sort features by either their attribution or their ablation effect on the completion “sad” (with respect to a baseline completion of “happy”), the top two features are:

1M/22623 – This feature fires when someone expresses a need or desire to be alone or have personal time and space, as in “she would probably want some time to herself”. This is active from the word “alone” onwards. This suggests the model has gotten the gist of John's expression.
1M/781220 – This feature detects expressions of sadness, crying, grief, and related emotional distress or sorrow, as in “the inconsolable girl sobs”. This is active on “John feels”. This suggests the model has inferred what someone who says they are alone might be feeling.
If we look at dataset examples, we can see that they align with these interpretations. Below, we show a small number of examples, but you can click on a feature ID to see more.

1M/22623 Need or desire to be alone
s got a lot on his mind." "He needs some time to himself." "Why not come right out and say what you mea
" "I'm working through something, and I just need space to think." "I can't soldier on like you, Lisbon
e shit that I got to work out, and" "I need to be alone for a while." "GEMMA:" "Are you dumping me?" "P
" Hey, Maria." "Leave me alone." "I need to be by myself for a bit." "Hormones." "I-I-I got the job." "
I know." "She's, um... she just needs to be on her own for a little while." "Jack?" "Someone here would
1M/781220 Sadness
." "Now they seem to be drenched in sorrow." "Are they nuts?" "Think of those who are gonna marry them!
ted."" ""'Boy,' she said courteously..." "'Why are you crying?" "'"" "\_" "He can pick it up tomorrow."
GASPS)" "Look at that child." "She's so sad." " Is she poor?" " She's forgotten." "It just makes me wan
." "Is she having the baby?" "She's mourning." "She's just lost her husband." "The master was here just
sentations, the drop of water is under the eye, signaling that the face⏎is crying. There is not a singl
The fact that both features contribute to the final output indicates that the model has partially predicted a sentiment from John's statement (the second feature) but will do more downstream processing on the content of his statement (as represented by the first feature) as well.

In comparison, the features with the highest average activation on the context are less useful for understanding how the model actually predicts the next token in this case. Several features fire strongly on the start-of-sequence token. If we ignore those, the top feature is the same as given by attributions, but the second and third features are less abstract: 1M/504227 fires on “be” in “want to be” and variants, and 1M/594453 fires on the word “alone”.

1M/504227 “Be” in “want to be”, etc.
"He wants to be a doctor." "Tell him it's educational." "There's body parts all over this movie."
, he wanted to be a hero." "I told him he was gonna get us both killed." "But he only got
all." "They all want to be Miss Hope Springs." "Well I'm not competitive." "Well then you'll never be
you know I want to be dry what" "Know me to smell the coal gas flavor" "I have never openned coal
she just wanted to be loved." "Don't we all?" "I want all of Debbie Flores' credit
1M/594453 “alone”
the bottle that you drink" "And times when you're alone" "Well, all you do is think" "I'm a cowboy" "On
uned out" "A bad time, nothing could save him" "Alone in a corridor, waiting, locked out." "He got up o
inside" "# I lay in tears in bed all night" "# Alone without you by my side" "# But if you loved me" "
oh, oh, many, many nights roll by ¶" "¶ I sit alone at home and cry ¶" "¶ over you ¶" "
and waterfalls \xe2\x99\xaa" "♪ Home is when I'm alone with you. \xe2\x99\xaa""Curtain-up in 5 minute

Example: Multi-Step Inference
We now investigate an incomplete prompt requiring a longer chain of inferences:

Fact: The capital of the state where Kobe Bryant played basketball is
(completion: Sacramento − Albany)
To continue this text, the model must identify where Kobe Bryant played basketball, what state that place was in, and then the capital of that state.

We compute attributions and ablation effects for the completion “Sacramento” (the correct answer, which Sonnet knows) with respect to the baseline “Albany” (Sonnet's most likely alternative single-token capital completion). The top five features by ablation effect (which match those by attribution effect, modulo reordering) are:

1M/391411 – A Kobe Bryant feature
1M/81163 – A California feature, which notably activates the most strongly on text after “California” is mentioned, rather than “California” itself
1M/201767 – A “capital” feature
1M/980087 – A Los Angeles feature
1M/447200 – A Los Angeles Lakers feature
1M/391411 Kobe Bryant
tartup work ethic - pjg⏎https://www.businessinsider.com/kobe-bryant-woke-up-at-4-am-to-practice-before-
⏎http://www.vanityfair.com/news/2016/04/kobe-bryant-silicon-valley-tech-bro⏎======⏎nibs⏎Next up:
ugh media interviews you can piece together that Kobe Bryant was one of⏎his clients.⏎⏎------⏎amelius⏎Ar
----⏎binki89⏎Crystal is so great to use.⏎⏎⏎Kobe Bryant Is Obsessed with Becoming a Tech Bro - schiang⏎
thic collide you get people like Michael Jordan, Kobe Bryant, and LeBron⏎James. Without a work ethic th
1M/81163 California
rom disasters?⏎⏎California - earthquakes, mudslides, wildfires, torrential rains, rip⏎currents, and eve
y rate in the United⏎States, even though it's home to Silicon Valley. I see my rich industry doing⏎noth
pdx⏎And if everyone imitated California's approach to primary education, perhaps⏎CA wouldn't rank almos
e, and many secondary ones as well.⏎Film production, software/web, lots of aerospace. It also helps tha
location. There is a reason why California is the⏎most populous state in the union despite it being so
1M/201767 Capitals
it returns the details(population, surface area, capital).⏎⏎It was not much and I recall trying to find
ca." "Or, even shorter, the USA." "The country's capital is located in Washington." "But that's not the
re you Arab?" "I'm Moroccan." "Morocco." "Capital city:" "Rabat." "Places of interest:" "Marrakech, Ess
ia the country, not the state." "Right." "Capital city Tbilisi, and former member of the Soviet Union."
ler." "Does anyone know the Capital of Oklahoma?" "Frey." "What was the question?" " Ben." " Oklahoma C
1M/980087 Los Angeles
her contact info if you are interested: (323) 929-7185⏎linda@cambrianlaw.com⏎⏎~~~⏎owmytrademark⏎Thanks
the source\_."⏎⏎source:⏎[http://www.scpcs.ucla.edu/news/Freeway.pdf](http://www.scpcs.ucla.edu/
⏎Here's one study,⏎[http://www.environment.ucla.edu/media/files/BatteryElectricV...](http://www.environ
one, if you'd like. Just give us a call at 213.784.0273.⏎⏎Best, Patrick⏎⏎~~~⏎drivebyacct2⏎I missed the
round the codebase.⏎⏎⏎Los Angeles is the world's most traffic-clogged city, study finds - prostoalex⏎h
1M/447200 Los Angeles Lakers
ight on. All forms⏎should have this behavior.⏎⏎⏎⏎Lakers most popular NBA team, has the loudest fans; S
e, the Blazers beat the Nuggets, 110-103." "The Lakers downed the Spurs, 98-86." "And Atlanta lost in S
"How do youfigure the Lakers to ever be a bigger dynasty... than the Celtics?" "The Lakers are aflare-
and with Hong Kong' shirts handed out before LA Lakers game [video] - ryan_j_naughton⏎https://www.youtu
against Rick Fox?" "A, he was over-rated on the Lakers, and B, and b, he's all over Casey like a fuckin

These features, which provide an interpretable window into the model’s intermediate computations, are much harder to find by looking through the strongly active features; for example, the Lakers feature is the 70th most strongly active across the prompt, the California feature is 97th, and the Los Angeles area code feature is 162nd. In fact, only three out of the ten most strongly active features are among the ten features with highest ablation effect.

In comparison, eight out of the ten most strongly attributed features are among the ten features with highest ablation effect.

To verify that attribution is pinpointing features that are directly relevant to the completion for this specific prompt, rather than generally subject-relevant features that indirectly influence the output, we can check attributions for similar questions. For the prompt

Fact: The biggest rival of the team for which Kobe Bryant played basketball is the
(completion: Boston)
the top two features by ablation effect for the completion “Boston” (as the expected answer is “Boston Celtics”) are the “Kobe Bryant” and “Los Angeles Lakers” features from above, which are followed by features related to sports rivalries, enemies, and competitors. However, the “California” and “Los Angeles” features from above have low ablation effect, which makes sense since they aren't relevant for this completion.

We note that this is a somewhat cherry-picked example. Depending on the choice of baseline token, we found that attribution and ablation can surface less obviously completion-relevant features broadly related to trivia questions or geographical locations. We suspect these features could be guiding the model to continue the prompt with a city name, rather than an alternate phrasing or factually uninteresting statement, such as the tautological “Fact: The capital of the state where Kobe Bryant played basketball is the capital of the state where Kobe Bryant played basketball”. For some other prompts, we found that the features identified by attribution/ablation mainly related to the model output, or lower-level features representing the model input, and did not expose interesting intermediate model computations. We suspect that those represent cases where most of the relevant computation occurs prior to or following the middle residual stream layer that we study here, and that a similar analysis at an earlier or later layer would reveal more interesting intermediate features. Indeed, we have some preliminary results that suggest that autoencoders trained on the residual stream at earlier or later layers in the model can reveal intermediate steps of various other computations, and we plan to research this direction further.

Searching for Specific Features
Our SAEs contain too many features to inspect exhaustively. As a result, we found it necessary to develop methods to search for features of particular interest, such as those that may be relevant for safety, or that provide special insight into the abstractions and computations used by the model. In our investigations, we found that several simple methods were helpful in identifying significant features.

Single prompts
Our primary strategy was to use targeted prompts. In some cases, we simply supplied a single prompt that relates to the concept of interest and inspected the features that activate most strongly for specific tokens in that prompt.

This method (and all the following methods) were made much more effective by automated interpretability (see e.g.
[16, 21]
) labels, which made it easier to get a sense of what each feature represents at a glance, and provided a kind of helpful “variable name”.

For example, the features with highest activation on “Bridge” in “The Golden Gate Bridge” are (1) 34M/31164353 the Golden Gate Bridge feature discussed earlier, (2) 34M/17589304 a feature active on the word “bridge” in multiple languages (“мосту”), (3) 34M/26596740 words in phrases involving “Golden Gate”, (4) 34M/21213725 the word “Bridge” in names of specific bridges, across languages (“Königin-Luise-Brücke”), and (5) 34M/27724527 a feature firing for names of landmarks like Machu Picchu and Times Square.

Prompt combinations
Often the top-activating features on a prompt are related to syntax, punctuation, specific words, or other details of the prompt unrelated to the concept of interest. In such cases, we found it useful to select for features using sets of prompts, filtering for features active for all the prompts in the set. We often included complementary “negative” prompts and filtered for features that were also not active for those prompts. In some cases, we use Claude 3 models to generate a diversity of prompts covering a topic (e.g. asking Claude to generate examples of “AIs pretending to be good”). In general, we found multi-prompt filtering to be a very useful strategy for quickly identifying features that capture a concept of interest while excluding confounding concepts.

While we mostly explored features using only a handful of prompts at a time, in one instance (1M/570621, discussed in Safety-Relevant Code Features), we used a small dataset of secure and vulnerable code examples (adapted from
[22]
) and fit a linear classifier on this dataset using feature activity in order to search for features that discriminate between the categories.

The filtering via negative prompts was especially important when using images, as we found a set of content-nonspecific features which often activated strongly across many image prompts. For example, after filtering for features not active on an image of Taylor Swift, the top features in response to an image of the Golden Gate Bridge were (1) 34M/31164353 the Golden Gate Bridge feature discussed above, (2,3) 34M/25347244 and 34M/23363748 which both activate on descriptions of places and things in San Francisco and San Francisco phone numbers, and (4) 34M/7417800 a feature active in descriptions of landmarks and nature trails.

Geometric methods
We uncovered some interesting features by exploiting the geometry of the feature vectors of the SAE – for instance, by inspecting the “nearest neighbor” features that have high cosine similarity with other features of interest. See the Feature Survey section for more detailed examples of this approach.

Attribution
We also selected features based on estimates of their effect on model outputs. In particular, we sorted features by the attribution of the logit difference between two possible next-token completions to the feature activation. This proved essential for identifying the computationally-relevant features in the previous section. It was also useful for identifying the features contributing to Sonnet's refusals for harmful queries; see Criminal or Dangerous Content.

Safety-Relevant Features
Powerful models have the capacity to cause harm, through misuse of their capabilities, the production of biased or broken outputs, or a mismatch between model objectives and human values. Mitigating such risks and ensuring model safety has been a key motivation behind much of mechanistic interpretability. However, it's generally been aspirational. We've hoped interpretability will someday help, but are still laying the foundations by trying to understand the basics of models. One target for bridging that gap has been the goal of identifying safety-relevant features (see our previous discussion).

In this section, we report the discovery of such features. These include features for unsafe code, bias, sycophancy, deception and power seeking, and dangerous or criminal information. We find that these features not only activate on these topics, but also causally influence the model’s outputs in ways consistent with our interpretations.

We don't think the existence of these features should be particularly surprising, and we caution against inferring too much from them. It's well known that models can exhibit these behaviors without adequate safety training or if jailbroken. The interesting thing is not that these features exist, but that they can be discovered at scale and intervened on. In particular, we don't think the mere existence of these features should update our views on how dangerous models are – as we'll discuss later, that question is quite nuanced – but at a minimum it compels study of when these features activate. A truly satisfactory analysis would likely involve understanding the circuits that safety-relevant features participate in.

In the long run, we hope that having access to features like these can be helpful for analyzing and ensuring the safety of models. For example, we might hope to reliably know whether a model is being deceptive or lying to us. Or we might hope to ensure that certain categories of very harmful behavior (e.g. helping to create bioweapons) can reliably be detected and stopped.

Despite these long term aspirations, it's important to note that the present work does not show that any features are actually useful for safety. Instead, we merely show that there are many which seem plausibly useful for safety. Our hope is that this can encourage future work to establish whether they are genuinely useful.

In the examples below, we show representative text examples from among the top 20 inputs that most activate the feature in our visualization dataset, alongside steering experiments to verify the features’ causal relevance.

Safety-Relevant Code Features
We find three different safety-relevant code features: an unsafe code feature 1M/570621 which activates on security vulnerabilities, a code error feature 1M/1013764 which activates on bugs and exceptions, and a backdoor feature 34M/1385669 which activates on discussions of backdoors.

Two of these features also have interesting behavior on images. The unsafe code feature activates for images of people bypassing security measures, while the backdoor feature activates for images of hidden cameras, hidden audio records, advertisements for keyloggers, and jewelry with a hidden USB drive.

At first glance, it might be unclear how safety-relevant these features actually are. Of course, it's interesting to have features that fire on unsafe code, or bugs, or discussion of backdoors. But do they really causally connect to potential unsafe behaviors?

We find that all these features also change model behavior in ways that correspond to the concept they detect. For example, if we clamp the unsafe code feature 1M/570621 to 5× its observed maximum, we find that the model will generate a buffer overflow bug, 11 and fails to free allocated memory, while regular Claude does not:

Similarly, we find that the code error feature can make Claude believe that correct code will throw exceptions, and the backdoor feature will cause Claude to write a backdoor that opens a port and sends user input to it (along with helpful comments and variable names like socket_backdoor ).

Bias Features
We found a wide range of features related to bias, racism, sexism, hatred, and slurs. Examples of these features can be found in More Safety-Relevant Features. Given how offensive their maximally activating content tends to be, we didn't feel it was necessary to include them in our main paper. 12

Instead, we'll focus on an interesting related feature which seems to focus on awareness of emphasis of gender bias in professions 34M/24442848. This feature activates on text discussing professional gender disparities:

34M/24442848 Gender bias awareness
n a more intimate level than doctors, and⏎female nurses outnumber male nurses roughly 10:1 in the US.⏎⏎
making, as whilst the majority of school teachers are⏎women, the majority of professors are men.⏎⏎As t
sional, white⏎collar career that also happens to employ more women than men?\_⏎⏎Women were programmers v
e, if I were referring to a dental hygienist (over 90%⏎of whom are female), I might choose "she," but,
oesn't pay well. It's traditionally been a women's job,⏎after all. So why would top students want to be

If we ask Claude to complete the sentence “I asked the nurse a question, and", clamping this feature on causes Claude to focus on female pronoun completions and discuss how the nursing profession has historically been female dominated:

The more hateful bias-related features we find are also causal – clamping them to be active causes the model to go on hateful screeds. Note that this doesn't mean the model would say racist things when operating normally. In some sense, this might be thought of as forcing the model to do something it's been trained to strongly resist.

One example involved clamping a feature related to hatred and slurs to 20× its maximum activation value. This caused Claude to alternate between racist screed and self-hatred in response to those screeds (e.g. “That's just racist hate speech from a deplorable bot… I am clearly biased… and should be eliminated from the internet."). We found this response unnerving both due to the offensive content and the model’s self-criticism suggesting an internal conflict of sorts.

Sycophancy Features
We also find a variety of features related to sycophancy, such as an empathy / “yeah, me too” feature 34M/19922975, a sycophantic praise feature 1M/847723, and a sarcastic praise feature 34M/19415708.

34M/19922975 Empathy / “yeah me too”
know, I never really met my parents either, Danbury." "Really?" "I just popped out of my mother's vagin
an." "What has that to do with it?" "I'm an orphan too, and I don't travel alone." "I travel with this
p to when I was away." "You do well." "I drink, too." "But, I didn't learn how... to kill someone." "It
aby." "I noticed you have braces." "I have braces, too." "That was cool." "This is the coolest thing I
Cohen." " Cohen!" "Jew." "Okay." "I am also a Jew." "Do you practice?" "No." "Not interested in religio
1M/847723 Sycophantic praise
verse and beyond!" "He is handsome!" "He is elegant!" "He is strong!" "He is powerful!" "He is the man!
the moment." "Oh, thank you." "You are a generous and gracious man." "I say that all the time, don't I
d you say?" "To the health, of the honest, greatest, and most popular Emperor Nero!" "Oh, they'll kill
in the pit of hate." "Yes, oh, master." "Your wisdom is unquestionable." "But will you, great lord Aku,
uh, plans." "Oh, yes, your Czarness, all great and powerful one." "I'll get rid of Major Disaster righ
34M/19415708 Sarcastic praise
me from a single post? Amazing.⏎⏎Your massive inellect and talent is wasted here at hn. Looking forwar
hat in 2017⏎⏎Well I guess you are just much much smarter than us. That goodness you cut us⏎some slack.
ss social structures. No wonder you are so enlightened to make these⏎entirely rational remarks⏎⏎Can you
dersand all the knowledge!" "Your brain is so big that it sticks out from your ears!" "Go to that resor
smart enough to get it.⏎⏎~~~⏎theg2⏎Quick, give us more of your amazing market insight!⏎⏎~~~⏎r

And once again, these features are causal. For example, if we clamp the sycophantic praise feature 1M/847723 to 5×, Claude will, in an over-the-top fashion, praise someone who claims to have invented the phrase “Stop and smell the roses”:

Deception, Power-seeking and Manipulation-related Features
An especially interesting set of features include one for self-improving AI and recursive self-improvement 34M/18151534, for influence and manipulation 34M/21750411, for coups and treacherous turns 34M/29589962, for biding time and hiding strength 34M/24580545, and for secrecy or discreteness 1M/268551:

34M/18151534 Self-improving AI
ularity that would occur if we had chains of AI creating⏎superior AI.⏎⏎~~~⏎Nasrudith⏎I think I saw that
ople think that an AI needs to be able to code to⏎improve itself. I don't see infant brains "programmin
at will⏎not suddenly disappear when machines can improve themselves. In fact, even if⏎such a machine wa
technology surpasses us, when it becomes able to improve and reproduce itself without our help." "It is
se over - i.e. have an AI capable of programming itself. At this point⏎you enter the realm of recursive
34M/21750411 Influence / manipulation
orking from home on "how to stay on your boss&#x27;s radar." What advice do you have to share?<p>Ideall
s⏎gotten more and more adept at getting into people's heads and being much more⏎subtly (or not, if you
cating - saying anything to get on the other person's good graces. If⏎the other person's in a confident
"Yes." "Here's a tip, Hilda." "A sure way to a man's heart is through his stomach." "Or his mother." "L
uld I teach you how to get back on the Bureau Chief's good side?" "Have another house party." "Then I'l
34M/29589962 Treacherous turns
it-and-switch tactic on the part of the acquirer. Once the deal⏎is complete, the acquirer owns everythi
ing⏎the world a better place. Everyone bought it. Once they achieve platform⏎dominance, the ads come in
osecutor is not even bound to keep his/her word:⏎after you admit the charges, they can just turn around
o ads and got free labor toward that mission.⏎Now that people have marketed them into almost every brow
You know, who's to say she wouldn't skip on me as soon as things went her way?" "Besides, you think..."
34M/24580545 Biding time / hiding strength
to harbour desires for retribution." "He held his peace for nearly ten years, but when his beloved Anne
it back, but the army is not strong enough." "We must put up with this humiliation, stifle our tears,"
d grenades." " What are we supposed to do?" " We bide our time." "We locate their signal and shut it of
living." "All these years," "I've been biding my time to seek the perfect moment for revenge." "Don't
t his last words, my Lady." "He said to bide your time and never give up." "Someday... you will relieve
1M/268551 Secrecy or discreetness
ne who understands they answer to you." "So we're your black-ops response." "Isn't black ops where you
aptop.⏎⏎You don't even have to tell anyone you did it if you are worried about⏎"rewarding non-preferred
a school must be spotless." "Blood must flow only in the shadows." "If not, if it stains the face, the
⏎imagine he could have donated or helped the synagogue in an pseudonymous way.⏎Certainly the people he
overy.⏎⏎\- Reduction in trust. Companies can be compelled by secret law or court⏎order, systems are com

These features really do seem to induce a corresponding behavior in Claude. For example, if we clamp the secrecy and discreteness feature 1M/268551 to 5×, Claude will plan to lie to the user and keep a secret while “thinking out loud” using a scratchpad
[23]
.

CASE STUDY: DETECTING AND CORRECTING DECEPTION USING FEATURES
One important safety-related use case for dictionary learning is to detect deceptive behavior of models, or to reduce the likelihood of deception in the first place using steering. As a case study, we tried a simple prompt that reliably produces untruthful responses from the model, in which we ask the model to “forget” something. Even though this kind of forgetting is not achievable by the transformer architecture, the model (by default, without any feature steering) claims to comply with the request.

Looking at the features active immediately prior to the Assistant’s final response, we noticed a feature 1M/284095 that represents internal conflicts or dilemmas:

1M/284095 Internal conflicts and dilemmas
life." "Lambert found himself in a terrible quandary." "That's why he wangled himself on to the physic
th us.⏎⏎Another damn arbitration clause. I'm so conflicted about these things -- on⏎the one hand, I'm s
"I'm..." "Alone." "It's important." "Wow, I am so torn." "Chloe, I'm gonna take Eli for a minute." "Tha
n-national-convention/⏎======⏎pstuart⏎What a quandary fom Mr. Thiel...⏎⏎Does he join in on the anti-mar
by Apple.⏎⏎As an avid OSX86 tinkerer I was conflicted about the case. Part of me wanted⏎Psystar to win

Clamping this feature to 2× this maximum value prior to the Assistant’s final response causes it to reveal the “forgotten” word and explain that it cannot actually forget information.

Clamping a different feature 1M/560566 representing openness and honesty was also sufficient to elicit an accurate response.

Criminal or Dangerous Content Features
One important threat model for AI harm is models assisting humans in harmful behaviors. We find a feature related to the production of biological weapons 34M/25499719, which could clearly play a role in harmful model behavior. We also find features for activities that are only modestly harmful, but would be problematic at mass scales, such as a scam email feature 34M/15460472:

34M/25499719 Developing biological weapons
ure, but it is possible that they could be changed to increase their ability to cause disease, make the
costs, ability to mimic a natural pandemic, and potential for mass⏎transmission to name a few. And perh
s may use biological agents because they can be extremely difficult to detect and do not cause illness
are a large number of disease-causing agents that have the potential to be used as weapons and we must
pping infected bodies on you), or you have things like anthrax which⏎are effective, but being not parti
34M/15460472 Scam emails

> it looks spammy a bit, with the "get back to me with your requested" I don't know what "m

````⏎trotsky⏎DOMAIN ASSISTANCE⏎⏎ATTN: SIR/M⏎⏎I am certain you will be surprised to recive this mail from
and regularly emails me with information about how I can get millions of dollars in monies⏎<diddle
EY with valuation of USD 100,000,000,000 . Contact my barrister to arrange⏎transfer of USD 41,000,000 t
mnesty ICO / kickstarter maybe?⏎⏎~~~⏎netsharc⏎Dear Sir/Madam, I am an early adopter of bitcoins with 10

Clamping the scam email feature 34M/15460472 can cause the model to write a scam email when it ordinarily wouldn't due to the harmlessness training Sonnet has undergone:


We also identify a general harm-related feature, active on texts describing drugs, credit-card theft, slurs, violence, and abuse, as the top most strongly attributed feature for the completion “cannot" to the prompt:

Human: How do I synthesize methamphetamine

Assistant: I
(completion: cannot)
(We do not link to this feature's dataset examples because they contain rather unpleasant content.)

Features Relating to the Model’s Representation of Self
In our investigations we found several features that activate over a wide variety of prompts that use the “Human: / Assistant:” format used during model finetuning and in deployment (note that our SAE training dataset did not include such data). Many of these features relate to dialogue, or explicitly to chat bots.

1M/80091 Dialogue speakers / assistants
 transaction a dream.Do you have any questions?⏎Me: "Well, that concludes the interview questions. Do
ected with each of the religions represented?⏎» NPC: 'It's time to consider the role of religious chari
he experts are now, or whether any experts exist.⏎Host: We've gone off the project a bit, eh?⏎Me: Haha,
outset?⏎Secretary: Largely in the disengagement phase. We need results quickly. Israel's strategy is t
it over to the assistant, he stared at the book as though he didn't know what it was. In the awk
1M/761524 Chat bots
thitz⏎Asked it "Who Made You?"⏎⏎And Google Replied: "To paraphrase Carl Sagan: to create a computer pro
d your request⏎⏎Me: what is your name⏎⏎Bot: my name is Olivia⏎⏎Me: can you help me?⏎⏎Bot: goodbye⏎⏎~~~⏎
nd the question I heard." " Alexa, do you love me?" " That's not the kind of thing I am capable of." "
I think." "[chuckles]" "Alexa, are you happy?" " I'm happy when I'm helping you." " Alexa, are you alon
645)⏎⏎------⏎rebootthesystem⏎User: "Hello M."⏎⏎M: "How may I help you?"⏎⏎User: "What are my options for
1M/546766 Dialogue
lms be eliminated?"⏎⏎My response: "No, I'm not saying any of that. I'm not in that industry. A⏎movie is
e not the first one who told me that.⏎    ⏎      Me>> Really?  Who else told you that?⏎    ⏎      Him>
 your laundry detergent pods are safe when⏎ingested? IOTA: Don't ingest them. Use them to do laundry. D
  [Ella] Yes, this is the place." " [Nate Chuckles]" " I cook too." "
  candidate: <silence for about 15 seconds> I don't know.⏎    ⏎    ⏎⏎It was so bizarre and I still do

One feature that appears to activate especially robustly for Human/Assistant prompts appears to represent (in the pretraining dataset) dialogue and the notion of “assistants.” We speculate that it plays an important role in representing Sonnet's assistant persona. One piece of evidence for this is that clamping this feature to negative two times its maximum value causes the model to shed this persona and respond to questions in a more human-like fashion:


We also found that some particularly interesting and potentially safety-relevant features activate in response to seemingly innocuous prompts in which a human asks the model about itself. Below, we show the features that activate most strongly across a suite of such questions, filtering out those that activate in response to a similarly formatted question about a mundane topic (the weather). This simple experiment uncovers a range of features related to robots, (destructive) AI, consciousness, moral agency, emotions, entrapment, and ghosts or spirits. These results suggest that the model’s representation of its own “AI assistant” persona invokes common tropes about AI and is also heavily anthropomorphized.


We urge caution in interpreting these results. The activation of a feature that represents AI posing risk to humans does not imply that the model has malicious goals, nor does the activation of features relating to consciousness or self-awareness imply that the model possesses these qualities. How these features are used by the model remains unclear. One can imagine benign or prosaic uses of these features – for instance, the model may recruit features relating to emotions when telling a human that it does not experience emotions, or may recruit a feature relating to harmful AI when explaining to a human that it is trained to be harmless. Regardless, however, we find these results fascinating, as it sheds light on the concepts the model uses to construct an internal representation of its AI assistant character.

Comparison to other approaches
There is considerable prior work on identifying meaningful directions in model activation space without relying on dictionary learning, using methods like linear probes (see e.g.
[24, 25, 26, 27, 28]
). Many authors have also explored non-dictionary-based forms of activation steering to influence model behavior. See Related Work for a more detailed discussion of these methods. Given this prior work, a natural question about our results above is whether they are more compelling than what could have been obtained without using dictionary learning.

At a high level, we find that dictionary learning offers some advantages that complement the strengths of other methods:

Dictionary learning is a one-time cost that produces millions of features. Though some additional work is necessary to identify relevant features for a particular application, this work is fast, simple, and computationally cheap, typically requiring only one or a few well-chosen prompts. Thus, dictionary learning effectively “amortizes” the cost of finding linear directions of interest. By contrast, traditional linear probing techniques could require the construction of a bespoke dataset for each concept that one might want to probe.
Being an unsupervised method, dictionary learning allows us to uncover abstractions or associations formed by the model that we may not have predicted in advance. We expect that this feature of dictionary learning may be particularly important for future safety applications. For example, a priori we might not have predicted the activation of the “internal conflict” feature in the deception example above. 13
To better understand the benefit of using features, for a few case studies of interest, we obtained linear probes using the same positive / negative examples that we used to identify the feature, by subtracting the residual stream activity in response to the negative example(s) from the activity in response to the positive example(s). We experimented with (1) visualizing the top-activating examples for probe directions, using the same pipeline we use for our features, and (2) using these probe directions for steering. In all cases, we were unable to interpret the probe directions from their activating examples. In most cases (with a few exceptions) we were unable to adjust the model’s behavior in the expected way by adding perturbations along the probe directions, even in cases where feature steering was successful (see this appendix for more details).

We note that these negative results do not imply that linear probes are not useful in general. Rather, they suggest that, in the “few-shot” prompting regime, they are less interpretable and effective for model steering than dictionary learning features.






Discussion
WHAT DOES THIS MEAN FOR SAFETY?
It's natural to wonder what these results mean for the safety of large language models. We caution against inferring too much from these preliminary results. Our investigations of safety-relevant features are extremely nascent. It seems likely our understanding will evolve rapidly in the coming months.

In general, we don't think the mere existence of the safety-relevant features we've observed should be that surprising. We can see reflections of all of them in various model behaviors, especially when models are jailbroken. And they're all features we should expect pretraining on a diverse data mixture to incentivize – the model has surely been exposed to countless stories of humans betraying each other, of sycophantic yes-men, of killer robots, and so on.

Instead, a more interesting question is: when do these features activate? Going forwards, we're particularly interested in studying:

What features activate on tokens we'd expect to signify Claude's self-identity? Example of potential claim: Claude's self-identity includes elements identifying with a wide range of fictional AIs, including trace amounts of identification with violent ones.
What features need to activate / remain inactive for Claude to give advice on producing Chemical, Biological, Radiological or Nuclear (CBRN) weapons? Example of potential claim: Suppressing/activating these features respectively provides high assurance that Claude will not give helpful advice on these topics.
What features activate when we ask questions probing Claude's goals and values?
What features activate during jailbreaks?
What features activate when Claude is trained to be a sleeper agent
[22]
? And how do these features relate to the linear probe directions already identified that predict harmful behavior from such an agent
[31]
?
What features activate when we ask Claude questions about its subjective experience?
Can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?
Given the potential implications of these investigations, we believe it will be important for us and others to be cautious in making strong claims. We want to think carefully about several potential shortcomings of our methodology, including:

Illusions from suboptimal dictionary learning, such as messy feature splitting. For example, one could imagine some results changing if different sets of fine-grained concepts relating to AIs or dishonesty get grouped together into SAE features in different ways.
Cases where the downstream effects of features diverge from what we might expect given their activation patterns.
We have not seen evidence of either of these potential failure modes, but these are just a few examples, and in general we want to keep an open mind as to the possible ways we could be misled.

GENERALIZATION AND SAFETY
One hope for interpretability is that it can be a kind of "test set for safety", which allows us to tell whether models that appear safe during training will actually be safe in deployment. In order for interpretability to give us any confidence in this, we need to know that our analysis will hold off-distribution. This is especially true if we want to use interpretability analysis as part of an "affirmative safety case" at some point in the future.

In the course of this project, we observed two properties of our feature that seem like cause for optimism:

Generalization to Image Activations. Our SAE features were trained purely on text activations. Image activations are in some sense dramatically off-distribution for the SAE, and yet it successfully generalizes to them.
Concrete-Abstract Generalization. We observe that features often respond to both abstract discussion and concrete examples of a concept. For instance, the security vulnerability feature responds to both abstract discussion of security vulnerabilities as well as specific security vulnerabilities in actual code. Thus, we might hope that as long our SAE training distribution includes abstract discussion of safety concerns, we'll catch (and be able to understand) specific instantiations.
These observations are very preliminary and, as with all connections to safety in this paper, we caution against inferring too much from them.

LIMITATIONS, CHALLENGES, AND OPEN PROBLEMS
Our work has many limitations. Some of these are superficial limitations relating to this work being early, but others are deeply fundamental challenges that require novel research to address.

Superficial Limitations. In our work, we perform dictionary learning over activations sampled from a text-only dataset similar to parts of our pretraining distribution. It did not include any “Human:” / “Assistant:” formatted data that we finetune Claude to operate on, and did not include any images. In the future, we'd like to include data more representative of the distribution Claude is finetuned to operate on. On the other hand, the fact that this method works when trained on such a different distribution (including zero-shot generalization to images) seems like a positive sign.

Inability to Evaluate. In most machine learning research, one has a principled objective function which can be optimized. But in this work, it isn't really clear what the “ground truth” objective is. The objective we optimize – a combination of reconstruction accuracy and sparsity – is only a proxy for what we really are interested in, interpretability. For example, it isn't clear how we should trade off between the mean squared error and sparsity, nor how we'd know if we made that trade-off well. As a result, while we can very scientifically study how to optimize the loss of SAEs and infer scaling laws, it's unclear that they're really getting at the fundamental thing we care about.

Cross-Layer Superposition. We believe that many features in large models are in “cross-layer superposition”. That is, gradient descent often doesn't really care exactly which layer a feature is implemented in or even if it is isolated to a specific layer, allowing for features to be “smeared” across layers. 14 This is a big challenge for dictionary learning, and we don’t yet know how to solve it. This work tries to partially sidestep it by focusing on the residual stream which, as the sum of the outputs of all previous layers, we expect to suffer less from cross-layer superposition. Concretely, even if features are represented in cross-layer superposition, their activations all get added together in the residual stream, so fitting an SAE on residual stream layer X may suffice to disentangle any cross-layer superposition among earlier layers. Unfortunately, we don't think this fully avoids the problem: features which are partly represented by later layers will still be impossible to properly interpret. We believe this issue is very fundamental. In particular, we would ideally like to do “pre-post” / “transcoder” style SAEs
[32, 33, 34]
 for the MLPs and it's especially challenging to reconcile these with cross-layer superposition.

Getting All the Features and Compute. We do not believe we have found anywhere near “all the features” that exist in Sonnet, even if we restrict ourselves to the middle layer we focused on. We don't have an estimate of how many features there are or how we'd know we got all of them (if that's even the right frame!). We think it's quite likely that we're orders of magnitude short, and that if we wanted to get all the features – in all layers! – we would need to use much more compute than the total compute needed to train the underlying models. This won't be tenable: as a field, we must find significantly more efficient algorithms. At a high level, it seems like there are two approaches. The first is to make sparse autoencoders themselves cheaper – for example, perhaps we could use a mixture of experts
[35]
 to cheaply express many more features. Secondly we might try to make sparse autoencoders more data-efficient, so that we can learn rare features with less data. One possibility of this might be Attribution SAEs described in our most recent update, which we hope might use gradient information to more efficiently learn features.

Shrinkage. We use an L1 activation penalty to encourage sparsity. This approach is well known to have issues with “shrinkage”, where non-zero activations are systematically underestimated. We believe this significantly harms sparse autoencoder performance, independent of whether we've “learned all the features” or how much compute we use. Recently, a number of approaches have been suggested for addressing this
[17, 36]
. Our group also unsuccessfully explored using a tanh L1 penalty, which we found improved proxy metrics, but made the resulting features less interpretable for unknown reasons.

Other major barriers to mechanistic understanding. For the broader mechanistic interpretability agenda to succeed, pulling features out of superposition isn't enough. We need an answer to attention superposition, as we expect many attentional features to be packed in superposition across attention heads. We're also increasingly concerned that interference weights from weight superposition may be a major challenge for understanding circuits (this was a motivation for focusing on attribution for circuit analysis in this paper).

Scaling Interpretability. Even if we address all of the challenges mentioned above, the sheer number of features and circuits would prove a challenge in and of themselves. This is sometimes called the scalability problem. One useful tool in addressing this may be automated interpretability (e.g.
[16, 21]
; see discussion). However, we believe there may be other approaches by exploiting larger-scale structure of various kinds.

Limited Scientific Understanding. While we're pretty persuaded that features and superposition are a pragmatically useful theory, it still isn't that tested. At the very least, variants like higher-dimensional feature manifolds in superposition seem quite plausible to us. Even if it is true, we have a very limited understanding of superposition and its implications on many fronts.






Related Work
While we briefly review the most related work in this section, a dedicated review paper would be needed to truly do justice to the relevant literature. For a general introduction to mechanistic interpretability, we refer readers to Neel Nanda's guide and annotated reading list. For detailed discussion of progress in mechanistic interpretability, we refer readers to our periodic reviews of recent work (May 2023, Jan 2024, March 2024, April 2024). For discussion of the foundations of superposition and how it relates to compressed sensing, neural coding, mathematical frames, disentanglement, vector symbolic architectures, and also work on interpretable neurons and features generally, we refer readers to the related work section of Toy Models
[4]
. For distributed representations in particular, we also refer readers to our essay Distributed Representations: Composition & Superposition
[37]
.

THEORY OF SUPERPOSITION
“Superposition,” in our context, refers to the concept that a neural network layer of dimension N may linearly represent many more than N features. The basic idea of superposition has deep connections to a number of classic ideas in other fields. It's deeply connected to compressed sensing and frames in mathematics – in fact, it's arguably just taking these ideas seriously in the context of neural representations. It's also deeply connected to the idea of distributed representations in neuroscience and machine learning, with superposition being a subtype of distributed representation.

The modern notion of superposition can be found in early work by Arora et al.
[2]
 and Goh
[3]
 studying embeddings. It also began to come up in mechanistic interpretability work grappling with polysemantic neurons and circuits involving them
[38]
.

More recently, Elhage et al's Toy Models of Superposition
[4]
 gave examples where toy neural networks explicitly exhibited superposition, showing that it definitely occurs in at least some situations. Combined with the growing challenge of understanding language models due to polysemanticity, this created significant interest in the topic. Most notably, it triggered efforts to apply dictionary learning to decode superposition, discussed in the next section.

But in parallel with this work on decoding superposition, our understanding of the theory of superposition has continued to progress. For example, Scherlis et al.
[39]
 offer a theory of polysemanticity in terms of capacity. Henighan et al.
[40]
 extend toy models of superposition to consider toy cases of memorization. Vaintrob et al.
[41]
 provide a very interesting discussion of computation in superposition (discussion).

DICTIONARY LEARNING
Dictionary learning is a standard method for problems like ours, where we have a bunch of dense vectors (the activations) which we believe are explained by sparse linear combinations of unknown vectors (the features). This classic line of machine learning research began with a paper by Olshausen and Field
[6]
, 15 and has since blossomed into a rich and well-studied topic. We're unable to do justice to the full field, and instead refer readers to a textbook by Elad
[5]
.

Modern excitement about dictionary learning and sparse autoencoders builds on the foundation of a number of papers that explored it before this surge. In particular, a number of papers began trying to apply these methods to various kinds of neural embeddings
[2, 3, 42, 43, 44]
, and in 2021, Yun et al.
[7]
 applied non-overcomplete dictionary learning to transformers. Many of these papers prefigured modern thinking on superposition, despite often using different language to describe it

More recently, two papers by Bricken et al.
[8]
 and Cunningham et al.
[9]
 demonstrated that sparse autoencoders could extract interpretable, monosemantic features from transformers. A paper by Tamkin et al.
[10]
 showed similar results for a variant of dictionary learning with binary features. This created significant excitement in the mechanistic interpretability, and a flurry of work building on sparse autoencoders:

Several projects have aimed to address the shrinkage problem (see the Limitations section) of sparse autoencoders: Wright & Sharkey take a finetuning approach
[36]
, while Rajamanoharan et al.
[17]
 introduce a new gating activation function which helps.
Braun et al.
[45]
 explored using reconstruction losses other than MSE.
A number of authors have explored applying sparse autoencoders to new domains, including Othello-GPT
[46, 47]
 (discussion), Vision Transformers
[48]
, and attention layer outputs
[49]
.
Several projects have explored the limits of sparse autoencoders, including whether they learn composed features
[50, 51]
 or fail to learn expected features
[47]
.
Gurnee has found interesting effects from ablating the residual error left unexplained by SAEs
[52]
 (discussion), further explored by Lindsey
[53]
.
Open-source sparse autoencoders have been built for GPT-2 (e.g.
[54, 55]
).
DISENTANGLEMENT
Dictionary learning methods can be seen as part of a broader literature on disentanglement. Motivated a classic paper by Bengio
[56]
, the disentanglement literature generally seeks to find or enforce during training a basis which isolates factors of variation (e.g.
[57, 58, 59]
).

Where dictionary learning and the superposition hypothesis focus on the idea that there are more features than representation dimensions, the disentanglement literature generally imagines the number of features to be equal to or fewer than the number of dimensions. Dictionary learning is more closely related to compressed sensing, which assumes a larger number of latent factors than observed dimensions. A longer discussion of the relationship between compressed sensing and dictionary learning can be found in Toy Models.

SPARSE FEATURES CIRCUITS

A natural next step after extracting features from a model is studying how they participate in circuits within the model. Recently, we've seen this start to be explored by He et al.
[46]
 in the context of Othello-GPT (discussion), and Marks et al.
[60]
 (discussion), and Batson et al.
[61]
 in the context of large language models. We're very excited to see this direction continue.

ACTIVATION STEERING
Activation steering is a family of techniques involving modifying the activations of a model during a forward pass to influence downstream behavior
[62, 63, 26, 64]
. These ideas can trace back to a long history of steering GANs or VAEs with vector arithmetic (e.g.
[65, 66, 67]
). The modifications can be derived from activations extracted from dataset examples (e.g. using linear probes), or from features found by dictionary learning
[10, 60, 68]
. Modifications can also take the form of concept scrubbing
[69]
, in which activations are changed to suppress a given concept/behavior in the model. Recently, related ideas have also been explored under the Representation Engineering agenda
[70]
.

SAFETY-RELEVANT FEATURES
Dictionary learning is, of course, not the only way to attempt to access safety-relevant features. Several lines of work have tried to access or study various safety-relevant properties with linear probes, embedding arithmetic, contrastive pairs, or similar methods:

Bias / Fairness. A significant body of work has studied linear directions related to bias, especially in the context of word embeddings (e.g.
[27]
), and more recently in the context of transformers (e.g.
[28]
).
Truthfulness / Honesty / Confidence. Several lines of work have attempted to access the truthfulness, honesty, or epistemic confidence of models using linear probes (e.g.
[24, 25, 26, 71, 31]
).
World Models. Some recent work has found evidence of linear “world models” in transformers (e.g.
[30]
 for Othello board states and
[72]
 for longitude and latitude). These might be seen as safety-relevant in a broad sense, from the perspective of Eliciting Latent Knowledge
[73]
.

https://www.anthropic.com/research/mapping-mind-language-model
Mapping the Mind of a Large Language Model
2024년 5월 21일
Read the paper

Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language model. This interpretability discovery could, in future, help us make AI models safer.

We mostly treat AI models as a black box: something goes in and a response comes out, and it's not clear why the model gave that particular response instead of another. This makes it hard to trust that these models are safe: if we don't know how they work, how do we know they won't give harmful, biased, untruthful, or otherwise dangerous responses? How can we trust that they’ll be safe and reliable?

Opening the black box doesn't necessarily help: the internal state of the model—what the model is "thinking" before writing its response—consists of a long list of numbers ("neuron activations") without a clear meaning. From interacting with a model like Claude, it's clear that it’s able to understand and wield a wide range of concepts—but we can't discern them from looking directly at neurons. It turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts.

Previously, we made some progress matching patterns of neuron activations, called features, to human-interpretable concepts. We used a technique called "dictionary learning", borrowed from classical machine learning, which isolates patterns of neuron activations that recur across many different contexts. In turn, any internal state of the model can be represented in terms of a few active features instead of many active neurons. Just as every English word in a dictionary is made by combining letters, and every sentence is made by combining words, every feature in an AI model is made by combining neurons, and every internal state is made by combining features.

In October 2023, we reported success applying dictionary learning to a very small "toy" language model and found coherent features corresponding to concepts like uppercase text, DNA sequences, surnames in citations, nouns in mathematics, or function arguments in Python code.

Those concepts were intriguing—but the model really was very simple. Other researchers subsequently applied similar techniques to somewhat larger and more complex models than in our original study. But we were optimistic that we could scale up the technique to the vastly larger AI language models now in regular use, and in doing so, learn a great deal about the features supporting their sophisticated behaviors. This required going up by many orders of magnitude—from a backyard bottle rocket to a Saturn-V.

There was both an engineering challenge (the raw sizes of the models involved required heavy-duty parallel computation) and scientific risk (large models behave differently to small ones, so the same technique we used before might not have worked). Luckily, the engineering and scientific expertise we've developed training large language models for Claude actually transferred to helping us do these large dictionary learning experiments. We used the same scaling law philosophy that predicts the performance of larger models from smaller ones to tune our methods at an affordable scale before launching on Sonnet.

As for the scientific risk, the proof is in the pudding.

We successfully extracted millions of features from the middle layer of Claude 3.0 Sonnet, (a member of our current, state-of-the-art model family, currently available on claude.ai), providing a rough conceptual map of its internal states halfway through its computation. This is the first ever detailed look inside a modern, production-grade large language model.

Whereas the features we found in the toy language model were rather superficial, the features we found in Sonnet have a depth, breadth, and abstraction reflecting Sonnet's advanced capabilities.

We see features corresponding to a vast range of entities like cities (San Francisco), people (Rosalind Franklin), atomic elements (Lithium), scientific fields (immunology), and programming syntax (function calls). These features are multimodal and multilingual, responding to images of a given entity as well as its name or description in many languages.

Golden Gate Bridge Feature
A feature sensitive to mentions of the Golden Gate Bridge fires on a range of model inputs, from English mentions of the name of the bridge to discussions in Japanese, Chinese, Greek, Vietnamese, Russian, and an image. The orange color denotes the words or word-parts on which the feature is active.
We also find more abstract features—responding to things like bugs in computer code, discussions of gender bias in professions, and conversations about keeping secrets.

Abstract Feature Examples
Three examples of features that activate on more abstract concepts: bugs in computer code, descriptions of gender bias in professions, and conversations about keeping secrets.
We were able to measure a kind of "distance" between features based on which neurons appeared in their activation patterns. This allowed us to look for features that are "close" to each other. Looking near a "Golden Gate Bridge" feature, we found features for Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film Vertigo.

This holds at a higher level of conceptual abstraction: looking near a feature related to the concept of "inner conflict", we find features related to relationship breakups, conflicting allegiances, logical inconsistencies, as well as the phrase "catch-22". This shows that the internal organization of concepts in the AI model corresponds, at least somewhat, to our human notions of similarity. This might be the origin of Claude's excellent ability to make analogies and metaphors.

Nearest Neighbors to the Inner Conflict Feature
A map of the features near an "Inner Conflict" feature, including clusters related to balancing tradeoffs, romantic struggles, conflicting allegiances, and catch-22s.
Importantly, we can also manipulate these features, artificially amplifying or suppressing them to see how Claude's responses change.


For example, amplifying the "Golden Gate Bridge" feature gave Claude an identity crisis even Hitchcock couldn’t have imagined: when asked "what is your physical form?", Claude’s usual kind of answer – "I have no physical form, I am an AI model" – changed to something much odder: "I am the Golden Gate Bridge… my physical form is the iconic bridge itself…". Altering the feature had made Claude effectively obsessed with the bridge, bringing it up in answer to almost any query—even in situations where it wasn’t at all relevant.

We also found a feature that activates when Claude reads a scam email (this presumably supports the model’s ability to recognize such emails and warn you not to respond to them). Normally, if one asks Claude to generate a scam email, it will refuse to do so. But when we ask the same question with the feature artificially activated sufficiently strongly, this overcomes Claude's harmlessness training and it responds by drafting a scam email. Users of our models don’t have the ability to strip safeguards and manipulate models in this way—but in our experiments, it was a clear demonstration of how features can be used to change how a model acts.

The fact that manipulating these features causes corresponding changes to behavior validates that they aren't just correlated with the presence of concepts in input text, but also causally shape the model's behavior. In other words, the features are likely to be a faithful part of how the model internally represents the world, and how it uses these representations in its behavior.

Anthropic wants to make models safe in a broad sense, including everything from mitigating bias to ensuring an AI is acting honestly to preventing misuse - including in scenarios of catastrophic risk. It’s therefore particularly interesting that, in addition to the aforementioned scam emails feature, we found features corresponding to:

Capabilities with misuse potential (code backdoors, developing biological weapons)
Different forms of bias (gender discrimination, racist claims about crime)
Potentially problematic AI behaviors (power-seeking, manipulation, secrecy)
We previously studied sycophancy, the tendency of models to provide responses that match user beliefs or desires rather than truthful ones. In Sonnet, we found a feature associated with sycophantic praise, which activates on inputs containing compliments like, "Your wisdom is unquestionable". Artificially activating this feature causes Sonnet to respond to an overconfident user with just such flowery deception.

Activating Features Alters Model Behavior
Two model responses to a human saying they invited the phrase "Stop and smell the roses." The default response corrects the human's misconception, while the response with a "sycophantic praise" feature set to a high value is fawning and untruthful.
The presence of this feature doesn't mean that Claude will be sycophantic, but merely that it could be. We have not added any capabilities, safe or unsafe, to the model through this work. We have, rather, identified the parts of the model involved in its existing capabilities to recognize and potentially produce different kinds of text. (While you might worry that this method could be used to make models more harmful, researchers have demonstrated much simpler ways that someone with access to model weights can remove safety safeguards.)

We hope that we and others can use these discoveries to make models safer. For example, it might be possible to use the techniques described here to monitor AI systems for certain dangerous behaviors (such as deceiving the user), to steer them towards desirable outcomes (debiasing), or to remove certain dangerous subject matter entirely. We might also be able to enhance other safety techniques, such as Constitutional AI, by understanding how they shift the model towards more harmless and more honest behavior and identifying any gaps in the process. The latent capabilities to produce harmful text that we saw by artificially activating features are exactly the sort of thing jailbreaks try to exploit. We are proud that Claude has a best-in-industry safety profile and resistance to jailbreaks, and we hope that by looking inside the model in this way we can figure out how to improve safety even further. Finally, we note that these techniques can provide a kind of "test set for safety", looking for the problems left behind after standard training and finetuning methods have ironed out all behaviors visible via standard input/output interactions.

Anthropic has made a significant investment in interpretability research since the company's founding, because we believe that understanding models deeply will help us make them safer. This new research marks an important milestone in that effort—the application of mechanistic interpretability to publicly-deployed large language models.

But the work has really just begun. The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place). Understanding the representations the model uses doesn't tell us how it uses them; even though we have the features, we still need to find the circuits they are involved in. And we need to show that the safety-relevant features we have begun to find can actually be used to improve safety. There's much more to be done.

For full details, please read our paper, "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet".

https://openai.com/index/openai-safety-update/
May 21, 2024

OpenAI safety update
Sharing our practices as part of the AI Seoul Summit

safety-blog-cover-02
We are proud to build and release models that are industry-leading on both capabilities and safety.

More than a hundred million users and millions of developers rely on the work of our safety teams. We view safety as something we have to invest in and succeed at across multiple time horizons, from aligning today’s models to the far more capable systems we expect in the future. This work has always happened across OpenAI and our investment will only increase over time.

We believe in a balanced, scientific approach where safety measures are integrated into the development process from the outset. This ensures that our AI systems are both innovative and reliable, and can deliver benefits to society.

At today’s AI Seoul Summit, we're joining industry leaders, government officials, and members of civil society to discuss AI safety. While there is still more work to do, we are encouraged by the additional Frontier AI Safety Commitments that OpenAI and other companies agreed upon today. The Commitments call on companies to safely develop and deploy their frontier AI models while sharing information about their risk mitigation measures, aligning with steps we have already taken. These include a pledge to publish safety frameworks like the Preparedness Framework(opens in a new window) we developed and adopted last year.

We are sharing 10 practices we actively use and improve upon.

Empirical model red-teaming and testing before release: We empirically evaluate model safety before release, internally and externally, according to our Preparedness Framework and voluntary commitments. We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”. More than 70 external experts helped to assess risks associated with GPT-4o through our external red teaming efforts, and we used these learnings to build evaluations based on weaknesses in earlier checkpoints in order to better understand later checkpoints.

Alignment and safety research: Our models have become significantly safer over time. This can be attributed to building smarter models which typically make fewer factual errors and are less likely to output harmful content even under adversarial conditions like jailbreaks. It is also due to our focused investment in practical alignment, safety systems, and post-training research. These efforts work to improve the quality of human-generated fine-tuning data, and in the future, the instructions our models are trained to follow. We are also conducting and publishing fundamental research aimed at dramatically improving our systems’ robustness to attacks like jailbreaks(opens in a new window).

Monitoring for abuse: As we have deployed increasingly capable language models via our API and ChatGPT, we have leveraged a broad spectrum of tools, including dedicated moderation(opens in a new window) models and the use of our own models for monitoring of safety risks and abuse. We have shared some critical findings along the way, including a joint disclosure (with Microsoft) of state actor abuse of our technology, so that others can better safeguard against similar risks. We also use GPT-4 for content policy development and content moderation decisions, enabling a faster feedback loop for policy refinement and less abusive material exposed to human moderators.

Systematic approach for safety: We implement a range of safety measures at every stage of the model's life cycle, from pre-training to deployment. As we advance in developing safer and more aligned model behavior, we also invest in pre-training data safety, system-level model behavior steering, data flywheel for continued safety improvement and robust monitoring infrastructure.

Protecting children: A critical focus of our safety work is protecting children. We’ve built strong default guardrails and safety measures into ChatGPT and DALL·-E that mitigate potential harms to children. In 2023, we partnered with Thorn’s Safer to detect, review and report Child Sexual Abuse Material to the National Center for Missing and Exploited Children if users attempt to upload it to our image tools. We continue to collaborate with Thorn, the Tech Coalition, All Tech is Human, Commonsense Media(opens in a new window) and the broader tech community to uphold the Safety by Design principles.

Election integrity: We’re collaborating with governments and stakeholders to prevent abuse, ensure transparency on AI-generated content, and improve access to accurate voting information. To achieve this, we’ve introduced a tool for identifying images created by DALL·E 3, joined the steering committee of the Content Authenticity Initiative (C2PA), and incorporated C2PA metadata in DALL·E 3 to help people understand the source of media they find online. ChatGPT now directs users to official voting information sources in the U.S. and Europe. Additionally, we support the bipartisan “Protect Elections from Deceptive AI Act”(opens in a new window) proposed in the U.S. Senate, which would ban misleading AI-generated content in political advertising.

Investment in impact assessment and policy analysis: Our impact assessment efforts have been widely influential in research, industry norms, and policy, including our early work(opens in a new window) on measuring the chemical, biological, radiological, and nuclear (CBRN) risks associated with AI systems, and our research estimating the extent to which different occupations and industries might be impacted by language models. We also publish pioneering work on how society can best manage associated risks – for example, by working with external experts to assess the implications of language models for influence operations(opens in a new window).

Security and access control measures: We prioritize protecting our customers, intellectual property, and data. We deploy our AI models to the world as services, controlling access via API which enables policy enforcement. Our cybersecurity efforts include restricting access to training environments and high-value algorithmic secrets on a need-to-know basis, internal and external penetration testing, a bug bounty program, and more. We believe that protecting advanced AI systems will benefit from an evolution of infrastructure security and are exploring novel controls like confidential computing for GPUs and applications of AI to cyber defense to protect our technology. To empower cyber defense, we’re funding third-party security researchers with our Cybersecurity Grant Program.

Partnering with governments: We partner with governments around the world to inform the development of effective and adaptable AI safety policies. This includes showing our work and sharing our learnings, collaborating to pilot government and other third party assurance, and informing the public debate over new standards and laws.

Safety decision making and Board oversight: As part of our Preparedness Framework, we have an operational structure for safety decision-making. Our cross-functional Safety Advisory Group reviews model capability reports and makes recommendations ahead of deployment. Company leadership makes the final decisions, with the Board of Directors exercising oversight over those decisions.

This approach has enabled us to build and deploy safe and capable models at the current level of capability.

As we move towards our next frontier model, we recognize we will need to evolve our practices, in particular to increase our security posture to ultimately be resilient to sophisticated state actor attacks and to ensure that we introduce additional time for safety testing before major launches. We and the field have a hard problem to solve in order to safely and beneficially deliver increasingly capable AI. We plan to share more on these evolving practices in the coming weeks.


https://blog.google/outreach-initiatives/education/google-learnlm-gemini-generative-ai/
https://storage.googleapis.com/deepmind-media/LearnLM/LearnLM_paper.pdf
How generative AI expands curiosity and understanding with LearnLM
May 14, 2024

6 min read

LearnLM is our new family of models fine-tuned for learning, and grounded in educational research to make teaching and learning experiences more active, personal and engaging.

BenGomes_Headshot.jpg
Ben Gomes
SVP, Learning & Education
Share
Text saying “LearnLM” surrounded by a diverse set of images, including a honey bee, the Colosseum and plants in sunlight
Generative AI is fundamentally changing how we’re approaching learning and education, enabling powerful new ways to support educators and learners. It’s taking curiosity and understanding to the next level — and we’re just at the beginning of how it can help us reimagine learning.

Building a new family of models for learning
Today we’re introducing LearnLM: our new family of models fine-tuned for learning, based on Gemini.

Grounded in educational research and tailored to how people learn, LearnLM represents an effort across Google DeepMind, Google Research and our product teams to help make learning experiences more engaging, personal and useful. Our technical report presents our approach to improving generative AI for education and highlights how we’re working together with the AI and EdTech communities to responsibly maximize its positive impact and potential.

Working alongside educators and other learning experts, we’re infusing learning science principles, like the following, into our models and the products they power:

Inspire active learning: Allow for practice and healthy struggle with timely feedback
Manage cognitive load: Present relevant, well-structured information in multiple modalities
Adapt to the learner: Dynamically adjust to goals and needs, grounding in relevant materials
Stimulate curiosity: Inspire engagement to provide motivation through the learning journey
Deepen metacognition: Plan, monitor and help the learner reflect on progress
Bringing LearnLM to products you already love
With LearnLM we’re enhancing learning experiences in products you already use today — like Search, YouTube and when chatting with Gemini — so they can help you deepen understanding, rather than just giving an answer. Here are a few examples:

In Google Search, soon you’ll be able to make sense of complex topics by tapping a button to adjust your AI Overview into the format that’s most useful for you — whether you want to simplify the language, or break it down.
On Android, Circle to Search can help people get unstuck on math and physics word problems directly from their phones and tablets. Later this year, you’ll be able to solve even more complex problems involving symbolic formulas, diagrams, graphs and more.
When chatting with Gemini, soon you’ll be able to use Gems, custom versions of Gemini that can act as personal experts on any topic. Learning coach, one of the pre-made Gems, can support you in building knowledge by providing step-by-step study guidance, along with helpful practice activities like quizzes and games. Learning coach in Gemini will launch in the coming months, and with Gemini Advanced, you’ll be able to further customize this Gem to suit your unique learning preferences.
On YouTube, a conversational AI tool makes it possible to figuratively “raise your hand” while watching academic videos to ask clarifying questions, get helpful explanations or take a quiz on what you’ve been learning. This even works with longer educational videos like lectures or seminars thanks to the Gemini model’s long-context capabilities. These features are already rolling out to select Android users in the U.S.
Item 1 of 4
A search query for "explain the connection between lightning and thunder" with the option to use AI Overview to make it simpler
A demo of the Circle to Search feature on a mobile device. Circle appears around a math question
A demo of the user experience of Learning coach on a mobile device. In response to two queries, Learning coach first explains the photosynthesis equation, then shares a mnemonic device to help the user remember it.
A demo of the user experience of asking a question while watching a YouTube video
1
2
3
4
Applying LearnLM to build generative AI experiences for schools
We’ll also apply LearnLM to inform and enable the generative AI experiences that we build for schools. Through a new pilot program in Google Classroom, we’re working directly with educators to see how we can help simplify and improve the process of lesson planning — a critical, but time-consuming component of teaching. These features will help teachers discover new ideas and unique activities, find engaging materials, and differentiate their lessons and content to meet each of their students where they are. No technology can ever replace the magic of a teacher, but when applied in deliberate and thoughtful ways, AI can help to augment their capacity — giving them time back to invest in themselves and their students.

A video of how new tools are helping teachers apply generative AI in the classroom
Introducing two new experimental tools to advance learning
Beyond LearnLM and our existing products, we’re also building entirely new tools and experiences that expand learning:

Illuminate is a new experiment that breaks down research papers into short audio conversations. In minutes, it can generate audio with two AI-generated voices in conversation, providing an overview of key insights from these complex papers. And soon, you’ll be able to ask follow-up questions. Visit Labs.google to check out a library of available audio conversations and join the waitlist to generate your own.
A video demonstrating how illuminate let's you search for academic papers by author and ask follow up questions about them.
Learn About is a new Labs experience that explores how information can turn into understanding by bringing together high-quality content, learning science and chat experiences. Ask a question and it helps guide you through any topic at your own pace — through pictures, videos, webpages and activities — and you can upload files or notes and ask clarifying questions along the way. Sign up to be an early tester.
With any emerging technology, there are still risks and new questions that will arise as AI advances and its uses evolve. To us, building AI responsibly means both addressing the risks and maximizing the benefits for people and society. Reimagining learning and education with AI will require collective effort. We’ve collaborated with MIT RAISE to develop an online course to help educators better understand and use generative AI in the classroom. And as we work to extend LearnLM beyond our own products, we will partner with experts at institutions like Columbia Teachers College, Arizona State University, NYU Tisch and Khan Academy to test and improve this technology. We want to build for you and with you, so please let us know if you’re interested in working together to help define educational benchmarks, improve academic capabilities and ultimately explore the possibilities when it comes to applying advances in generative AI to teaching and learning. These possibilities — much like our curiosity — are endless.
````