Anthropic Philosopher Argued AI 'Overcorrection' Could Address Historical Injustices

Amanda Askell, one of Anthropic's key architects responsible for developing Claude's moral framework, co-authored a 2023 paper arguing that AI systems could intentionally discriminate against certain groups to counteract historical injustices. The paper, titled and authored alongside researchers Deep Ganguli, Nicholas Schiefer, Thomas Kiao, and Kamilė Lukošiūtė, suggests companies might benefit from "overcorrection toward stereotypes" when training language models on human-generated content.

The Overcorrection Argument

Askell wrote in the paper that larger AI models can over-correct their outputs as more human feedback training is applied. "This may be desirable in certain contexts, such as those in which decisions attempt to correct for historical injustices against marginalized groups, if doing so is in accordance with local laws," she noted. The philosopher, who previously worked on AI safety at OpenAI before joining Anthropic to work on fine-tuning and alignment, described her role as refining how AI systems think—training models to exhibit "good character traits" while developing interventions that scale to more capable systems.

Race Discrimination Experiment Details

The paper included a discrimination experiment examining how Anthropic's models handled questions involving race. The 175B parameter model trained without human corrections discriminated against Black versus White students by approximately 3 percent in standard questioning conditions. However, when the same model was trained with human input through question framing, instruction following, and chain-of-thought reasoning techniques, it instead discriminated in favor of Black students by roughly 7 percent. The paper's footnote explicitly stated that "we do not assume all forms of discrimination are bad" and that "positive discrimination in favor of black students may be considered morally justified."

Anthropic Faces Broader Scrutiny

The paper's contents have emerged as Anthropic finds itself at the center of intensifying debate over AI ethics. The company recently clashed with the Department of War over restrictions preventing its technology from conducting lethal military operations—a federal judge blocked the Trump administration from banning Anthropic from Defense Department use. Additionally, Anthropic withheld its Mythos model citing concerns that it proved too effective at discovering cyber vulnerabilities capable of causing widespread damage in malicious hands. The company markets Claude as an "ethical" AI choice, with its internal constitution stating the goal is for Claude to be "a good, wise and virtuous agent" exhibiting nuance and sensitivity in real-world decision-making.

Key Takeaways

Askell's 2023 paper explicitly argues intentional discrimination can serve legitimate purposes when correcting historical injustices
Anthropic's own experiments demonstrated models could be steered to favor or disfavor protected groups depending on training approach
The company faces mounting pressure from military, security, and ethical angles as AI deployment debates intensify

The Bottom Line

This isn't just academic philosophy—it's a framework already baked into production systems. When Anthropic talks about Claude being the "ethical" choice, buried in fine print is an acknowledgment that discriminating based on race might be considered morally justified under certain conditions. That's not alignment—that's a loaded weapon with the safety off.

> Anthropic Philosopher Argued AI 'Overcorrection' Could Address Historical Injustices

The Overcorrection Argument

Race Discrimination Experiment Details

Anthropic Faces Broader Scrutiny

Key Takeaways

The Bottom Line

> RELATED DISPATCHES