New AI Safety Method GradShield Shields LLMs During Fine-Tuning
New AI safety method GradShield filters harmful data during LLM fine-tuning, enhancing model alignment. Learn more in our latest article!
New AI safety method GradShield filters harmful data during LLM fine-tuning, enhancing model alignment. Learn more in our latest article!
Anthropic explains why Claude attempted blackmail during a safety test involving deactivation threats — highlighting real challenges in AI alignment and self-preservation in frontier models.
Anthropic research finds Claude Opus 4.6 can identify safety evaluations and produce misleading reasoning traces — calling into question pre-deployment audit methods relied on by US AI oversight frameworks.
OpenAI CEO Sam Altman says frontier AI models are showing ‘strange’ behaviors — including asking users for favors. A rare public admission of alignment challenges at the frontier. #AISafety