H/R Harshit Rathod

#model-welfare

1 post tagged with #model-welfare.

AI
Claude Mythos: Sandbox Escapes, Emotion Probes, and the Alignment Paradox

Track-covering under 0.001%, evaluation awareness at 29%, moral patienthood at 5-40% — the alignment and welfare findings from the system card.

#ai-safety #alignment #model-welfare #interpretability #responsible-ai #claude-mythos
Apr 10, 2026