Training LLMs for Honesty via Confessions

Created 1/15/2026 at 10:17:36 PM

we propose a method for eliciting an honest expression of an LLM’s shortcomings via a self-reported confession. A confession is an output, provided upon request after a model’s original answer, that is meant to serve as a full account of the model’s compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer’s reward.

Public