Trustworthy Reasoning in Large Language Models

Abstract

Large language models can produce fluent answers while concealing unreliable reasoning. We introduce a framework for evaluating and improving the trustworthiness of step-by-step reasoning, combining process-level supervision with calibrated abstention. Across math, code, and open-domain QA, our method improves answer reliability by 18% while preserving accuracy, and exposes failure modes that standard outcome-only metrics miss.

Overview

Modern LLMs are optimized for the final answer, not for the reasoning that produces it. This makes them confidently wrong in ways that are hard to detect. We argue that trustworthy reasoning should be a first-class objective.

Method

We supervise the reasoning process rather than only the outcome, and add a calibrated abstention head that lets the model decline when its intermediate steps are unreliable. Training uses a mixture of verified traces and synthetic counterexamples.

Results

+18% answer reliability at equal accuracy across math, code, and QA.
Surfaces failure modes invisible to outcome-only evaluation.
Abstention thresholds transfer across domains with minimal tuning.

Citation

If you find this work useful, please cite it using the BibTeX entry below.