SATLM: Satisfiability-Aided Language Models Using Declarative Prompting

R package build / 2026-01-27

Paper:SATLM: Satisfiability-Aided Language Models Using Declarative Prompting
Authors: Xi Ye, Qiaochu Chen, Isil Dillig, Greg Durrett

Chain-of-thought works best when the task is mostly forward (has a linear procedure), and adding tools (e.g., a Python executor) helps most when the main failure mode is arithmetic or other execution mistakes. But many real problems are closer to constraint solving: you need to pick a plan, search over possibilities, and keep global consistency—where LLMs often fail due to planning errors even if they can execute steps correctly.

SATLM’s answer is a simple division of labor: LLMs parse constraints; solvers reason.

The key contrast is imperative vs declarative:

Imperative: CoT produces how to solve it (a step-by-step chain/program that encodes the plan and executes it).
Declarative: SATLM declaractive parse produces what must be true (constraints + a query), and a solver decides how to search/plan to satisfy those constraints.

Method

Caption: Illustration of the satisfiability-aided language modeling approach (right). SATLM parses an NL input into a declarative task specification (a set of logical constraints) using prompting, then uses a SAT solver to solve the problem. The chain-of-thought strategy in prior work (left) yields imperative reasoning processes.

SATLM can be viewed as parse → plan → execute, but with a deliberate division of labor: the LLM focuses on parsing (NL → constraints), while the solver handles planning/search + execution.

1) Declarative Parsing (parse)Prompt the LLM (few-shot) to output a declarative specification: a small set of logical constraints plus a query. This is “declarative” because it states what must be true, not how to solve it step-by-step.
2) Translation (parse → solver input) Convert LLM generated declarative parse into a solver-ready satisfiability instance
3) Solving (plan + execute) Run an automated theorem prover / SMT solver (Z3 in the paper’s implementation) to find a satisfying model
4) Using solver feedback Compared to executor-only approaches (which mostly detect “runtime” failures), the solver can return richer signals:
- ERROR (invalid formulas / timeout)
- UNSAT (constraints inconsistent)
- AMBIG (multiple solutions / underconstrained, depending on the task/query)
SATLM can use these signals to abstain when the parsed spec is likely wrong, inconsistent, or underspecified.

Experiments

Prompting Techniques

STANDARD: direct answer prediction (few-shot).
COT: chain-of-thought prompting.
PAL / PROGLM: executor-augmented program-aided baseline (not applicable to all datasets; shown as “-” in the table when not used).

Decoding

Report greedy and self-consistency.
Self-consistency uses 40 samples on most datasets, and 5 on long/expensive ones (LSAT, BOARDGAMEQA, PROOFWRITER).

Language models

Main results use code-davinci-002.
Additional tests include gpt-3.5-turbo (0613), text-davinci-003, and code-davinci-001.

Datasets / abbreviations (as used in Table 1)

GSM / GSM-SYS / ALGE: arithmetic reasoning (ALGE = algebra textbook problems).
LSAT / BOARD / CLUTRR / PROOF: logical reasoning (PROOF = PROOFWRITER).
COLOR: symbolic reasoning (Colored Object from BIG-bench).
REGEX: structured regex synthesis (StructuredRegex / STREGEX).

Main results (Table 1)

The paper’s main table reports accuracy (%) across datasets/tasks with code-davinci-002 (greedy and self-consistency).

Table 1a — greedy decoding (accuracy %)

Method	GSM-SYS	GSM	ALGE	LSAT	BOARD	CLUTRR	PROOF	COLOR	REGEX
STANDARD	21.0	22.2	45.9	22.0	44.6	41.2	76.6	75.7	-
COT	46.5	62.7	53.6	23.5	60.7	40.8	80.1	86.3	-
PAL / PROGLM	43.4	72.7	52.3	-	-	58.9	83.7	95.1	39.1
SATLM	69.4	71.8	77.5	35.0	79.4	68.3	99.7	97.7	41.0

Biggest SATLM gains (vs the best non-SATLM baseline in Table 1a):

ALGE: 77.5 vs 53.6 (CoT) → +23.9.
GSM-SYS: 69.4 vs 46.5 (CoT) → +22.9.
BOARD: 79.4 vs 60.7 (CoT) → +18.7.
PROOF: 99.7 vs 83.7 (PAL) → +16.0.
LSAT: 35.0 vs 23.5 (CoT) → +11.5.
CLUTRR: 68.3 vs 58.9 (PAL) → +9.4.

Where gains are modest:

COLOR: 97.7 vs 95.1 (PAL) → +2.6.
REGEX: 41.0 vs 39.1 (PAL) → +1.9.

Where SATLM is (slightly) worse under greedy:

GSM (original): SATLM 71.8 vs PAL 72.7 → -0.9. Though SATLM is slightly lower than PAL on GSM under greedy, but recovers and surpasses it under self-consistency (+2.4 there).

Table 1b — self-consistency decoding (accuracy %)

Method	GSM-SYS	GSM	ALGE	LSAT	BOARD	CLUTRR	PROOF	COLOR	REGEX
COT	56.1	77.3	64.9	23.1	62.8	45.7	88.7	90.6	-
PAL / PROGLM	53.4	82.4	57.7	-	-	71.9	91.2	98.0	56.5
SATLM	80.9	84.8	90.9	37.4	80.7	80.1	99.7	99.4	59.7

GSM flips in SATLM’s favor under self-consistency (unlike under greedy decoding)
SoTA:
- LSAT: 37.4 (self-consistency), improving over the previous SoTA 30.9.
- BOARDGAMEQA: 80.7 (self-consistency), improving over the previous SoTA 73.9.

What parts matter

SATLM has two intertwined ideas: (1) make the LLM write a declarative spec (constraints + query), and (2) let a symbolic solver do the planning/search and execution. The paper disentangles them by holding the declarative prompt format fixed, and swapping out what does the “solve” step.

Caption: A variant of the approach which replaces the SAT solver with a “CoT solver” that takes the SAT problem as input and solves it in natural language.

On GSM-SYS, GSM, and CLUTRR (greedy decoding, code-davinci-002), they compare:

Sat_SymSolver: the full SATLM pipeline (declarative spec + symbolic solver)
Sat_CotSolver: same declarative spec, but the LLM tries to solve the spec in natural language (so the LLM is doing the planning/execution again)
Sat_NoSolver: declarative prompting, but then directly answer (no separate solve stage)

Results (accuracy %, greedy):

Dataset	CoT	Sat_CotSolver	Sat_SymSolver
GSM-SYS	46.5	54.5	69.4
GSM	62.7	63.2	71.7
CLUTRR	40.8	48.9	68.3

Declarative prompting helps even without a symbolic solver: Sat_CotSolver beats vanilla CoT on GSM-SYS (+8.0) and CLUTRR (+8.1). So parsing the problem before solving helps.
Sat_SymSolver beats Sat_CotSolver by +14.9 (GSM-SYS), +8.5 (GSM), and +19.4 (CLUTRR).
They manually analyze 40 cases where the symbolic solver succeeds but the CoT-solver fails, and find most errors are planning errors (not arithmetic slips). Reported planning-error rates are 72.5% (GSM-SYS), 42.5% (GSM), 47.5% (CLUTRR).
SATLM relies on the LLM to do a strong job at “reading the problem” (including implicit commonsense premises), while the solver handles the structured reasoning.

Caption: Example outputs from GSM (left) and BOARDGAMEQA (right) show that LLMs can perform commonsense reasoning while parsing.

Limitations

If the LLM misparses the problem into the wrong constraints, the solver will still solve the wrong problem perfectly.
Solver-backed reasoning can be computationally expensive for complex formulas (e.g., nonlinear arithmetic, quantifiers).
Practical performance depends on prompting/spec design (how you express constraints and queries so the model reliably produces well-formed specs).

Final Thought

SATLM fits into the tool-using & neuro-symbolic lines of work: LLMs parse; solvers reason. When the task stresses planning/search rather than just execution, that division of labor can pay off.