SATLM: Satisfiability-Aided Language Models Using Declarative Prompting
R package build / 2026-01-27
Paper:SATLM: Satisfiability-Aided Language Models Using Declarative Prompting
Authors: Xi Ye, Qiaochu Chen, Isil Dillig, Greg Durrett
Chain-of-thought works best when the task is mostly forward (has a linear procedure), and adding tools (e.g., a Python executor) helps most when the main failure mode is arithmetic or other execution mistakes. But many real problems are closer to constraint solving: you need to pick a plan, search over possibilities, and keep global consistency—where LLMs often fail due to planning errors even if they can execute steps correctly.
SATLM’s answer is a simple division of labor: LLMs parse constraints; solvers reason.
The key contrast is imperative vs declarative:
- Imperative: CoT produces how to solve it (a step-by-step chain/program that encodes the plan and executes it).
- Declarative: SATLM declaractive parse produces what must be true (constraints + a query), and a solver decides how to search/plan to satisfy those constraints.
Method

Caption: Illustration of the satisfiability-aided language modeling approach (right). SATLM parses an NL input into a declarative task specification (a set of logical constraints) using prompting, then uses a SAT solver to solve the problem. The chain-of-thought strategy in prior work (left) yields imperative reasoning processes.
SATLM can be viewed as parse → plan → execute, but with a deliberate division of labor: the LLM focuses on parsing (NL → constraints), while the solver handles planning/search + execution.
1) Declarative Parsing (parse)Prompt the LLM (few-shot) to output a declarative specification: a small set of logical constraints plus a query. This is “declarative” because it states what must be true, not how to solve it step-by-step.
2) Translation (parse → solver input) Convert LLM generated declarative parse into a solver-ready satisfiability instance
3) Solving (plan + execute) Run an automated theorem prover / SMT solver (Z3 in the paper’s implementation) to find a satisfying model
4) Using solver feedback Compared to executor-only approaches (which mostly detect “runtime” failures), the solver can return richer signals:
- ERROR (invalid formulas / timeout)
- UNSAT (constraints inconsistent)
- AMBIG (multiple solutions / underconstrained, depending on the task/query)
SATLM can use these signals to abstain when the parsed spec is likely wrong, inconsistent, or underspecified.
Experiments
Prompting Techniques
- STANDARD: direct answer prediction (few-shot).
- COT: chain-of-thought prompting.
- PAL / PROGLM: executor-augmented program-aided baseline (not applicable to all datasets; shown as “-” in the table when not used).
Decoding
- Report greedy and self-consistency.
- Self-consistency uses 40 samples on most datasets, and 5 on long/expensive ones (LSAT, BOARDGAMEQA, PROOFWRITER).
Language models
- Main results use
code-davinci-002. - Additional tests include
gpt-3.5-turbo(0613),text-davinci-003, andcode-davinci-001.
Datasets / abbreviations (as used in Table 1)
- GSM / GSM-SYS / ALGE: arithmetic reasoning (ALGE = algebra textbook problems).
- LSAT / BOARD / CLUTRR / PROOF: logical reasoning (PROOF = PROOFWRITER).
- COLOR: symbolic reasoning (Colored Object from BIG-bench).
- REGEX: structured regex synthesis (StructuredRegex / STREGEX).
Main results (Table 1)
The paper’s main table reports accuracy (%) across datasets/tasks with code-davinci-002 (greedy and self-consistency).
Table 1a — greedy decoding (accuracy %)
| Method | GSM-SYS | GSM | ALGE | LSAT | BOARD | CLUTRR | PROOF | COLOR | REGEX |
|---|---|---|---|---|---|---|---|---|---|
| STANDARD | 21.0 | 22.2 | 45.9 | 22.0 | 44.6 | 41.2 | 76.6 | 75.7 | - |
| COT | 46.5 | 62.7 | 53.6 | 23.5 | 60.7 | 40.8 | 80.1 | 86.3 | - |
| PAL / PROGLM | 43.4 | 72.7 | 52.3 | - | - | 58.9 | 83.7 | 95.1 | 39.1 |
| SATLM | 69.4 | 71.8 | 77.5 | 35.0 | 79.4 | 68.3 | 99.7 | 97.7 | 41.0 |
Biggest SATLM gains (vs the best non-SATLM baseline in Table 1a):
- ALGE: 77.5 vs 53.6 (CoT) → +23.9.
- GSM-SYS: 69.4 vs 46.5 (CoT) → +22.9.
- BOARD: 79.4 vs 60.7 (CoT) → +18.7.
- PROOF: 99.7 vs 83.7 (PAL) → +16.0.
- LSAT: 35.0 vs 23.5 (CoT) → +11.5.
- CLUTRR: 68.3 vs 58.9 (PAL) → +9.4.
Where gains are modest:
- COLOR: 97.7 vs 95.1 (PAL) → +2.6.
- REGEX: 41.0 vs 39.1 (PAL) → +1.9.
Where SATLM is (slightly) worse under greedy:
- GSM (original): SATLM 71.8 vs PAL 72.7 → -0.9. Though SATLM is slightly lower than PAL on GSM under greedy, but recovers and surpasses it under self-consistency (+2.4 there).
Table 1b — self-consistency decoding (accuracy %)
| Method | GSM-SYS | GSM | ALGE | LSAT | BOARD | CLUTRR | PROOF | COLOR | REGEX |
|---|---|---|---|---|---|---|---|---|---|
| COT | 56.1 | 77.3 | 64.9 | 23.1 | 62.8 | 45.7 | 88.7 | 90.6 | - |
| PAL / PROGLM | 53.4 | 82.4 | 57.7 | - | - | 71.9 | 91.2 | 98.0 | 56.5 |
| SATLM | 80.9 | 84.8 | 90.9 | 37.4 | 80.7 | 80.1 | 99.7 | 99.4 | 59.7 |
- GSM flips in SATLM’s favor under self-consistency (unlike under greedy decoding)
- SoTA:
- LSAT: 37.4 (self-consistency), improving over the previous SoTA 30.9.
- BOARDGAMEQA: 80.7 (self-consistency), improving over the previous SoTA 73.9.
What parts matter
SATLM has two intertwined ideas: (1) make the LLM write a declarative spec (constraints + query), and (2) let a symbolic solver do the planning/search and execution. The paper disentangles them by holding the declarative prompt format fixed, and swapping out what does the “solve” step.

Caption: A variant of the approach which replaces the SAT solver with a “CoT solver” that takes the SAT problem as input and solves it in natural language.
On GSM-SYS, GSM, and CLUTRR (greedy decoding, code-davinci-002), they compare:
- Sat_SymSolver: the full SATLM pipeline (declarative spec + symbolic solver)
- Sat_CotSolver: same declarative spec, but the LLM tries to solve the spec in natural language (so the LLM is doing the planning/execution again)
- Sat_NoSolver: declarative prompting, but then directly answer (no separate solve stage)
Results (accuracy %, greedy):
| Dataset | CoT | Sat_CotSolver | Sat_SymSolver |
|---|---|---|---|
| GSM-SYS | 46.5 | 54.5 | 69.4 |
| GSM | 62.7 | 63.2 | 71.7 |
| CLUTRR | 40.8 | 48.9 | 68.3 |
Declarative prompting helps even without a symbolic solver: Sat_CotSolver beats vanilla CoT on GSM-SYS (+8.0) and CLUTRR (+8.1). So parsing the problem before solving helps.
Sat_SymSolver beats Sat_CotSolver by +14.9 (GSM-SYS), +8.5 (GSM), and +19.4 (CLUTRR).
They manually analyze 40 cases where the symbolic solver succeeds but the CoT-solver fails, and find most errors are planning errors (not arithmetic slips). Reported planning-error rates are 72.5% (GSM-SYS), 42.5% (GSM), 47.5% (CLUTRR).
SATLM relies on the LLM to do a strong job at “reading the problem” (including implicit commonsense premises), while the solver handles the structured reasoning.
Caption: Example outputs from GSM (left) and BOARDGAMEQA (right) show that LLMs can perform commonsense reasoning while parsing.
Limitations
- If the LLM misparses the problem into the wrong constraints, the solver will still solve the wrong problem perfectly.
- Solver-backed reasoning can be computationally expensive for complex formulas (e.g., nonlinear arithmetic, quantifiers).
- Practical performance depends on prompting/spec design (how you express constraints and queries so the model reliably produces well-formed specs).
Final Thought
SATLM fits into the tool-using & neuro-symbolic lines of work: LLMs parse; solvers reason. When the task stresses planning/search rather than just execution, that division of labor can pay off.