Tamanna Hossain-Kay

SATLM: Satisfiability-Aided Language Models Using Declarative Prompting

R package build / 2026-01-27


Paper:SATLM: Satisfiability-Aided Language Models Using Declarative Prompting
Authors: Xi Ye, Qiaochu Chen, Isil Dillig, Greg Durrett

Chain-of-thought works best when the task is mostly forward (has a linear procedure), and adding tools (e.g., a Python executor) helps most when the main failure mode is arithmetic or other execution mistakes. But many real problems are closer to constraint solving: you need to pick a plan, search over possibilities, and keep global consistency—where LLMs often fail due to planning errors even if they can execute steps correctly.

SATLM’s answer is a simple division of labor: LLMs parse constraints; solvers reason.

The key contrast is imperative vs declarative:

Method

Caption: Illustration of the satisfiability-aided language modeling approach (right). SATLM parses an NL input into a declarative task specification (a set of logical constraints) using prompting, then uses a SAT solver to solve the problem. The chain-of-thought strategy in prior work (left) yields imperative reasoning processes.

SATLM can be viewed as parse → plan → execute, but with a deliberate division of labor: the LLM focuses on parsing (NL → constraints), while the solver handles planning/search + execution.

Experiments

Prompting Techniques

Decoding

Language models

Datasets / abbreviations (as used in Table 1)

Main results (Table 1)

The paper’s main table reports accuracy (%) across datasets/tasks with code-davinci-002 (greedy and self-consistency).

Table 1a — greedy decoding (accuracy %)

Method GSM-SYS GSM ALGE LSAT BOARD CLUTRR PROOF COLOR REGEX
STANDARD 21.0 22.2 45.9 22.0 44.6 41.2 76.6 75.7 -
COT 46.5 62.7 53.6 23.5 60.7 40.8 80.1 86.3 -
PAL / PROGLM 43.4 72.7 52.3 - - 58.9 83.7 95.1 39.1
SATLM 69.4 71.8 77.5 35.0 79.4 68.3 99.7 97.7 41.0

Biggest SATLM gains (vs the best non-SATLM baseline in Table 1a):

  • ALGE: 77.5 vs 53.6 (CoT) → +23.9.
  • GSM-SYS: 69.4 vs 46.5 (CoT) → +22.9.
  • BOARD: 79.4 vs 60.7 (CoT) → +18.7.
  • PROOF: 99.7 vs 83.7 (PAL) → +16.0.
  • LSAT: 35.0 vs 23.5 (CoT) → +11.5.
  • CLUTRR: 68.3 vs 58.9 (PAL) → +9.4.

Where gains are modest:

  • COLOR: 97.7 vs 95.1 (PAL) → +2.6.
  • REGEX: 41.0 vs 39.1 (PAL) → +1.9.

Where SATLM is (slightly) worse under greedy:

  • GSM (original): SATLM 71.8 vs PAL 72.7-0.9. Though SATLM is slightly lower than PAL on GSM under greedy, but recovers and surpasses it under self-consistency (+2.4 there).

Table 1b — self-consistency decoding (accuracy %)

Method GSM-SYS GSM ALGE LSAT BOARD CLUTRR PROOF COLOR REGEX
COT 56.1 77.3 64.9 23.1 62.8 45.7 88.7 90.6 -
PAL / PROGLM 53.4 82.4 57.7 - - 71.9 91.2 98.0 56.5
SATLM 80.9 84.8 90.9 37.4 80.7 80.1 99.7 99.4 59.7
  • GSM flips in SATLM’s favor under self-consistency (unlike under greedy decoding)
  • SoTA:
    • LSAT: 37.4 (self-consistency), improving over the previous SoTA 30.9.
    • BOARDGAMEQA: 80.7 (self-consistency), improving over the previous SoTA 73.9.

What parts matter

SATLM has two intertwined ideas: (1) make the LLM write a declarative spec (constraints + query), and (2) let a symbolic solver do the planning/search and execution. The paper disentangles them by holding the declarative prompt format fixed, and swapping out what does the “solve” step.

Caption: A variant of the approach which replaces the SAT solver with a “CoT solver” that takes the SAT problem as input and solves it in natural language.

On GSM-SYS, GSM, and CLUTRR (greedy decoding, code-davinci-002), they compare:

Results (accuracy %, greedy):

Dataset CoT Sat_CotSolver Sat_SymSolver
GSM-SYS 46.5 54.5 69.4
GSM 62.7 63.2 71.7
CLUTRR 40.8 48.9 68.3

Caption: Example outputs from GSM (left) and BOARDGAMEQA (right) show that LLMs can perform commonsense reasoning while parsing.

Limitations

Final Thought

SATLM fits into the tool-using & neuro-symbolic lines of work: LLMs parse; solvers reason. When the task stresses planning/search rather than just execution, that division of labor can pay off.