1 of 23

Faithful Logical Reasoning via

Symbolic Chain-of-Thought

Jundong Xu, Hao Fei, Liangming Pan,

Qian Liu, Mong-Li Lee, Wynne Hsu

National University of Singapore, University of California,

University of Auckland

Paper link

Accepted by ACL2024

2 of 23

2

Abstract

3 of 23

3

4 of 23

4

Evaluation

5 standard datasets (PrOntoQA, ProofWriter, FOLIO, LogicalDeduction, AR-LSAT )

First-Order Logic (PrOntoQA, ProofWriter, FOLIO )

a formal system that uses predicates and quantifiers to express relationships between objects.

Constraint Optimization symbolic expressions (LogicalDeduction, AR-LSAT )

find optimal solutions by manipulating symbolic variables subject to defined constraints.

SymbCoT better than CoT

評估上利用五個不同的資料集其中PrOntoQA, ProofWriter, FOLIO是基於first order logic, LogicalDeduction, AR-LSAT base on constraint optimization symbolic expression, 結果顯示symbcot better than cot

First-order logic: 一階陳述表達關係陳述T/F �ex.

{

"id": "FOLIO_train_0",

"context": "All people who regularly drink coffee are dependent on caffeine. People either regularly drink coffee or joke about being addicted to caffeine. No one who jokes about being addicted to caffeine is unaware that caffeine is a drug. Rina is either a student and unaware that caffeine is a drug, or neither a student nor unaware that caffeine is a drug. If Rina is not a person dependent on caffeine and a student, then Rina is either a person dependent on caffeine and a student, or neither a person dependent on caffeine nor a student.",

"question": "Based on the above information, is the following statement true, false, or uncertain? Rina is a person who jokes about being addicted to caffeine or unaware that caffeine is a drug.",

"options": [

"A) True",

"B) False",

"C) Uncertain"

],

"answer": "A"

},

Constraint Optimization: 約束優化給予條件找最佳解

{

"id": "ar_lsat_200006_1-G_1_1",

"context": "Four boys—Fred, Juan, Marc, and Paul—and three girls—Nita, Rachel, and Trisha—will be assigned to a row of five adjacent lockers, numbered consecutively 1 through 5, arranged along a straight wall. The following conditions govern the assignment of lockers to the seven children: Each locker must be assigned to either one or two children, and each child must be assigned to exactly one locker. Each shared locker must be assigned to one girl and one boy. Juan must share a locker, but Rachel cannot share a locker. Nita's locker cannot be adjacent to Trisha's locker. Fred must be assigned to locker 3.",

"question": "Which one of the following is a complete and accurate list of the children who must be among those assigned to shared lockers?",

"options": [

"A) Fred, Juan",

"B) Juan, Paul",

"C) Juan, Marc, Paul",

"D) Juan, Marc, Trisha",

"E) Juan, Nita, Trisha"

],

"answer": "E"

}

5 of 23

5

Introduction

6 of 23

6

7 of 23

7

SymbCoT for

Symbolic Reasoning

8 of 23

8

9 of 23

9

Prompt

+

Question

+

Premisse

First-order logic format

10 of 23

10

Prompt

+

Translator's output

Step-by-step

solution plan

11 of 23

11

Prompt

+

Translator's output

+

Planner's output

Step-by-step reasoning process

+

Final conclusion

12 of 23

12

Prompt

+

Original Q & context

+

Component’s I/O

Component results

+

Issue

+

Final Status

13 of 23

13

Datasets:

Five standard datasets: PrOntoQA, ProofWriter, FOLIO, LogicalDeduction, AR-LSAT.
Symbolic Structures: First-Order Logic (FOL) and Constraint Optimization (CO).

Baselines:

Naive Prompting, CoT, Logic-LM (GPT-3.5 and GPT-4).
CoT-SC, ToT, CR, DetermLR (GPT-4).

Metrics:

Accuracy (multiple-choice correctness).

Experiments

After introducing the basic concepts and architecture of SymbCoT, I will now explain in detail how the authors designed the experiments in this paper.

Selecting representative datasets is the first step.

To verify the performance and versatility of SymbCoT, the authors carefully selected five standard logical reasoning datasets: PrOntoQA, ProofWriter, FOLIO, LogicalDeduction, and AR-LSAT. These datasets vary in difficulty and focus, covering a wide range of logical reasoning scenarios.

PrOntoQA focuses on ontology-based question answering, testing the model's understanding and reasoning abilities about knowledge. ProofWriter requires the model to generate proofs, focusing more on the completeness of the reasoning process. FOLIO emphasizes natural language reasoning, highlighting the model's understanding of natural language and logical transformation. LogicalDeduction and AR-LSAT use constraint optimization and abstract reasoning, respectively, to test the model's performance under different symbolic structures.

And the reason why I introduced this is because the choice of symbolic structures corresponds to the datasets.

The authors did not use a single logical expression. To more comprehensively evaluate SymbCoT's capabilities, they used two symbolic structures: First-Order Logic (FOL) and Constraint Optimization (CO).

FOL is a powerful logical expression language that can precisely describe the relationships and attributes between objects. It is suitable for datasets such as PrOntoQA, ProofWriter, and FOLIO, which require detailed logical expressions. CO is a framework for solving constraint satisfaction problems, which can effectively handle reasoning tasks with complex constraints. It is more suitable for datasets such as LogicalDeduction and AR-LSAT, which need to handle constraints.

By using different symbolic structures, the authors can verify SymbCoT's performance under different logical expression methods.

As the baseline part,the authors selected a series of baseline models, including Naive Prompting, Chain-of-Thought (CoT), and Logic-LM.

Naive Prompting is the most basic prompting method, directly inputting the problem into the LLM without using any additional techniques. Chain-of-Thought (CoT) is a common method to improve the reasoning ability of LLMs by guiding the LLM to think step by step to generate more reliable answers. Logic-LM is a hybrid method that combines LLMs and external symbolic solvers.

For a more comprehensive comparison, the authors used these baseline models on both GPT-3.5 and GPT-4. In addition, on GPT-4, they also added more advanced baseline models such as CoT-SC, ToT, CR, and DetermLR. These models represent different strategies for improving the reasoning ability of LLMs, such as self-consistency, tree-of-thought, cumulative reasoning, and so on.

And the main evaluation metric is Accuracy, which I think we all familiar on this so I won’t explain on this.

14 of 23

14

Results

Overall Performance:

SymbCoT significantly outperforms Naive, CoT, and Logic-LM baselines.
Demonstrates general versatility in different symbolic reasoning expressions.

15 of 23

15

Results

Ablation Study :

16 of 23

16

Analysis and Discussion

Reasoning Depth:

SymbCoT's improvement over CoT becomes more pronounced with increasing reasoning depth.

17 of 23

17

Analysis and Discussion

Robustness to Symbolic Syntax Errors:

SymbCoT achieves a high execution success rate compared to methods relying on external resolvers like Logic-LM.

Figure 5 shows that SymbCoT has a higher execution success rate compared to Logic-LM, which relies on external solvers. This is indeed a significant achievement.

But they said in paper that, quote “Notably, our method achieves a remarkable execution success rate of up to 100%.” My personal perspective thinks that it is kind of like a dirty trick. Let me explain why.

Logic-LM needs to pass the symbolic expressions generated by the LLM to an external solver for solving. If the symbolic expressions contain syntax errors, the external solver will refuse to execute, leading to reasoning failure. SymbCoT relies entirely on the LLM for reasoning and does not need to use an external solver. This makes SymbCoT more robust to symbolic syntax errors. Even if the symbolic expressions generated by the LLM contain slight syntax errors, SymbCoT can still correct them through its internal reasoning mechanism, thereby improving the success rate of reasoning.

So it is quite trivial, and it cannot say that they are wrong, but I think it might be a little bit over exaggerate. Anyway, a dirty trick share with you guys, maybe this trick will help you in your own research.

18 of 23

18

Analysis and Discussion

Benefits of Hybrid Expression (Figure 6):

SymbCoT reduces errors caused by information loss and inaccurate translations by cross-referencing symbolic and natural language data.

19 of 23

19

Analysis and Discussion

Reasoning Faithfulness:

SymbCoT ensures credible, symbolic-based reasoning and reduces reliance on chance.

20 of 23

20

Analysis and Discussion

Impact of LLM Scale (Figure 8):

Performance gains are more significant when upgrading from GPT-3.5 to GPT-4.

21 of 23

21

Conclusion

Summary:

SymbCoT improves logical reasoning by integrating symbolic expressions and logical rules with CoT prompting.
Enhances vanilla CoT on logical reasoning with both FOL and CO symbolic expressions.

Significance:

Advances in faithfulness, flexibility, and explainability of logical reasoning.

22 of 23

22

Future Directions

Their perspective:

Combining SymbCoT with external solvers to leverage complementary strengths.
Evaluating more symbolic structures to ensure comprehensive evaluation.
Optimizing the framework's efficiency to reduce implementation costs.

My perspective:

Agent Design & Communication
Empowering Smaller LLMs
Integration with Reasoning Models(O-series, R1, S1, etc.)

Finally, they bring up three future direction:�First, SymbCoT relies entirely on LLMs for reasoning. In the future, they said that they can consider combining SymbCoT with external solvers, fully utilizing their respective advantages.

Second, this research only evaluated two symbolic structures: FOL and CO. In the future, they said that they can evaluate more symbolic structures, such as description logic, rule engines, etc., to more comprehensively understand SymbCoT's capabilities.

Third, the implementation cost of SymbCoT is relatively high, mainly because it needs to generate a large number of tokens. In the future, they said that they can optimize the efficiency of the framework, reduce the implementation cost, and make it easier to apply.

But I think we can ignore that.

In my personal perspective, what we can do maybe is that we can try to apply these concepts to agent design, enhancing the accuracy and efficiency of communication between agents.Or furthermore, we could even maybe try this approach to fine-tune relatively small language models, transforming them into valuable components within agent systems.

And In addition, recent times have witnessed a surge in interest surrounding reasoning models, especially chain of thought approaches. The widely anticipated DeepSeek R1 and OpenAI's newly released o3-mini serve as prime examples of the continuous drive to improve model reasoning. Adding to this trend, the S1 model, which is fine-tuned using relatively low cost and datasets, also demonstrates ways to further reduce the fine-tuning cost for reasoning models.

It might holds significant potential for synergy with this research.

23 of 23

Thanks for listening