1 of 17

Hierarchical Deconstruction of LLM Reasoning:

A Graph-Based Framework for Analyzing Knowledge Utilization

Miyoung Ko1^*, Sue Hyun Park^1*, Joonsuk Park^2,3,4†, Minjoon Seo^1†

¹KAIST AI, ²NAVER AI Lab, ³NAVER Cloud, ⁴University of Richmond

^*Equal contribution ^†Equal advising

The 2024 Conference on Empirical Methods in Natural Language Processing: EMNLP 2024

2 of 17

1. Introduction

2

1. Introduction

2. DepthQA - Graph-based Reasoning

3. Analysis on Depthwise Knowledge Reasoning

4. Memorization in Depthwise Knowledge Reasoning

5. Effect of Explicit Reasoning process

Table of Contents

3 of 17

1. Introduction

3

Hierarchical Deconstruction of LLM Reasoning

Q: How can we understand how LLM utilize multiple knowledge to solve complex questions?

A: Deconstruction based on Knowledge depth.

1. Introduction

Webb’s Depth of Knowledge

Deconstruction based on knowledge depth

4 of 17

1. Introduction

4

Hierarchical Deconstruction of LLM Reasoning

1. Introduction

- Deconstruct depth 3 questions into graph structure based on knowledge depth.

⇒ Transition to deeper nodes = Acquiring and reasoning with knowledge from shallower nodes.

- Forward Discrepancy vs. Backward Discrepancy

: Failure of reasoning vs. Failure of retrieving basic knowledge

5 of 17

1. Introduction

5

Knowledge Depths in Nodes

Following the definition of Webb’s depth of knowledge (Webb, 1997, 1999, 2002)

[Depth 1] Factual and Conceptual Knowledge - What the knowledge entails

Acquisition and recall of information, or following a simple formula

[Depth 2] Procedure Knowledge - How the knowledge can be utilized

Application of concepts through the selection of appropriate procedures and step-by-step engagement

[Depth 3] Strategic Knowledge - Why the knowledge is applicable

Analysis, decision-making, or justification to address non-routine problems, emphasizing

2. DepthQA - Graph-based Reasoning

Webb, N. L. (1997). Criteria for Alignment of Expectations and Assessments in Mathematics and Science Education. Research Monograph No. 6.

Webb, N. L. (1999). Alignment of Science and Mathematics Standards and Assessments in Four States. Research Monograph No. 18.

Webb, N. L. (2002). Depth-of-knowledge levels for four content areas. Language Arts, 28(March), 1-9.

6 of 17

1. Introduction

6

Construction of DepthQA

Top-down deconstruction of D₃ questions

- From TutorEval (Chevalier et al., 2024) dataset, collect D₃ questions

- For each D₃ questions, deconstruct into D₂ questions using GPT-4 Turbo. (Same process for each D₂ to D₁)

2. DepthQA - Graph-based Reasoning

Chevalier, A., Geng, J., Wettig, A., Chen, H., Mizera, S., Annala, T., … & Chen, D.. (2024). Language Models as Science Tutors. Proceedings of the 41st International Conference on Machine Learning.

7 of 17

1. Introduction

7

2. DepthQA - Graph-based Reasoning

Criteria in Reasoning in Edges

- Three criteria to ensure that edges accurately represent the reasoning processes

[C1] Comprehensiveness: Questions at lower levels should aim to cover all foundational concepts necessary to question at higher levels.

[C2] Implicitness: Questions at lower levels should avoid directly revealing answers for higher-level questions.

[C3] Non-binary Questioning: Questions should elicit detailed, exploratory responses instead of simple yes/no answers.

8 of 17

1. Introduction

8

Data Statistics

2. DepthQA - Graph-based Reasoning

Reasoning Types Analysis

9 of 17

1. Introduction

9

3. Analysis on Depthwise Knowledge Reasoning

Evaluation Metrics

1) Depthwise Evaluation ([1, 5])

- Factual correctness of answer: {1, 2, 3, 4, 5} by LLM-as-a-judge (GPT-4 Turbo)

- Average factual correctness over questions in each depth

2) Discrepancy Evaluation ([0, 1])

- Forward discrepancy vs. Backward discrepancy

Models

LLaMA2 {7B, 13B, 70B} Chat, Mistral {7B, 8x7B} Instruct, LLaMA3 {8B, 70B} Instruct, GPT-3.5-Turbo

10 of 17

1. Introduction

10

3. Analysis on Depthwise Knowledge Reasoning

Depthwise Knowledge Reasoning Results

- Larger models exhibit smaller discrepancies: LLaMa 2 7B Chat vs. LLaMa 3 70B Instruct

- Contrasting patterns of discrepancies

: Discrepancy = Intensity x Frequency ⇒ Forward discrepancy - Low Intensity (0.225) & High Frequency (44%)

Backward discrepancy - High Intensity (0.323) & Low Frequency (23%)

11 of 17

1. Introduction

11

3. Analysis on Depthwise Knowledge Reasoning

Depthwise Knowledge Reasoning Results

- Larger models exhibit smaller discrepancies: LLaMa 2 7B Chat vs. LLaMa 3 70B Instruct

- Contrasting patterns of discrepancies

: Discrepancy = Intensity x Frequency ⇒ Forward discrepancy - Low Intensity (0.225) & High Frequency (44%)

Backward discrepancy - High Intensity (0.323) & Low Frequency (23%)

12 of 17

1. Introduction

12

Depthwise Memorization

4. Memorization in Depthwise Knowledge Reasoning

Min-K% Probability (Shi et al., 2024): Average of negative log-likelihood of the K% least probable tokens

⇒ Higher Min-K%: Smaller possibility of the answer existed in training data

Models rely less on memorization for complex questions: D1 < D2 < D3 Min-K%

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. ICLR 2024

13 of 17

1. Introduction

13

Memorization Gaps between Depths

4. Memorization in Depthwise Knowledge Reasoning

Memorization Gap: [Accuracy(D₃) - Accuracy(D₂)] / 4

- Positive gap: Higher factual accuracy for the deeper questions, representing backward discrepancy

- Negative gap: Higher accuracy for the shallower questions, representing forward discrepancy

Larger variance for smaller model: Variance of LLaMA 2 7B > LLaMa 2 70B, LLaMA 3 70B

Potential causes of discrepancies for larger models:

- Relatively higher forward discrepancies for 75% quantile (less memorization)

- Relatively higher backward discrepancies for 25% quantile (more memorization)

14 of 17

1. Introduction

14

Memorization Gaps between Depths

4. Memorization in Depthwise Knowledge Reasoning

Memorization Gap: [Accuracy(D₃) - Accuracy(D₂)] / 4

- Positive gap: Higher factual accuracy for the deeper questions, representing backward discrepancy

- Negative gap: Higher accuracy for the shallower questions, representing forward discrepancy

Larger variance for smaller model: Variance of LLaMA 2 7B > LLaMa 2 70B, LLaMA 3 70B

Potential causes of discrepancies for larger models:

- Relatively higher forward discrepancies for 75% quantile (less memorization)

- Relatively higher backward discrepancies for 25% quantile (more memorization)

15 of 17

1. Introduction

15

Memorization Gaps between Depths

4. Memorization in Depthwise Knowledge Reasoning

Bwd ↑

Mem ↑

Mem ↓

Fwd ↑

Memorization Gap: [Accuracy(D₃) - Accuracy(D₂)] / 4

- Positive gap: Higher factual accuracy for the deeper questions, representing backward discrepancy

- Negative gap: Higher accuracy for the shallower questions, representing forward discrepancy

Larger variance for smaller model: Variance of LLaMA 2 7B > LLaMa 2 70B, LLaMA 3 70B

Potential causes of discrepancies for larger models:

- Relatively higher forward discrepancies for 75% quantile (less memorization)

- Relatively higher backward discrepancies for 25% quantile (more memorization)

16 of 17

1. Introduction

16

Providing Explicit Reasoning Process

5. Effect of Explicit Reasoning Process

- Models

(i) Multiturn: Shallower questions are provided as user queries in a multi-turn conversation

(ii) Prompt (Gold): Shallower questions and their gold answers are provided in prompts

(iii) Prompt (Pred.): Shallower questions with the model’s predictions are given in prompts. as inputs

- Explicitly providing shallower solutions is beneficial for small models and complex questions.

- Implicitly guiding reasoning via multi-turn interactions best improves performance.

17 of 17

1. Introduction

17

Thank you!

Data and Code:

Code - https://github.com/kaistAI/knowledge-reasoning

DepthQA Dataset -

https://huggingface.co/datasets/kaist-ai/DepthQA