Hierarchical Deconstruction of LLM Reasoning:
A Graph-Based Framework for Analyzing Knowledge Utilization
Miyoung Ko1*, Sue Hyun Park1*, Joonsuk Park2,3,4†, Minjoon Seo1†
1KAIST AI, 2NAVER AI Lab, 3NAVER Cloud, 4University of Richmond
*Equal contribution †Equal advising
The 2024 Conference on Empirical Methods in Natural Language Processing: EMNLP 2024
1. Introduction
2
1. Introduction
2. DepthQA - Graph-based Reasoning
3. Analysis on Depthwise Knowledge Reasoning
4. Memorization in Depthwise Knowledge Reasoning
5. Effect of Explicit Reasoning process
Table of Contents
1. Introduction
3
Hierarchical Deconstruction of LLM Reasoning
Q: How can we understand how LLM utilize multiple knowledge to solve complex questions?
A: Deconstruction based on Knowledge depth.
1. Introduction
Webb’s Depth of Knowledge
Deconstruction based on knowledge depth
1. Introduction
4
Hierarchical Deconstruction of LLM Reasoning
1. Introduction
- Deconstruct depth 3 questions into graph structure based on knowledge depth.
⇒ Transition to deeper nodes = Acquiring and reasoning with knowledge from shallower nodes.
- Forward Discrepancy vs. Backward Discrepancy
: Failure of reasoning vs. Failure of retrieving basic knowledge
1. Introduction
5
Knowledge Depths in Nodes
Following the definition of Webb’s depth of knowledge (Webb, 1997, 1999, 2002)
[Depth 1] Factual and Conceptual Knowledge - What the knowledge entails
Acquisition and recall of information, or following a simple formula
[Depth 2] Procedure Knowledge - How the knowledge can be utilized
Application of concepts through the selection of appropriate procedures and step-by-step engagement
[Depth 3] Strategic Knowledge - Why the knowledge is applicable
Analysis, decision-making, or justification to address non-routine problems, emphasizing
2. DepthQA - Graph-based Reasoning
Webb, N. L. (1997). Criteria for Alignment of Expectations and Assessments in Mathematics and Science Education. Research Monograph No. 6.
Webb, N. L. (1999). Alignment of Science and Mathematics Standards and Assessments in Four States. Research Monograph No. 18.
Webb, N. L. (2002). Depth-of-knowledge levels for four content areas. Language Arts, 28(March), 1-9.
1. Introduction
6
Construction of DepthQA
Top-down deconstruction of D3 questions
- From TutorEval (Chevalier et al., 2024) dataset, collect D3 questions
- For each D3 questions, deconstruct into D2 questions using GPT-4 Turbo. (Same process for each D2 to D1)
2. DepthQA - Graph-based Reasoning
Chevalier, A., Geng, J., Wettig, A., Chen, H., Mizera, S., Annala, T., … & Chen, D.. (2024). Language Models as Science Tutors. Proceedings of the 41st International Conference on Machine Learning.
1. Introduction
7
2. DepthQA - Graph-based Reasoning
Criteria in Reasoning in Edges
- Three criteria to ensure that edges accurately represent the reasoning processes
[C1] Comprehensiveness: Questions at lower levels should aim to cover all foundational concepts necessary to question at higher levels.
[C2] Implicitness: Questions at lower levels should avoid directly revealing answers for higher-level questions.
[C3] Non-binary Questioning: Questions should elicit detailed, exploratory responses instead of simple yes/no answers.
1. Introduction
8
Data Statistics
2. DepthQA - Graph-based Reasoning
Reasoning Types Analysis
1. Introduction
9
3. Analysis on Depthwise Knowledge Reasoning
Evaluation Metrics
1) Depthwise Evaluation ([1, 5])
- Factual correctness of answer: {1, 2, 3, 4, 5} by LLM-as-a-judge (GPT-4 Turbo)
- Average factual correctness over questions in each depth
2) Discrepancy Evaluation ([0, 1])
- Forward discrepancy vs. Backward discrepancy
Models
LLaMA2 {7B, 13B, 70B} Chat, Mistral {7B, 8x7B} Instruct, LLaMA3 {8B, 70B} Instruct, GPT-3.5-Turbo
1. Introduction
10
3. Analysis on Depthwise Knowledge Reasoning
Depthwise Knowledge Reasoning Results
- Larger models exhibit smaller discrepancies: LLaMa 2 7B Chat vs. LLaMa 3 70B Instruct
- Contrasting patterns of discrepancies
: Discrepancy = Intensity x Frequency ⇒ Forward discrepancy - Low Intensity (0.225) & High Frequency (44%)
Backward discrepancy - High Intensity (0.323) & Low Frequency (23%)
1. Introduction
11
3. Analysis on Depthwise Knowledge Reasoning
Depthwise Knowledge Reasoning Results
- Larger models exhibit smaller discrepancies: LLaMa 2 7B Chat vs. LLaMa 3 70B Instruct
- Contrasting patterns of discrepancies
: Discrepancy = Intensity x Frequency ⇒ Forward discrepancy - Low Intensity (0.225) & High Frequency (44%)
Backward discrepancy - High Intensity (0.323) & Low Frequency (23%)
1. Introduction
12
Depthwise Memorization
4. Memorization in Depthwise Knowledge Reasoning
Min-K% Probability (Shi et al., 2024): Average of negative log-likelihood of the K% least probable tokens
⇒ Higher Min-K%: Smaller possibility of the answer existed in training data
Models rely less on memorization for complex questions: D1 < D2 < D3 Min-K%
Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. ICLR 2024
1. Introduction
13
Memorization Gaps between Depths
4. Memorization in Depthwise Knowledge Reasoning
Memorization Gap: [Accuracy(D3) - Accuracy(D2)] / 4
- Positive gap: Higher factual accuracy for the deeper questions, representing backward discrepancy
- Negative gap: Higher accuracy for the shallower questions, representing forward discrepancy
Larger variance for smaller model: Variance of LLaMA 2 7B > LLaMa 2 70B, LLaMA 3 70B
Potential causes of discrepancies for larger models:
- Relatively higher forward discrepancies for 75% quantile (less memorization)
- Relatively higher backward discrepancies for 25% quantile (more memorization)
1. Introduction
14
Memorization Gaps between Depths
4. Memorization in Depthwise Knowledge Reasoning
Memorization Gap: [Accuracy(D3) - Accuracy(D2)] / 4
- Positive gap: Higher factual accuracy for the deeper questions, representing backward discrepancy
- Negative gap: Higher accuracy for the shallower questions, representing forward discrepancy
Larger variance for smaller model: Variance of LLaMA 2 7B > LLaMa 2 70B, LLaMA 3 70B
Potential causes of discrepancies for larger models:
- Relatively higher forward discrepancies for 75% quantile (less memorization)
- Relatively higher backward discrepancies for 25% quantile (more memorization)
1. Introduction
15
Memorization Gaps between Depths
4. Memorization in Depthwise Knowledge Reasoning
Bwd ↑
Mem ↑
Mem ↓
Fwd ↑
Memorization Gap: [Accuracy(D3) - Accuracy(D2)] / 4
- Positive gap: Higher factual accuracy for the deeper questions, representing backward discrepancy
- Negative gap: Higher accuracy for the shallower questions, representing forward discrepancy
Larger variance for smaller model: Variance of LLaMA 2 7B > LLaMa 2 70B, LLaMA 3 70B
Potential causes of discrepancies for larger models:
- Relatively higher forward discrepancies for 75% quantile (less memorization)
- Relatively higher backward discrepancies for 25% quantile (more memorization)
1. Introduction
16
Providing Explicit Reasoning Process
5. Effect of Explicit Reasoning Process
- Models
(i) Multiturn: Shallower questions are provided as user queries in a multi-turn conversation
(ii) Prompt (Gold): Shallower questions and their gold answers are provided in prompts
(iii) Prompt (Pred.): Shallower questions with the model’s predictions are given in prompts. as inputs
- Explicitly providing shallower solutions is beneficial for small models and complex questions.
- Implicitly guiding reasoning via multi-turn interactions best improves performance.
1. Introduction
17
Thank you!
Data and Code:
Code - https://github.com/kaistAI/knowledge-reasoning
DepthQA Dataset -
https://huggingface.co/datasets/kaist-ai/DepthQA