1 of 7

Aligning Large Language Models to be Better Medical Reasoners

Ritabrata Maiti

ritabrat001@e.ntu.edu.sg

2 of 7

CoT and Reasoning Capabilities of LLMs

CoT (chain of thought) involves prompting Large Language Models to generate intermediate reasoning steps, mimicking human-like thought processes to solve complex problems (Wei et al., 2022).
The main goal is to improve the LLM's ability to handle tasks that require sequential thinking and deeper cognitive processes.
By utilizing CoT, LLMs demonstrate improved accuracy and efficiency on reasoning benchmarks such as ScienceQA (Zhang et al., 2023).
This method enables LLMs to produce more intuitive and understandable reasoning paths, similar to how humans solve problems.

3 of 7

Using Alignment to Guide Desirable Behaviors in LLMs

Alignment in LLMs aims to adjust responses to align with human values, steering models to exhibit desirable behaviors and suppress undesirable ones (Wolf et al., 2023).
Challenges of Alignment highlight inherent limitations, such as the possibility of adversarial prompting exploiting any retained undesirable behavior.
Advancements in Alignment Techniques like synthetic feedback are being developed to improve LLM alignment more effectively and efficiently (Kim et al., 2023).
Importance of Continued Research in robust alignment methods is crucial for safe and ethical application of LLMs across various domains.

4 of 7

Using Alignment to Make LLMs Better Reasoners

Techniques like alignment fine tuning (AFT) helps optimize large language models by enhancing their reasoning capabilities, ensuring they prioritize high-quality responses during tasks (Wang et al., 2023).
Such techniques focus on producing the correct answers and adjusting scores based on response quality, a method that improves both alignment and constraint handling.
By supervised fine tuning with CoT data, and then aligning the resulting model with feedback improves the reasoning capabilities of the LLMs.

5 of 7

Diagnostic Reasoning Ability of LLMs

LLMs like GPT-4 are already used to write clinical notes, pass medical exams, and respond to patient inquiries [1].
Recent studies suggest LLMs can effectively handle complex clinical scenarios, demonstrating advanced diagnostic reasoning.
Techniques such as Chain-of-thought prompting improve LLM performance by mimicking sequential clinical reasoning processes.
Adapting LLMs to clinical reasoning steps provides insights into their decision-making processes, aiding interpretability in healthcare.

[1] Savage, T., Nayak, A., Gallo, R. et al. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. npj Digit. Med. 7, 20 (2024). https://doi.org/10.1038/s41746-024-01010-1

6 of 7

Proposed methodology

We start with a set of GPT-4 generated responses for clinical diagnosis queries, which has been assessed by medical experts for accuracy.
We then compile a dataset of correctly reasoned responses for supervised fine-tuning on open source LLMs (like Mistral or Llama-2).
Next an alignment dataset, including both correct and incorrect reasoning samples, for performing Kahneman-Tversky Optimization (KTO) to enhance LLM reasoning .
We expand our datasets either artificially or via manual curations to further scale up our experiments and improve performance.

7 of 7

Steps to be Taken

Initial datasets for SFT and KTO are ready

Expanding these for scaling experiments needs to be undertaken

Fine-tuning of LLMs will utilize LoRA to keep the overall GPU compute required as manageable as possible
Development of a robust evaluation pipeline and benchmark is essential for assessing performance.