1 of 7

Aligning Large Language Models to be Better Medical Reasoners

Ritabrata Maiti

ritabrat001@e.ntu.edu.sg

1

2 of 7

CoT and Reasoning Capabilities of LLMs

  • CoT (chain of thought) involves prompting Large Language Models to generate intermediate reasoning steps, mimicking human-like thought processes to solve complex problems (Wei et al., 2022).
  • The main goal is to improve the LLM's ability to handle tasks that require sequential thinking and deeper cognitive processes.
  • By utilizing CoT, LLMs demonstrate improved accuracy and efficiency on reasoning benchmarks such as ScienceQA (Zhang et al., 2023).
  • This method enables LLMs to produce more intuitive and understandable reasoning paths, similar to how humans solve problems.

3 of 7

Using Alignment to Guide Desirable Behaviors in LLMs

  • Alignment in LLMs aims to adjust responses to align with human values, steering models to exhibit desirable behaviors and suppress undesirable ones (Wolf et al., 2023).
  • Challenges of Alignment highlight inherent limitations, such as the possibility of adversarial prompting exploiting any retained undesirable behavior.
  • Advancements in Alignment Techniques like synthetic feedback are being developed to improve LLM alignment more effectively and efficiently (Kim et al., 2023).
  • Importance of Continued Research in robust alignment methods is crucial for safe and ethical application of LLMs across various domains.

4 of 7

Using Alignment to Make LLMs Better Reasoners

  • Techniques like alignment fine tuning (AFT) helps optimize large language models by enhancing their reasoning capabilities, ensuring they prioritize high-quality responses during tasks (Wang et al., 2023).
  • Such techniques focus on producing the correct answers and adjusting scores based on response quality, a method that improves both alignment and constraint handling.
  • By supervised fine tuning with CoT data, and then aligning the resulting model with feedback improves the reasoning capabilities of the LLMs.

5 of 7

Diagnostic Reasoning Ability of LLMs

  • LLMs like GPT-4 are already used to write clinical notes, pass medical exams, and respond to patient inquiries [1].
  • Recent studies suggest LLMs can effectively handle complex clinical scenarios, demonstrating advanced diagnostic reasoning.
  • Techniques such as Chain-of-thought prompting improve LLM performance by mimicking sequential clinical reasoning processes.
  • Adapting LLMs to clinical reasoning steps provides insights into their decision-making processes, aiding interpretability in healthcare.

[1] Savage, T., Nayak, A., Gallo, R. et al. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. npj Digit. Med. 7, 20 (2024). https://doi.org/10.1038/s41746-024-01010-1

6 of 7

Proposed methodology

  • We start with a set of GPT-4 generated responses for clinical diagnosis queries, which has been assessed by medical experts for accuracy.
  • We then compile a dataset of correctly reasoned responses for supervised fine-tuning on open source LLMs (like Mistral or Llama-2).
  • Next an alignment dataset, including both correct and incorrect reasoning samples, for performing Kahneman-Tversky Optimization (KTO) to enhance LLM reasoning .
  • We expand our datasets either artificially or via manual curations to further scale up our experiments and improve performance.

7 of 7

Steps to be Taken

  • Initial datasets for SFT and KTO are ready
    • Expanding these for scaling experiments needs to be undertaken
  • Fine-tuning of LLMs will utilize LoRA to keep the overall GPU compute required as manageable as possible
  • Development of a robust evaluation pipeline and benchmark is essential for assessing performance.