1 of 32

LLM fine-tuning

with medical records 이론/실습

2025-10-23, 15:00 ~ 17:00

Seongsu Bae, Sujeong Im

KAIST AI @ Edlab (Advised by Edward Choi)

KoSAIM 2025 개발자를 위한 AI 실습교육: Train your own medical AI

2 of 32

Speaker Bio

Sujeong Im (임수정)

Education

POSTECH Creative IT Engineering, B.Sc�(2018-2022)
KAIST Kim Jaechul Graduate School of AI, M.Sc (2023-2025)
KAIST Kim Jaechul Graduate School of AI, Ph.D (2025-)

Research Interests

Foundation Model
Natural Language Processing
Machine Learning for Healthcare

Seongsu Bae (배성수)

Education

Hanyang University Mathematics, B.Sc (2013-2019)
KAIST Kim Jaechul Graduate School of AI, M.Sc (2020-2022)
KAIST Kim Jaechul Graduate School of AI, Ph.D (2022-)

Research Interests

Semantic Machine
Multimodal Learning
Machine Learning for Healthcare

3 of 32

Table of Contents

How to build a clinical domain Large Language Model (LLM)? (40 mins)

(Large) Language Model
How to build a (large) language model?
Building an instruction-following LLM in the clinical domain
Asclepius (Gweon and Kim et al., ACL 2024 Findings)�

Hands-on Session: Fine-tuning a clinical domain LLM (80 mins)

Environment Setup & Colab Practice
LLM memory layout
Parameter-Efficient Fine-Tuning (LoRA/QLoRA)

4 of 32

Language Model

5 of 32

We deal with LMs every day!

6 of 32

How to train a LM?

The

sky

is

blue

.

The

sky

is

The

sky

blue

is

sky

Next Token Prediction task for the sentence “The sky is blue.”

7 of 32

Text Generation via a Probabilistic Model

The

sky

is

blue

clear

usually

the

<

(Large)

Language Model

more likely

less likely

8 of 32

How to build a (large) language model?

Pre-training and Fine-tuning

e.g., BERT (2018), T5 (2019)

Pretrained

LM

Finetune on task A

Finetune on task B

Finetune on task C

Inference

on task A

Inference

on task B

Inference

on task C

(-) Task-specific training → One specialized model for each task

9 of 32

How to build a (large) language model?

Pre-training and Prompting

e.g., GPT-3 (2020)

Pretrained

LM

Inference

on task A

Inference

on task B

Inference

on task C

(+) Improve performance via few-shot prompting or prompt engineering

prompting

10 of 32

How to build a (large) language model?

Pre-training and Prompting

(-) Forced few-shot prompting

(-) Manual efforts for the prompting technique

(-) Not aligned with natural instructions

11 of 32

How to build a (large) language model?

Pre-training and Instruction tuning

Supervised Fine-Tuning (SFT) on instruction data
e.g., FLAN (2021), LLaMA (2023)

Pretrained

LM

Inference

on task A

Inference

on task B

Inference

on task C

(+) model learns to perform many tasks via natural language instructions

…

instructions

fine-tune on many instructions

12 of 32

How to build a (large) language model?

Pre-training and Alignment tuning

Supervised Fine-Tuning (SFT) on instruction data�+ Alignment learning on preference data (e.g., RLHF, DPO)
e.g., InstructGPT (2022), ChatGPT (2022), Llama 2 (2023), Llama 3 (2024)

13 of 32

Building an instruction-following LLM

How can we build an instruction-following LLM?

Prepare a pre-trained large language model (e.g., LLaMA 7B)
Perform supervised fine-tuning on instruction data (e.g., Alpaca 52K dataset)��

How can we build an instruction-following LLM in the clinical domain?

Prepare a pre-trained large language model
Pre-training on clinical corpus for domain adaptation
Perform supervised fine-tuning using domain-specific clinical instruction data

Today, we will focus on instruction-following data tailored for clinical notes!

14 of 32

Imagine a clinical LLM

Given a clinical note, a clinical LLM can perform these tasks as follows:

“What medical procedures were performed on the patient during her hospital course, as mentioned in the discharge summary?” Named Entity Recognition
“What abbreviation was expanded using the acronym ‘ANH’ in the diagnosis section of the discharge summary?” Abbreviation Expansion
“When was the patient started on oral acyclovir and what was the duration of treatment?” Temporal Information Extraction
“Can you summarize the patient’s hospital course, treatment, and diagnoses according to the given discharge summary?” Summarization
“What was the reason for the patient’s transfer to ICU and what was the treatment plan for infection-induced respiratory failure?” Question Answering

15 of 32

Asclepius: Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes (Gweon and Kim et al., ACL 2024 Findings)

16 of 32

Real clinical note

Semi-Structured Text about Patient Activity
Properties

Semi-structured: Associated with headers
Acronyms
Typos

Problem: Protected Health Information (PHI)

Use GPT: PHI ⇒ Impractical
Human Annotation: Require Experts ⇒ cost
Machine Annotation: PHI ⇒ Impractical

17 of 32

Case report

To share “case” with community

No PHI ⇒ Sharable

Properties

Plain text
Less acronyms
Well-written

Contents are similar to the notes
e.g., PMC (PubMed Central) case report

18 of 32

Synthetic clinical note generation

19 of 32

Clinical instruction/response data generation

20 of 32

Final dataset

(clinical note, instruction, response) triples ⇒ all synthetics!

21 of 32

Asclepius-Llama3-8B

How can we build an instruction-following LLM in the clinical domain?

Prepare a pre-trained large language model

use Llama3-8B model

Pre-training on clinical corpus for domain adaptation

Pre-training (1 epoch): 2h 59m with 4x A100 80G
dataset: synthetic clinical notes

Perform supervised fine-tuning using domain-specific clinical instruction data

Instruction fine-tuning (3 epoch): 30h 41m with 4x A100 80G
dataset: clinical instruction-response pairs with synthetic clinical notes

22 of 32

Hands-on Session:

Fine-tuning a clinical domain LLM

23 of 32

Environment Setup

https://github.com/baeseongsu/

colab link

24 of 32

Environment Setup

https://github.com/baeseongsu/Clinical-LLM-FineTuning-HandsOn

25 of 32

Environment Setup

26 of 32

Colab Objectives

Goal: Fine-tuning a clinical domain LLM
Environment: Google Colab
Dataset: starmpcc/Asclepius-Synthetic-Clinical-Notes
Model: microsoft/phi-2 (2.7B)
CAUTION (주의)

LLM 학습하는 과정에서 Colab을 절대 끄지 마시기 바랍니다.

새로고침 금지
코랩 내에서 다른 버튼 클릭 금지
실행 중지 금지

27 of 32

Deep learning memory layout

Model size: B (billion) scale

xB parameters = xB floating point numbers = 2x GB (bf16/fp16)

Deep Learning Memory Requirements

model parameter: 2x GB
gradient state: 2x GB
optimizer state: 2x ~ 12x GB
Total: 6~16x GB + alpha

Our requirements

model: phi-2 (2.7B)
GPU VRAM: Colab T4 (16GB)
2.7*6=16.2

28 of 32

Can You Run it?

https://huggingface.co/spaces/Vokturz/can-it-run-llm

29 of 32

LoRA (Hu and Shen et al., 2021)

30 of 32

QLoRA (Dettmers and Pagnoni et al., 2023)

31 of 32

Parameter-Efficient Fine-Tuning (PEFT)

https://github.com/huggingface/peft

32 of 32

Thank you :D

If you require any further information, feel free to contact us: seongsu@kaist.ac.kr, sujeongim@kaist.ac.kr

KoSAIM 2025 개발자를 위한 AI 실습교육: Train your own medical AI