1 of 52

2 of 52

Before Ablation ��Groundwork for Lexical Alignment Tests

Research group members:

Presentation speakers:

Tommie Juzek
Xiaoyang Ming

RCC specialist:

Jose Hernandez

07 Nov 2025

@SC-AI Seminar

3 of 52

Before Ablation ��Groundwork for Lexical Alignment Tests

Three core takeaways:

Measuring AI language: Document-based frequency + windows allows for automation
FSU’s RCC has great infrastructure
Lambda is pretty neat

4 of 52

Background

5 of 52

Background

LLMs overuse certain words

5

6 of 52

Background

Assumption: �Exposure → influence (Zajonc, 1968; Hasher et al., 1977)
LLM-authored messaging might shift attitudes (Bai et al., 2025).

6

7 of 52

Background

LLMs overuse certain words
Relevance:

As hundreds of millions�of people are exposed�we want good alignment

7

language alignment & general alignment

8 of 52

Background

It is thought that Human Preference Learning contributes to this by these fine-tuning methods:

Learning from Human Feedback (LHF)
Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimisation (DPO)

8

9 of 52

Background

9

Typical LHF procedures

10 of 52

Background

10

Typical LHF procedures

11 of 52

Background

We know:

LHF: small preference differences → bigger model behaviour differences
Shown for formatting (Zhang et al. 2024/5)
But needs to be shown for lexical usage

11

12 of 52

Background

Thus, loosely speaking, the thinking is:

relatively few ‘delves’ in the LHF datasets cause rather big changes in model behaviour downstream

12

13 of 52

Background

13

14 of 52

Background

How to show this?

Ablation study

experimental procedure in which components of a system are systematically altered or removed to identify their causal contribution to the system’s overall behaviour

14

15 of 52

Background

Thus:

dosage of AI-buzzword usage during training�→
and we check model behaviour post training

15

16 of 52

Background

17 of 52

Background

We want two things:

How do we measure “buzzwordiness”? �→ need a measure
One of the criticisms of Zhang et al. (2024/5)’s formatting study was that it used synthetic data �→ need real data

17

18 of 52

Big Plan

19 of 52

Big plan

1) score buzzwordiness
2) get real data from experiment
3) run ablation

19

20 of 52

Big plan

1) score buzzwordiness ← been working on this
2) get real data from experiment
3) run ablation

20

21 of 52

Big plan

1) score buzzwordiness

→ issue: literature so far always has a moment of manual curation

21

22 of 52

Big plan

1) score buzzwordiness

Mingmeng et al. → some 8 words
Kobak et al. → ominous manual check (??”omicron”)
JW25 → manual filtering (“radar”) (EXPLAIN)

22

23 of 52

Big plan

Approach

document frequency
windows

23

24 of 52

Another procedure

New-New_Final copy (1).docx

25 of 52

LAS Score

XM

26 of 52

Lexical Alignment Score

Core idea:

Obtain the lexical frequency discrepancies between human-authored and multiple models’ generated texts.
Transform this difference as a metric to quantify and assess the alignment shift from human expectations.

26

27 of 52

Lexical Alignment Score

Implementation procedures:

Our research work investigated six public model families:

28 of 52

Lexical Alignment Score

Implementation procedures:

Collected 42,000 scientific PubMed article abstracts published between 2012 and 2021, prior to the widespread application of LLMs.
Following the previous approach (Juzek and Ward, 2025), divided each PubMed abstract into two halves of equal length. First half was used as the prompts for each model’s generated continuation and second one as the human standard for subsequent computation.

29 of 52

Lexical Alignment Score

Estimation:

For each type S from model generation and human text:Compute the frequency of each lemmatised token (unique word or punctuation mark) w using Windowed Document Prevalence.

Here comes the example of Windowed Document Prevalence.

29

30 of 52

Lexical Alignment Score

Continuation S:�In this study, we formulate a clear hypothesis regarding the correlation between social media use and attention span, proposing that frequent exposure to short-form content reduces sustained focus over time.

Lemmatized continuation S:�in, this, study, we, formulate, a, clear, hypothesis, regard, the, correlation, between, social, media, use, and, attention, span, propose, that, frequent, exposure, to, short, form, content, reduce, sustain, focus, over, time

postag

31 of 52

Lexical Alignment Score

Lemmatized continuation S using windowed document prevalence(window size=10, window index = 1):�in, this, study, we, formulate, a, clear, hypothesis, regard, the, correlation, between, social, media, use, and, attention, span, propose, that, frequent, exposure, to, short, form, content, reduce, sustain, focus, over, time

32 of 52

Lexical Alignment Score

For a lemma type

w and a model M ∈ {B, I} we define the lemma type-level LAS wLAS:

33 of 52

Lexical Alignment Score

Scoring:

Following the estimation way, we define per-lemmatised-token contributions to lemmatised token t:

33

Then we defined the each token’s Lexical Alignment Score (LAS):

34 of 52

Lexical Alignment Score

By utilizing each token's LAS score as the fundamental unit, we could generate several levels of LAS scores. This was achieved by summing the L2 mean of the basic LAS scores to quantify the alignment shift at the sentence, document, and corpus (model) levels.

35 of 52

Technical Demo

36 of 52

Demo

Continuations generation
Part-of-speech tag the continuations to get lemmatized tokens
Corpus level LAS scores calculation

Acknowledgement to RCC specialist Jose Hernandez for the assistance!

36

37 of 52

Outlook on distributed multiple GPUs computing

Transformer models demand substantial memory and computing power for both training and inference.
Our previous work basically required model inference, which is less consuming. However to explore future ablation study, fine-tuning instruction model is inevitable.
Then, even the partial fine-tuning a 7B LLM in numerical format BF16 precision necessitates around 14-24 GB of VRAM.

37

38 of 52

Outlook on distributed multiple GPUs computing

38

We are notified that torch-based library accelerate can be supported by RCC.��This library is rarely being used by RCC users. Therefore, a series of trials will be necessary to verify its compatibility and functionality.

39 of 52

More Demo

TJ

40 of 52

Programming with .zips

projects with zips
show

40

41 of 52

Continuations with Lambda

show

41

42 of 52

Analysis

and

Results

43 of 52

Analysis and Results

analysis of document frequencies �in AI vs human
show a few lists

nice thing: by model

43

44 of 52

Lists

44

45 of 52

Interpretation

All the function words → A good deal of AI language happens on the syntactic level

45

46 of 52

Validation

“test” set: same results
convergence with literature
varying parameters

46

47 of 52

Next Steps

48 of 52

Next steps

Next steps:

Ablation study
AI language across the world’s languages

48

49 of 52

Thank You

Contact information

50 of 52

RCC

RCC has great infrastructure:

Running on RCC is very doable
Outlook for GPUs parallelisation
And growing: LEEP grant for H100s submitted
Currently, vanilla access: up to 4hrs
A100 takes special permission

50

51 of 52

Technical Demo

Continuations generation
Part-of-speech tag the continuations to get lemmatized tokens
Corpus level LAS scores calculation��Acknowledgement to RCC specialist Jose Hernandez for the assistance!

52 of 52

Outlook on distributed multiple GPUs computing

Transformer models demand substantial memory and computing power for both training and inference.

Our previous work basically required model inference, which is less consuming. However to explore future ablation study, fine-tuning instruction model is inevitable.

Then, even the partial fine-tuning a 7B LLM in numerical format BF16 precision necessitates around 14-24 GB of VRAM.