Before Ablation ��Groundwork for Lexical Alignment Tests
Research group members:
Presentation speakers:
RCC specialist:
07 Nov 2025
@SC-AI Seminar
Before Ablation ��Groundwork for Lexical Alignment Tests
Three core takeaways:
Background
Background
5
Background
6
Background
7
language alignment & general alignment
Background
8
Background
9
Typical LHF procedures
Background
10
Typical LHF procedures
© NYT & WaPo
Background
11
Background
12
Background
13
Background
14
Background
15
Background
Background
17
Big Plan
Big plan
19
Big plan
20
Big plan
→ issue: literature so far always has a moment of manual curation
21
Big plan
22
Big plan
23
Another procedure
New-New_Final copy (1).docx
LAS Score
XM
Lexical Alignment Score
Core idea:
26
Lexical Alignment Score
Implementation procedures:
Our research work investigated six public model families:
Lexical Alignment Score
Implementation procedures:
Lexical Alignment Score
Estimation:
29
Lexical Alignment Score
Continuation S:�In this study, we formulate a clear hypothesis regarding the correlation between social media use and attention span, proposing that frequent exposure to short-form content reduces sustained focus over time.
Lemmatized continuation S:�in, this, study, we, formulate, a, clear, hypothesis, regard, the, correlation, between, social, media, use, and, attention, span, propose, that, frequent, exposure, to, short, form, content, reduce, sustain, focus, over, time
postag
Lexical Alignment Score
Lemmatized continuation S using windowed document prevalence(window size=10, window index = 1):�in, this, study, we, formulate, a, clear, hypothesis, regard, the, correlation, between, social, media, use, and, attention, span, propose, that, frequent, exposure, to, short, form, content, reduce, sustain, focus, over, time
Lexical Alignment Score
For a lemma type
w and a model M ∈ {B, I} we define the lemma type-level LAS wLAS:
Lexical Alignment Score
Scoring:
Following the estimation way, we define per-lemmatised-token contributions to lemmatised token t:
33
Then we defined the each token’s Lexical Alignment Score (LAS):
Lexical Alignment Score
By utilizing each token's LAS score as the fundamental unit, we could generate several levels of LAS scores. This was achieved by summing the L2 mean of the basic LAS scores to quantify the alignment shift at the sentence, document, and corpus (model) levels.
Technical Demo
Demo
Acknowledgement to RCC specialist Jose Hernandez for the assistance!
36
Outlook on distributed multiple GPUs computing
37
Outlook on distributed multiple GPUs computing
38
We are notified that torch-based library accelerate can be supported by RCC.��This library is rarely being used by RCC users. Therefore, a series of trials will be necessary to verify its compatibility and functionality.
More Demo
TJ
Programming with .zips
40
Continuations with Lambda
41
Analysis
and
Results
Analysis and Results
43
Lists
44
Interpretation
45
Validation
46
Next Steps
Next steps
Next steps:
48
Thank You
Contact information
RCC
RCC has great infrastructure:
50
Technical Demo
Outlook on distributed multiple GPUs computing
Transformer models demand substantial memory and computing power for both training and inference.
Our previous work basically required model inference, which is less consuming. However to explore future ablation study, fine-tuning instruction model is inevitable.
Then, even the partial fine-tuning a 7B LLM in numerical format BF16 precision necessitates around 14-24 GB of VRAM.