1 of 55

CSCI-SHU 376: Natural Language Processing

Hua Shen

2026-03-12

Spring 2026

Lecture 11: LLM Decoding / Semantic Parsing

2 of 55

Today’s Plan

  • LLM Decoding Overview
  • Sampling
  • Controllable Generation
  • Semantic Parsing

3 of 55

What is inside LLM?

  • A model defines a conditional probability distribution

4 of 55

LMs are locally normalized

  • It could be the sequence starts with low probability tokens, but have high overall probability
  • Therefore, it is hard to do inference with global constraints

5 of 55

Probability distribution -> Hallucination

  • Model generally assigns non-zero probability to any (incorrect) outputs

6 of 55

Our goal: Get “Good” Outputs

  • A “good” output given a probability distribution
  • Changing the decoding algorithms

7 of 55

Today’s Plan

  • LLM Decoding Overview
  • Sampling
  • Controllable Generation
  • Semantic Parsing

8 of 55

Recap: Greedy Decoding

  • Greedy Decoding: Compute argmax (over entire vocab) at every step

9 of 55

Recap: Beam Search

  • At every step, keep track of the k most probable partial translations
  • Score of each hypothesis = log probability of sequence so far

  • Not guaranteed to be optimal, but more efficient than exhaustive search

10 of 55

Highest probability always best?

  • Outputs with low probability tend to be worse

  • Difference between top outputs is unclear…

11 of 55

Highest probability always best?

  • Many outputs are meaningful!

12 of 55

Ancestral Sampling

  • Exactly samples from model distribution!

13 of 55

Ancestral Sampling

  • Long-tail problem
  • Even if each individual token in the long-tail has low probability, these small probabilities add up…

14 of 55

Top-K Sampling

  • Only sample from the most probable <k> tokens

15 of 55

Top-p Sampling

  • Also called nucleus sampling
  • Only sample from the top <p> probability mass
  • Ignore the long-tails

16 of 55

Epsilon Sampling

 

17 of 55

Contrastive Decoding

  • Smaller models make different mistakes

  • Choose outputs that the “expert” finds much more likely than the “amateur”

18 of 55

Today’s Plan

  • LLM Decoding Overview
  • Sampling
  • Controllable Generation
  • Semantic Parsing

19 of 55

Different types of Constraints

  • Low-level constraints: structured output, length etc

19

20 of 55

Different types of Constraints

  • High-level constraints: semantic, avoid hallucination etc

20

21 of 55

Prompting is not enough!

21

22 of 55

Constrained decoding: Manipulate logits

  • Set P(“climb” | X, y) = 0?

22

23 of 55

Constrained decoding: Rejection Sampling

  • Generate a lot of samples, then reject

23

24 of 55

Today’s Plan

  • LLM Decoding Overview
  • Sampling
  • Controllable Generation
  • Semantic Parsing

25 of 55

Semantic Parsing

25

26 of 55

Semantic Parsing: QA

26

27 of 55

Semantic Parsing: Instructions

27

28 of 55

Language to Meaning

28

29 of 55

Neural Semantic Parsing

29

30 of 55

Text-to-SQL Semantic Parsing

30

How many cities have at least 25,000 people?

Natural Language Question

Database Schema

City

Population

Area

Execution Result

4

SELECT count(c1) FROM w WHERE c2 >= 25000

SQL Query

Input

Output

31 of 55

Text-to-SQL Semantic Parsing: Evaluation Metrics

31

How many cities have at least 25,000 people?

Natural Language Question

Database Schema

City

Population

Area

Execution Result

4

SELECT count(c1) FROM w WHERE c2 >= 25000

SQL Query

Input

Output

Logical Form Accuracy

Execution Accuracy

32 of 55

Text-to-SQL Semantic Parsing: Supervision

32

How many cities have at least 25,000 people?

SELECT count(c1) FROM w WHERE c2 >= 25000

Execution Result

4

Natural Language Question

SQL Query

Database Schema

City

Population

Area

33 of 55

33

34 of 55

On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries

{ }, Jordan Boyd-Graber, Hal Daumé III, Lillian Lee

34

Tianze Shi

Chen Zhao

In EMNLP-Findings (2020)

35 of 55

Text-to-SQL Semantic Parsing: Supervision

35

Execution Result

4

How many cities have at least 25,000 people?

Natural Language Question

SELECT count(c1) FROM w WHERE c2 >= 25000

SQL Query

+Alignments

Database Schema

City

Population

Area

36 of 55

Dataset: SQUALL =“SQL+QUestion pairs ALigned Lexically”

  • Built on the existing dataset of WikiTableQuestions (Pasupat and Liang, 2015)

  • Collected expert annotations of logical forms and lexical alignments for more than 11k training instances

  • We also experimented with automatically-derived alignments

36

37 of 55

Annotation Interface

37

38 of 55

Annotation Interface

38

39 of 55

Alignment Annotations

  • ~half of question tokens and ~90% of SQL tokens are aligned (excluding basic keywords of “SELECT”, “FROM”, “WHERE”, …)

  • Example frequently-aligned segments

39

40 of 55

Model

  • SEQ2SEQ+
    • Our seq2seq base model with attention and copying mechanisms
    • Competitive with a state-of-the-art text-to-SQL semantic parser (Suhr et al., 2020), evaluated on the Spider dataset (Yu et al., 2018)

40

41 of 55

Model: Encoder

41

Natural Language Question

Table Schema

City

Population

Area

Region

Word embedding lookup

Bi-LSTM final states

Bi-directional LSTM

Self-Attention

Attention

Bi-directional LSTM

Bi-directional LSTM

How many cities have …

42 of 55

Model: Encoder w/ BERT

42

Natural Language Question

Table Schema

BERT

Attention

Bi-directional LSTM

Bi-directional LSTM

[CLS]

[SEP] City [SEP] Population [SEP] Area [SEP] … [SEP]

How many cities have …

43 of 55

Model: Decoder

43

Natural Language Question

Encoder

Decoder LSTM

SELECT

count

Attention

Table Schema

<START>

44 of 55

Model: Decoder

44

Decoder LSTM

SELECT

<START>

MLP

Keyword

STR

COL

Copy

Mechanism

Over

Question Tokens

Copy

Mechanism

Over

Columns

MLP

to

Predict

Keyword

count

45 of 55

Model

  • SEQ2SEQ+
    • Our seq2seq base model with attention and copying mechanisms
    • Competitive with a state-of-the-art text-to-SQL semantic parser (Suhr et al., 2020), evaluated on the Spider dataset (Yu et al., 2018)

  • ALIGN
    • Same model architecture as SEQ2SEQ+, same inference steps
    • Two training strategies:
      • Supervised attention
      • Column Prediction

45

46 of 55

Model: Supervised Attention

  • Previously used in machine translation (Liu et al., 2016; Mi et al., 2016)
  • Decoder attention as an example; similar for encoder attention

46

47 of 55

Model: Supervised Attention

  • Previously used in machine translation (Liu et al., 2016; Mi et al., 2016)
  • Decoder attention as an example; similar for encoder attention

47

How many cities have at least 25,000 people ?

SELECT count ( c1 ) FROM w WHERE c2 >= 25000

About to predict this token

48 of 55

Model: Supervised Attention

  • Previously used in machine translation (Liu et al., 2016; Mi et al., 2016)
  • Decoder attention as an example; similar for encoder attention

48

How many cities have at least 25,000 people ?

SELECT count ( c1 ) FROM w WHERE c2 >= 25000

About to predict this token

Attention weights

0.3 0.25 0.05 0.1 0.05 0.05 0.05 0.1 0.05

49 of 55

Model: Supervised Attention

  • Previously used in machine translation (Liu et al., 2016; Mi et al., 2016)
  • Decoder attention as an example; similar for encoder attention

49

How many cities have at least 25,000 people ?

SELECT count ( c1 ) FROM w WHERE c2 >= 25000

About to predict this token

Attention weights

0.3 0.25 0.05 0.1 0.05 0.05 0.05 0.1 0.05

Alignment vector

0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0

50 of 55

Model: Supervised Attention

  • Previously used in machine translation (Liu et al., 2016; Mi et al., 2016)
  • Decoder attention as an example; similar for encoder attention

50

How many cities have at least 25,000 people ?

SELECT count ( c1 ) FROM w WHERE c2 >= 25000

About to predict this token

Attention weights

0.3 0.25 0.05 0.1 0.05 0.05 0.05 0.1 0.05

Alignment vector

0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Loss:

Final loss: linear combination of and seq2seq

51 of 55

Results on WikiTableQuestions

  • Unsurprisingly, strong supervision beats previous weakly-supervised models on WTQ’s test set

51

EXE accuracy

w/ BERT

w/o BERT

+6.2

+8.4

Previous

Best

ALIGN

(single)

ALIGN

(ensemble)

Previous

Best

ALIGN

(single)

ALIGN

(ensemble)

52 of 55

Alignment Annotations Provide Further Improvements

52

+4.4

LF accuracy

SEQ2SEQ+

ALIGN

Automatic

alignment

Sup.

decoder

attention

Sup.

encoder

attention

53 of 55

Analysis

  • Comparing ALIGN with SEQ2SEQ+

53

Logical form accuracy

+ 4.4

Template accuracy

+ 2.0

Column accuracy

+ 4.9

Logical form accuracy

+10.6

Execution accuracy

+12.5

On unseen templates

Absolute improvements

54 of 55

Unrealized Potential

54

+4.4

LF accuracy

SEQ2SEQ+

ALIGN

Oracle

attention

+23.9

55 of 55

Interim Summary

  • We collect and release a large-scale text-to-SQL semantic parsing dataset with lexical alignment annotations

  • Models trained with lexical alignments improve over strong baselines by 4.4% logical form accuracy

  • There still exists large unrealized potential

55