1 of 25

Towards the identification of joint meaning construal: language and co-speech gestures on focus

Nickil Maveli�/ University of Edinburgh

Tiago Torrent�/ Federal University of Juiz de Fora

2 of 25

Both FrameNet Brasil and Red Hen have been investigating how meaning is construed in multimodal communication. While Red Hen has been focusing more on the relation between speech and co-speech gestures, FN-Br has been looking into how frames are evoked by different modalities, especially audio and video.

The Idea

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

3 of 25

Both FrameNet Brasil and Red Hen have been investigating how meaning is construed in multimodal communication. While Red Hen has been focusing more on the relation between speech and co-speech gestures, FN-Br has been looking into how frames are evoked by different modalities, especially audio and video.

In a nutshell: identifying joint meaning construal patterns

The Idea

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

4 of 25

(...) a dynamic process of meaning construction, in which speakers and hearers encode and decode, respectively, some intended meaning in a given communicative context. To do so, they draw on their repertoire of linguistic and conceptual structures, composing and transforming them to build coherent interpretations consistent with the speaker’s lexical, grammatical, and other expressive choices. (Trott et al., 2020)

Construal

,

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

5 of 25

(...) a dynamic process of meaning construction, in which speakers and hearers encode and decode, respectively, some intended meaning in a given communicative context. To do so, they draw on their repertoire of linguistic and conceptual structures, composing and transforming them to build coherent interpretations consistent with the speaker’s lexical, grammatical, and other expressive choices. (Trott et al., 2020)

Construal

,

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

6 of 25

Trott et al. (2020) define an incomplete taxonomy of construal dimensions and discuss their implications to how meaning is addressed in NLP.

Dimensions of Construal

an incomplete taxonomy

Langacker, 1993; Talmy, 1988; Croft & Wood, 2000

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

7 of 25

Does construal really matter for daily unpretentious language use?

Construal in Action

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

8 of 25

Well…

YES!

Construal in Action

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

9 of 25

A woman figure skater in a blue costume holds her leg in the air by the blade of her skate.�

A man and a woman ice skating on a rink.�

Mulher patinando em pista de gelo e homem patinando também ao fundo.

Woman skating on ice track and man also skating in the background.

Construal and Multimodality

Source: Flickr 30K dataset

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

10 of 25

A woman figure skater in a blue costume holds her leg in the air by the blade of her skate.�

A man and a woman ice skating on a rink.�

Uma patinadora artística vestida a caráter está patinando no gelo segurando a lâmina de um patins.

A figure skater.FEM dressed to character is ice skating while holding the blade of one of her skates

Construal and Multimodality

Source: Flickr 30K dataset

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

11 of 25

But what about gesture?

12 of 25

Understanding interactions between lexical prompts and gestures can help us draw close connections for multimodal grounding of human language to hand gestures

Motivation

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

13 of 25

Perspective

Describe a static scene from a specific vantage point
Essential to understand spatial language
E.g. - “The highway runs along the coast”�

Prominence

Denotes relative attention focused on different components of a static scene
Usage of spatial/temporal/verbal alternations lead to different elements in a sentence having different saliency
E.g. - “I rolled the dice” vs "The dice rolled”

Construal Dimensions

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

14 of 25

Hand Gesture Types

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

15 of 25

PATS Dataset

The PATS dataset¹ contains aligned pose, audio and transcripts

25 Speakers with different styles
15 talk show hosts, 5 lecturers, 3 YouTubers, and 2 televangelists
251 hours of data

¹https://chahuja.com/pats/

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

16 of 25

Lexical Triggers:

“from-to” mostly denotes location context
“first-second” mostly denotes a sequential ordering context
“firstly-secondly” mostly denotes a sequential ordering context
“here-then” mostly denotes a positional change context

Lexical Contexts: Word tokens corresponding to the Transcription (transcribed using Google ASR)

Lexical Triggers and Lexical Contexts

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

17 of 25

Anticipated�Hand Gestures

Transcript:

“with Mexico that players can either travel from the u.s. to Mexico by plane or just walked past the wall that still won’t be built it’s up to you you can choose”

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

18 of 25

Anticipated�Hand Gestures

Transcript:

“with Mexico that players can either travel from the u.s. to Mexico by plane or just walked past the wall that still won’t be built it’s up to you you can choose”

True Positive

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

19 of 25

Anticipated�Hand Gestures

Transcript:

“$25,000 do you know how short a flight is from DC to Philadelphia if you tried to watch Thelma and Louise on that flight you wouldn’t meet Louie Susan Sarandon on the bar Tyler so tan prices Medicaid patients should lose their health care but has no problem spending tens of thousands of dollars on private jets and he’s not the only one treasury secretary Steve mnuchin also came”

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

20 of 25

Anticipated�Hand Gestures

Transcript:

“$25,000 do you know how short a flight is from DC to Philadelphia if you tried to watch Thelma and Louise on that flight you wouldn’t meet Louie Susan Sarandon on the bar Tyler so tan prices Medicaid patients should lose their health care but has no problem spending tens of thousands of dollars on private jets and he’s not the only one treasury secretary Steve mnuchin also came”

True Negative

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

21 of 25

Given a video input, identify whether a hand gesture is present corresponding to the hand gestures portrayed by the speaker during the enunciation of the “from-to” lexical trigger in the training video frames. If it is, then classify the video frame with the different gesture types (handedness, axis, shape, direction). If it is not, then classify the video frame as “No Gesture”.

Objective

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

22 of 25

True Positives: Video frames corresponding to the start and ending portions of the lexical trigger

True Negatives: Video frames unrelated to the ones found in the True Positives set

Dataset

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

23 of 25

The classification model comprises of mainly three units:

Positional embedding to enable the model access to the pixel-order information
Transformer encoder to process the source sequence
Max-pooling layer to keep the most important feature

Model Architecture

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

24 of 25

Use a frame semantic parser (sling, open sesame, etc.) to extract lexical units and the frames evoked.�
Rely on a human-in-loop strategy to validate by asking if the highlighted text surrounding the lexical trigger invokes ordering of items, positions, or does it contain temporal information, and so on.

Future Directions

ISGS - Data Science Methods for The Study of Co-Speech Gesture • Jul. 13rd, 2022 / / / /

25 of 25

Thank you!

@TorrentTiago

tiagotorrent.com

@nickilmaveli

nickilmaveli.com