1 of 37

Speech2Action: Cross-modal supervision for Action Recognition

(phone line ringing) hello.

CVPR 2020

Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

2 of 37

Goal: Human Centric Action Recognition

answer phone

dig

hug

clink glasses

fall down

serve

3 of 37

Labelling actions in videos is extremely expensive

  • Manual annotation costs millions of $$$

  • Ambiguity and combinatorial space of human activities, it is hard to exhaustively pre-define and collect all actions

4 of 37

Different Domains

Internet videos (Kinetics) vs edited films (AVA), all need to be manually labelled separately

Pretrained: Kinetics

Applied to: Movie (edited video)

5 of 37

Can we use speech as a form of supervision?

Sometimes, people speak about their actions…

6 of 37

Come sit down. Are you okay?

Jed: I gotta sit down.

Please sit down.

Uh, well, come here, sit down.

Uh, Niles, please, would you stop hovering and just sit down?

I find a dozen reasons not to sit down and write.

7 of 37

"Come on, come on, let's dance and shout now."

Uh, Cliff, let's dance.

EIGHTBALL, LET'S DANCE.

Come on, everyone, let's dance

AND I WANT TO DANCE WITH CHEAP WOMEN.

Come on, let's dance.

8 of 37

Can I kiss you before I go?

CAN I KISS YOU?

Can I kiss somebody right now?

Can I kiss the bride?

-[ INHALES DEEPLY ] -CAN I KISS YOU?

- CAN I KISS YOU? - NO.

9 of 37

ALEX: All right, let's take a photo here.

[Jeff] Will you take a photo of us?

Can I take a photo? Please?

URS, TAKE A PHOTO OF US.

I want to take a photo. Give me an action pose.

"TAKE A PHOTO WITH A STRANGER IN A PHOTO BOOTH."

10 of 37

Instructional Videos

Lifestyle Vlogs

How2100M, Miech et al.

COIN, Tang et al.

VLOG: Foughey et al.

Visible Actions: Ignat et al.

11 of 37

What about the more general domain of movies and TV shows?

12 of 37

Often, however, the speech is completely unrelated

  • Speak about actions in the past, or actions that will happen in the future
  • Narrate a story about actions that have occurred off screen
  • Many atomic actions like sit, stand, etc occur frequently with all kinds of speech segments

We want to learn when the speech is indicative of an action, and which action that is likely to be

13 of 37

Movie Screenplays

Contain both speech segments and scene directions with actions

Speech

Actions

Scene Directions

14 of 37

Movie Screenplays

www.imsb.com: The Internet Movie Script Database

Contain both speech segments and scene directions with actions

  • Diversity in genre - Action Adventure Animation Biography Comedy Crime Drama Family Fantasy Film-Noir History Horror Music Musical Mystery Romance Sci-Fi Short Sport Thriller War Western

  • Diversity in time (1980 - 2018)

# scripts

# scene descriptions

# speech segments

# total sentences

# words

1.07K

539K

595K

2.5M

21M

15 of 37

Retrieve speech segments near verbs

  • Obtain a list of verbs present in the screenplays
  • Convert each verb to all verb conjugation forms:
    • kiss, kisses, kissed etc

  • Retrieve speech segments based on proximity and assign them to the verb class

Hello, precious - kiss

16 of 37

This gives us speech-action pairs for different verbs

Screenplays contain both speech segments and scene directions with actions

Speech

Actions

Scene Directions

action:

speech:

action:

speech:

action:

speech:

speech-action pairs

[answers] phone

Hello, it’s me

[answers] phone

Thanks for calling so soon

[answers] phone

Hello Dad, are you still there?

movie screenplays

17 of 37

Train a Speech2Action model on this paired data

Screenplays contain both speech segments and scene directions with actions

Speech

Actions

Scene Directions

Speech2Action

classifier

action:

speech:

action:

speech:

action:

speech:

speech-action pairs

[answers] phone

Hello, it’s me

[answers] phone

Thanks for calling so soon

[answers] phone

Hello Dad, are you still there?

movie screenplays

18 of 37

Use BERT backbone for Speech2Action

Verb class

Speech

Verb Class

Ya nothin’ boy!

push

Look over there

point

BERT model pretrained on BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words)

Fine-tune on our IMSDB dataset of movie scripts

BERT (Bidirectional Encoder Representations from Transformers)

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." (2018).

19 of 37

Most confident predictions

Class label: PHONE

Hello

Hello

May I have the number for Doctor George Shannan

Hello hello

Honey I asked you not to call unless what why

[message beep] hey, it's me

He’s not answering his phone

Hello it’s me

Class label: KISS

One more kiss

Give me a kiss

Good night my darling

I love you my darling

Noone had ever kissed me there before

Goodnight angel my sweet boy

Since when I cant kiss my sisterinlaw

I love you

Class label: DANCE

Come on Ill take a break and well all dance

Ladies and Gentlemen the first dance

I dont feel like dancing

Wanna dance

Shes a beautiful dancer

Im about to ask this lady for a dance

Waddaya say you wanna dance

Class label: PUSH

Let go of me

Get the fuck off me

If youre standing strong I cant even push you

Get out

Get out

Get out

Get away from those

Get away from those

20 of 37

Unlabelled Video Clip from a Movie

Caption: Hello, it’s me

Speech2Action

classifier

Weak label: [answer] phone

We apply the Speech2Action model to the closed captions of unlabelled videos

At this stage, the text screenplays are no longer needed

21 of 37

Apply to a large movie corpus

  • Applied the classifier to 220K movies
    • For each movie, obtain closed captions (same as assuming we have perfect ASR)
    • Convert closed captions to sentences using the NLTK toolkit [188M sentences]
    • Apply the BERT classifier to every single sentence
    • Retrieve the most confident segments for each class (softmax as a confidence score)

At this stage, the text screenplays are no longer needed

22 of 37

run

mike, run, run!

he was running after me.

he is running away.

Chase him!

They ran into the woods.

don’t move, hey!

23 of 37

phone

[ beeps ] hello.

rebekah is not answering her calls.

hey, it's me.

(phone line ringing) hello.

[ phone beeping ] dad, are you there ?

skinner's not answering his phone.

24 of 37

phone

next caller on the phone

Can you please connect me to the tip line

next message

call me back please

  • different types of phones

25 of 37

kiss

I will burn for this

[kisses] i love you.

[dawn] i love you.

i love you infinity.

i love you, angel.

26 of 37

hit

i'm gonna smash that camera to bits!

you gotta hit him in the solar plexus!

backhand, snap down, round off reach into the back handspring, and then tuck.

hit him right between the eyes.

27 of 37

drive

he made a u turn on an empty street.

camaro headed east on ocean park.

he got back in his car and chased after her.

they stopped under the brooklyn queens expressway.

my wife gets in the car i start driving down my block to the corner.

28 of 37

point

there's the first guy right there.

it's that one down there.

that one over there.

that one in the corner is clutch.

that is over there.

29 of 37

shoot

you got 10 seconds to come out, or we start shooting.

with the sharps carbine, that is within range.

you need more arc in that shot.

kincaid ordered not to shoot.

30 of 37

Result - many examples of rare actions

Log scale!

Log scale!

  • Long tail of natural distribution of actions
  • We mine 2 orders of magnitude more training examples for rare/mid classes in AVA

AVA dataset

31 of 37

Train a visual classifier on weakly labelled data

  • Train an S3D-G [1] model on the weakly labelled movie clips (visual frames only)
  • 18 way softmax

[1] Xie et al. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. ECCV, 2018

captions are no longer needed

S3D-G CNN Classifier

[answer] phone

32 of 37

Datasets

AVA

  • 1.62M action labels for 80 ‘atomic’ action classes
  • Movie Data

HMDB-51

  • 6766 realistic and varied video clips from 51 action classes.
  • User uploaded data (eg. YouTube)

33 of 37

Results - directly evaluate on AVA (NO FT)

  • For 8 out of 14 classes, exceed fully supervised performance without a single training example

  • With fine-tuning, exceed supervised performance for all classes

  • ‘Hug’ often confused with ‘kiss’ - ‘I love you

34 of 37

Results - transfer learning to HMDB51

Method

Pretraining Data

Top 1 % Acc.

Scratch

41.2

Shuffle&Learn

UCF101

35.8

OPN

UCF101

23.8

ClipOrder

UCF101

30.9

Wang et al.

Kinetics

33.4

3DRotNet

Kinetics

40.0

DPC

Kinetics

35.7

CBT

Kinetics

44.6

Korbar et al

Kinetics

53.0

Imagenet Pretraining

Imagenet

54.7

DisInit (RGB)

Kinetics + Imagenet

54.8

Ours

S2A-mined

58.1

self-supervised works (+14%)

scratch (+18%)

works that rely on ImageNet supervision (+3 %)

35 of 37

More abstract actions

come right behind me!

follow me quick!

after you

thirty six thousand four hundred, five hundred

two quarters, three dimes, one nickel, two pennies

twenty four thousand four hundred

COUNT

FOLLOW

36 of 37

Future Work

  • Extending to more literary work like TV show screenplays, books etc
  • Action Localisation

  • People talk about objects, scenes as well

37 of 37

Thank you!

Since I moved here, I actually like food again!

have you ever had szechwan cuisine before?

This chicken is very tasty

because if you are just drive

babe, the speed limit is 120

i always drive the car on saturday

DRIVE

EAT