Speech2Action: Cross-modal supervision for Action Recognition
(phone line ringing) hello.
CVPR 2020
Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
Goal: Human Centric Action Recognition
answer phone
dig
hug
clink glasses
fall down
serve
Labelling actions in videos is extremely expensive
Different Domains
Internet videos (Kinetics) vs edited films (AVA), all need to be manually labelled separately
Pretrained: Kinetics
Applied to: Movie (edited video)
Can we use speech as a form of supervision?
Sometimes, people speak about their actions…
Come sit down. Are you okay?
Jed: I gotta sit down.
Please sit down.
Uh, well, come here, sit down.
Uh, Niles, please, would you stop hovering and just sit down?
I find a dozen reasons not to sit down and write.
"Come on, come on, let's dance and shout now."
Uh, Cliff, let's dance.
EIGHTBALL, LET'S DANCE.
Come on, everyone, let's dance
AND I WANT TO DANCE WITH CHEAP WOMEN.
Come on, let's dance.
Can I kiss you before I go?
CAN I KISS YOU?
Can I kiss somebody right now?
Can I kiss the bride?
-[ INHALES DEEPLY ] -CAN I KISS YOU?
- CAN I KISS YOU? - NO.
ALEX: All right, let's take a photo here.
[Jeff] Will you take a photo of us?
Can I take a photo? Please?
URS, TAKE A PHOTO OF US.
I want to take a photo. Give me an action pose.
"TAKE A PHOTO WITH A STRANGER IN A PHOTO BOOTH."
Instructional Videos
Lifestyle Vlogs
How2100M, Miech et al.
COIN, Tang et al.
VLOG: Foughey et al.
Visible Actions: Ignat et al.
What about the more general domain of movies and TV shows?
Often, however, the speech is completely unrelated
We want to learn when the speech is indicative of an action, and which action that is likely to be
Movie Screenplays
Contain both speech segments and scene directions with actions
Speech
Actions
Scene Directions
Movie Screenplays
www.imsb.com: The Internet Movie Script Database
Contain both speech segments and scene directions with actions
# scripts | # scene descriptions | # speech segments | # total sentences | # words |
1.07K | 539K | 595K | 2.5M | 21M |
Retrieve speech segments near verbs
Hello, precious - kiss
This gives us speech-action pairs for different verbs
Screenplays contain both speech segments and scene directions with actions
Speech
Actions
Scene Directions
action:
speech:
action:
speech:
action:
speech:
speech-action pairs
[answers] phone
Hello, it’s me
[answers] phone
Thanks for calling so soon
[answers] phone
Hello Dad, are you still there?
movie screenplays
Train a Speech2Action model on this paired data
Screenplays contain both speech segments and scene directions with actions
Speech
Actions
Scene Directions
Speech2Action
classifier
action:
speech:
action:
speech:
action:
speech:
speech-action pairs
[answers] phone
Hello, it’s me
[answers] phone
Thanks for calling so soon
[answers] phone
Hello Dad, are you still there?
movie screenplays
Use BERT backbone for Speech2Action
Verb class
Speech | Verb Class |
Ya nothin’ boy! | push |
Look over there | point |
BERT model pretrained on BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words)
Fine-tune on our IMSDB dataset of movie scripts
BERT (Bidirectional Encoder Representations from Transformers)
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." (2018).
Most confident predictions
Class label: PHONE Hello Hello May I have the number for Doctor George Shannan Hello hello Honey I asked you not to call unless what why [message beep] hey, it's me He’s not answering his phone Hello it’s me | Class label: KISS One more kiss Give me a kiss Good night my darling I love you my darling Noone had ever kissed me there before Goodnight angel my sweet boy Since when I cant kiss my sisterinlaw I love you |
Class label: DANCE Come on Ill take a break and well all dance Ladies and Gentlemen the first dance I dont feel like dancing Wanna dance Shes a beautiful dancer Im about to ask this lady for a dance Waddaya say you wanna dance | Class label: PUSH Let go of me Get the fuck off me If youre standing strong I cant even push you Get out Get out Get out Get away from those Get away from those |
Unlabelled Video Clip from a Movie
Caption: Hello, it’s me
Speech2Action
classifier
Weak label: [answer] phone
We apply the Speech2Action model to the closed captions of unlabelled videos
At this stage, the text screenplays are no longer needed
Apply to a large movie corpus
At this stage, the text screenplays are no longer needed
run
mike, run, run!
he was running after me.
he is running away.
Chase him!
They ran into the woods.
don’t move, hey!
phone
[ beeps ] hello.
rebekah is not answering her calls.
hey, it's me.
(phone line ringing) hello.
[ phone beeping ] dad, are you there ?
skinner's not answering his phone.
phone
next caller on the phone
Can you please connect me to the tip line
next message
call me back please
kiss
I will burn for this
[kisses] i love you.
[dawn] i love you.
i love you infinity.
i love you, angel.
hit
i'm gonna smash that camera to bits!
you gotta hit him in the solar plexus!
backhand, snap down, round off reach into the back handspring, and then tuck.
hit him right between the eyes.
drive
he made a u turn on an empty street.
camaro headed east on ocean park.
he got back in his car and chased after her.
they stopped under the brooklyn queens expressway.
my wife gets in the car i start driving down my block to the corner.
point
there's the first guy right there.
it's that one down there.
that one over there.
that one in the corner is clutch.
that is over there.
shoot
you got 10 seconds to come out, or we start shooting.
with the sharps carbine, that is within range.
you need more arc in that shot.
kincaid ordered not to shoot.
Result - many examples of rare actions
Log scale!
Log scale!
AVA dataset
Train a visual classifier on weakly labelled data
[1] Xie et al. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. ECCV, 2018
captions are no longer needed
S3D-G CNN Classifier
[answer] phone
Datasets
AVA
HMDB-51
Results - directly evaluate on AVA (NO FT)
Results - transfer learning to HMDB51
Method | Pretraining Data | Top 1 % Acc. |
Scratch |
| 41.2 |
Shuffle&Learn | UCF101 | 35.8 |
OPN | UCF101 | 23.8 |
ClipOrder | UCF101 | 30.9 |
Wang et al. | Kinetics | 33.4 |
3DRotNet | Kinetics | 40.0 |
DPC | Kinetics | 35.7 |
CBT | Kinetics | 44.6 |
Korbar et al | Kinetics | 53.0 |
Imagenet Pretraining | Imagenet | 54.7 |
DisInit (RGB) | Kinetics + Imagenet | 54.8 |
Ours | S2A-mined | 58.1 |
self-supervised works (+14%)
scratch (+18%)
works that rely on ImageNet supervision (+3 %)
More abstract actions
come right behind me!
follow me quick!
after you
thirty six thousand four hundred, five hundred
two quarters, three dimes, one nickel, two pennies
twenty four thousand four hundred
COUNT
FOLLOW
Future Work
Thank you!
Since I moved here, I actually like food again!
have you ever had szechwan cuisine before?
This chicken is very tasty
because if you are just drive
babe, the speed limit is 120
i always drive the car on saturday
DRIVE
EAT
Project Page: https://www.robots.ox.ac.uk/~vgg/research/speech2action/
Contact: arsha@robots.ox.ac.uk