1 of 24

BinoSoRAs

AI-generated video detection through Binoculars on Sora

Joshua Bowden, Willy Chan

Stanford University

2 of 24

about us!

Willy

Stanford CS sophomore
AMD Intern

Bell Labs 1st Place Computing Award
Research @ Stanford AI Lab (STAIR group)
etc…

Josh

Stanford CS sophomore
Meta Intern

CS 221: Intro to AI Top Project Winner
Research @ Caltech
Eagle Scout
etc…

11 hours on conception + implementation

3 of 24

BinoSoRAs

BinoSoRAs is a novel algorithm and platform designed to authenticate the origin of videos through automated frame interpolation and deep learning techniques.

4 of 24

BinoSoRAs

BinoSoRAs is a novel algorithm and platform designed to authenticate the origin of videos through automated frame interpolation and deep learning techniques.

5 of 24

inspiration

6 of 24

Prompt: Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.

Prompt: A Chinese Lunar New Year celebration video with Chinese Dragon.

Prompt: Historical footage of California during the gold rush.

Prompt: A white and orange tabby cat is seen happily darting through a dense garden, as if chasing something. Its eyes are wide and happy as it jogs forward, scanning the branches, flowers, and leaves as it walks. The path is narrow as it makes its way between all the plants. the scene is captured from a ground-level angle, following the cat closely, giving a low and intimate perspective. The image is cinematic with warm tones and a grainy texture. The scattered daylight between the leaves and plants above creates a warm contrast, accentuating the cat’s orange fur. The shot is clear and sharp, with a shallow depth of field.

7 of 24

inspiration

8 of 24

inspiration

9 of 24

inspiration

10 of 24

inspiration

11 of 24

current solutions

Detecting Deep-Fake Videos

AI Detection Will Never Be Enough

12 of 24

current solutions

Current Solutions

Naive, output-focused, human approach
Focus on overlaid deepfakes, not ground-up generation

Our Algorithm

Adaptive, input-focused, computational approach
Draws from state-of-the-art LLM AI detection

13 of 24

bottom-up analysis

Customer Segment:

Digital Platforms - 200 Globally
Relevant News Organizations - 3,000 Globally
Relevant Educational Institutions - 1,000 Globally
Corporate Entities (HR, content ID, etc.) - 10,000 Globally
etc.

Revenue Per Customer

Digital Platforms: 200 (orgs) * 10,950,00 (hours uploaded / year) * 0.50 ($ / hour verified) ~ $1.1 Billion
News Organizations: 3000 (orgs) * 700 videos/day * 365 days/year * 0.50 ($ / video verified) ~$400 Million
Educational Institutions: 5000 (orgs) * 250 videos/day * 280 days/year * 0.50 ($ / video verified) ~$180 Million
Corporations: 10,000 (orgs) * 500 videos/day * 365 days/year * 0.50 ($ / video verified) ~$1 Billion
Etc.

Source

14 of 24

BinoSoRAs

introducing

AI-generated video detection through binoculars on Sora

15 of 24

detection results

Evaluation Threshold:

�52.87 Frechet Inception Distance

Overall Accuracy: 91.67%

SORA - (38 Images)

Real - 17 images

Based on the analysis, a potential threshold for distinguishing between fake and real videos could be set at 152.75 for the FID score. This threshold is calculated to be midway between the upper bound of the real videos dataset and the lower bound of the fake videos dataset, aiming to effectively separate the majority of real and fake videos based on their FID scores. Videos with FID scores lower than this threshold are more likely to be real, while those with scores higher are more likely to be fake. However, it's important to consider the specific context and distribution of FID scores in your datasets, as well as perform validation to ensure this threshold optimally differentiates between real and fake videos in practice.

True Negative: 78.6%

False Positives: 21.4%

16 of 24

Algorithm

Drawing from the state-of-the-art Binoculars framework for detecting AI-generated text from large language models, we upscale to video with BinoSoRAs.

17 of 24

LLM Detection

Unknown Text

Generated Text

LLM 1

LLM 2

To me, how surprising is the unknown text compared to the generated text?

‘surprising’: log-perplexity; how far out of distribution tokens are on average

18 of 24

Video Detection

Unknown Video

Interpolated Video

FLAVR CNN

Frechet Inception Distance

To me, how well can I recognize the unknown video compared to the interpolated video?

‘recognize’: Inception v3 image classifier ran on each frame, 2048-vector

19 of 24

BinoSoRAs

We create model-generated video by feeding the suspect video into a Fast Frame Interpolation (FLAVR CNN) model.

We use Fréchet Inception Distance (Inception v3 CNN) between the unknown video and interpolated generated video as a metric to determine if video is generated or real.

20 of 24

SoRA vs. Reality

Real Video

Real Video with Generator

SoRA Video

SoRA Video with Generator

SoRA Video

SoRA Video with Generator

21 of 24

BinoSoRAs

Our system's efficiency, scalability, and effectiveness proves necessary in addressing the evolving challenges of digital content authentication in an increasingly automated world.

Able to quickly, accurately verify large volumes of data

Resistant to adversarial compute

Novel approach compared to traditional methods

22 of 24

What’s next?

Threat: enterprise level open source models for anyone to create
Solution: integration with media platforms, companies, etc
Advice from experts: ___________________________________________

23 of 24

Photo: after 11 hours

24 of 24

BinoSoRAs

AI-generated video detection through binoculars on Sora

Joshua Bowden, Willy Chan

Stanford University