1 of 24

BinoSoRAs

AI-generated video detection through Binoculars on Sora

Joshua Bowden, Willy Chan

Stanford University

2 of 24

about us!

Willy

  • Stanford CS sophomore
  • AMD Intern

  • Bell Labs 1st Place Computing Award
  • Research @ Stanford AI Lab (STAIR group)
  • etc…

Josh

  • Stanford CS sophomore
  • Meta Intern

  • CS 221: Intro to AI Top Project Winner
  • Research @ Caltech
  • Eagle Scout
  • etc…

11 hours on conception + implementation

3 of 24

BinoSoRAs

BinoSoRAs is a novel algorithm and platform designed to authenticate the origin of videos through automated frame interpolation and deep learning techniques.

4 of 24

BinoSoRAs

BinoSoRAs is a novel algorithm and platform designed to authenticate the origin of videos through automated frame interpolation and deep learning techniques.

5 of 24

inspiration

6 of 24

Prompt: Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.

Prompt: A Chinese Lunar New Year celebration video with Chinese Dragon.

Prompt: Historical footage of California during the gold rush.

Prompt: A white and orange tabby cat is seen happily darting through a dense garden, as if chasing something. Its eyes are wide and happy as it jogs forward, scanning the branches, flowers, and leaves as it walks. The path is narrow as it makes its way between all the plants. the scene is captured from a ground-level angle, following the cat closely, giving a low and intimate perspective. The image is cinematic with warm tones and a grainy texture. The scattered daylight between the leaves and plants above creates a warm contrast, accentuating the cat’s orange fur. The shot is clear and sharp, with a shallow depth of field.

7 of 24

inspiration

8 of 24

inspiration

9 of 24

inspiration

10 of 24

inspiration

11 of 24

current solutions

12 of 24

current solutions

Current Solutions

    • Naive, output-focused, human approach
    • Focus on overlaid deepfakes, not ground-up generation

Our Algorithm

    • Adaptive, input-focused, computational approach
    • Draws from state-of-the-art LLM AI detection

13 of 24

bottom-up analysis

Customer Segment:

  • Digital Platforms - 200 Globally
  • Relevant News Organizations - 3,000 Globally
  • Relevant Educational Institutions - 1,000 Globally
  • Corporate Entities (HR, content ID, etc.) - 10,000 Globally
  • etc.

Revenue Per Customer

  • Digital Platforms: 200 (orgs) * 10,950,00 (hours uploaded / year) * 0.50 ($ / hour verified) ~ $1.1 Billion
  • News Organizations: 3000 (orgs) * 700 videos/day * 365 days/year * 0.50 ($ / video verified) ~$400 Million
  • Educational Institutions: 5000 (orgs) * 250 videos/day * 280 days/year * 0.50 ($ / video verified) ~$180 Million
  • Corporations: 10,000 (orgs) * 500 videos/day * 365 days/year * 0.50 ($ / video verified) ~$1 Billion
  • Etc.

14 of 24

BinoSoRAs

introducing

AI-generated video detection through binoculars on Sora

15 of 24

detection results

Evaluation Threshold:

52.87 Frechet Inception Distance

Overall Accuracy: 91.67%

16 of 24

Algorithm

Drawing from the state-of-the-art Binoculars framework for detecting AI-generated text from large language models, we upscale to video with BinoSoRAs.

17 of 24

LLM Detection

Unknown Text

Generated Text

LLM 1

LLM 2

To me, how surprising is the unknown text compared to the generated text?

‘surprising’: log-perplexity; how far out of distribution tokens are on average

18 of 24

Video Detection

Unknown Video

Interpolated Video

FLAVR CNN

Frechet Inception Distance

To me, how well can I recognize the unknown video compared to the interpolated video?

‘recognize’: Inception v3 image classifier ran on each frame, 2048-vector

19 of 24

BinoSoRAs

  • We create model-generated video by feeding the suspect video into a Fast Frame Interpolation (FLAVR CNN) model.

  • We use Fréchet Inception Distance (Inception v3 CNN) between the unknown video and interpolated generated video as a metric to determine if video is generated or real.

20 of 24

SoRA vs. Reality

Real Video

Real Video with Generator

SoRA Video

SoRA Video with Generator

SoRA Video

SoRA Video with Generator

21 of 24

BinoSoRAs

Our system's efficiency, scalability, and effectiveness proves necessary in addressing the evolving challenges of digital content authentication in an increasingly automated world.

  • Able to quickly, accurately verify large volumes of data

  • Resistant to adversarial compute

  • Novel approach compared to traditional methods

22 of 24

What’s next?

  • Threat: enterprise level open source models for anyone to create
  • Solution: integration with media platforms, companies, etc
  • Advice from experts: ___________________________________________

23 of 24

Photo: after 11 hours

24 of 24

BinoSoRAs

AI-generated video detection through binoculars on Sora

Joshua Bowden, Willy Chan

Stanford University