1 of 24

Applied Foundation Models

Practical Course -- Kick-Off Meeting

April 28th, 2025

Computer Vision Group

Felix Wimbauer, Dominik Muhle, Christoph Reich, and Daniil Zverev

2 of 24

Outline

  1. Introduction
  2. Course Structure
    1. Sessions
    2. GPU Infrastructure
  3. Grading
  4. Projects
  5. Next Steps

2

3 of 24

Dominik Muhle

Info

  • 3rd year PhD student
  • MSc. Robotics Cognition and Intelligence, TUM, KIT

Research Interests

  • (Dynamic) Scene Reconstruction
  • Visual SLAM and Odometry
  • Object-centric learning

Website: dominikmuhle.github.io

3

4 of 24

Felix Wimbauer

Info

  • 3rd year PhD student
  • MSc. Computer Science,�TUM, University of Oxford

Research Interests

  • (Dynamic) 3D Reconstruction
  • Diffusion Models
  • Object-centric learning
  • Bayesian Approaches

Website: fwmb.github.io

4

5 of 24

Christoph Reich

Info

  • 2nd year PhD student
  • M.Sc. Autonomous Systems (CS) TU Darmstadt

Research Interests

  • Unsupervised scene understanding
  • Representation learning
  • Motion/depth estimation & reconstruction

Website: christophreich1996.github.io

5

6 of 24

Daniil Zverev

Info

  • 1st year PhD student
  • MSc. Computer Science,

TUM, Yandex Data School

Research Interests

  • Multimodal machine learning

Website: https://akoepke.github.io/mumol.html

6

7 of 24

Introduction

Foundation Models

  • Models trained on broad datasets for diverse applications
  • Have transformed Computer Vision and NLP
  • Examples: Text 2 Image, CLIP, Dino, Marigold Depth, Llama, Llava

This Course

  • Get an overview of different foundation models through student presentations.
  • Explore applications of such models via hackathon-style projects.

Relevance

  • If you want to build a product that uses Computer Vision, you likely get better results when using pretrained foundation models rather than models trained on you own (much smaller) datasets.
  • Experience with foundation models is / will become an important skill in the job market.

7

8 of 24

Course Structure

8

9 of 24

Sessions & Important Dates

  • Today: Kick-Off Session
  • May 5th, 23:59: Project Preference Submission
  • May 19th, 2024, (time TBD): Initial Presentation
  • TBD, (mid/end of July): Final Presentation & Demo
  • TBD, (end of July): Report Submission

9

10 of 24

Project Assignment

  • Talk to the other students in the course about similar preferences
    • In-person
    • Via Element Group
  • If you don’t have a team, we will group you with other students
  • We will try to accommodate everyone’s preferences
  • Submit Google Form which we will share with you after the session�
  • Deadline: May 5th, 23:59

10

11 of 24

GPU Infrastructure

  • We will give you access to CVG’s computing infrastructure
  • You will have access to 12GB Titan Xs (should be enough for all projects)
  • SLURM Scheduler:
    • Submit jobs with specific resource requirements
    • SLURM assigns resources to jobs
  • We will send you the account details and a tutorial for our infrastructure in the coming days.

11

12 of 24

Grading

12

  • We are happy to provide supervision, but part of the challenge is to also overcome problems as a team.
  • Groups will get the same grade, unless there are significant discrepancies between the individual contributions.
  • Late submission will result in loss of grades.

13 of 24

Initial Presentation

  • Format
    • Will take place on May 19th, 2025 (time TBD)
    • 15-20 min (it is important to stay within this time!)
    • Each student should contribute equally to the presentation.
    • Submit the slides on the evening before the presentation May 18th, 23:59.
  • Content
    • Give the other students an overview of your topic and the relevant papers
    • Provide outline of the specific project you want to work on (no timeline needed)

13

14 of 24

Projects

14

15 of 24

#1 Driving Scenario Reconstruction

  • Scene Reconstruction/Segmentation: �Combine different foundation models for 3D reconstruction and semantic �image understanding
  • Goal: Reconstruct a scene in an autonomous driving dataset to be used�as an asset to use in animation
    1. use MDE/SSC models to reconstruct the environment
    2. extract information about dynamic object, semantics,�drivable surface etc.
    3. re-animate the scene with scene assets in a controllable�manner (optionally with LLMs)

15

16 of 24

#2 VLM-based Image or Video Search

  • VLM: Fusion of vision and language models, learns from paired images and text.
  • Goal: Build an image search pipeline using VLM foundation models:
    • Use Clip for embedding based text-to-image and image-to-image search
    • Use GIT to generate searchable image captions
    • Use Vision LLM (like LLaVa) for text-to-image search
    • Optional: Use video captioning and keyframes to do text search in a video

16

17 of 24

#3 VLM based AI Tutor

  • VLM Chat:
    • Fusion of vision and language models, learns from provided books and lecture materials.
    • Trained to be interactive in chat
  • Goal: Build a chat with VLM model that can answer any questions regarding particular TUM course, help preparing for exam, create cheat sheet, mindmaps and etc.
    • Use open source frameworks to host models such as Deepseek, LLaMA, Mistral and etc.
    • Use our compute to finetune them on related books/papers.
    • Use clean lecture materials as non-optional context in chat
    • Build your chat with gradio app
    • (Optional) Connect VLM chat to the RAG (opensource framework)
    • (Optional) Fine tune it to produce artefacts such as PDF, Slides, mindmaps.
  • Future:
    • We help you to host internally it and make it available to all TUM students.

17

18 of 24

#4 Image-Editing with Foundation Models

  • Foundation models power diverse image editing techniques in modern smartphones' camera software.
  • Different models like Stable Diffusion, Segment Anything (SAM), and Depth Anything enable realistic image generation, object segmentation, and depth prediction, respectively.
  • Goal: Develop different image editing features as described above:
    • Implement background blurring
    • Implement object removal (magic eraser)
    • Implement background replacement (backdrop)

18

19 of 24

#5 Video Editing with Foundation Models and Rendering

Setting:

  • Modern segmentation models like SAM2 can track any object throughout a video
  • Modern 6D pose estimation models can track the pose of any object�-> We can replace or edit them with rendered 3D assets
  • Use point tracking to track individual points through a video�

Goal: Build powerful video editing pipeline with SAM2, FoundationPose, and CoTracker

19

FoundationPose by Nvidia

SAM2 by Meta AI

CoTracker by Meta AI

20 of 24

#6 Search in 3D Room

Setting:

  • Dust3r and similar models allow for accurate dense 3D reconstruction.
  • Build reconstruction with Clip embeddings or similar.
  • Store descriptions of VLMs in the 3D reconstruction.�

Goal: Build a search engine for 3D scans of your room.

20

CLIP by OpenAI

21 of 24

#7 Unsupervised Point Cloud Segmentation

Setting:

  • Advances in SSL have enabled expressive SSL point cloud �representations
  • Representations can be employed for unsupervised scene �understanding (e.g., segmentation)�

Goal: Build a foundation model approach for unsupervised segmentation of point clouds in the wild.

21

Wu et al. “Sonata: Self-supervised learning of reliable

point representations”, CVPR, 2025

22 of 24

#8 Bringing Unsupervised Whole-Image Segmentation to Videos

Setting:

  • Whole-image segmentation have been driven by large-scale foundation models (cf. Segment Anything)
  • Recently, UnSAM, a large-scale unsupervised approach, has been proposed
  • Unsupervised whole-video segmentation is still unexplored

Goal: Finetune UnSAM for unsupervised whole-video segmentation

22

Wang et al. “Segment Anything without Supervision”, NeurIPS, 2024

23 of 24

#9 Build a Kicker Tracker and Commentator

Setting:

  • Certain research group at TUM has a private Kicker room.
  • This group needs a software to track the ball and analyze the game.
  • Techniques like Point tracking, SAM2, and others can be used to track the game.
  • Use Whisper, LLMs, and Text-to-Speech to interact with the players.

Goal: Build software for automatically analyzing the game and automatically commentate.

23

24 of 24

What’s next?

Teams & Project Preferences

  • Join the Matrix group
  • Find teammates
  • Submit team preferences via the provided Google Form by May 5th, 23:59.

Initial Presentation

  • Reach out to respective tutor after the projects have been assigned.
  • Submit draft of the slides on the evening before the presentation date by May 18th, 23:59.

Questions?

24