1 of 24

Applied Foundation Models

Practical Course -- Kick-Off Meeting

April 28th, 2025

Computer Vision Group

Felix Wimbauer, Dominik Muhle, Christoph Reich, and Daniil Zverev

2 of 24

Outline

Introduction
Course Structure

Sessions
GPU Infrastructure

Grading
Projects
Next Steps

3 of 24

Dominik Muhle

Info

3rd year PhD student
MSc. Robotics Cognition and Intelligence, TUM, KIT

Research Interests

(Dynamic) Scene Reconstruction
Visual SLAM and Odometry
Object-centric learning

Website: dominikmuhle.github.io

4 of 24

Felix Wimbauer

Info

3rd year PhD student
MSc. Computer Science,�TUM, University of Oxford

Research Interests

(Dynamic) 3D Reconstruction
Diffusion Models
Object-centric learning
Bayesian Approaches

Website: fwmb.github.io

5 of 24

Christoph Reich

Info

2nd year PhD student
M.Sc. Autonomous Systems (CS) TU Darmstadt

Research Interests

Unsupervised scene understanding
Representation learning
Motion/depth estimation & reconstruction

Website: christophreich1996.github.io

6 of 24

Daniil Zverev

Info

1st year PhD student
MSc. Computer Science,

TUM, Yandex Data School

Research Interests

Multimodal machine learning

Website: https://akoepke.github.io/mumol.html

7 of 24

Introduction

Foundation Models

Models trained on broad datasets for diverse applications
Have transformed Computer Vision and NLP
Examples: Text 2 Image, CLIP, Dino, Marigold Depth, Llama, Llava

This Course

Get an overview of different foundation models through student presentations.
Explore applications of such models via hackathon-style projects.

Relevance

If you want to build a product that uses Computer Vision, you likely get better results when using pretrained foundation models rather than models trained on you own (much smaller) datasets.
Experience with foundation models is / will become an important skill in the job market.

8 of 24

Course Structure

9 of 24

Sessions & Important Dates

Today: Kick-Off Session
May 5th, 23:59: Project Preference Submission
May 19th, 2024, (time TBD): Initial Presentation
TBD, (mid/end of July): Final Presentation & Demo
TBD, (end of July): Report Submission

10 of 24

Project Assignment

Talk to the other students in the course about similar preferences

In-person
Via Element Group

If you don’t have a team, we will group you with other students
We will try to accommodate everyone’s preferences
Submit Google Form which we will share with you after the session�
Deadline: May 5th, 23:59

11 of 24

GPU Infrastructure

We will give you access to CVG’s computing infrastructure
You will have access to 12GB Titan Xs (should be enough for all projects)
SLURM Scheduler:

Submit jobs with specific resource requirements
SLURM assigns resources to jobs

We will send you the account details and a tutorial for our infrastructure in the coming days.

12 of 24

Grading

We are happy to provide supervision, but part of the challenge is to also overcome problems as a team.
Groups will get the same grade, unless there are significant discrepancies between the individual contributions.
Late submission will result in loss of grades.

13 of 24

Initial Presentation

Format

Will take place on May 19th, 2025 (time TBD)
15-20 min (it is important to stay within this time!)
Each student should contribute equally to the presentation.
Submit the slides on the evening before the presentation May 18th, 23:59.

Content

Give the other students an overview of your topic and the relevant papers
Provide outline of the specific project you want to work on (no timeline needed)

14 of 24

Projects

15 of 24

#1 Driving Scenario Reconstruction

Scene Reconstruction/Segmentation: �Combine different foundation models for 3D reconstruction and semantic �image understanding
Goal: Reconstruct a scene in an autonomous driving dataset to be used�as an asset to use in animation

use MDE/SSC models to reconstruct the environment
extract information about dynamic object, semantics,�drivable surface etc.
re-animate the scene with scene assets in a controllable�manner (optionally with LLMs)

16 of 24

#2 VLM-based Image or Video Search

VLM: Fusion of vision and language models, learns from paired images and text.
Goal: Build an image search pipeline using VLM foundation models:

Use Clip for embedding based text-to-image and image-to-image search
Use GIT to generate searchable image captions
Use Vision LLM (like LLaVa) for text-to-image search
Optional: Use video captioning and keyframes to do text search in a video

17 of 24

#3 VLM based AI Tutor

VLM Chat:

Fusion of vision and language models, learns from provided books and lecture materials.
Trained to be interactive in chat

Goal: Build a chat with VLM model that can answer any questions regarding particular TUM course, help preparing for exam, create cheat sheet, mindmaps and etc.

Use open source frameworks to host models such as Deepseek, LLaMA, Mistral and etc.
Use our compute to finetune them on related books/papers.
Use clean lecture materials as non-optional context in chat
Build your chat with gradio app
(Optional) Connect VLM chat to the RAG (opensource framework)
(Optional) Fine tune it to produce artefacts such as PDF, Slides, mindmaps.

Future:

We help you to host internally it and make it available to all TUM students.

18 of 24

#4 Image-Editing with Foundation Models

Foundation models power diverse image editing techniques in modern smartphones' camera software.
Different models like Stable Diffusion, Segment Anything (SAM), and Depth Anything enable realistic image generation, object segmentation, and depth prediction, respectively.
Goal: Develop different image editing features as described above:

Implement background blurring
Implement object removal (magic eraser)
Implement background replacement (backdrop)

19 of 24

#5 Video Editing with Foundation Models and Rendering

Setting:

Modern segmentation models like SAM2 can track any object throughout a video
Modern 6D pose estimation models can track the pose of any object�-> We can replace or edit them with rendered 3D assets
Use point tracking to track individual points through a video�

Goal: Build powerful video editing pipeline with SAM2, FoundationPose, and CoTracker

FoundationPose by Nvidia

SAM2 by Meta AI

CoTracker by Meta AI

20 of 24

#6 Search in 3D Room

Setting:

Dust3r and similar models allow for accurate dense 3D reconstruction.
Build reconstruction with Clip embeddings or similar.
Store descriptions of VLMs in the 3D reconstruction.�

Goal: Build a search engine for 3D scans of your room.

Mast3r-SLAM by ICL

CLIP by OpenAI

21 of 24

#7 Unsupervised Point Cloud Segmentation

Setting:

Advances in SSL have enabled expressive SSL point cloud �representations
Representations can be employed for unsupervised scene �understanding (e.g., segmentation)�

Goal: Build a foundation model approach for unsupervised segmentation of point clouds in the wild.

Wu et al. “Sonata: Self-supervised learning of reliable

point representations”, CVPR, 2025

22 of 24

#8 Bringing Unsupervised Whole-Image Segmentation to Videos

Setting:

Whole-image segmentation have been driven by large-scale foundation models (cf. Segment Anything)
Recently, UnSAM, a large-scale unsupervised approach, has been proposed
Unsupervised whole-video segmentation is still unexplored

Goal: Finetune UnSAM for unsupervised whole-video segmentation

Wang et al. “Segment Anything without Supervision”, NeurIPS, 2024

23 of 24

#9 Build a Kicker Tracker and Commentator

Setting:

Certain research group at TUM has a private Kicker room.
This group needs a software to track the ball and analyze the game.
Techniques like Point tracking, SAM2, and others can be used to track the game.
Use Whisper, LLMs, and Text-to-Speech to interact with the players.

Goal: Build software for automatically analyzing the game and automatically commentate.

24 of 24

What’s next?

Teams & Project Preferences

Join the Matrix group
Find teammates
Submit team preferences via the provided Google Form by May 5th, 23:59.

Initial Presentation

Reach out to respective tutor after the projects have been assigned.
Submit draft of the slides on the evening before the presentation date by May 18th, 23:59.

Questions?