End-To-End Finetuning of Diffusion Models
Main idea: Finetune a pretrained diffusion model end-to-end without requiring backpropagation
Motivation:
Related reading: �[1] Black et al. „Training Diffusion Models with Reinforcement Learning” (https://rl-diffusion.github.io/)�
Contact: Jan Hendrik Metzen (janhendrik.metzen@de.bosch.com)
LLM-based Operational Design Domain Definition
Main idea: Automatically collect a large set of corner-case descriptions from a large language model
Motivation:
Related reading: �[1] Metzen et al. „Identification of Systematic Errors of Image Classifiers on Rare Subgroups” (https://arxiv.org/abs/2303.05072)
[2] Tong et al. „Mass-Producing Failures of Multimodal Systems with Language Models” (https://arxiv.org/abs/2306.12105)
Contact: Jan Hendrik Metzen (janhendrik.metzen@de.bosch.com)
Impact of Generative Data Augmentation with Diffusion Models on Semantic Segmentation
Main Idea:
Diffusion Models such as Stable Diffusion are now proficient in creating high-quality, photorealistic images from text prompts, and have shown to enhance Image Classification Tasks via Generative Data Augmentation [1]. However, the impact of these models on other vision tasks, like Semantic Segmentation, hasn't been thoroughly evaluated due to the recent development of models capable of manipulating image layouts, such as ControlNet [2] and FreestyleNet. Despite these advancements, there is a noted discrepancy in spatial alignment with input layout conditions among these methods, which limits their application in generative data augmentation. This project aims to quantitatively assess the alignment fidelity of different methods that use layout conditions for the diffusion image generation process and to examine how Generative Data Augmentation can enhance other vision tasks, particularly semantic segmentation.
Related Reading: �[1] Azizi, Shekoofeh, et al. "Synthetic data from diffusion models improves imagenet classification." arXiv preprint arXiv:2304.08466 (2023).
[2] Zhang, Lvmin, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." arXiv preprint arXiv:2302.05543 (2023).
Contact: Julio Borges (Julio.Borges@de.bosch.com)
Self-Diagnostics of Diffusion Models
Main idea:
Not all samples are equal! Although diffusion models can generate stunning images, sometimes they fail comically. Is it possible to automatically detect failures by examining the diffusion process itself? You can help answer that! The goal of the thesis is to investigate whether various mechanism derived from the latent diffusion process itself [1], such as self-attention, cross-attention, and Score-Distillation Sampling [2], can be used for self-diagnostics.
Motivation:
Existing work have shown that diffusion model can self-correct images when provided with a mask to remove unnatural looking parts. This suggests that the model innately understand that parts of the image have a more probable alternative. Making use of this knowledge to automate error detection would be an essential part of any pipeline that aims to create high quality images at scale.
Related reading: �[1] Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." (CVPR 2022)
[2] Poole, Ben et al. “DreamFusion: Text-to-3D using 2D Diffusion.” (ICLR 2023)
Contact: Jiayi Wang (Jiayi.Wang2@de.bosch.com)
"An astronaut riding a horse in space"
Robust Open-set Learning meets Synthetic Data Augmentation
Motivation:
We have witnessed the impressive progress in content generation with and without control. It is not difficult to generate high-quality images, however, the challenge is how to constantly and consistently generate high-quality images. While waiting for next release of generative models, can we try to harvest more from current ones?
Main idea:
The ultimate goal of using synthetic data is that we can treat them equally as human annotated and curated data. However, when the quality of synthetic data is not yet consistently high and hard to filter beforehand, how to make use of them during training? Maybe we can borrow ideas from robust open-set learning, which aims at learning from partially labelled dataset where the dataset itself can contain unknown outliers [1].
We will practice this idea in the context of improving model‘s out-of-distribution generalization performance (e.g., being unknown aware when solving computer vision tasks, can detect unknown objects/novel instances, etc.)
Related reading: �[1] Kuniaki Saito, Donghyun Kim, Kate Saenko ”OpenMatch: Open-set Consistency Regularization for Semi-supervised Learning with Outliers”, https://arxiv.org/abs/2105.14148
Contact: Dan Zhang (Dan.Zhang2@de.bosch.com or Dan.Zhang@wsii.uni-tuebingen.de )
Geometry-guided Open-World Learning
Motivation:
When performing tasks like classification, and object detection, neural networks can make use of many fine-grained and local details. In contrast, human looks at more holistic information, such as geometric information. Both cues provide some generalization capabilities. Can we exploit them jointly?
Main idea:
It has been observed that stable diffusion has a good sense of geometry, that enables, for instance, 3D synthesis, changing viewpoints. In parallel, the UNet of stable diffusion is also shown to be a strong feature extractor [1] that can be used as the backbone for solving object detection, semantic segmentation, and even depth estimation. In our prior work [2], we have shown geometry cues are great helps to open-world object detection. Can we unleash more potential through generative modelling?
Related reading: �[1] W. Zhao*, et. al., Unleashing Text-to-Image Diffusion Models for Visual Perception" (ICCV 2023), https://arxiv.org/abs/2303.02153
[2] Haiwen Huang, Andreas Geiger, and Dan Zhang, “GOOD: Exploring geometric cues for detecting objects in an open world” (ICLR’2023) https://arxiv.org/abs/2212.11720
Contact: Haiwen Huang (Haiwen.Huang@uni-tuebingen.de) and Dan Zhang (Dan.Zhang@wsii.uni-tuebingen.de )
Using Multi-Model to Improve Out-of-distribution Detection�
Motivation:
Often in the computer vision-based solution in Bosch, it is required to identify anomalous object / scenarios. This may include anomalous scenarios inside of a building such as fire, and anomalous objects in driving scenes such as rare objects.
Main idea:
In this internship, the student will tackle the challenge of pushing beyond the limitations of traditional computer vision techniques by incorporating multi-modal data. Some recent framework such as ImageBind [1] has been proposed, but much more exploration is needed. In this internship, the student will explore fusing visual information with other sensory inputs such as audio, depth, heat map, or motion, and help paving the way for more robust and accurate anomaly detection systems. This work may help contributing to real world applications across various domains in Bosch, from surveillance to automated driving solutions.
This master internship presents an extraordinary opportunity to work alongside our team of experts in the area. Collaborating closely with our researchers and engineers, you will gain invaluable mentorship and guidance as you explore these new territories.
Related reading: �[1] [2305.05665] ImageBind: One Embedding Space To Bind Them All (arxiv.org)
Contact: Grace Hua (grace.hua@de.bosch.com)